REPEATED SubstringS SieveS

Three variants of a substring sieve are presented:

Cross-Repetitions : repetitions across two strings
Self-Repetitions : repetitions within the same string
Staged Repetitions : repetitions within the first and across the second string

These sieves are inspired by the Karp-Rabin algorithm; in particular they use the Rabin fingerprint modulo M as rolling hash. A minimalistic hash table reduced to M one-bit slots will be used to keep track of the fingerprints.

► The sieves operate on substrings of pre-defined length;
whether they repeate one or multiple times is irrelevant.

Sieving a given string S of length N, consists in successively reducing the residue r, defined by the set of n substrings of length L. As illustrated by the example below, the initial residue contains n = N - L + 1 items, and ideally in the end, the final residue will contain repeating items only.

L is critical as it controls the resulting amount of residue: if it is too wide, the sieve will be empty, if L is too narrow, an unmanageable quantity of repetitions will survive.

In telecommunications L may be interpreted as a noise level, in the sense that the shorter and more frequent the repetitions the less information they provide. However, by suppressing too much noise, one risks to miss valuable information.

Example: Consider the string S of length N= (4*3) + (3*7) = 33 :
S = “abc1234567def1234567ghi1234567jkl”.
With L = 5 and moving forward from left to right, the initial forward residue is :
r = { “abc12”, “bc123”, “c1234”, … , “567jk”, “67jkl” }.
After sieving, the resulting residue will (implicitly) contain for each repetition the items:
“12345”, “23456” and “34567”,
with the left most ones representing the prefixes of length L of the repeated substrings.
Similarly, the backward residue is obtained from the reversed S.

In the literature the overlapping residue items, illustrated in the above example, are also called "shingles". Using this terminology the principle of the method can be stated as follows:

1. 1. sieve unique shingles: eliminate those shingles that produce unique fingerprints
  2. extract start shingles: identify those shingles in the resulting residue that mark beginnings of repetitions (prefixes)
  3. pair the start shingles: find the matching start shingles in the resulting residue

In the following we shall concentrate on two quantities:

1) E_R : the expected number of repeated items in the final residue

2) E_M : the required number of iterations until all unique items are eliminated

E_R in function of L allows to set a limit on the amount of the final residue;
E_Mserves as a hardware independent measure of performance.

This table attempts to show certain analogies between a mechanical sieve and its algorithmic counterpart:

Cross-Repetitions Sieve

Given two IID distributed byte strings S₁, S₂ of lengths N₁, N₂ and a parameter L, after a few rounds, the sieve will only contain those substrings of length L that have at least one repetition in both strings S₁and S₂.

the bytes of strings S₁, S₂ are taken from the same alphabet
repetitions within S₁as well as repetitions within S₂ are irrelevant
S₁, S₂ may overlap at most on L-1 bytes, with a concatenated length N_c ≥ N₁+N₂ - (L-1)

In telecommunications such a sieve may be used to identify repeating information transmitted on correlated channels.

The cross-sieve operates forward and backward:

During the forward movement the fingerprints of residue r₁, originating from S₁, are mapped to the hash table. The residue r₂ originating from S₂ is then cross-checked against the table and reduced by the items that are not found.

For the backward movement the roles of S₁and S₂ are inversed.

1.1 Number of Repeated Substrings E_R (S₁x S₂)

Each byte of the strings occurring with equal probability p= 1/256, the expected number of cross-repetitions of length L between S₁and S₂ is given by

E_R (S₁ x S₂) = E_R(S₁+ S₂) - E_R(S₁) - E_R(S₂)

= ½ p^L ( ((N_c-L)² + (N_c-L)) - ((N₁-L)² + (N₁-L)) - ((N₂-L)² + (N₂-L)) )

≈ ½ p^L ( (N₁+N₂)² - N₁² - N₂² ) , N₁,N₂ → ∞

≈ p^L N₁N₂ = p^L N² k, with N₁ = N and N₂ = kN

With the formula of section 2.1 we can write

E_R (S₁ x S₂) ≈ E_R (S) * 2k

Hence, to obtain the expected repetitions, the values in the table of section 2.1 have to be multiplied by 2k. For example with L= 5, N= 2²¹ (N₁ = 2 MB) and k= 100 (N₂= 200 MB), we can expect about 2 * 2k = 400 repetitions.

1.2 Required Number of Iterations E_M (S₁ x S₂)

Mapping the fingerprints of residue r₁ to the M slots of our hash table, the probability of the event E₁ that a slot remains empty after mapping the n₁ items of S₁ is

P(E₁) = (1- 1/M)ⁿ¹, M: number of slots of the hash table = modulus of the fingerprints

Now, probing our hash table with r₂ that contains n₂ items,

n_e2= n₂* P(E₁)

items can expect to find the corresponding slot empty.

As these items are unique with respect to r₁, they can be eliminated from r₂ and we obtain a new

n₂ ← n₂ - n_e2

At this point there are two options to continue the sieving process:

a) either rehash r₁,

b) or reverse the hashing direction starting with r₂:

P(E₂) = (1- 1/M)ⁿ²

n_e1= n₁ * P(E₂)

n₁ ← n₁ – n_e1

As long as n₂stays above a certain threshold, for example n₂> N₁, option (a) remains superior. Thereafter, forward/backward sieving according to (b) becomes more effective.

The process end is reached, when either n₁ or n₂drop below E_R (S₁ x S₂), the number of expected cross-repetitions.

Finally, the number of required iterations to reach the halting state is assigned to E_M (S₁ x S₂).

1.3 Required Memory

In addition to the space occupied by S₁, S₂, the sieve requires three bit-vectors u₁, u₂ and z:

u₁ and u₂ are of respective length N₁ , N₂ , they are used to flag the substrings of S₁, S₂ that are eliminated from the sieve
z of length M, holds the hash table slots that are used to keep track of mapped fingerprints

Note that in contrast to u, the hash vector z is unpredictably accessed and as such depends critically on the availability of large and fast random access memory.

E_M (S₁ x S₂) allows to predict the number of required sieve movements / iterations in function of the size of the hash vector.

1.4 Experimental Results

Various relations between M, N1 and N2 on the one hand and the number of required sieve movements and processing time on the other hand, have been investigated, using:

Hardware: Inspiron 5748, Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz; 8,00 GB RAM
Software: available on GitHub

Recall that M, the modulus of the fingerprints, implies the number of bits required by the minimalistic hash table.

A) Variable Modulus M ; Fixed equal string lengths : N1 = N2 = const.

Influence of the modulus on the sieving time of equal length strings (Fig. 1.1)

B) Variable Modulus M ; Variable equal string lengths : N1 = N2 = M

influence of the problem size M = N1 = N2

on the required sieve movements (Fig. 1.2)
on the required sieving time (Fig. 1.3)

C) Fixed equal modulus and string length M = N1 ; Variable string length N2 = k * N1

influence of the relation N2 / N1 with modulus M = N1

on the required sieve movements (Fig. 1.4)
on the required sieving time (Fig. 1.5)

Fig. 1.1 Influence of the modulus on the sieving time of equal length strings

Fig. 1.2 Influence of the problem size M = N1 = N2 on the required sieve movements

Fig. 1.3 Influence of the problem size M = N1 = N2 on the required sieving time

Fig. 1.4 Influence of the relation N2/N1 with modulus M=N1 on the required sieve movements

Fig. 1.5 Influence of the relation N2/N1 with modulus M=N1 on the required sieving time

2. Self-Repetitions Sieve

Given an IID distributed byte string S of length N, and a parameter L << N, after a few rounds, the sieve will only contain those substrings of length L that have at least one (possibly overlapping) repetition in S.

In telecommunications such a sieve may be used to identify information sequences as for example syncs, preambles, delimiters, headers or even entire message blocks that repeat in a data stream.

As opposed to the cross-string sieve, the self-string sieve operates from beginning in forward/backward mode, where the hash is alternatively rolled from left to right and from right to left.

2.1 Number of Repeated Substrings E_R (S)

Each byte of S occurring with equal probability p= 1/256, the expected number of repetitions of length L, is given by the formula

E_R (S) = ½ p^L ((N-L)² + (N-L))

≈ ½ p^L N² = ½ 4^i-4L, N= 2ⁱ →∞

For example with L= 4 and N= 2²⁰ (1 MB) we expect about 128 repetitions, or
with L=7 and N= 2³⁰ (1 GB) the sieving process should end with approximately 8 repeated substrings, as illustrated in the table below:

2.2 Required Number of Iterations E_M (S)

Let us divide the sieve´s residue in two complementary subsets:

u contains the items with fingerprints that have been mapped without collision
v contains the items that encountered a collision

At the beginning of the sieving cycle, u contains all N - L + 1 items of S and v is empty.

After mapping the fingerprints of u to a refreshed table, the situation presents itself as follows:

P(E_u) = (1- 1/M)^nu, the probability that a slot remains empty
n_T = M (1 - P(E_u)) , the expected number of collision free mappings
n_C = n_u - n_T , the expected number of colliding mappings n_C = n_u - M (1 - P(E_u))

Probing this table with the n_v items of v,

n_E = n_v* P(E_u)

of them can expect to find a slot empty.

Drawing the balance, the new sets contain:

n_u ← n_T and
n_v ← n_v + n_C - n_E items.

By switching between u and v

u ↔ v,

the cycle is repeated until either n_u or n_vdrop below E_R(S), the number of expected repetitions.

Finally, the number of required iterations to reach the halting state is assigned to E_M (S).

2.3 Required Memory

In addition to the space occupied by S, the sieve requires three bit-vectors u, v and z:

u and v are both of length N, they are used to flag the substrings of S;
the flags are set when the corresponding substring encounters:
- an empty slot (u), that triggers a new potential repetition
- an occupied slot (v), which indicates a collision with an existing potential repetition
z of length M, holds the hash table slots
that are used to keep track of mapped fingerprints

Note that in contrast to u and v, the hash vector z is unpredictably accessed and as such depends critically on the availability of large and fast random access memory.

E_M (S) allows to predict the number of required sieve movements/iterations in function of the size of the hash vector.

2.4 Experimental Results

Various relations between M and N on the one hand and the number of required sieve movements and processing time on the other hand, have been investigated, using:

Hardware: Inspiron 5748, Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz; 8,00 GB RAM
Software: available on GitHub

Recall that M, the modulus of the fingerprints, implies the number of bits required by the minimalistic hash table.

A) Variable Modulus M ; Fixed string length : N = const.

Influence of the modulus on the sieving time (Fig. 2.1)

B) Variable Modulus M ; Variable string length : N = M

influence of the problem size M = N

on the required sieve movements (Fig. 2.2)
on the required sieving time (Fig. 2.3)

Fig. 2.1 Influence of the modulus on the sieving time

Fig. 2.2 Influence of the problem size M = N on the required sieve movements

REPEATED SubstringS SieveS

Cross-Repetitions Sieve

2. Self-Repetitions Sieve

further chapters to follow ...