Three variants of a substring sieve are presented:
Cross-Repetitions : repetitions across two strings
Self-Repetitions : repetitions within the same string
Staged Repetitions : repetitions within the first and across the second string
These sieves are inspired by the Karp-Rabin algorithm; in particular they use the Rabin fingerprint modulo M as rolling hash. A minimalistic hash table reduced to M one-bit slots will be used to keep track of the fingerprints.
► The sieves operate on substrings of pre-defined length;
whether they repeate one or multiple times is irrelevant.
Sieving a given string S of length N, consists in successively reducing the residue r, defined by the set of n substrings of length L. As illustrated by the example below, the initial residue contains n = N - L + 1 items, and ideally in the end, the final residue will contain repeating items only.
L is critical as it controls the resulting amount of residue: if it is too wide, the sieve will be empty, if L is too narrow, an unmanageable quantity of repetitions will survive.
In telecommunications L may be interpreted as a noise level, in the sense that the shorter and more frequent the repetitions the less information they provide. However, by suppressing too much noise, one risks to miss valuable information.
Example: Consider the string S of length N= (4*3) + (3*7) = 33 :
S = “abc1234567def1234567ghi1234567jkl”.
With L = 5 and moving forward from left to right, the initial forward residue is :
r = { “abc12”, “bc123”, “c1234”, … , “567jk”, “67jkl” }.
After sieving, the resulting residue will (implicitly) contain for each repetition the items:
“12345”, “23456” and “34567”,
with the left most ones representing the prefixes of length L of the repeated substrings.
Similarly, the backward residue is obtained from the reversed S.
In the literature the overlapping residue items, illustrated in the above example, are also called "shingles". Using this terminology the principle of the method can be stated as follows:
sieve unique shingles: eliminate those shingles that produce unique fingerprints
extract start shingles: identify those shingles in the resulting residue that mark beginnings of repetitions (prefixes)
pair the start shingles: find the matching start shingles in the resulting residue
In the following we shall concentrate on two quantities:
1) ER : the expected number of repeated items in the final residue
2) EM : the required number of iterations until all unique items are eliminated
ER in function of L allows to set a limit on the amount of the final residue;
EM serves as a hardware independent measure of performance.
This table attempts to show certain analogies between a mechanical sieve and its algorithmic counterpart:
Given two IID distributed byte strings S1, S2 of lengths N1, N2 and a parameter L, after a few rounds, the sieve will only contain those substrings of length L that have at least one repetition in both strings S1 and S2.
the bytes of strings S1, S2 are taken from the same alphabet
repetitions within S1 as well as repetitions within S2 are irrelevant
S1, S2 may overlap at most on L-1 bytes, with a concatenated length Nc ≥ N1+N2 - (L-1)
In telecommunications such a sieve may be used to identify repeating information transmitted on correlated channels.
The cross-sieve operates forward and backward:
During the forward movement the fingerprints of residue r1, originating from S1, are mapped to the hash table. The residue r2 originating from S2 is then cross-checked against the table and reduced by the items that are not found.
For the backward movement the roles of S1 and S2 are inversed.
1.1 Number of Repeated Substrings ER (S1 x S2)
Each byte of the strings occurring with equal probability p= 1/256, the expected number of cross-repetitions of length L between S1 and S2 is given by
ER (S1 x S2) = ER(S1 + S2) - ER(S1) - ER(S2)
= ½ pL ( ((Nc-L)2 + (Nc-L)) - ((N1-L)2 + (N1-L)) - ((N2-L)2 + (N2-L)) )
≈ ½ pL ( (N1+N2)2 - N12 - N22 ) , N1,N2 → ∞
≈ pL N1N2 = pL N2 k, with N1 = N and N2 = kN
With the formula of section 2.1 we can write
ER (S1 x S2) ≈ ER (S) * 2k
Hence, to obtain the expected repetitions, the values in the table of section 2.1 have to be multiplied by 2k. For example with L= 5, N= 221 (N1 = 2 MB) and k= 100 (N2 = 200 MB), we can expect about 2 * 2k = 400 repetitions.
1.2 Required Number of Iterations EM (S1 x S2)
Mapping the fingerprints of residue r1 to the M slots of our hash table, the probability of the event E1 that a slot remains empty after mapping the n1 items of S1 is
P(E1) = (1- 1/M)n1 , M: number of slots of the hash table = modulus of the fingerprints
Now, probing our hash table with r2 that contains n2 items,
ne2 = n2 * P(E1)
items can expect to find the corresponding slot empty.
As these items are unique with respect to r1, they can be eliminated from r2 and we obtain a new
n2 ← n2 - ne2
At this point there are two options to continue the sieving process:
a) either rehash r1,
b) or reverse the hashing direction starting with r2:
P(E2) = (1- 1/M)n2
ne1 = n1 * P(E2)
n1 ← n1 – ne1
As long as n2 stays above a certain threshold, for example n2 > N1, option (a) remains superior. Thereafter, forward/backward sieving according to (b) becomes more effective.
The process end is reached, when either n1 or n2 drop below ER (S1 x S2), the number of expected cross-repetitions.
Finally, the number of required iterations to reach the halting state is assigned to EM (S1 x S2).
1.3 Required Memory
In addition to the space occupied by S1, S2, the sieve requires three bit-vectors u1, u2 and z:
u1 and u2 are of respective length N1 , N2 , they are used to flag the substrings of S1, S2 that are eliminated from the sieve
z of length M, holds the hash table slots that are used to keep track of mapped fingerprints
Note that in contrast to u, the hash vector z is unpredictably accessed and as such depends critically on the availability of large and fast random access memory.
EM (S1 x S2) allows to predict the number of required sieve movements / iterations in function of the size of the hash vector.
1.4 Experimental Results
Various relations between M, N1 and N2 on the one hand and the number of required sieve movements and processing time on the other hand, have been investigated, using:
Hardware: Inspiron 5748, Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz; 8,00 GB RAM
Software: available on GitHub
Recall that M, the modulus of the fingerprints, implies the number of bits required by the minimalistic hash table.
A) Variable Modulus M ; Fixed equal string lengths : N1 = N2 = const.
Influence of the modulus on the sieving time of equal length strings (Fig. 1.1)
B) Variable Modulus M ; Variable equal string lengths : N1 = N2 = M
influence of the problem size M = N1 = N2
on the required sieve movements (Fig. 1.2)
on the required sieving time (Fig. 1.3)
C) Fixed equal modulus and string length M = N1 ; Variable string length N2 = k * N1
influence of the relation N2 / N1 with modulus M = N1
on the required sieve movements (Fig. 1.4)
on the required sieving time (Fig. 1.5)
Fig. 1.1 Influence of the modulus on the sieving time of equal length strings
Fig. 1.2 Influence of the problem size M = N1 = N2 on the required sieve movements
Fig. 1.3 Influence of the problem size M = N1 = N2 on the required sieving time
Fig. 1.4 Influence of the relation N2/N1 with modulus M=N1 on the required sieve movements
Fig. 1.5 Influence of the relation N2/N1 with modulus M=N1 on the required sieving time
Given an IID distributed byte string S of length N, and a parameter L << N, after a few rounds, the sieve will only contain those substrings of length L that have at least one (possibly overlapping) repetition in S.
In telecommunications such a sieve may be used to identify information sequences as for example syncs, preambles, delimiters, headers or even entire message blocks that repeat in a data stream.
As opposed to the cross-string sieve, the self-string sieve operates from beginning in forward/backward mode, where the hash is alternatively rolled from left to right and from right to left.
2.1 Number of Repeated Substrings ER (S)
Each byte of S occurring with equal probability p= 1/256, the expected number of repetitions of length L, is given by the formula
ER (S) = ½ pL ((N-L)2 + (N-L))
≈ ½ pL N2 = ½ 4i-4L, N= 2i →∞
For example with L= 4 and N= 220 (1 MB) we expect about 128 repetitions, or
with L=7 and N= 230 (1 GB) the sieving process should end with approximately 8 repeated substrings, as illustrated in the table below:
2.2 Required Number of Iterations EM (S)
Let us divide the sieve´s residue in two complementary subsets:
u contains the items with fingerprints that have been mapped without collision
v contains the items that encountered a collision
At the beginning of the sieving cycle, u contains all N - L + 1 items of S and v is empty.
After mapping the fingerprints of u to a refreshed table, the situation presents itself as follows:
P(Eu) = (1- 1/M)nu, the probability that a slot remains empty
nT = M (1 - P(Eu)) , the expected number of collision free mappings
nC = nu - nT , the expected number of colliding mappings nC = nu - M (1 - P(Eu))
Probing this table with the nv items of v,
nE = nv * P(Eu)
of them can expect to find a slot empty.
Drawing the balance, the new sets contain:
nu ← nT and
nv ← nv + nC - nE items.
By switching between u and v
u ↔ v,
the cycle is repeated until either nu or nv drop below ER(S), the number of expected repetitions.
Finally, the number of required iterations to reach the halting state is assigned to EM (S).
2.3 Required Memory
In addition to the space occupied by S, the sieve requires three bit-vectors u, v and z:
u and v are both of length N, they are used to flag the substrings of S;
the flags are set when the corresponding substring encounters:
an empty slot (u), that triggers a new potential repetition
an occupied slot (v), which indicates a collision with an existing potential repetition
z of length M, holds the hash table slots
that are used to keep track of mapped fingerprints
Note that in contrast to u and v, the hash vector z is unpredictably accessed and as such depends critically on the availability of large and fast random access memory.
EM (S) allows to predict the number of required sieve movements/iterations in function of the size of the hash vector.
2.4 Experimental Results
Various relations between M and N on the one hand and the number of required sieve movements and processing time on the other hand, have been investigated, using:
Hardware: Inspiron 5748, Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz; 8,00 GB RAM
Software: available on GitHub
Recall that M, the modulus of the fingerprints, implies the number of bits required by the minimalistic hash table.
A) Variable Modulus M ; Fixed string length : N = const.
Influence of the modulus on the sieving time (Fig. 2.1)
B) Variable Modulus M ; Variable string length : N = M
influence of the problem size M = N
on the required sieve movements (Fig. 2.2)
on the required sieving time (Fig. 2.3)
Fig. 2.1 Influence of the modulus on the sieving time
Fig. 2.2 Influence of the problem size M = N on the required sieve movements
Fig. 2.3 Influence of the problem size M = N on the required sieving time