Dieharder test descriptions

Original Die Hard tests

0 - "Birthdays" test (modified).

Each test determines the number of matching intervals from 512 "birthdays" (by default) drawn on a 24-bit "year" (by default).  This is repeated 100 times (by default) and the results cumulated in a histogram.  Repeated intervals should be distributed in a Poisson distribution if the underlying generator is random enough, and a a chisq and p-value for the test are evaluated relative to this null hypothesis.

It is recommended that you run this at or near the original 100 test samples per p-value with -t 100.  Two additional parameters have been added. In diehard, nms=512 but this CAN be varied and all Marsaglia's formulae still work.  It can be reset to different values with -x nmsvalue. Similarly, nbits "should" 24, but we can really make it anything we want that's less than or equal to rmax_bits = 32.  It can be reset to a new value with -y nbits. Both default to diehard's values if no -x or -y options are used.

1 - Overlapping 5-Permutations Test

This is the OPERM5 test.  It looks at a sequence of one million 32-bit random integers.  Each set of five consecutive integers can be in one of 120 states, for the 5! possible orderings of five numbers.  Thus the 5th, 6th, 7th, ... numbers  each provide a state. As many thousands of state transitions  are observed,  cumulative counts are made of the number of  occurences of each state.  Then the quadratic form in the weak inverse of the 120x120 covariance matrix yields a test equivalent to the likelihood ratio test that the 120 cell counts came from the specified (asymptotically) normal distribution with the specified 120x120 covariance matrix (with rank 99).  This version uses 1,000,000 integers, twice. Note that Dieharder runs the test 100 times, not twice, by default.

2 - 32x32 Binary Rank Test

This is the BINARY RANK TEST for 32x32 matrices. A random 32x32 binary matrix is formed, each row a 32-bit random integer.  The rank is determined. That rank can be from 0 to 32, ranks less than 29 are rare, and their counts are pooled with those for rank 29.  Ranks are found for 40,000 such random matrices and a chisquare test is performed on counts for ranks  32, 31, 30 and <=29.<

As always, the test is repeated and a KS test applied to the resulting p-values to verify that they are approximately uniform.

3 - 6x8 Binary Rank Test

This is the BINARY RANK TEST for 6x8 matrices.  From each of six random 32-bit integers from the generator under test, a specified byte is chosen, and the resulting six bytes form a 6x8 binary matrix whose rank is determined.  That rank can be from 0 to 6, but ranks 0,1,2,3 are rare; their counts are pooled with those for rank 4. Ranks are found for 100,000 random matrices, and a chi-square test is performed on counts for ranks 6,5 and <=4.


As always, the test is repeated and a KS test applied to the resulting p-values to verify that they are approximately uniform.

4 - 6x8 Binary Rank Test

This is the BINARY RANK TEST for 6x8 matrices.  From each of six random 32-bit integers from the generator under test, a specified byte is chosen, and the resulting six bytes form a 6x8 binary matrix whose rank is determined.  That rank can be from 0 to 6, but ranks 0,1,2,3 are rare; their counts are pooled with those for rank 4. Ranks are found for 100,000 random matrices, and a chi-square test is performed on  counts for ranks 6,5 and <=4.

As always, the test is repeated and a KS test applied to the resulting p-values to verify that they are approximately uniform.

5 -Overlapping Pairs Sparse Occupance (OPSO)

The OPSO test considers 2-letter words from an alphabet of 1024 letters.  Each letter is determined by a specified ten bits from a 32-bit integer in the sequence to be tested. OPSO generates  2^21 (overlapping) 2-letter words  (from 2^21+1 "keystrokes")  and counts the number of missing words---that is 2-letter words which do not appear in the entire sequence.  That count should be very close to normally distributed with mean 141,909, sigma 290. Thus (missingwrds-141909)/290 should be a standard normal variable. The OPSO test takes 32 bits at a time from the test file and uses a designated set of ten consecutive bits. It then restarts the file for the next designated 10 bits, and so on.

Note 2^21 = 2097152, tsamples cannot be varied.

6 -Overlapping Quadruples Sparce Occupancy (OQSO) Test

Similar, to OPSO except that it considers 4-letter words from an alphabet of 32 letters, each letter determined by a designated string of 5 consecutive bits from the test file, elements of which are assumed 32-bit random integers.  The mean number of missing words in a sequence of 2^21 four-letter words,  (2^21+3 "keystrokes"), is again 141909, with sigma = 295.  The mean is based on theory; sigma comes from extensive simulation.

 

Note 2^21 = 2097152, tsamples cannot be varied.

7 - DNA Test.

The DNA test considers an alphabet of 4 letters::  C,G,A,T, determined by two designated bits in the sequence of random integers being tested.  It considers 10-letter words, so that as in OPSO and OQSO, there are 2^20 possible words, and the mean number of missing words from a string of 2^21  (overlapping)  10-letter  words (2^21+9 "keystrokes") is 141909.   The standard deviation sigma=339 was determined as for OQSO by simulation.  (Sigma for OPSO, 290, is the true value (to three places), not determined by simulation.

Note 2^21 = 2097152

Note also that we don't bother with overlapping keystrokes (and sample more rands -- rands are now cheap).

8 - Count the 1s (stream) (modified) Test.

Consider the file under test as a stream of bytes (four per 32 bit integer).  Each byte can contain from 0 to 8 1's, with probabilities 1,8,28,56,70,56,28,8,1 over 256.  Now let the stream of bytes provide a string of overlapping  5-letter words, each "letter" taking values A,B,C,D,E. The letters are determined by the number of 1's in a byte::  0,1,or 2 yield A, 3 yields B, 4 yields C, 5 yields D and 6,7 or 8 yield E. Thus we have a monkey at a typewriter hitting five keys with various probabilities (37,56,70,56,37 over 256).  There are 5^5 possible 5-letter words, and from a string of 256,000 (overlapping) 5-letter words, counts are made on the frequencies for each word.   The quadratic form in the weak inverse of the covariance matrix of the cell counts provides a chisquare test::  Q5-Q4, the difference of the naive Pearson sums of (OBS-EXP)^2/EXP on counts for 5- and 4-letter cell counts.

9 - Count the 1s Test (byte) (modified).

This is the COUNT-THE-1's TEST for specific bytes.  Consider the file under test as a stream of 32-bit integers.  From each integer, a specific byte is chosen , say the left-most::  bits 1 to 8. Each byte can contain from 0 to 8 1's, with probabilitie 1,8,28,56,70,56,28,8,1 over 256.  Now let the specified bytes from successive integers provide a string of (overlapping) 5-letter words, each "letter" taking values  A,B,C,D,E. The letters are determined  by the number of 1's,  in that byte::  0,1,or 2 ---> A, 3 ---> B, 4 ---> C, 5 ---> D, and  6,7 or 8 ---> E.  Thus we have a monkey at a typewriter  hitting five keys with with various probabilities::  37, 56, 70, 56, 37 over 256. There are 5^5 possible 5-letter words, and from a string of 256,000 (overlapping) 5-letter words, counts are made on the frequencies for each word. The quadratic form in the weak inverse of the covariance matrix of the cell counts provides a chisquare test::  Q5-Q4, the difference of the naive Pearson  sums of (OBS-EXP)^2/EXP on counts for 5- and 4-letter cell counts.

Note: We actually cycle samples over all 0-31 bit offsets, so that if there is a problem with any particular offset it has a chance of being observed.  One can imagine problems with odd offsets but not even, for example, or only with the offset 7. tsamples and psamples can be freely varied, but you'll likely need tsamples >> 100,000 to have enough to get a reliable kstest result.

10 - Parking Lot Test (modified).

This tests the distribution of attempts to randomly park a square car of length 1 on a 100x100 parking lot without crashing.  We plot n (number of attempts) versus k (number of attempts that didn't "crash"  because the car squares overlapped and compare to the expected result from a perfectly random set of parking coordinates.  This is, alas, not really known on theoretical grounds so instead we compare to n=12,000 where k should average 3523 with sigma 21.9 and is very close to normally distributed.  Thus (k-3523)/21.9 is a standard # normal variable, which converted to a uniform p-value, provides input to a KS test with a default 100 samples.

11 - Minimum Distance (2d Circle) Test

It does this 100 times::   choose n=8000 random points in a square of side 10000.  Find d, the minimum distance between the (n^2-n)/2 pairs of points.  If the points are truly independent uniform, then d^2, the square of the minimum distance should be (very close to) exponentially distributed with mean .995 .  Thus 1-exp(-d^2/.995) should be uniform on [0,1) and  a KSTEST on the resulting 100 values serves as a test of uniformity for random points in the square. Test numbers=0 mod 5 are printed but the KSTEST is based on the full set of 100 random choices of 8000 points in the 10000x10000 square.


This test uses a fixed number of samples -- tsamples is ignored. It also uses the default value of 100 psamples in the final KS test, for once agreeing precisely with Diehard.

12 -3d Sphere (Minimum Distance) Test

Choose  4000 random points in a cube of edge 1000.  At each point, center a sphere large enough to reach the next closest point. Then the volume of the smallest such sphere is (very close to) exponentially distributed with mean 120pi/3.  Thus the radius cubed is exponential with mean 30. (The mean is obtained by extensive simulation).  The 3DSPHERES test generates 4000 such spheres 20 times.  Each min radius cubed leads to a uniform variable by means of 1-exp(-r^3/30.), then a KSTEST is done on the 20 p-values.

This test ignores tsamples, and runs the usual default 100 psamples to use in the final KS test.

13 - Squeeze Test.

Random integers are floated to get uniforms on [0,1). Starting with k=2^31=2147483647, the test finds j, the number of iterations necessary to reduce k to 1, using the reduction k=ceiling(k*U), with U provided by floating integers from the file being tested.  Such j's are found 100,000 times, then counts for the number of times j was <=6,7,...,47,>=48  are used to provide a chi-square test for cell frequencies.

14 - Sums Test

Integers are floated to get a sequence U(1),U(2),... of uniform [0,1) variables.  Then overlapping sums, S(1)=U(1)+...+U(100), S2=U(2)+...+U(101),... are formed. The S's are virtually normal with a certain covariance matrix.  A linear transformation of the S's converts them to a sequence of independent standard normals, which are converted to uniform variables for a KSTEST. The  p-values from ten KSTESTs are given still another KSTEST.

Comments

At this point I think there is rock solid evidence that this test is completely useless in every sense of the word. It is broken, and it is so broken that there is no point in trying to fix it.  The problem is that the transformation above is not linear, and doesn't work. Don't use it. 

For what it is worth, rgb_lagged_sums with ntuple 0 tests for exactly the same thing, but scalably and reliably without the complication of overlapping samples and covariance. Use it instead.

15 - Runs Test

This is the RUNS test. It counts runs up, and runs down, in a sequence of uniform [0,1) variables, obtained by floating the 32-bit integers in the specified file. This example shows how runs are counted: .123, .357, .789, .425, .224, .416, .95 contains an up-run of length 3, a down-run of length 2 and and up-run of (at least) 2, depending on the next values. The covariance matrices for the runs-up and runs-down are well known, leading to chisquare tests for quadratic forms in the weak inverses of the covariance matrices. Runs are counted for sequences of length 10,000. This is done ten times. Then repeated.

In Dieharder sequences of length tsamples = 100000 are used by default, and 100 p-values thus generated are used in a final KS test.

16 - Craps Test

This is the CRAPS TEST. It plays 200,000 games of craps, finds the number of wins and the number of throws necessary to end each game. The number of wins should be (very close to) a normal with mean 200000p and variance 200000p(1-p), with p=244/495. Throws necessary to complete the game can vary from 1 to infinity, but counts for all>21 are lumped with 21.  A chi-square test is made on the no.-of-throws cell counts.  Each 32-bit integer from the test file provides the value for the throw of a die, by floating to [0,1), multiplying by 6 and taking 1 plus the integer part of the result.

17 - Marsaglia and Tsang GCD Test

10^7 tsamples (default) of uint rands u, v are generated and two statistics are generated: their greatest common divisor (GCD) (w) and the number of steps of Euclid's Method required to find it (k). Two tables of frequencies are thus generated -- one for the number of times each value for k in the range 0 to 41 (with counts greater than this range lumped in with the endpoints).  The other table is the frequency of occurrence of each GCD w.  k is be distributed approximately binomially, but this is useless for the purposes of performing a stringent test. Instead four "good" RNGs (gfsr4, mt19937_1999, rndlxs2, taus2) were used to construct a simulated table of high precision probabilities for k (a process that obviously begs the question as to whether or not THESE generators are "good" wrt the test). At any rate, they produce very similar tables and pass the test with each other's tables (and are otherwise very different RNGs). The table of probabilities for the gcd distribution is generated dynamically per test (it is easy to compute). Chisq tests on both of these binned distributions yield two p-values per test, and 100 (default) p-values of each are accumulated and subjected to final KS tests and displayed in a histogram.

STS Tests

100 - Monobit Test

Very simple. Counts the 1 bits in a long string of random uints.  Compares to expected number, generates a p-value directly from erfc(). Very effective at revealing overtly weak generators; Not so good at determining where stronger ones eventually fail.

101 - Runs Test

Counts the total number of 0 runs + total number of 1 runs across a sample of bits. Note that a 0 run must begin with 10 and end with 01. Note that a 1 run must begin with 01 and end with a 10.  This test, run on a bitstring with cyclic boundary conditions, is absolutely equivalent to just counting the 01 + 10 bit pairs.  It is therefore totally redundant with but not as good as the rgb_bitdist() test for 2-tuples, which looks beyond the means to the moments, testing an entire histogram of 00, 01, 10, and 11 counts to see if it is binomially distributed with p = 0.25.

102 - Serial Test

Accumulates the frequencies of overlapping n-tuples of bits drawn from a source of random integers. The expected distribution of n-bit patterns is multinomial with p = 2^(-n) e.g. the four 2-bit patterns 00 01 10 11 should occur with equal probability. The target distribution is thus a simple chisq with 2^n - 1 degrees of freedom, one lost due to the constraint that:

 p_00 + p_01 + p_01 + p_11 = 1

With overlap, though the test statistic is more complex. For example, given a bit string such as 0110100111000110 without overlap, it becomes 01|10|10|01|11|00|01|10 and we count 1 00, 3 01s, 3 10s, and 1 11.  WITH overlap we get all of these patterns as well as (with cyclic wrap): 0|11|01|00|11|10|00|11|0 and we count 4 00s, 4 01s, 4 10s, and 3 11s.  There is considerable covariance in the bit frequencies and a simple chisq test no longer suffices. The STS test uses target statistics that are valid for overlapping samples but which require multiple orders to generate.

It is much easier to write a test that doesn't use overlapping samples and directly checks to ensure that the distribution of bit ntuples is consistent with a multinomial distribution with uniform probability p = 1/2^n, e.g. 1/8 for n = 3 bit, 1/16 for n = 4 bit NON-overlapping samples, and the rgb_bitdist is just such a test. This test doesn't require comparing different orders. An open research question is whether or not test sensitivity significantly depends on managing overlap testing software RNGs where it is presumed that generation is cheap and unlimited. This question pertains to related tests, such as overlapping permutations tests (where non-overlapping permutation tests are isomorphic to non-overlapping frequency tests, fairly obviously).

This test does all the possible bitlevel tests from n=1 to n=24 bits (where n=1 is basically sts_monobit, and n=2 IMO is redundant with sts_runs). However, if I understand things correctly it is not possible to fail a 2 bit test and pass a 24 bit test, as if 2 bits are biased so that (say) 00 occurs a bit too often, then 24 bit strings containing 00's MUST be imbalanced as well relative to ones that do not, so we really only need to check n=24 bit results to get all the rest for free, so to speak.

RGB tests

200 - Bit Distribution Test

Accumulates the frequencies of all n-tuples of bits in a list of random integers and compares the distribution thus generated with the theoretical (binomial) histogram, forming chisq and the associated p-value. In this test n-tuples are selected without WITHOUT overlap (e.g. 01|10|10|01|11|00|01|10) so the samples are independent. Every other sample is offset modulus of the sample index and ntuple_max.

This test must be run with -n ntuple for ntuple > 0. Note that if ntuple > 12, one should probably increase tsamples so that each of the 2^ntuple bins should end up with an average of around 30 occurrences. Note also that the memory requirements and CPU time requirements will get quite large by e.g. ntuple = 20 -- use caution when sampling the distribution of very large ntuples.

201 - The Generalized Minimum Distance Test

This is the generalized minimum distance test, based on the paper of M. Fischler in the doc directory and private communications. This test utilizes correction terms that are essential in order for the test not to fail for large numbers of trials. It replaces both diehard_2dsphere.c and diehard_3dsphere.c, and generalizes the test itself so that it can be run for any d = 2,3,4,5. There is no fundamental obstacle to running it for d = 1 or d>5, but one would need to compute the expected overlap integrals (q) for the overlapping d-spheres in the higher dimensions. Note that in this test there is no real need to stick to the parameters of Marsaglia. The test by its nature has three controls: n (the number of points used to sample the minimum distance) which determines the granularity of the test -- the approximate length scale probed for an excess of density; p, the usual number of trials; and d the dimension. As Fischler points out, to actually resolve problems with a generator that had areas 20% off the expected density (consistently) in d = 2, n = 8000 (Marsaglia's parameters) would require around 2500 trials, where p = 100 (the old test default) would resolve only consistent deviations of around 1.5 times the expected density. By making both of these user selectable parameters, dieharder should be able to test a generator pretty much as thoroughly as one likes subject to the generous constraints associated with the eventual need for still higher order corrections as n and p are made large enough.

202 - Permutations Test

This is a non-overlapping test that simply counts order permutations of random numbers, pulled out n at a time. There are n! permutations and all are equally likely. The samples are independent, so one can do a simple chisq test on the count vector with n! - 1 degrees of freedom. This is a poor-man's version of the overlapping permutations tests, which are much more difficult because of the covariance of the overlapping samples.

203 - Lagged Sums Test

This package contains many very lovely tests. Very few of them, however, test for lagged correlations -- the possibility that the random number generator has a bitlevel correlation after some fixed number of intervening bits.

The lagged sums test is therefore very simple. One simply adds up uniform deviates sampled from the rng, skipping lag samples in between each rand used. The mean of tsamples samples thus summed should be 0.5*tsamples. The standard deviation should be sqrt(tsamples/12).  The experimental values of the sum are thus converted into a p-value (using the erf()) and a ks-test applied to psamples of them.

204 - The Kolmogorov-Smirnov Test Test

This test generates a vector of tsamples uniform deviates from the selected rng, then applies an Anderson-Darling or Kuiper KS test to it to directly test for uniformity. The AD version has been symmetrized to correct for weak left bias for small sample sets; Kuiper is already ring-symmetric on the interval. The AD code corresponds roughly to what is in R (thanks to a correction sent in by David Bauer).  As always, the test is run pvalues times and the (same) KS test is then used to generate a final test pvalue, but the real purpose of this test is to test ADKS and KKS, not to test rngs. This test clearly reveals that kstests run on only 100 test values (tsamples, herein) are only approximately accurate; their pvalues are distinctly high-biased (but less so than Kuiper or KS before the fix). This bias is hardly visible for less than 1000 trivals (psamples, herein) but will constently cause failure for -t 100, -p 10000 or higher. For -t 1000, it is much more difficult to detect, and the final kstest is approximately valid for the test in question.

205 - DAB Byte Distribution Test

Extract n independent bytes from each of k consecutive words. Increment indexed counters in each of n tables. (Total of 256 * n counters.) Currently, n=3 and is fixed at compile time. If n>=2, then the lowest and highest bytes will be used, along with n-2 bytes from the middle. If the generator's word size is too small, overlapped bytes will be used. Current, k=3 and is fixed at compile time. Use a basic chisq fitting test (chisq_pearson) for the p-value. Previous version also used a chisq independence test (chisq2d); it was found to be slightly less sensitive. I envisioned this test as using a small number of samples and large number of separate tests. Experiments so far show that keeping -p 1 and increasing -t performs best.

206 - DCT (Frequency Analysis) Test

This test performs a Discrete Cosine Transform (DCT) on the output of the RNG. More specifically, it performs tsamples transforms, each over an independent block of ntuple words. If tsamples is large enough, the positions of the maximum (absolute) value in each transform are recorded and subjected to a chisq test for uniformity/independence. [1] (A standard type II DCT is used.)

If tsamples is smaller than or equal to 5 times ntuple then a fallback test will be used, whereby all DCT values are converted to p-values and tested for uniformity via a KS test. This version is significantly less sensitive, and is not recommended.

Power: With the right parameters, this test catches more GSL generators than any other; however, that count is biased by each of the randomNNN generators having three copies.

Limitations: ntuple is required to be a power of 2, because a radix 2 algorithm is used to calculate the DCT.

False positives: targets are (mostly) calculated exactly, however it will still return false positives when ntuple is small and tsamples is very large. For the default ntuple value of 256, I get bad scores with about 100 million or more tsamples (psamples set to 1).

[1] The samples are taken as unsigned integers, and the DC coefficient is adjusted to compensate for this.

207 - DAB Fill Tree Test

This test fills small binary trees of fixed depth with words from the the RNG. When a word cannot be inserted into the tree, the current count of words in the tree is recorded, along with the position at which the word would have been inserted.

The words from the RNG are rotated (in long cycles) to better detect RNGs that may bias only the high, middle, or low bytes.

The test returns two p-values. The first is a Pearson chi-sq test against the expected values (which were estimated empirically). The second is a Pearson chi-sq test for a uniform distribution of the positions at which the insert failed.

Because of the target data for the first p-value, ntuple must be kept at the default (32).

208 - DAB Fill Tree 2 Test

Bit version of Fill Tree test. This test fills small binary trees of fixed depth with "visited" markers. When a marker cannot be placed, the current count of markers in the tree and the position that the marker would have been inserted, if it hadn't already been marked.

For each bit in the RNG input, the test takes a step right (for a zero) or left (for a one) in the tree.  If the node hasn't been marked, it is marked, and the path restarts. Otherwise, the test continues with the next bit.

The test returns two p-values. The first is a Pearson chi-sq test against the expected values (which were estimated empirically. The second is a Pearson chi-sq test for a uniform distribution of the positions at which the insert failed.

Because of the target data for the first p-value, ntuple must be kept at the default (128).

209 - DAB Monobit 2 Test

Block-monobit test.  Since we don't know what block size to use, try multiple block sizes. In particular, try all block sizes of 2^k words, where k={0..n}. The value of n is calculated from the word size of the generator and the sample size used, and is shown as ntuple.

Comments