CUDA Sieve of Eratosthenes

I've recently coded a sieve that is quite fast so I thought I'd share it here. It is a segmented sieve. Each thread block sieves a different sub-interval of the desired interval and then counts the primes in that sub-interval. (A final count over all of the sub-intervals occurs in host code.) Within each thread block, 256 threads simultaneously sieve 8192 bytes of shared memory. Each thread is responsible for sieving a small set of primes. Race conditions are avoided by eliminating read-modify-write operations and simply doing byte writes instead. Except for the smallest primes which unroll a pre-computed table over a bit-packed copy of the shared sieve block. The speed comes from parallelization of the blocks due to segmentation of the sieve interval, parallization of the threads into shared memory, use of a "mod 6" wheel, hand-unrolled inner loops, and the small-prime bit-packed optimization.

Input intervals are limited to 10e9, but, for example, sieving a range from 0 to 1e9 takes 38 milliseconds using a M2050 Tesla card. I'd be interested to know how (or even *if*) the code runs on other hardware.

This thread on mersenneforum.org captures most of the development work and history of this effort.