GPU-based inference in Bayesian networks requires performing a tensor product of several functions and summation (contraction) over  shared dimensions.

The challenge is to recognize the seemingly irregular memory access pattern at runtime and prefetch the right data based on the pre-processing.

The paper on this was presented at ICS08 in Greece.
Here is the talk I gave on this subject @ MS Research, Redmond

The following archive contains the updated version of the Sum-product kernel implemented using NVIDIA CUDA and OpenMP.

Note that the code is experimental, it is no longer maintained.