posted Jan 24, 2009 11:42 PM by Nicolas Pinto
[
updated Jan 24, 2009 11:45 PM
]
M. Strengert, C. Müller, C. Dachsbacher, and T. ErtlVisualization Research Center (VISUS), University of Stuttgart
Eurographics Symposium on Parallel Graphics and Visualization (2008)
Abstract:
We present an extension to the CUDA programming language which extends parallelism to multi-GPU systems and GPU-cluster environments. Following the existing model, which exposes the internal parallelism of GPUs, our extended programming language provides a consistent development interface for additional, higher levels of parallel abstraction from the bus and network interconnects. The newly introduced layers provide the key features specific to the architecture and programmability of current graphics hardware while the underlying communication and scheduling mechanisms are completely hidden from the user. All extensions to the original programming language are handled by a self-contained compiler which is easily embedded into the CUDA compile process. We evaluate our system using two different sample applications and discuss scaling behavior and performance on different system architectures
The paper: http://www.vis.uni-stuttgart.de/~dachsbcn/download/egpgv08.pdf
The slides: http://vidi.cs.ucdavis.edu/EGPGV08/pages/strengert.pdf
|
posted Jan 15, 2009 9:59 PM by Nicolas Pinto
[
updated Jan 15, 2009 10:03 PM
]
N. Satish, M. Harris, and M. Garland.
Designing efficient sorting algorithms for manycore GPUs.
Proc. 23rd IEEE Int’l Parallel & Distributed Processing
Symposium, May 2009. To appear.
{download}
Abstract:
We describe the design of high-performance parallel radix sort and
merge sort routines for manycore GPUs, taking advantage of the full
programmability offered by CUDA. Our radix sort is the fastest GPU
sort and our merge sort is the fastest comparison-based sort
reported in the literature. Our radix sort is up to 4 times faster
than the graphics-based GPUSort and 2 times faster than other
CUDA-based radix sorts. It is also highly competitive with even the
most carefully optimized multi-core CPU sorting routines.
To achieve this performance, we carefully design our algorithms to
expose substantial fine-grained parallelism and decompose the
computation into independent tasks that perform minimal global
communication. We exploit the high-speed on-chip shared memory
provided by NVIDIA’s Tesla architecture and efficient data-parallel
primitives, particularly parallel scan. While targeted at GPUs,
these algorithms should also be well-suited for other manycore
processors.
|
posted Dec 18, 2008 1:01 AM by Nicolas Pinto
[
updated Dec 18, 2008 1:02 AM
]
posted Dec 18, 2008 1:00 AM by Nicolas Pinto
[
updated Dec 31, 2008 10:01 AM
]
Authors:
|
Akira Nukada
(Tokyo Institute of Technology)
|
|
Yasuhiko Ogata
(Tokyo Institute of Technology)
|
|
Toshio Endo
(Tokyo Institute of Technology)
|
|
Satoshi Matsuoka
(Tokyo Institute of Technology)
|
Abstract:
Most GPU performance "hypes" have focused around tightly-coupled
applications with small memory bandwidth requirements e.g., N-body, but
GPUs are also commodity vector machines sporting substantial memory
bandwidth; however, effective programming methodologies thereof have
been poorly studied. Our new 3-D FFT kernel, written in NVidia CUDA,
achieves nearly 80 GFLOPS on a top-end GPU, being more than three times
faster than any existing FFT implementations on GPUs including CUFFT.
Careful programming techniques are employed to fully exploit modern GPU
hardware characteristics while overcoming their limitations, including
on-chip shared memory utilization, optimizing the number of threads and
registers through appropriate localization, and avoiding low-speed
stride memory accesses. Our kernel applied to real applications
achieves orders of magnitude boost in power&cost vs. performance
metrics. The off-card bandwidth limitation is still an issue, which
could be alleviated somewhat with application kernels confinement
within the card, while ideal solution being facilitation of faster GPU
interfaces.
|
posted Dec 18, 2008 12:59 AM by Nicolas Pinto
[
updated Dec 31, 2008 10:04 AM
]
Authors:
|
Naga Govindaraju
(Microsoft Corporation)
|
|
Brandon Lloyd
(Microsoft Corporation)
|
|
Yuri Dotsenko
(Microsoft Corporation)
|
|
Burton Smith
(Microsoft Corporation)
|
|
John Manferdelli
(Microsoft Corporation)
|
Papers Session
Abstract:
We present novel algorithms for computing Fourier transforms with high
performance on GPUs. We present hierarchical, mixed radix FFT
algorithms for both power-of-two and non-power-of-two sizes. Our
hierarchical FFT algorithms efficiently exploit shared memory on GPUs
using a Stockham formulation. We reduce the memory transpose overheads
in hierarchical algorithms by combining the transposes into a
block-based multi-FFT algorithm. For non-power-of-two sizes, we use a
combination of mixed radix FFTs of small primes and Bluestein's
algorithm. We use modular arithmetic in Bluestein's algorithm to
improve the accuracy. We implemented our algorithms using the NVIDIA
CUDA API and compared their performance with NVIDIA's CUFFT library and
an optimized CPU-implementation (Intel's MKL) on a high-end quad-core
CPU. On an NVIDIA GPU, we obtained up to 250 GFlops, a 3x performance
improvement over CUFFT, and 5-25x improvement over MKL on large sizes.
|
posted Dec 18, 2008 12:58 AM by Nicolas Pinto
[
updated Dec 31, 2008 10:05 AM
]
Authors:
|
Vasily Volkov
(University of California, Berkeley)
|
|
James W. Demmel
(University of California, Berkeley)
|
Papers Session
Abstract:
We present performance results for dense linear algebra using the
8-series NVIDIA GPUs. Our GEMM routine runs 60% faster than the vendor
implementation and approaches the peak of hardware capabilities. Our
LU, QR and Cholesky factorizations achieve up to 80-90% of the peak
GEMM rate. Our parallel LU running on two GPUs achieves up to ~300
Gflop/s. These results are accomplished by challenging the accepted
view of the GPU architecture and programming guidelines. We argue that
modern GPUs should be viewed as multithreaded multicore vector units.
We exploit register blocking to optimize GEMM and heterogeneity of the
system (compute both on GPU and CPU). This study includes detailed
benchmarking of the GPU memory system that reveals sizes and latencies
of caches and TLB. We present a couple of algorithmic optimizations
aimed at increasing parallelism and regularity in the problem that
provide us with slightly higher performance.
|
posted Dec 18, 2008 12:54 AM by Nicolas Pinto
The flexibility and power needed in the data channel for a computer display are considered. To work efficiently, such a channel must have a sufficient number of instructions that it is best understood as a small processor rather than a powerful channel. As it was found that successive improvements to the display processor design lle on a circular path, by making improvements one can return to the original simple design plus one new general purpose computer for each trip around. The degree of physical separation between display and parent computer is a key factor in display processor design
http://cva.stanford.edu/classes/cs99s/papers/myer-sutherland-design-of-display-processors.pdf
|
|