Materials‎ > ‎

Lectures


Lecture #01 : Kick-Off


Lecture #02 : CUDA Basics #1

  • CUDA Basics #1 (Nicolas Pinto, MIT)
    {slideshare} {pdf}
    Note that slides were borrowed from Matthew Bolitho (John Hopkins), Mike Houston (Stanford) and NVIDIA.


Lecture #03 : CUDA Basics #2

  • CUDA Basics #2 (Nicolas Pinto, MIT)
    {slideshare} {pdf}
    Note that slides were borrowed from Matthew Bolitho (John Hopkins), Dr. Kirk and Prof. Hwu (UIUC) and NVIDIA.


Lecture #04 : CUDA Advanced #1

  • CUDA Advanced #1 (Nicolas Pinto, MIT)
    {slideshare} {pdf}
    Note that slides were borrowed from
    NVIDIA.


Lecture #05 : Theory #1

  • Wheel of Reincarnation (Steven Johnson, MIT)
    {slideshare} {pdf}
  • Theory #1 (Steven Johnson, MIT)
    (blackboard)


Lecture #06 : Theory #2


Lecture #07 : CUDA Advanced #2

  • CUDA Advanced #2 (Nicolas Pinto, MIT)
    {slideshare} {pdf}
    Note that slides were borrowed from
    NVIDIA.


Guest Lectures : Case studies

  • Utilizing Graphics Hardware to Address Critical Problems in Medical Imaging

    Prof. David Kaeli, NEU

    {slides}

    Abstract: 

    Graphics Processing Units (GPUs) have been growing in popularity due to their impressive processing capabilities, and equipped with general purpose programming languages such as NVIDIA's
    CUDA, are becoming the platform of choice in the scientific computing community.

    This talk will focus on two topics: how to utilize GPUs to accelerate critical medical image reconstruction algorithms, and how best to utilize multiple GPUs to accelerate a range of applications.  Previous studies that used GPUs focused on obtaining significant performance gains from execution on a single GPU.  These studies employed low-level, architecture-specific, tuning in order to achieve sizeable benefits over multicore CPU execution.

    In this talk, we consider the benefits of running on multiple (parallel) GPUs to provide further orders of performance speedup.  Our approach attempts to reduce or eliminate the need to apply low-level fine
    tuning to extract performance from a GPU. Our methodology allows developers to accurately predict execution time for GPU applications while varying the number and configuration of the GPUs, and the size
    of the input data set.

    We believe this is a natural next step in GPU computing because it allows researchers to determine the most appropriate GPU configuration for an application without having to purchase hardware, or write
    customized code for a mulitple-GPU implementation.

  • CUDA Tricks and High-Performance Computational Physics

    Kipton Barros, BU

    {slideshare} {pdf}

    Abstract:

    In this talk I will discuss advanced tricks to maximize CUDA performance, taking examples from my physics research. The topics to be covered include: how to maximize device bandwidth, the (unofficial) CUDA disassembler decuda, and a discussion of the optimizations performed by the CUDA compiler.  I will also mention various gotcha's, pitfalls, tips, ands tricks that I have encountered. An interactive discussion format is encouraged.

  • Out-of-Core Programming with NVIDIA's CUDA

    Prof. Gene Cooperman, NEU

    {slideshare} {pdf} {cuda overview}

    Abstract:

    The word "core" in this title has a double meaning. The older term core refers to an ancient implementation of RAM. The newer term core refers to a CPU or GPU core. For example, each NVIDIA SM (streaming multiprocessor) currently has eight cores. The amount of on-chip memory, or cache, on an SM is some small number of kilobytes. We will abuse the term out-of-core to refer to data that lies off-chip (outside the SM).

    The key to efficiency in many CUDA algorithms is to efficiently move data between on-chip cache (for in-core programming), and off-chip global memory on the video baord (for out-of-core programming). As the dual use of the term core implies, CUDA programming is not the first example in which skill in out-of-core programming has been important. This talk will clarify the issue by abstracting the issue of out-of-core programming. It will then discuss some principles that we have found useful in our own lab, and their application both to CUDA programming
    and to disk-based programming.

  • Radar Pulse Compression using Modern NVIDIA GPUs 

    Stephen Bash, MIT

    {slides} {paper} {poster}

    Abstract

    Over the past several years, graphics processing units (GPUs) have gained interest as general purpose highly parallel coprocessors.  Early adopters were forced to use traditional 3D graphics application programming interfaces (APIs) in order to access the computational power of the GPU.  This process of recasting general purpose problems into graphical terms can be time consuming and create obscure code.  The introduction of NVidia's Compute Unified Device Architecture (CUDA) Framework, a C-language development environment for NVidia GPUs, is designed to ease the burden placed on the general purpose GPU programmer.  In parallel with the CUDA release, NVidia also released implementations of the BLAS and FFT libraries for the GPU under the names CUBLAS and CUFFT, respectively.

    Previous research has shown the vast computational power of GPUs for signal processing.  Modern radar signal processing is a data parallel operation that benefits from parallel processing architectures.  This investigation will focus on the real-world benefit of GPUs for radar pulse compression.  First, the performance of 1D and 2D FFTs on a GPU via CUFFT will be compared to a modern day multi-core CPU implementation using FFTW.  Subsequently, these performance results will inform the implementation of two surrogate radar pulse compression chains, having differing processing complexity, which will also in turn be benchmarked similar to the FFTs.

  • The Present and Future of GPU Computing

    Kurt Keville, MIT

    {slides}

    Abstract:

    My presentation will be on alternate GPU topics, focusing on the various hardware differences and future feature sets.

    In this lecture I hope to give an overview of contemporary GPGPU topics including a survey of architecture-agnostic languages (OpenCL, BGSP, etc), a discussion of hand-coded algorithms (Folding@Home, for instance), GPU overclocking hacks, vendor hardware options and roadmaps, and publicly accessible multi-GPU systems to test out your massively parallel codes.

  • High-Productivity Supercomputing: Metaprogramming GPUs

    Andreas Klockner, Brown

    {slides} {pycuda webpage}

    Abstract:

    Tuning high-performance computational kernels relies on detailed machine knowledge, is error-prone and often tedious. It is thus an attractive target for automation. This is "metaprogramming": Programs write and tune other programs.

    After a brief introduction to the ideas behind modern, high-productivity scripting languages, I will discuss PyCuda, a toolkit for making CUDA-based GPUs accessible from Python, one such language. PyCuda allows the easy creation of high-performance script+GPU hybrid computational codes. In addition, PyCuda provides a vehicle for metaprogramming of GPUs.

    In the final part of the talk, I will outline how we used these tools to implement a self-tuning GPU-based Discontinuous Galerkin solver. On real-world 3D electromagnetic scattering problems, a single GPU with this solver achieves speedups between 40 and 60 over a current-generation CPU.

  • GPUs for Computer Vision: Overview, Examples & Opportunities

    Joe Stam, NVIDIA

    {slides} {stereo vision white paper} {stereo vision code}

    Abstract:

    Flexibly programmable graphics processors usher in a new era for computer vision and image processing.  Many vision algorithms map extremely well to the GPUs massively parallel architecture, perhaps even as well as graphics algorithms themselves. Techniques previously limited to off-line experimentation or expensive supercomputers can now be deployed for real-time use in consumer machines.  This talk will introduce computer vision on the GPU, and highlight some features especially well suited towards vision tasks.  We'll show examples of optical flow and stereo vision, two computationally intensive algorithms enabled for real-time use on the GPU.  Finally we'll conclude with a discussion of interesting future directions and exciting opportunities for research and products.


  • CUDA Optimization, an Image Processing Case Study

    Joe Stam, NVIDIA

    {slides}
    {code}

    Abstract:

    Graphics processors can be easily programmed to provide significant acceleration in many common parallel tasks.  However, with additional architecture knowledge and understanding of optimization strategies, a savvy programmer can unleash the full potential of the GPU's massive memory bandwidth and ensure the processing resources are utilized to their fullest extent.  In this talk, we'll explore several different approaches to a very simple but ubiquitous image processing algorithm, the convolution.  A naive approach shows the detrimental impact of poorly written code, a simple approach achieves decent results with little effort or code complexity, and a few highly optimized techniques realize the GPUs full power for the most demanding tasks.  The techniques explored in this simple but illustrative example will serve as a base for understanding the optimization strategies to apply towards more complex algorithms. 

  • Unlocking Biologically-Inspired Computer Vision: a High-Throughput Approach

    David Cox, Harvard | James DiCarlo, Nicolas Pinto, MIT

    {slideshare}

    Abstract:

    The study of biological vision and the creation of artificial vision systems are naturally intertwined – exploration of the neuronal substrates of visual processing provides clues and inspiration for artificial systems, and artificial systems, in turn, serve as important generators of new ideas and working hypotheses.  However, while systems neuroscience has provided inspiration for some of the "broad-stroke" properties of the visual system, much is still unknown. Even for those qualitative properties that most biologically-inspired  models share, experimental data currently provide little constraint on their key parameters.  Consequently, it is difficult to truly evaluate a set of computational ideas, since the performance of a model depends strongly on its particular instantiation – the size of the pooling kernels, the number of units per layer, exponents in normalization operations, etc.

    To pave a way forward, we have developed a high-throughput approach to more expansively explore the possible range of biologically-inspired models, including models of larger, more realistic scale, leveraging recent advances in commodity stream processing hardware - particularly, high-end NVIDIA GPUs.  In analogy to high-throughput screening approaches in molecular biology and genetics, we generated and trained thousands of potential network architectures and parameter instantiations, and "screened" the visual representations produced by these models using an object recognition task.  From these candidate models, the most promising were selected for further analysis. We have shown that this approach can yield significant, reproducible gains in performance in a basic object recognition tasks, and that it can offer insight into which computational ideas are most important for achieving this performance. 

    In this talk, I'll also highlight how the application of flexible programming tools, such as high-level scripting and template metaprogramming, can enable large performance gains, while managing complexity for the developer.  As the scale of available computational power continues to expand, our approach holds great potential both for accelerating progress in artificial vision, and for generating new, experimentally-testable hypotheses for the study of biological vision.


Č
Ċ
Nicolas Pinto,
Jan 21, 2009, 7:48 PM
Ċ
Nicolas Pinto,
Jan 21, 2009, 7:48 PM
ċ
Nicolas Pinto,
Jan 29, 2009, 9:39 PM
Ċ
Nicolas Pinto,
Jan 29, 2009, 9:38 PM
Ċ
ics05.pdf
(149k)
Nicolas Pinto,
Jan 21, 2009, 7:48 PM
Ċ
Nicolas Pinto,
Jan 21, 2009, 7:48 PM
Ċ
Nicolas Pinto,
Jan 21, 2009, 7:48 PM
Ċ
spaa06.pdf
(193k)
Nicolas Pinto,
Jan 21, 2009, 7:49 PM
Comments