TRITON - GPU language
Triton is a language and compiler for parallel programming. It provides a Python-based programming environment to write custom DNN compute kernels that can run at maximal throughput on GPU.Triton language can be used to write GPU code, instead of CUDA. Triton simplifies the development of specialized kernels. OpenAI open sourced Triton on July 28, 2021.
Programming Model
In Triton kernels are defined as decorated Python functions (@triton.jit) & launched concurrently with different program_id’s on a grid of instances. Thus Triton is similar to Numba (JIT compiler) in writing kernels. But Triton performs operations on blocks (small arrays whose dimensions are powers of two) instead of a SIMT (Single Instruction, Multiple Thread) execution model
How @triton.jit decorator works ?
The @triton.jit decorator walks the AST (Abstract Syntax Tree)of the Python function & uses a SSA construction algorithm to generate Triton-IR on the fly
Triton Compiler backend simplifies, optimizes and auto parallelizes the Triton-IR code to converted into high-quality LLVM-IR (Low Level Virtual Machine - Intermediate Representation)
The libLLVM converts LLVM-IR to PTX (Parallel Thread eXecution) for execution on GPUs
Compiler Backend
The use of blocked program representations via Triton-IR allows Triton compiler to auto optimize important programs. Triton programs can be auto parallelized
Across SMs by executing different kernel instances concurrently, and
Within SMs by analyzing the iteration space of each block-level operation and partitioning it adequately across different SIMD units
CUDA vs TRITON
CUDA Programming Model follows a Scalar Program, Blocked Threads approach
Triton Programming Model follows a Blocked Program, Scalar Threads approach
As Triton aims to be broadly applicable, it allows manual scheduling of work (e.g. tiling, inter-SM synchronization) across SMs (Streaming Multiprocessors). All other tasks are auto optimized by Triton enabling the developers to focus on the high-level logic of their parallel code
Block Representation
Triton's block based representation approach is highly beneficial as it leads to block-structured iteration spaces that :
Allow programmers to manually handle load-balancing as they wish & programmers have more flexibility when implementing sparse operations
Allow compilers to optimize programs for data locality and parallelism
Work scheduling means how the work done by each program instance should be partitioned for efficient execution on GPUs. To overcome the challenge of work scheduling, Triton compiler uses block-level data-flow analysis technique which schedules the iteration blocks based on the control- and data-flow structure of the target program. Thus, Triton compiler can auto optimizes many tasks.
Matrix Multiplication Kernels
In Triton, we can write FP16 matrix multiplication kernels on par with cuBLAS with just 25 lines if code. Triton is used to write matrix multiplication kernels which can be customized to accommodate fused transformations of their inputs (e.g., slicing) and outputs (e.g., Leaky ReLU)
Limitations
Triton currently supports only Nvidia GPUs, but Triton will soon support CPUs and AMD GPUs with the help of contributions from the community.
Hope you are doing well ... Pleasure meeting you online ...
I am Sri Lakshmi , AI Practitioner, Developer & Technical Content Producer
Liked this article ??? If you want me to : Write articles that give simple explanations of complex topics / Design outstanding presentations / Develop cool AI apps / Launch and popularize your products to the target audience / Manage social media and digital presence / Partner or Collaborate with me, feel free to discuss with me your ideas & requirements by clicking the button below