TRITON - GPU language

Triton is a language and compiler for parallel programming. It provides a Python-based programming environment to write custom DNN compute kernels that can run at maximal throughput on GPU.Triton language can be used to write GPU code, instead of CUDA. Triton simplifies the development of specialized kernels. OpenAI open sourced Triton on July 28, 2021.

Programming Model

In Triton kernels are defined as decorated Python functions (@triton.jit) & launched concurrently with different program_id’s on a grid of instances. Thus Triton is similar to Numba (JIT compiler) in writing kernels. But Triton performs operations on blocks (small arrays whose dimensions are powers of two) instead of a SIMT (Single Instruction, Multiple Thread) execution model

How @triton.jit decorator works ?

The @triton.jit decorator walks the AST (Abstract Syntax Tree)of the Python function & uses a SSA construction algorithm to generate Triton-IR on the fly
Triton Compiler backend simplifies, optimizes and auto parallelizes the Triton-IR code to converted into high-quality LLVM-IR (Low Level Virtual Machine - Intermediate Representation)
The libLLVM converts LLVM-IR to PTX (Parallel Thread eXecution) for execution on GPUs

Compiler Backend

The use of blocked program representations via Triton-IR allows Triton compiler to auto optimize important programs. Triton programs can be auto parallelized

Across SMs by executing different kernel instances concurrently, and
Within SMs by analyzing the iteration space of each block-level operation and partitioning it adequately across different SIMD units

CUDA vs TRITON

CUDA Programming Model follows a Scalar Program, Blocked Threads approach

Triton Programming Model follows a Blocked Program, Scalar Threads approach

As Triton aims to be broadly applicable, it allows manual scheduling of work (e.g. tiling, inter-SM synchronization) across SMs (Streaming Multiprocessors). All other tasks are auto optimized by Triton enabling the developers to focus on the high-level logic of their parallel code

Block Representation

Triton's block based representation approach is highly beneficial as it leads to block-structured iteration spaces that :

Allow programmers to manually handle load-balancing as they wish & programmers have more flexibility when implementing sparse operations
Allow compilers to optimize programs for data locality and parallelism

Work scheduling means how the work done by each program instance should be partitioned for efficient execution on GPUs. To overcome the challenge of work scheduling, Triton compiler uses block-level data-flow analysis technique which schedules the iteration blocks based on the control- and data-flow structure of the target program. Thus, Triton compiler can auto optimizes many tasks.

Matrix Multiplication Kernels

In Triton, we can write FP16 matrix multiplication kernels on par with cuBLAS with just 25 lines if code. Triton is used to write matrix multiplication kernels which can be customized to accommodate fused transformations of their inputs (e.g., slicing) and outputs (e.g., Leaky ReLU)

Limitations

Triton currently supports only Nvidia GPUs, but Triton will soon support CPUs and AMD GPUs with the help of contributions from the community.

Hope you are doing well ... Pleasure meeting you online ...

I am Sri Lakshmi , AI Practitioner, Developer & Technical Content Producer

Liked this article ??? If you want me to : Write articles that give simple explanations of complex topics / Design outstanding presentations / Develop cool AI apps / Launch and popularize your products to the target audience / Manage social media and digital presence / Partner or Collaborate with me, feel free to discuss with me your ideas & requirements by clicking the button below

Let's work together