KDD 2024 Tutorial

Logistics/Materials

2:00-2:30pm Section 1: Inference Overview (30m)

(1) Primer on LLM inference

(2) Primer on DL hardware accelerators

(3) Inference optimization opportunities

2:30- 2:40pm Section 2: Structured Transformer architectures (10m)

(1) Grouped Query Attention

(2) Mixture of Experts

2:40-3:20pm Section 3: System Optimization (40m)

(1) Optimize the forward computation

(2) Make the device compute resource busier

(3) Leverage the LLM inference characteristics

(4) LLM Inference serving systems

3:20-3:30pm Break

3:30-4:15pm Section 4: Model compression (45m)

(1) Quantization

(2) Pruning

(3) Distillation

4:15-4:30pm Section 5: Fast decoding (15m)

(1) Speculative decoding

4:30- 4:50pm Section 6: Case Studies (20m)

(1) Speculative decoding on Trn1

(2) INT8 quantization on Trn1

4:50-5:00pm Q&A

Page updated

Google Sites

Report abuse