2:00-2:30pm Section 1: Inference Overview (30m)
(1) Primer on LLM inference
(2) Primer on DL hardware accelerators
(3) Inference optimization opportunities
2:30- 2:40pm Section 2: Structured Transformer architectures (10m)
(1) Grouped Query Attention
(2) Mixture of Experts
2:40-3:20pm Section 3: System Optimization (40m)
(1) Optimize the forward computation
(2) Make the device compute resource busier
(3) Leverage the LLM inference characteristics
(4) LLM Inference serving systems
3:20-3:30pm Break
3:30-4:15pm Section 4: Model compression (45m)
(1) Quantization
(2) Pruning
(3) Distillation
4:15-4:30pm Section 5: Fast decoding (15m)
(1) Speculative decoding
4:30- 4:50pm Section 6: Case Studies (20m)
(1) Speculative decoding on Trn1
(2) INT8 quantization on Trn1
4:50-5:00pm Q&A