As part of the AI group at AMD, working on improving throughput (tokens/sec) of emerging LLM/SLM models on the AMD Instinct GPUs for a variety of framework (Pytorch, Megatron, JAX)
- Spearheaded mix precision effort focused on studying the impact of precision on different operators
and designing graph-level mix precision optimizations for an MLIR based machine learning compiler.
- Worked as compiler lead in bringing up customer critical models in the custom compiler stack,
ensuring models are successfully lowered to the accelerator with reasonable performance and accuracy.
- Working on efficient mapping of large ML models on the underlying hardware accelerator to
maximize performance/accuracy and also enhance robustness of the compiler stack.
Worked on ILP Limit study and built an analyzer for performance modeling to study the impact of various micro-architectural features for a diverse range of workloads.
Studied bottlenecks of Vector Packet Processing (VPP) for Cavium processor.
Developed an intelligent mode that fine-tunes various knobs to reduce the runtime of the Design Compiler.
Developed Firmleak, a pre-silicon design technique for runtime leakage power characterization of processor.