I am a Staff ML Compiler Engineer and Tech Lead at Cruise working on an MLIR enabled compilation stack to radically simplify ML model deployment onto a variety of commodity and in-house ML accelerators. I tech-lead the compiler frontend, providing seamless connectivity from our SOTA PyTorch models to the MLIR ecosystem so they can be optimized to target our heterogenous compute platforms efficiently for latency critical inference on the AV.
In my previous role at Xilinx (now AMD) I worked at the intersection of compilers and algorithms for deep learning optimizations such as quantization-aware-training and pruning. Our group developed advanced algorithms and compilers for machine learning acceleration on embedded and cloud platforms for low-latency, high-throughput applications. Here I worked on efficient training and inference of state-of-the-art CNN / LSTM networks including fixed-point modeling of deep neural nets, low precision training, quantization (dynamic fixed point, trained quantization, etc.), pruning, compression and graph optimizations. Prior to this I was with Oracle where I worked on building custom, high-TFLOP architectures to accelerate deep learning applications (training & inference) on cloud. I also developed and trained deep learning models to solve interesting problems in Integrated Circuit design. A long time ago I worked at Texas Instruments in the ultra low-power Mixed Signal Microcontrollers group (MSP430).
In 2016 I received my Master's degree in Electrical Engineering from Stanford University, where I was a Research Assistant in the EXtreme Environment Microsystems Lab (XLab) led by Prof. Debbie Senesky. Here my focus was on nanoscale engineering of wide-bandgap semiconductors such as gallium nitride (GaN) and silicon carbide (SiC) for sensing in harsh environments. I also took (and thoroughly enjoyed) Stanford CS offerings like CS229 (Machine Learning) by Andrew Ng, and CS231N (Convolutional Neural Networks for Visual Recognition) by Andrej Karpathy, Justin Johnson & Fei-Fei Li.
In 2012 I received my Bachelor's degree with the highest honor in Electrical and Electronics Engineering from the National Institute of Technology, Trichy. I also spent my undergraduate summers working on fascinating problems with eminent research groups at the Technical University of Munich and Indian Institute of Technology Madras.
2021 - present: Staff ML Compiler Engineer, Tech Lead at Cruise
2017 - 2021: Staff Machine Learning Engineer, Tech Lead at Xilinx/AMD
2016-2017: Senior Machine Learning Engineer at Oracle
2014 - 2016: M.S. candidate in Electrical Engineering at Stanford University | Research Assistant at XLab, Advisers: Prof. Dr. Debbie G. Senesky and Prof. Dr. Krishna Saraswat
2012 - 2014: Design Engineer at Texas Instruments
Summer 2011: Research Intern at TU Munich | Electronic Design Automation Group, Adviser: Prof. Dr.-Ing. Ulf Schlichtmann
Summer 2010: Research Assistant at IIT Madras | Dynamics and Control Lab, Adviser: Prof. Dr. Arun D. Mahindrakar
2008 - 2012: B.Tech. candidate in Electrical and Electronics Engineering at NIT Trichy | Bachelor Thesis Adviser: Prof. Dr. C. Nagamani
KC Mahindra Fellow (2014): One of top 3 students from India chosen by the KC Mahindra Education Trust for an interest-free loan scholarship of INR 800,000 to pursue post-graduate studies at Stanford University
President Gold Medal (2008-12): Awarded for the overall highest CGPA in a batch of 689 undergraduates at the 8th convocation of NIT Trichy
RECAL Alumni Award (2012): Recognized as the outstanding student of the year 2011-2012 by RECAL, the alumni association of REC/NIT Trichy
DAAD-WISE Scholarship (2011): Selected by the German Academic Exchange Service (DAAD) for conducting academic research at the Technical University of Munich during summer of 2011
GATE AIR-93 (2011): Obtained All-India-Rank 93 / 72680 in the Graduate Aptitude Test in Engineering (GATE-2011) in Electrical Engineering, conducted by IIT Madras
"Sambhav has a keen understanding of compiler internals and the skillset to develop maintainable software that lasts. Most of all, he has a ‘sixth sense‘ for identifying improvements that make the entire team more productive. Sambhav exhibits extreme ownership, always jumping in to help users & colleagues. This created a positive and collaborative environment for the team." - Sanjoy Das
"Sambhav is extremely passionate about high engineering quality in his work and leading cultural changes in teams across the board in valuing software quality and craftsmanship. He consistently impressed in all areas of software engineering from deep compiler work, setting up good repo practices, testing and CI and more." - Ravi Narayanaswami
"Sambhav really exemplifies #worktogether. He is gracious with his time, provides great PR feedback and in some cases helped catch subtle issues due to his close attention to detail. He identifies opportunities for large scale quality improvements and is able to create momentum behind these efforts, involving the whole team in the adoption. He has great connections to the PyTorch/MLIR communities which enables him to effectively clear roadblocks while also making many contributions upstream, greatly benefiting the Cruise brand." - Srinath Avadhanula
"Sambhav has the archetype of someone who likes to own and build the capability of a substantial piece of technology, and is very hands-on about it, leading from the front for both the broad design and the detailed implementation work. His methodical approach to planning and communicating the effort ahead of time for feedback means that I've never encountered any conflict around work he's driving. He's able to self-start and find important domains that need solutions, and drive them forward." - Suraj Sudhir
Cruise has invested extensively in its deep learning (DL) optimization platform for systematic and composable optimizations such as quantization-aware training (QAT) and structured pruning, offering fast and accurate inference on NVIDIA GPUs making up the autonomous vehicle stack. We developed a DL compiler for graph mode QdQ insertions that utilize the new explicit precision QAT mode in TensorRT 8. We further extend the search for highly accurate quantized representations through techniques like trained quantization thresholds. In addition, the DL compiler is equipped to allow efficient and automated search for promising pruning configurations through one-shot techniques.
We propose a method of training quantization thresholds (TQT) for uniform symmetric quantizers using standard backpropagation and gradient descent. Contrary to prior work, we show that a careful analysis of the straight-through estimator for threshold gradients allows for a natural range-precision trade-off leading to better optima. Our quantizers are constrained to use power-of-2 scale-factors and per-tensor scaling of weights and activations to make it amenable for hardware implementations. We present analytical support for the general robustness of our methods and empirically validate them on various CNNs for ImageNet classification. We are able to achieve near-floating-point accuracy on traditionally difficult networks such as MobileNets with less than 5 epochs of quantized (8-bit) retraining. Finally, we present Graffitist, a framework that enables automatic quantization of TensorFlow graphs for TQT.
Routing is a complex spatial optimization problem in the physical design of integrated circuits that is NP-complete. The task is to connect different circuit segments (pins) spanning multiple layout hierarchies and multiple wire classes, while complying to a strict set of design rules dictated by the foundry’s process design kit (PDK). We present a deep convolutional neural network (CNN) that learns to route a circuit layout ’net’ given its pin locations. The 15 layer network is trained on a dataset with 50k training and 10k validation samples that are generated based on pre-defined layout constraints. Input to the network is a layout sample with pin locations only. The network outputs 8 layers, corresponding to one pin layer, four route layers and three via layers, which are then decoded to obtain the final layout with predicted routes. Precision, recall and F-1 score metrics are used to track the training progress. Our network achieves F1=97% on the train set and F1=92% on the validation set. We use PyTorch for training and implementation of the network.
We propose a method of training quantization thresholds (TQT) for uniform symmetric quantizers using standard backpropagation and gradient descent. Contrary to prior work, we show that a careful analysis of the straight-through estimator for threshold gradients allows for a natural range-precision trade-off leading to better optima. Our quantizers are constrained to use power-of-2 scale-factors and per-tensor scaling of weights and activations to make it amenable for hardware implementations. We present analytical support for the general robustness of our methods and empirically validate them on various CNNs for ImageNet classification. We are able to achieve near-floating-point accuracy on traditionally difficult networks such as MobileNets with less than 5 epochs of quantized (8-bit) retraining. Finally, we present Graffitist, a framework that enables automatic quantization of TensorFlow graphs for TQT.
We present a deep, fully convolutional neural network that learns to route a circuit layout net with appropriate choice of metal tracks and wire class combinations. Inputs to the network are the encoded layouts containing spatial location of pins to be routed. After 15 fully convolutional stages followed by a score comparator, the network outputs 8 layout layers (corresponding to 4 route layers, 3 via layers and an identity-mapped pin layer) which are then decoded to obtain the routed layouts. We formulate this as a binary segmentation problem on a per-pixel per-layer basis, where the network is trained to correctly classify pixels in each layout layer to be 'on' or 'off'. To demonstrate learnability of layout design rules, we train the network on a dataset of 50,000 train and 10,000 validation samples that we generate based on certain pre-defined layout constraints. Precision, recall and F1 score metrics are used to track the training progress. Our network achieves F1≈97% on the train set and F1≈92% on the validation set. We use PyTorch for implementing our model.
Sensors and electronics with robust operation under extreme harsh environments are required for a host of chemical, optical, physical and radiative applications. Gallium nitride (GaN) based sensors and electronics, particularly the high electron mobility transistor (HEMT) configuration, are emerging as candidates for gathering and processing local situational data from such high temperature, high radiation and corrosive environments. Here, a discussion of the material properties of the GaN platform and state-of-the-art sensor and electronics is provided, as well as future outlook for monolithic integration of components for high-temperature working environments.
In this paper, the electron mobility and sheet density of the two-dimensional electron gas (2DEG) in both air and argon environments at 600 °C were measured intermittently over a 5 h duration using unpassivated and Al2O3-passivated AlGaN/GaN (with 3 nm GaN cap) van der Pauw test structures. The unpassivated AlGaN/GaN heterostructures annealed in air showed the smallest decrease (∼8%) in 2DEG electron mobility while Al2O3-passivated samples annealed in argon displayed the largest drop (∼70%) based on the Hall measurements. Photoluminescence and atomic force microscopy showed that minimal strain relaxation and surface roughness changes have occurred in the unpassivated samples annealed in air, while those with Al2O3 passivation annealed in argon showed significant microstructural degradations. This suggests that cracks developed in the samples annealed in air were healed by oxidation reactions. To further confirm this, Auger electron spectroscopy was conducted on the unpassivated samples after the anneal in air and results showed that extra surface oxides have been generated, which could act as a dislocation pinning layer to suppress the strain relaxation in AlGaN. On the other hand, similar 2DEG sheet densities were observed in passivated and unpassivated AlGaN/GaN samples at the end of the 5-h anneal in air or argon due to the combined impact of strain relaxation and changes in the ionized electronic states. The results support the use of unpassivated GaN-capped AlGaN/GaN heterostructures as the material platform for high-temperature electronics and sensors used in oxidizing environmental conditions.
A microscale soot-particulate sensor using interdigitated platinum-gallium nitride (Pt-GaN) Schottky interfaces was developed to monitor fine soot particles within high-temperature environments (e.g., combustion exhausts and flues). Upon exposure to soot particles (30 to 50 nm in diameter) from an experimental chimney, an increased current (∼43.6%) is observed through the back-to-back Schottky contact to n-type GaN. This is attributed to a reduction in the effective Schottky barrier height (SBH) of ∼10 meV due to the electric field from the charged soot particles in the depletion region and exposed GaN surface. Furthermore, the microfabricated sensor was shown to recover sensitivity and regenerate the sensing response (∼11 meV SBH reduction) after exposure to temperature as high as 550 °C. This study supports the feasibility of a simple and reliable soot sensor to meet the increasing market demand for particulate matter sensing in harsh environments.
In the context of recent advancements in 3-phase phase-locked loop (PLL) structures to tackle grid imperfections, this paper attempts to shift focus towards dynamic response optimization for fast tracking of disturbed grids, as opposed to Wiener optimization, a trade-off between filtering characteristic and dynamic response. In this respect, an ingenious self-consistent model (SCM) based approach is proposed which explores filter design in the presence of frequency shifts and phase jumps, and facilitates the analytical computation of unique loop filter parameters. Trial and error in filter parameter selection is inconvenient, but more importantly, even rigorous trials would be insufficient in qualifying the non-existence of a better design. Having eliminated trial and error, this novel technique limits transients to user specifications while fixing on an optimum damping ratio, to yield the best fit. The design methodology is applied to three existing 3-phase PLL structures modelled in MATLAB/Simulink, and the proposed method is further evaluated through extensive simulations and performance comparisons with the traditional Wiener approach. To enhance the understanding of model behaviour and the feasibility of practical implementation, comprehensive three-dimensional (3-D) lookup tables are presented. They enable the study of optimized filter parameter variations for a range of grid disturbances, and broaden the application to filter optimization in real-time. In the interest of the reader, this paper is structurally split in two parts. Part 1 covers the premise and theory that explicates the proposed SCM methodology. The detailed analysis and verification of the SCM is covered in Part 2.
Systems and methods for training a neural network model includes providing a quantization function including a quantization log threshold parameter associated with a log value of a quantization threshold. A quantization training to a neural network model is performed to generate quantized neural network parameters. The quantization training includes: generating first values with a first precision for the neural network parameters; performing a first optimization process to generate an updated quantization log threshold parameter; and generating quantized values with a second precision lower than the first precision for the neural network parameters by applying the quantization function with the updated quantization log threshold parameter to the first values. The neural network model with the quantized values for the neural network parameters is provided for performing a task.
We explore wide bandgap semiconductors such as III-Nitrides (AlN, GaN, AlGaN) and 4H-SiC as they are better suited for harsh environment sensors and electronics. The goal of this study is to understand defect kinetics in 4H-SiC single crystal substrates annealed in micro-gravity for improved device performance. We conduct 1) microstructural analysis through X-ray diffraction (XRD), scanning/transmission electron microscopy (SEM/TEM), atomic force microscopy (AFM), Auger electron spectroscopy (AES), Raman spectroscopy, and secondary ion mass spectroscopy (SIMS), 2) electrical characterization using Schottky/ohmic contact evolution studies, 3) thermal conductivity studies using phonon-defect scattering experiments, and 4) molecular dynamic simulations. Interesting results indicate reduction in higher order defects (e.g. stacking faults) in space annealed samples.
Aging of integrated circuits is a major reliability concern in digital circuit design. Continuous scaling of transistor dimensions, without proportional downscaling of the supply voltage may lead to the degradation of circuit performance with time. Thus it is possible that a digital circuit that fulfilled the timing specifications right after manufacturing, does fail the specifications before the end of its specified lifetime. Negative Bias Temperature Instability (NBTI) is one such aging phenomenon that affects PMOS transistors stressed under Negative Bias (Vgs = -Vdd) at an elevated Temperature. Thus the need of an aging-aware optimization technique is inevitable in reducing the safety margins under which a circuit is designed, to improve the aged gate performances and enable the circuit to operate at higher optimum frequencies. Joint Logic Restructuring is a scheme based on the concept of functional symmetry, aimed at the reduction in NBTI-induced performance degradation. This work focuses on the implementation of an algorithm to detect functional symmetries by the identification of supergates in a circuit netlist. In order to achieve more realistic results as against the pessimistic method that assumes a constant stress on all PMOS transistors in the design, the values of stress probability (SP) and transition density (TD) are extracted in real-time, from a memory dump created during the execution of instructions on a MIPS processor.
A single-pass assembler is designed for a specific instruction set that reads assembly code, converts it to object code with symbols and their references, fetches the values and addresses pertaining to each symbol, and formulates the final symbol table. Involves comprehensive use of hashing, parsing and file-handling in C++. Symbol tables generated by single pass assemblers are used for locating and relocating symbolic definitions in the input, which is needed during linking.
Site inspired by Andrej Karpathy's webpage.