Santoshkumar Tongli

About:

Santoshkumar Tongli is a passionate and skilled professional specializing in Machine Learning, Embedded Intelligent Systems, Computer Vision, and C/C++ system-level kernel development for AI accelerators. He holds a Bachelor's degree in Electronics and Communications and is currently pursuing a Master's degree in Computer Science at Colorado State University. With extensive experience in compiler optimization techniques, parallel programming, and distributed AI model training. He has worked on diverse Deep Learning projects during both his bachelor's and master's studies further enhancing his expertise in Machine Learning and Computer Vision Algorithms. His combined academic and professional experiences equip him with a unique skill set that bridges low-level system development and advanced AI applications, making him a valuable contributor to the field of artificial intelligence.

Work Experience

Software Research Intern: San Jose, California, USA

Developed a web-based software tool for Machine Learning (ML) developers to visualize and analyze trained AI models, enhancing model interpretability and debugging capabilities.
Created a Python-based backend library to parse and perform layer-wise analysis, calculate memory footprint, and roofline performance for Large Language Models (LLMs), offering AI model developers actionable insights for optimizing model development and deployment.
Integrated profiling tools to monitor performance bottlenecks in ML workloads, to help optimize memory usage and execution time for improved efficiency.
Developed a Visual Studio Code (VS Code) extension for the software tool, enabling it to function independently as a standalone application and improving developer accessibility.
Contributed to the development of a VS Code extension tool for a client, providing UI support to interact with remotely connected AI Accelerators to train, test, compile, deploy, and debug AI models through an intuitive interface.

Graduate Teaching Assistant: Fort Collins, Colorado, USA.

Colorado State University

COMPUTER SYSTEM FOUNDATIONS:
- Conducted recitation classes for undergraduate students, covering topics such as numerical systems, computer architecture, memory hierarchy, networking protocols, and data storage algorithms.
- Facilitated in-class learning experiences, enhancing student understanding through interactive sessions and practical examples.
- Managed course logistics, including grading assignments, assisting students with coursework, and overseeing the course software portal.

Senior Software Engineer:

Led a team of five, managing Git repositories, CI/CD pipelines, customer deliverables, and overseeing project tasks in JIRA to ensure timely and efficient project execution.
Optimized deep learning kernels for various AI accelerators, including Digital Signal Processors (DSPs) using SIMD instructions and SRAM-based in-memory computing chips (Custom Neural Processors/ASICs), enhancing computational performance.
Increased the throughput of the BERT model's convolution layer by 3.5x by scaling and reengineering the data pipeline, significantly improving model efficiency.
Scaled ResNet-50, MobileNet, and SSD models for multi-core systems, focusing on parallelizing operations to achieve double the frames per second (FPS) for ResNet-50 inference.
Developed custom MLIR-based compiler pipelines for hardware-specific optimization and lowering processes, enabling efficient deployment of machine learning models on specialized accelerators.

Developed and implemented an optimal selection strategy for kernels and configurations in MLIR programs, enabling the compiler to automatically select the most efficient implementation for enhanced computational performance.
Possess strong fundamentals in MLIR dialects, LLVM, and AI-related compilers, with hands-on experience in compiler integration and optimization techniques.

Software Engineer:

Expertise in training and optimizing AI models using TensorFlow and PyTorch, employing techniques like quantization and pruning to enhance model efficiency and performance.
Designed and implemented high-performance C/C++ kernels for Deep Learning SDKs on multiple AI accelerators, contributing to improved computational throughput and hardware utilization.
Developed Python-based API functions and compiler components, including MLIR dialects, unit tests, and parser functions, to streamline model inference within machine learning frameworks.
Optimized Machine Learning SDKs with hardware-specific enhancements, reducing model latency by 25% for classification and object detection models like YoloXL and YolAct through targeted optimizations.
Proficient in memory profiling and model optimization tools, utilizing ONNX Runtime for model format conversion and optimization; experienced with compilers such as OpenVINO, TVM, and OpenXLA for efficient model deployment.

Education

Master of Computer Science: Fort Collins, USA.

Colorado State University

- Coursework:
  1. Machine Learning
  2. Fault Tolerance Computing
  3. Embedded Systems for Machine Learning
  4. Parallel Programming
  5. Computer Security
  6. Compiler for High-Performance Program Generation
  7. Image Computing - (Computer Vision)
- Research work - Optimizing sparse matrix-vector computations using polyhedral affine transformations to facilitate compiler optimization, compressed storage, and efficient execution.

Bachelor in Electronics and Communication: Hubballi, India.

KLE Technological University

During my bachelor's degree, I acquired knowledge in C Programming, Circuit Analysis, Computer Architecture, Computer Vision, Control Systems, Data Structures and Algorithms, Digital and Analog Circuits, Machine Learning, Microcontrollers, Operating Systems, and Signals and Communication Systems.

ACADEMIC PROJECTS:

Intrusion Detection using Deep Learning
- - Developed a TabNet-based model for network intrusion detection, leveraging PyTorch, trained model on GPU using cuDNN. The model provides users with detailed statistical insights into its decision-making process by analyzing input feature variations compared to general trends. This approach enhances the interpretability of the model's predictions, allowing for a deeper understanding of the factors contributing to potential intrusions. Employed data analytics techniques to visualize and interpret model behavior, significantly improving the effectiveness of network security measures.

Learn Cost Model using Graph Neural Network (GNN) to Predict K-Top Tiles for Tensor Computation Graph
- - Developed a Graph Neural Network (GNN) model using PyTorch, PyTorch Lightning, and PyTorch Geometric to predict the top K tiles for efficient tensor computation on Tensor Processing Units (TPUs). Utilizing the Google TPUGraphs dataset, the project aimed to enhance the computational efficiency of neural network graphs on TPUs by reducing execution times. Implemented distributed training with Horovod, achieving a 98% model accuracy.

Smart HomeGuard: Real-time Anomaly Detection for Domestic Safety
- - Developed a real-time anomaly detection framework for video streams to enhance domestic security, utilizing a 3D Convolutional Neural Network (3D CNN) architecture. Trained on 140 GB of video data (12 GB of processed video frames), the model effectively identifies unusual activities and alerts users via a mobile application. Implemented using TensorFlow and OpenCV, the solution was deployed on Raspberry Pi 3 and leveraged Google Cloud Platform/Firebase for data storage and analysis. This implementation achieves efficient real-time performance in processing and analyzing video feeds, significantly improving home security monitoring systems.

Refining sparse 3D point cloud through deep learning models
- - Developed a Convolutional Neural Network (CNN)-based model to upsample sparse 3D heritage point clouds using edge-aware points, significantly enhancing the density and accuracy of 3D reconstructions for cultural heritage preservation. This deep learning approach addressed the challenges of refining low-density point clouds, resulting in improved detail and fidelity in 3D models. Presented this work at SIGGRAPH Asia 2020, showcasing the innovative application of CNNs in enhancing sparse 3D point cloud data.

Framework to categorize crowd-sourced data using the learning method
- - As part of my Sponsored Research Project (SRP) with the Indian Heritage in Digital Space (IHDS), Developed a framework for categorizing and selecting unlabeled data. The framework employed nested clustering algorithms and incremental learning, training nuisance classifiers for each newly identified class. This approach enhanced the classification and selection process of unlabeled data, improving the efficiency and accuracy of data management within the project.

Offline Caffee2 parser for the ARMNN framework (Samsung R & D, Prism Program)
- - Developed an offline Caffe2 parser for the ARM NN framework as part of Samsung R&D's Prism Program, for which received the Best Project Award and a prize of ₹40,000. Implemented the MobileNetV2 parser to convert Caffe2 models into ARM NN format, storing the parsed output using FlatBuffers for efficient serialization. Key contributions included writing parser functions for layers such as Convolution, Softmax, Max Pool, and ReLU, and designing the schema for FlatBuffers to store the parsed ARM NN graph. This work enhanced the ARM NN framework's capability to support Caffe2 models, enabling efficient deployment of neural networks on ARM-based devices.