I am a Ph.D. scholar advised by Dr. Saket Anand in the Department of Computer Science and Engineering at IIIT Delhi, India. We are affiliated with The Vision Lab at Infosys Centre for Artificial Intelligence. I am honored to be awarded the prestigious Qualcomm Innovation Fellowship, CHANAKYA Doctoral Fellowship and Google AI for Social Good Program.
My academic pursuits and interests revolve around the dynamic field of Machine Learning and Deep Learning, particularly their applications to address intricate real-world challenges.
My Ph.D. research is centered on computer vision, primarily focusing on exploiting structural priors for various visual recognition tasks. The structural priors can be represented as taxonomy trees or graphs, depicting semantic and geometric relationships. A key benefit of encoding these structural priors is that they can inject domain knowledge as an inductive bias. My work exploits these priors and has been applied to develop algorithms for robust object recognition, multi-modal multi-object tracking, visual re-identification and active learning.
I have joined Mr. Alok Talekar's team at Google DeepMind as a Student Researcher to work on "AI for Agricultural Remote Sensing"
I was invited as a guest lecturer for “Spatial Statistics and Spatial Econometrics” by Dr. Gaurav Arora.
I was invited to deliver a talk at Qualcomm (ADAS Team) on “Hierarchy-Aware Feature Representations to Reduce Severity of Mistakes for Robust Visual Perception” by Dr. Chiranjib Choudhuri.
I am awarded the prestigious Qualcomm Innovation Fellowship and the CHANAKYA Doctoral Fellowship.
Our work on fine-grained visual recognition of wildlife species was deployed to process 50 million camera trap images for All India Tiger Estimation (Prof. Qamar Qureshi), which was recognized as the largest camera trap survey in the globe.
Our work on AI for Agriculture using remote sensing led to a first-of-its-kind dataset, SICKLE, accepted as an Oral Presentation at WACV 2024.
Our work on “Graph-Based Statistical Analysis of Entire Scenes by Combining Multi-Sensor, Multi-Perspective Video Streams” was one of the 30 projects supported by NSF-DST Workshop 2023.
Our work on “High-Resolution Satellite Imagery for Modeling the Impact of Aridification on Crop Production” was recognized for its methods at Google AI4SG Workshop 2021.
Winner of the Samsung AI Hackathon 2020 for the product “On-Device Dynamic Emoji Generation”.
Google DeepMind, Bangalore, India
Student Researcher | Advisor: Mr. Alok Talekar
Aug 2025 — Dec 2025
Florida State University, Tallahassee, Florida, USA
Visiting Student Researcher | Advisor: Dr. Anuj Srivastava
May 2023 — June 2023
Samsung Research Institute, Noida, Uttar Pradesh, INDIA
Research Engineer | Manager: Mr. Saurabh Garg
Sep 2019 — Aug 2021
Legal Raasta Tech. Pvt. Ltd., New Delhi, India
Full Stack Developer | Manager: Mr. Shubham Jain
Mar 2019 — Sep 2019
Ministry of Electronics and Information Technology, New Delhi, India
Intern | Advisor: Mr. Nishit Gupta
Jun 2018 — Jul 2018
Artificial Intelligence, IIIT Delhi, India
Head Teaching Assistant | Faculty: Dr. Saket Anand
Monsoon 2024
Computer Vision, IIIT Delhi, India
Head Teaching Assistant | Faculty: Dr. Saket Anand
Winter 2023
Machine Learning, IIIT Delhi, India
Teaching Assistant | Faculty: Dr. Saket Anand
Monsoon 2022
Mentees: Ojaswani Sharma (M. Tech.), Pranav Bansal (RA), Vivaswan Nawani (B. Tech.), Jeet Mukherjee (RA), Yatin Phalak (M. Tech.), Divyanshu Bhati (RA), Harsh Bindal (RA), Mehar Khurana (B. Tech.), Atharv Goel (B. Tech.), Anirudh Iyer (B. Tech.), Prakhar Rai (B. Tech.), Sourabh Saini (B. Tech.), Harsh Kumar Agarwal (B. Tech.) and Parichya Sirohi (RA)
Reviewed Journals / Conferences: IJCV, CVPR, ICCV, ECCV and NCC
Organizing Committee Member: ICVGIP Data Challenge 2021, Anveshan 2.0 LiDAR Segmentation Challenge and Pasteur’s Quadrant Seminar Series
Served as a mentor for Competitive Data Engineering and Artificial Intelligence (CoDE-AI), which is a student group that participates and competes in data challenges.
Active Learning for Animal Re-Identification with Ambiguity-Aware Sampling, Under Review, 2025
Animal re-identification (Re-ID) has recently gained substantial attention in the AI research community due to its high impact on biodiversity monitoring and unique research challenges arising from environmental factors. The subtle distinguishing patterns like stripes or spots, handling new species and the inherent open-set nature make the problem even harder. To address these complexities, foundation models trained on labeled, large-scale and multi-species animal Re-ID datasets have recently been introduced to enable zero-shot Re-ID. However, our benchmarking reveals significant gaps in their zero-shot Re-ID performance for both known and unknown species. While this highlights the need for collecting labeled data in new domains, exhaustive annotation for Re-ID is laborious and requires domain expertise. Our analyses also show that existing unsupervised (USL) and active learning (AL) Re-ID methods underperform for animal Re-ID. To address these limitations, we introduce a novel AL Re-ID framework that leverages complementary clustering methods to uncover and target structurally ambiguous regions in the embedding space for mining pairs of samples that are both informative and broadly representative of the visual space. Oracle feedback on these pairs, in the form of must-link and cannot-link constraints, facilitates a simple annotation interface, which naturally integrates with existing USL methods through our proposed constrained clustering refinement algorithm. Through extensive experiments, we demonstrate that, by utilizing only 0.1% of all possible annotations, our approach consistently outperforms existing foundational, USL and AL baselines for animal Re-ID. Specifically, we report an average improvement of 10.49%, 11.19% and 3.99% (mAP) on 13 wildlife datasets over foundational, USL and AL methods, respectively, while attaining state-of-the-art performance on each dataset. Furthermore, we also show an improvement of 11.09%, 8.2% and 2.06% (AUC ROC) for unknown individuals in an open-world setting. For completeness, we also present a comparative analysis on 2 publicly available person Re-ID datasets, showing average gains of 7.96% and 2.86% (mAP) over existing USL and state-of-the-art AL Re-ID methods. For reproducibility of this research, we will open-source our code and models upon acceptance.
Traditional classifiers often treat categories to be independent and can therefore fail drastically making severe mistakes. Learning hierarchy-aware representations based on a given taxonomy, mitigates this problem by capturing semantic correlations between categories. Existing methods attempt to resolve this problem either by taking a post-hoc adaptation strategy or during the training phase by using carefully designed loss functions. The quality of hierarchical representations is often indirectly measured via evaluation metrics like Mistake Severity (MS), Average Hierarchical Distance (AHD) and their variants, which compute a distance between the predicted and the true categories using some inter-node distance like the Lowest Common Ancestor (LCA) from the taxonomy tree. In this paper, we make two key contributions to hierarchical classification: 1) We propose a novel framework, Hierarchical Composition of Orthogonal Subspaces (Hier-COS), which learns to map deep feature embeddings from any neural network backbone into a vector space that is, by design, consistent with the structure of a given taxonomy tree and therefore reduces severity of mistakes. 2) We highlight important shortcomings in existing evaluation metrics like MS and AHD, and argue for a ranking-based metric, by proposing the Hierarchically Ordered Preference Score (HOPS) that demonstrably overcomes these limitations. We benchmark our method on four challenging datasets including tieredImageNet-H with a deep, 12-level hierarchy and iNat-19, a fine-grained recognition dataset with a 7-level hierarchy. Through extensive experiments, we demonstrate that Hier-COS achieves state-of-the-art performance across all hierarchical metrics including the novel proposed one, HOPS, for every dataset, while simultaneously beating the top-1 accuracy in all but one case. We also demonstrate the performance of a frozen Vision Transformer (ViT) backbone alone can be used within the Hier-COS framework and yield substantial benefits on hierarchical classification performance.
Recent progress in open-source object detection techniques has significantly advanced Multi-Object Tracking (MOT) methodologies, primarily under the tracking-by-detection paradigm. To enhance the robustness and reliability of MOT systems, recent research has proposed integrating information gathered from diverse sensors. However, many Kalman filter-based MOT approaches assume the independence of object trajectories, overlooking potential inter-object relationships. While some efforts have been made to incorporate these relationships, they often concentrate on learning feature representations to facilitate better association. Moreover, the existing filter-based method for estimating graphs from noisy data is unsuitable for online MOT applications. To alleviate these problems, we introduce a Sensor Agnostic Graph-Aware (SAGA) Kalman filter, which is the first online state estimation technique designed to fuse multi-modal graphs derived from noisy multi-sensor data. We validate the effectiveness of our proposed framework through extensive experiments conducted on both synthetic and real-world driving dataset (nuScenes). Our results showcase an improvement in MOTA and a reduction in estimated position errors (MOTP) and identity switches (IDS) for tracked objects using the SAGA-KF. More details are available here.
The availability of well-curated datasets has driven the success of Machine Learning (ML) models. Despite greater access to earth observation data in agriculture, there is a scarcity of curated and labelled datasets, which limits the potential of its use in training ML models for remote sensing (RS) in agriculture. To this end, we introduce a first-of-its-kind dataset called SICKLE, which constitutes a time-series of multi-resolution imagery from 3 distinct satellites: Landsat-8, Sentinel-1 and Sentinel-2. Our dataset constitutes multi-spectral, thermal and microwave sensors during January 2018 - March 2021 period. We construct each temporal sequence by considering the cropping practices followed by farmers primarily engaged in paddy cultivation in the Cauvery Delta region of Tamil Nadu, India; and annotate the corresponding imagery with key cropping parameters at multiple resolutions (i.e. 3m, 10m and 30m). Our dataset comprises 2, 370 season-wise samples from 388 unique plots, having an average size of 0.38 acres, for classifying 21 crop types across 4 districts in the Delta, which amounts to approximately 209, 000 satellite images. Out of the 2, 370 samples, 351 paddy samples from 145 plots are annotated with multiple crop parameters; such as the variety of paddy, its growing season and productivity in terms of per-acre yields. Ours is also one among the first studies that consider the growing season activities pertinent to crop phenology (spans sowing, transplanting and harvesting dates) as parameters of interest. We benchmark SICKLE on three tasks: crop type, crop phenology (sowing, transplanting, harvesting), and yield prediction. More details are available here.
Label hierarchies are often available a priori as part of biological taxonomy or language datasets WordNet. Several works exploit these to learn hierarchy-aware features in order to improve the classifier to make semantically meaningful mistakes while maintaining or reducing the overall error. In this paper, we propose a novel approach for learning Hierarchy Aware Features (HAF) that leverages classifiers at each level of the hierarchy that are constrained to generate predictions consistent with the label hierarchy. The classifiers are trained by minimizing a Jensen-Shannon Divergence with target soft labels obtained from the fine-grained classifiers. Additionally, we employ a simple geometric loss that constrains the feature space geometry to capture the semantic structure of the label space. HAF is a training time approach that improves the mistakes while maintaining top-1 error, thereby, addressing the problem of cross-entropy loss that treats all mistakes as equal. We evaluate HAF on three hierarchical datasets and achieve state-of-the-art results on the iNaturalist-19 and CIFAR-100 datasets. The source code is available here.
Using the proposed approach, we developed a tool, CaTRAT (Camera Trap Data Repository and Analysis Tool), which is now used for “All India Tiger Estimation” by the government of India.
The integration of the modern Machine Learning (ML) models into remote sensing and agriculture has expanded the scope of the application of satellite images in the agriculture domain. In this paper, we present how the accuracy of crop type identification improves as we move from medium-spatiotemporal-resolution (MSTR) to high-spatiotemporal-resolution (HSTR) satellite images. We further demonstrate that high spectral resolution in satellite imagery can improve prediction performance for low spatial and temporal resolutions (LSTR) images. The F1-score is increased by 7% when using multispectral data of MSTR images as compared to the best results obtained from HSTR images. Similarly, when crop season based time series of multispectral data is used we observe an increase of 1.2% in the F1-score. The outcome motivates further advancements in the field of synthetic band generation.
Emojis are a succinct and visual way to express feelings, emotions, and thoughts during text conversations. Owing to the increase in the use of social media, the usage of emojis has increased drastically. There are various techniques for automating emoji prediction, which use contextual information, temporal information, and user-based features. However, the problem of personalised and dynamic recommendations of emojis persists. This paper proposes personalised emoji recommendations using the time and location parameters. It presents a new annotated conversational dataset and investigates the impact of time and location for emoji prediction. The methodology comprises a hybrid model that uses neural networks and score-based metrics: semantic and cosine similarity. Our approach differs from existing studies and improves the accuracy of emoji prediction up to 73.32% using BERT.
This paper shows the working of a device based on implementation of a voice command system as an intelligent personal assistant. The services provided by the device depends on the input given in the form of voice command by the user and ability to access information from a variety of online sources such as weather, telling time or accessing online applications to listen to music. This Voice driven device uses Raspberry Pi as its main hardware. Speech to text engine is used to convert the voice command to simple text. Query processing is then applied using natural language processing (NLP) onto this text to interpret the intended meaning of the command given by the user. After interpreting the intended meaning, text to speech conversion is used to give appropriate output in the form of speech. This device might provide a platform to visually impair to do their day to day tasks more easily like listening to music, checking weather conditions, checking current time or even doing a simple mathematical calculation. Many experiments and results were accomplished and documented.