Dr. Ila Gokarn is an experienced technology leader and researcher with a strong foundation in emerging technologies such as embodied AI, machine visual perception, and edge computing-based intelligent systems. She has a proven track record in leading AI research, as well as in technical pre-sales and network automation projects across Asia Pacific in IoT, financial services, defense, smart city, and telecom domains. She is skilled in cross-functional collaboration and delivering impactful real-world AI solutions.
Currently, Dr. Gokarn is a Postdoctoral Associate working at the intersection of edge computing, spatial awareness, and generative AI for next-generation immersive workplaces. She is advised by Prof Sanjay Sarma from MIT and Prof Archan Misra from SMU. Dr. Gokarn holds a PhD in Computer Science from Singapore Management University.
[Resume] (As of January 2025) [LinkedIn] [Google Scholar]
[July 2025] Our ongoing work on Embodied AI in the field of task replanning and dynamic scene adaptation was presented at the MIT-Singapore Symposium on Embodied and Scalable AI at CREATE, Singapore.
[December 2024] Our paper titled "RA-MOSAIC: Resource Adaptive Edge AI Optimization over Spatially Multiplexed Video Streams", was accepted at ACM Transactions on Multimedia Computing, Communications and Applications (TOMM).
[May 2024] I am now a Postdoctoral Associate at Mens Manus and Machina (M3S), an interdisciplinary research group at Singapore MIT Alliance for Research and Technology Center. Find me at ila.gokarn@smart.mit.edu.
[May 2024] 🎓 I defended my PhD Thesis - "Enabling criticality-aware optimized machine perception at the edge" - Access it here: https://ink.library.smu.edu.sg/etd_coll/609/ 🎓
[April 2024] I'm thrilled to share that I have been selected as a MobiSys 2024 Rising Star 🌟
[April 2024] Our poster "Profiling Event Vision Processing on Edge Devices" has been accepted for publication at MobiSys 2024.
[March 2024] Our paper "JIGSAW: Edge-based Streaming Perception over Spatially Overlapped Multi-Camera Deployments" has been selected for publication at ICME 2024.
[January 2024] Our paper "Algorithms for Canvas-based Attention Scheduling with Resizing" has been accepted for publication at RTAS 2024.
[January 2024] Our demo titled "Demonstrating Canvas-based Processing of Multiple Camera Streams at the Edge" won the "Best Demo Award" at COMSNETS 24!
[December 2023] Our demo paper titled "Demonstrating Canvas-based Processing of Multiple Camera Streams at the Edge" has been accepted for publication at COMSNETS 24.
[August 2023] Our chapter, "Lightweight Collaborative Perception at the Edge," is now published by Springer in the book "Artificial Intelligence at the Edge".
[June 2023] I interned with the Pervasive Systems Research Group at Nokia Bell Labs for the summer of 2023.
[May 2023] Our work titled "Underprovisioned GPUs: On Sufficient Capacity for Real-Time Mission-Critical Perception" has been accepted for publication at ICCCN 2023.
[March 2023] Our work titled "MOSAIC: Spatially-Multiplexed Edge AI Optimization over Multiple Concurrent Video Sensing Streams" has been accepted for publication at ACM Multimedia Systems (MMSys) 2023.
[July 2021] We are presenting our work "VibranSee: Enabling Simultaneous Visible Light Communication and Sensing" at SECON 2021.
[June 2021] Awarded the N2Women Young Researcher Fellowship for SECON 2021.
[March 2021] We are presenting a short paper on Adaptive & Simultaneous Visible Light Sensing and Communication at PerCom 2021.
[January 2021] Awarded "Best Research Demo Award" at COMSNETS 2021 for our work "Demonstrating Simultaneous Visible Light Sensing and Communication"
Continuous tracking of eye movement dynamics plays a significant role in developing a broad spectrum of human-centered applications, such as cognitive skills modeling, biometric user authentication, and foveated rendering. Recently neuromorphic cameras have garnered significant interest in the eye-tracking research community, owing to their sub-microsecond latency in capturing intensity changes resulting from eye movements. Nevertheless, the existing approaches for eventbased eye tracking suffer from several limitations: dependence on RGB frames, label sparsity, and training on datasets collected in controlled lab environments that do not adequately reflect real-world scenarios. To address these limitations, in this paper, we propose a dynamic graph-based approach that uses the event stream for high-fidelity tracking of pupillary movement. We first present EyeGraph, a large-scale, multi-modal near-eye tracking dataset collected using a wearable event camera attached to a head-mounted device from 40 participants– the dataset was curated while mimicking in-the-wild settings, with variations in user movement and ambient lighting conditions. Subsequently, to address the issue of label sparsity, we propose an unsupervised topology-aware spatio-temporal graph clustering approach as a benchmark. We show that our unsupervised approach achieves performance comparable to more onerous supervised approaches while consistently outperforming the conventional clustering-based unsupervised approaches.
Eye-tracking technology has gained significant attention in recent years due to its wide range of applications in humancomputer interaction, virtual and augmented reality, and wearable health. Traditional RGB camera-based eye-tracking systems often struggle with poor temporal resolution and computational constraints, limiting their effectiveness in capturing rapid eye movements. To address these limitations, we propose EyeTrAES, a novel approach using neuromorphic event cameras for high-fidelity tracking of natural pupillary movement that shows significant kinematic variance. One of EyeTrAES’s highlights is the use of a novel adaptive windowing/slicing algorithm that ensures just the right amount of descriptive asynchronous event data accumulation within an event frame, across a wide range of eye movement patterns. EyeTrAES then applies lightweight image processing functions over accumulated event frames from just a single eye to perform pupil segmentation and tracking (as opposed to gaze-based techniques that require simultaneous tracking of both eyes). We show that these two techniques boost pupil tracking fidelity by 6+%, achieving IoU∼=92%, while incurring at least 3x lower latency than competing pure event-based eye tracking alternatives [38]. We additionally demonstrate that the microscopic pupillary motion captured by EyeTrAES exhibits distinctive variations across individuals and can thus serve as a biometric fingerprint. For robust user authentication, we train a lightweight per-user Random Forest classifier using a novel feature vector of short-term pupillary kinematics, comprising a sliding window of pupil (location, velocity, acceleration) triples. Experimental studies with two different datasets (capturing eye movement across a range of environmental contexts) demonstrate that the EyeTrAES-based authentication technique can simultaneously achieve high authentication accuracy (∼=0.82) and low processing latency (∼=12ms), and significantly outperform multiple state-of-the-art competitive baselines.
As RGB camera resolutions and frame-rates improve, their increased energy requirements make it challenging to deploy fast, efficient, and low-power applications on edge devices. Newer classes of sensors, such as the biologically inspired neuromorphic event-based camera, capture only changes in light intensity per-pixel to achieve operational superiority in sensing latency (O(μs)), energy consumption (O(mW)), high dynamic range (140dB), and task accuracy such as in object tracking, over traditional RGB camera streams. However, highly dynamic scenes can yield an event rate of up to 12MEvents/second, the processing of which could overwhelm resource-constrained edge devices. Efficient processing of high volumes of event data is crucial for ultra-fast machine vision on edge devices. In this poster, we present a profiler that processes simulated event streams from RGB videos into 6 variants of framed representations for DNN inference on an NVIDIA Jetson Orin AGX, a representative edge device. The profiler evaluates the trade-offs between the volume of events evaluated, the quality of the processed event representation, and processing time to present the design choices available to an edge-scale event camera-based application observing the same RGB scenes. We believe that this analysis opens up the exploration of novel system designs for real-time low-power event vision on edge devices.
Sustaining real-time, high fidelity AI-based vision perception on edge devices is challenging due to both the high computational overhead of increasingly “deeper” Deep Neural Networks (DNNs) and the increasing resolution/quality of camera sensors. Such high-throughput vision perception is even more challenging in multi-tenancy systems, where video streams from multiple such high-quality cameras need to share the same GPU resource on a single edge device. Criticality-aware canvas-based processing is a promising paradigm that decomposes multiple concurrent video streams into Regions of Interest (RoI) and spatially channels the limited computational resources to selected RoI with higher “resolution”, thereby moderating the trade-off between computational load, task fidelity, and processing throughput. RA-MOSAIC (Resource Adaptive MOSAIC) employs such canvas-based processing, while further tuning the incoming video streams and available resources on-demand to allow the system to adapt to dynamic changes in workload (often arising from variations in the number or size of relevant objects observed by individual cameras). RA-MOSAIC utilizes two distinct and synergistic concepts. First, at the camera sensor, a bandwidth-adaptive and lightweight Bandwidth Aware Camera Transmission (BACT) method applies differential down-sampling to create mixed-resolution individual frames that preferentially preserve resolution for critical ROIs, before being transmitted to the edge node. Second, at the edge, BACT video streams received from multiple cameras are decomposed into multi-scale RoI tiles and spatially packed using a novel workload-adaptive bin-packing strategy into a single ‘canvas frame’. Notably, the canvas frame itself is dynamically sized such that the edge device can opportunistically provide higher processing throughput for selected high-priority tiles during periods of lower aggregate workloads. To demonstrate RA-MOSAIC’s gains in processing throughput and perception fidelity, we evaluate RA-MOSAIC on a single NVIDIA Jetson TX2 edge device for two benchmark tasks: Drone-based Pedestrian Detection and Automatic License Plate Recognition. In a bandwidth-constrained wireless environment, RA-MOSAIC employs a batch size of 1 to pack up to 6 concurrent video streams on a dynamically sized canvas frame to provide (i) 14.3% gain in object detection accuracy and (ii) 11.11% gain in throughput on average (up to 20 FPS per camera, cumulatively 120 FPS), over our previous work MOSAIC, a naïve canvas-based baseline. Compared to prior state of the art baselines such as batched inference over extracted RoI, RA-MOSAIC provides a very-significant, 29.6% gain in accuracy for a comparable throughput. Similarly, RA-MOSAIC dramatically outperforms bandwidth adaptive baselines, such as FCFS (<=1% accuracy gain but 5.6x or 566.67% throughput gain) and uniform grid packing (17% accuracy improvement and 5% throughput gain).
We present JIGSAW, a novel system that performs edge-based streaming perception over multiple video streams, while additionally factoring in the redundancy offered by the spatial overlap often exhibited in urban, multi-camera deployments. To assure high streaming throughput, JIGSAW extracts and spatially multiplexes multiple regions of interest from different camera frames into a smaller canvas frame. Moreover, to ensure that perception stays abreast of evolving object kinematics, JIGSAW includes a utility-based weighted scheduler to preferentially prioritize and even skip object-specific tiles extracted from an incoming stream of camera frames. Using the CityflowV2 traffic surveillance dataset, we show that JIGSAW can simultaneously process 25 cameras on a single Jetson TX2 with a 66.6% increase in accuracy and a simultaneous 18x (1800%) gain in cumulative throughput (475 FPS), far outperforming competitive baselines.
Sustaining high fidelity and high throughput of perception tasks over vision sensor streams on edge devices remains a formidable challenge, especially given the continuing increase in image sizes (e.g., generated by 4K cameras) and complexity of DNN models. One promising approach involves criticality-aware processing, where the computation is directed selectively to ``critical" portions of individual image frames. We introduce MOSAIC, a novel system for such criticality-aware concurrent processing of multiple vision sensing streams that provides a multiplicative increase in the achievable throughput with negligible loss in perception fidelity. MOSAIC determines critical regions from images received from multiple vision sensors and spatially bin-packs these regions using a novel multi-scale Mosaic Across Scales (MoS) tiling strategy into a single `canvas frame’, sized such that the edge device can retain sufficiently high processing throughput. Experimental studies using benchmark datasets for two tasks, Automatic License Plate Recognition and Drone-based Pedestrian Detection, shows that MOSAIC, executing on a Jetson TX2 edge device, can provide dramatic gains in the throughput vs. fidelity tradeoff. For instance, for drone-based pedestrian detection, for a batch size of 4, MOSAIC can pack input frames from 6 cameras to achieve (a) 4.75x (475%) higher throughput (23 FPS per camera, cumulatively 138FPS) with <=1% accuracy loss, compared to a First Come First Serve (FCFS) processing paradigm.
Visible Light Communication (VLC) goodput (or application-level throughput), and Visible Light Sensing (VLS) accuracy or coverage demonstrate a natural trade-off depending on the duty cycle of the light source. Intuitively, VLS ideally assumes the use of a strobing source with an infinitesimally small duty cycle, whereas VLC goodput is directly proportional to the active period of each individual pulse, maximized when the duty cycle is 100%. We used this understanding to design mechanisms that moderate this tradeoff – a time-multiplexed single strobe architecture and a harmonic multi-strobe architecture. Based on these understandings, we designed VibranSee, an adaptive mechanism that further fine-tunes the tradeoff between VLC and VLS, and setup experiments on cheap commodity pervasive devices - Arduino and Raspberry Pi.
Gokarn, I., Hu, Y., Abdelzaher, T., & Misra, A. (2025). RA-MOSAIC: Resource adaptive edge AI optimization over spatially multiplexed video streams. ACM Transactions on Multimedia Computing, Communications, and Applications.
Bandara, N., Kandappu, T., Sen, A., Gokarn, I., & Misra, A. (2024). EyeGraph: Modularity-aware spatio-temporal graph clustering for continuous event-based eye tracking. Advances in Neural Information Processing Systems, 37.
Sen, A., Bandara, N. S., Gokarn, I., Kandappu, T., & Misra, A. (2024). EyeTrAES: Fine-grained, low-latency eye tracking via adaptive event slicing. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies.
Gokarn, I., Hu, Y., Abdelzaher, T., & Misra, A. (2024). JIGSAW: Edge-based streaming perception over spatially overlapped multi-camera deployments. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME) (pp. 1–6).
Gokarn, I. (2024). Criticality aware canvas-based visual perception at the edge. In Proceedings of the 22nd Annual International Conference on Mobile Systems, Applications, and Services (MobiSys).
Gokarn, I., & Misra, A. (2024). Poster: Profiling event vision processing on edge devices. In Proceedings of the 22nd Annual International Conference on Mobile Systems, Applications, and Services (MobiSys) (pp. 672–673).
Hu, Y., Gokarn, I., Liu, S., Misra, A., & Abdelzaher, T. (2024). Algorithms for canvas-based attention scheduling with resizing. In Proceedings of the 2024 IEEE 30th Real-Time and Embedded Technology and Applications Symposium (RTAS) (pp. 348–359).
Hu, Y., Gokarn, I., Liu, S., Misra, A., & Abdelzaher, T. (2023). Work-in-progress: Algorithms for canvas-based attention scheduling with resizing. In Proceedings of the 44th IEEE Real-Time Systems Symposium (RTSS) (pp. 435–438).
Hu, Y., Gokarn, I., Liu, S., Misra, A., & Abdelzaher, T. (2023). Underprovisioned GPUs: On sufficient capacity for real-time mission-critical perception. In Proceedings of the 32nd International Conference on Computer Communications and Networks (ICCCN).
Gokarn, I., Sabella, H., Hu, Y., Abdelzaher, T., & Misra, A. (2023). MOSAIC: Spatially-multiplexed edge AI optimization over multiple concurrent video sensing streams. In Proceedings of the 14th ACM Multimedia Systems Conference (MMSys) (pp. 278–288).
Gokarn, I., & Misra, A. (2021). VibranSee: Enabling simultaneous visible light communication and sensing. In Proceedings of the 18th IEEE International Conference on Sensing, Communication, and Networking (SECON).
Gokarn, I., & Misra, A. (2021). Adaptive & simultaneous pervasive visible light communication and sensing. In Proceedings of the IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops) (pp. 344–347).
Gokarn, I., Swapna, G., & Shankararaman, V. (2015). Analyzing educational comments for topics and sentiments: A text analytics approach. In Proceedings of the 2015 IEEE Frontiers in Education Conference (FIE) (pp. 1–9).
Gokarn, I., & Phua, C. (2015). Understanding characteristics of insider threats by using feature extraction. SAS Ambassador Award 2025 Proceedings.
Gokarn, I., Jayarajah, K., & Misra, A. (2023). Lightweight Collaborative Perception at the Edge. In Artificial Intelligence for Edge Computing (pp. 265-296). Cham: Springer International Publishing.
Gokarn, I., Sabella, H., Hu, Y., Abdelzaher, T., & Misra, A. (2024, January). Demonstrating Canvas-based Processing of Multiple Camera Streams at the Edge. In 2024 16th International Conference on COMmunication Systems & NETworkS (COMSNETS) (pp. 297-299). IEEE.
Gokarn, I., & Misra, A. (2021). Demonstrating high-performance simultaneous visible light communication and sensing.
June 2024 - MobiSys 2024 Rising Star
January 2024 - Best Research Demo at COMSNETS 2024 for Demonstrating Canvas-based Processing of Multiple Camera Streams at the Edge
June 2021 - N2Women Young Researcher Fellowship for SECON 2021
January 2021 - Best Research Demo at COMSNETS 2021 for Demonstrating Simultaneous Visible Light Sensing and Communication
August 2019 - PhD Full Scholarship Singapore Management University
Singapore MIT Alliance for Research and Technology
Massachusetts Institute of Technology
Singapore Management University
University of Illinois Urbana- Champagne
Nokia Bell Labs
Living Analytics Research Center
Arista Networks
Cisco Systems
I am a trained Bharatnatyam dancer and I am now pursuing the Odissi form as well with Ethos Odissi. I am actively involved in the fine arts community in Singapore. I am also involved with mentoring young girls interested in STEM research and industry.
Reach me at: ila(dot)gokarn(AT)smart(dot)mit(dot)edu or on LinkedIn