Identifying when elderly individuals struggle to rise from a chair and offering aid based on their posture is highly beneficial.
We integrated human pose estimation into a smart robot to assist seniors in their daily tasks.
From real-time videos, we analyzed the pixels in the frames to understand the shapes and positions or the key points of a human body.
Human Pose Estimation is a computer vision task that involves estimating the 3D positions and orientations of body joints and bones from 2D images or videos.
The goal is to reconstruct the 3D pose of a person. Accuracy is measured by the Mean Per Joint Position Error (MPJPE) which is 72.4 for the HybrIK (Inverse Kinematics) model.
The research project, 'Supporting Aging-In-Place Through Multimodal Sensing And Reasoning', is a real-life, industry collaborative project to support the elderly population through a smart robot environment.
One of the modules is Embedded Systems Deployment where we use Jetson Orin to put human pose estimation models into action.
We improved inference performance by 2x on Intel's OpenVINO and 10x on NVIDIA's Jetson using hardware acceleration methods like model compression, and post-training quantization to achieve low-latency outcomes.
Overview of the pipeline starting from capturing data using the camera to processing the frames at 30FPS such that an accurate and fast model is deployed to Jetson Orin.
To accelerate local Deep Neural network inference, the first step was to use Intel's Neural Compute Stick (NCS2) and its OpenVINO Toolkit (OneAPI).
Deploying high-performance Deep Learning Inference was made possible in 3 simple steps:
Convert the PyTorch/TensorFlow model to optimized Intermediate Representation (IR)
Load this representation model instead of the original model
Replace inference calls with the OpenVINO inference engine
It's much simpler and easier to convert your model to an optimized OpenVINO model.
As the throughput from Intel was not high for the autonomous robot, we moved on to NVIDIA's TensorRT platformwhich includes a deep learning inference optimizer and runtime that delivers low-latency and high throughput outcome.
TensorRT, built on the NVIDIA CUDA parallel programming model, enables you to optimize inference using techniques such as quantization, layer and tensor fusion, kernel tuning, and others on NVIDIA GPUs.
Floating point 16 (FP16) optimization for post-training quantization is used to deploy the human pose models.
Some of the Datasets we worked with included,
Microsoft Common Objects in Context (MS COCO) (200K poses)
MPI-INF-3DHP (1.3 Million frames from 14 cameras)
Human3.6M (3.6 Million poses!)
NTU RGB+D (60-120 actions)
This project helped me build an understanding of deploying optimized computer vision models on edge devices, leveraging network quantization for resource-constrained platforms.