Computer Vision:

Depth and Contact AI

A Tactile Sensor Prediction Model

Project Overview

This project focuses on reconstructing 3D surface geometry and contact regions from tactile sensor images. Using GelSight/DIGIT tactile data, this system learns to predict contact masks and depth (heightmaps) from raw tactile images using deep neural networks.

The goal is to build an inverse sensor model that converts tactile visual input into meaningful physical interaction data, enabling more reliable robotic grasping and object manipulation.

Tools & Technologies

Python
PyTorch
OpenCV
NumPy
Matplotlib
Jupyter Notebook
Deep Learning
Robotics Perception
Tactile Sensing

Background & Motivation

Tactile sensors provide rich physical feedback about surface geometry, contact pressure, and material interaction. While many recent approaches focus on modeling tactile deformation, this project emphasizes predicting contact regions and depth directly from sensor images.

This work builds on several key ideas:

GelSight tactile sensing — a vision‑based sensor that uses a soft gel and internal lighting to capture extremely detailed surface impressions, almost like taking a high‑resolution photograph of touch.
DIGIT sensor design — a compact, low‑cost tactile sensor developed for robotics that captures touch information as images, making it easy to integrate with machine‑learning models.
Multi‑scale depth‑prediction networks — neural network architectures that learn depth at different spatial scales, helping the model understand both fine textures and larger surface shapes.

Together, these concepts motivate a system that can interpret tactile images as meaningful 3D contact information.

Figure 1. This is a tactile sensor that captures fine surface impressions as high‑resolution images.

Figure 2. This illustrates the resolution of tactile sensors when they come into contact with various objects.

Objective

This project aims to translate raw tactile images into meaningful 3D contact information by training a deep learning system that can understand and reconstruct touch.

This model is designed to:

Predict binary contact regions, identifying where the sensor physically touches an object
Predict dense depth maps that capture local surface geometry
Learn visual representations directly from tactile sensor images

The system uses a two‑stage network architecture:

Coarse Contact Network — estimates the initial contact mask
Fine Depth Network — predicts detailed depth, conditioned on the contact output

Part 1 Data Loading & Preprocessing

To train the model effectively, tactile images and their paired depth data need to be organized, cleaned, and converted into a format suitable for deep learning. A custom PyTorch Dataset was implemented to handle this process and ensure that each tactile image is correctly aligned with its corresponding depth sample.

Key Preprocessing Steps:

Sorting file indices to keep tactile images and depth maps aligned
Loading RGB tactile images from disk
Handling missing or placeholder depth values
Applying consistent transforms for training and testing
Using a PyTorch DataLoader for efficient batch loading

Dataset Implementation

The custom dataset handles image loading, applies transformations, and returns each sample in a structured format that the model can use during training.

Python:

class TactileDataset(Dataset):

def __getitem__(self, idx):

tactile_sample = Image.open(path).convert('RGB')

tactile_sample = self.transform(tactile_sample)

return {'tactile': tactile_sample}

Why This Code Matters

This dataset class is the foundation of the entire training pipeline. It ensures that every tactile image is:

Loaded consistently
Converted into a normalized tensor
Prepared with the same preprocessing steps the model expects

Without this stage, the network wouldn’t receive clean, aligned data, and the predictions for contact and depth would be unreliable.

Part 2 Neural Network Architecture

The system uses a dual‑network design that separates coarse contact prediction from fine‑grained depth reconstruction. This structure allows the model to first understand where contact occurs, then refine that information into detailed surface geometry.

Network components

ContactNet (Coarse Network) — predicts a binary contact mask using stacked convolutional layers
TactileDepthNet (Fine Network) — predicts a dense depth map, conditioned on both tactile features and the contact mask

Design Highlights:

Multi‑layer CNN feature extraction
Stable weight initialization for reliable training
Bilinear up-sampling to recover full‑resolution outputs
Feature fusion between the coarse and fine networks

Forward pass

The two networks operate sequentially: the contact prediction guides the depth network, helping it focus on regions where meaningful tactile interaction occurs.

Python:

contact_model_output = contact_model(tactile_input)

depth_output = tactile_depth_model(tactile_input, contact_model_output)

Why This Code Matters

This forward‑pass structure is the core of the architecture. By conditioning depth prediction on the contact mask, the model learns to reconstruct geometry only where the sensor actually touched the object. This improves accuracy, reduces noise, and mirrors how tactile sensing works in real physical interactions.

Figure 3. Tactile input is processed through coarse and fine networks to generate contact and depth reconstructions.

Part 3 Loss Functions & Optimization

Both networks were trained using Mean Squared Error (MSE) loss and optimized with Adam optimization. MSE encourages the model to produce smooth, accurate predictions for both contact masks and depth maps, while Adam provides stable and efficient updates during training.

Adam optimization automatically adjusts how much each weight changes as the model learns. It combines momentum (to keep moving in the right direction) with adaptive learning rates (to choose smarter step sizes), allowing the networks to train faster and more reliably than with basic gradient descent.

Training Setup:

Separate loss tracking for contact and depth predictions
GPU acceleration when available
DataParallel for scalable multi‑GPU training

Why This Matters

Reliable optimization is essential for learning from tactile data, which can be noisy and highly variable. Using MSE and Adam helps the model converge smoothly and produce consistent contact and depth reconstructions.

Figure 4. The loss curves show how the model’s predictions improve over training iterations.

Part 4 Training Process

Contact Model Training

The contact model was trained to identify where the sensor makes physical contact with an object based solely on the tactile image input. During training, the model gradually learned to produce spatially consistent contact patterns that aligned with the underlying geometry.

Observations:

Training loss decreased steadily and stabilized after the early iterations
The model learned coherent, structured contact regions rather than noisy or scattered predictions

Why This Matters

Learning stable contact patterns is essential for the second stage of the system. The depth network relies on accurate contact predictions to focus on meaningful regions of the tactile image, improving the quality of the final 3D reconstruction.

Figure 5. The training loss curves demonstrate how the contact model improves throughout multiple iterations of training.

Depth Model Training

The depth model used the predicted contact regions as guidance to generate refined heightmaps. By conditioning on contact, the network focused on meaningful areas of the tactile image, leading to smoother and more stable depth predictions.

Observations

Training converged steadily over time
Output depth maps showed improved spatial smoothness
Conditioning on contact significantly improved depth stability

Why This Matters

Depth reconstruction is sensitive to noise, especially in regions where the sensor barely touches the object. Using the contact mask as a prior helps the model avoid guessing in irrelevant areas, resulting in cleaner and more accurate heightmaps.

Figures 6 & 7. The training loss curves show how the depth model improves throughout training.

Part 5 Evaluation & Output Visualization

The trained models were evaluated on a set of validation tactile images to assess how well they generalized beyond the training data. For each sample, the system produced three outputs that together describe the full tactile interaction:

The raw tactile image
The predicted contact mask
The predicted depth map

Depth predictions were normalized and exported to make the heightmaps easier to interpret visually.

Model Inference

During evaluation, the contact model and depth model were run sequentially, mirroring the structure used during training.

Python:

contact_output = contact_model(tactile_input)

depth_output = tactile_depth_model(tactile_input, contact_output)

Why This Matters

Visualizing the outputs side‑by‑side makes it easy to see how the model interprets touch: where contact occurs, how that contact is shaped, and what 3D structure the sensor likely encountered. These evaluations help verify that the system produces stable, meaningful predictions across different objects and contact conditions.

Part 6 Inference Pipeline

A reusable prediction pipeline was implemented to load the trained model weights and generate contact and depth outputs for new tactile inputs. This allows the system to be applied to unseen data without retraining, making it easy to run evaluations or integrate the model into downstream applications.

Prediction Function

The inference function mirrors the two‑stage architecture used during training: the contact model runs first, and its output guides the depth model.

Python:

def predict(tactile_image):

contact_output = contact_model(tactile_image)

depth_output = tactile_depth_model(tactile_image, contact_output)

return contact_output, depth_output

Why This Matters

A clean inference pipeline ensures that the model can be deployed consistently and reliably. By encapsulating the full prediction process in a single function, the system can process new tactile images, generate contact masks, and reconstruct depth maps with minimal overhead.

Results & Insights

This project was completed as part of a team research assignment, accompanied by a full LaTeX technical report detailing the model architecture, training methodology, and evaluation metrics.

CMSC_426_Project_4_Report.pdf

What This Project Demonstrates

Multi‑stage deep learning pipeline design
Sensor‑driven perception modeling
Feature fusion between coarse and fine networks
Depth estimation from tactile imagery
Practical machine‑learning applications for robotics

Key Takeaways

Contact prediction significantly improves downstream depth accuracy
Multi‑scale CNNs capture both global structure and fine‑grained detail
Tactile sensing can serve as a strong alternative to vision‑only depth estimation

Page updated

Google Sites

Report abuse

Computer Vision:

Depth and Contact AI

Project Overview

Tools & Technologies

Background & Motivation

Objective

Part 1

Data Loading & Preprocessing

Dataset Implementation

Why This Code Matters

Part 2

Neural Network Architecture

Network components

Forward pass

Why This Code Matters

Part 3

Loss Functions & Optimization

Why This Matters

Part 4

Training Process

Contact Model Training

Why This Matters

Depth Model Training

Why This Matters

Part 5

Evaluation & Output Visualization

Model Inference

Why This Matters

Part 6

Inference Pipeline

Prediction Function

Why This Matters

Results & Insights

What This Project Demonstrates

Key Takeaways

Like what you see?