A Tactile Sensor Prediction Model
This project focuses on reconstructing 3D surface geometry and contact regions from tactile sensor images. Using GelSight/DIGIT tactile data, this system learns to predict contact masks and depth (heightmaps) from raw tactile images using deep neural networks.
The goal is to build an inverse sensor model that converts tactile visual input into meaningful physical interaction data, enabling more reliable robotic grasping and object manipulation.
Python
PyTorch
OpenCV
NumPy
Matplotlib
Jupyter Notebook
Deep Learning
Robotics Perception
Tactile Sensing
Tactile sensors provide rich physical feedback about surface geometry, contact pressure, and material interaction. While many recent approaches focus on modeling tactile deformation, this project emphasizes predicting contact regions and depth directly from sensor images.
This work builds on several key ideas:
GelSight tactile sensing — a vision‑based sensor that uses a soft gel and internal lighting to capture extremely detailed surface impressions, almost like taking a high‑resolution photograph of touch.
DIGIT sensor design — a compact, low‑cost tactile sensor developed for robotics that captures touch information as images, making it easy to integrate with machine‑learning models.
Multi‑scale depth‑prediction networks — neural network architectures that learn depth at different spatial scales, helping the model understand both fine textures and larger surface shapes.
Together, these concepts motivate a system that can interpret tactile images as meaningful 3D contact information.
Figure 1. This is a tactile sensor that captures fine surface impressions as high‑resolution images.
Figure 2. This illustrates the resolution of tactile sensors when they come into contact with various objects.
This project aims to translate raw tactile images into meaningful 3D contact information by training a deep learning system that can understand and reconstruct touch.
This model is designed to:
Predict binary contact regions, identifying where the sensor physically touches an object
Predict dense depth maps that capture local surface geometry
Learn visual representations directly from tactile sensor images
The system uses a two‑stage network architecture:
Coarse Contact Network — estimates the initial contact mask
Fine Depth Network — predicts detailed depth, conditioned on the contact output
To train the model effectively, tactile images and their paired depth data need to be organized, cleaned, and converted into a format suitable for deep learning. A custom PyTorch Dataset was implemented to handle this process and ensure that each tactile image is correctly aligned with its corresponding depth sample.
Key Preprocessing Steps:
Sorting file indices to keep tactile images and depth maps aligned
Loading RGB tactile images from disk
Handling missing or placeholder depth values
Applying consistent transforms for training and testing
Using a PyTorch DataLoader for efficient batch loading
The custom dataset handles image loading, applies transformations, and returns each sample in a structured format that the model can use during training.
Python:
class TactileDataset(Dataset):
def __getitem__(self, idx):
tactile_sample = Image.open(path).convert('RGB')
tactile_sample = self.transform(tactile_sample)
return {'tactile': tactile_sample}
This dataset class is the foundation of the entire training pipeline. It ensures that every tactile image is:
Loaded consistently
Converted into a normalized tensor
Prepared with the same preprocessing steps the model expects
Without this stage, the network wouldn’t receive clean, aligned data, and the predictions for contact and depth would be unreliable.
The system uses a dual‑network design that separates coarse contact prediction from fine‑grained depth reconstruction. This structure allows the model to first understand where contact occurs, then refine that information into detailed surface geometry.
ContactNet (Coarse Network) — predicts a binary contact mask using stacked convolutional layers
TactileDepthNet (Fine Network) — predicts a dense depth map, conditioned on both tactile features and the contact mask
Design Highlights:
Multi‑layer CNN feature extraction
Stable weight initialization for reliable training
Bilinear up-sampling to recover full‑resolution outputs
Feature fusion between the coarse and fine networks
The two networks operate sequentially: the contact prediction guides the depth network, helping it focus on regions where meaningful tactile interaction occurs.
Python:
contact_model_output = contact_model(tactile_input)
depth_output = tactile_depth_model(tactile_input, contact_model_output)
This forward‑pass structure is the core of the architecture. By conditioning depth prediction on the contact mask, the model learns to reconstruct geometry only where the sensor actually touched the object. This improves accuracy, reduces noise, and mirrors how tactile sensing works in real physical interactions.
Figure 3. Tactile input is processed through coarse and fine networks to generate contact and depth reconstructions.
Both networks were trained using Mean Squared Error (MSE) loss and optimized with Adam optimization. MSE encourages the model to produce smooth, accurate predictions for both contact masks and depth maps, while Adam provides stable and efficient updates during training.
Adam optimization automatically adjusts how much each weight changes as the model learns. It combines momentum (to keep moving in the right direction) with adaptive learning rates (to choose smarter step sizes), allowing the networks to train faster and more reliably than with basic gradient descent.
Training Setup:
Separate loss tracking for contact and depth predictions
GPU acceleration when available
DataParallel for scalable multi‑GPU training
Reliable optimization is essential for learning from tactile data, which can be noisy and highly variable. Using MSE and Adam helps the model converge smoothly and produce consistent contact and depth reconstructions.
Figure 4. The loss curves show how the model’s predictions improve over training iterations.
The contact model was trained to identify where the sensor makes physical contact with an object based solely on the tactile image input. During training, the model gradually learned to produce spatially consistent contact patterns that aligned with the underlying geometry.
Observations:
Training loss decreased steadily and stabilized after the early iterations
The model learned coherent, structured contact regions rather than noisy or scattered predictions
Learning stable contact patterns is essential for the second stage of the system. The depth network relies on accurate contact predictions to focus on meaningful regions of the tactile image, improving the quality of the final 3D reconstruction.
Figure 5. The training loss curves demonstrate how the contact model improves throughout multiple iterations of training.
The depth model used the predicted contact regions as guidance to generate refined heightmaps. By conditioning on contact, the network focused on meaningful areas of the tactile image, leading to smoother and more stable depth predictions.
Observations
Training converged steadily over time
Output depth maps showed improved spatial smoothness
Conditioning on contact significantly improved depth stability
Depth reconstruction is sensitive to noise, especially in regions where the sensor barely touches the object. Using the contact mask as a prior helps the model avoid guessing in irrelevant areas, resulting in cleaner and more accurate heightmaps.
Figures 6 & 7. The training loss curves show how the depth model improves throughout training.
The trained models were evaluated on a set of validation tactile images to assess how well they generalized beyond the training data. For each sample, the system produced three outputs that together describe the full tactile interaction:
The raw tactile image
The predicted contact mask
The predicted depth map
Depth predictions were normalized and exported to make the heightmaps easier to interpret visually.
During evaluation, the contact model and depth model were run sequentially, mirroring the structure used during training.
Python:
contact_output = contact_model(tactile_input)
depth_output = tactile_depth_model(tactile_input, contact_output)
Visualizing the outputs side‑by‑side makes it easy to see how the model interprets touch: where contact occurs, how that contact is shaped, and what 3D structure the sensor likely encountered. These evaluations help verify that the system produces stable, meaningful predictions across different objects and contact conditions.
A reusable prediction pipeline was implemented to load the trained model weights and generate contact and depth outputs for new tactile inputs. This allows the system to be applied to unseen data without retraining, making it easy to run evaluations or integrate the model into downstream applications.
The inference function mirrors the two‑stage architecture used during training: the contact model runs first, and its output guides the depth model.
Python:
def predict(tactile_image):
contact_output = contact_model(tactile_image)
depth_output = tactile_depth_model(tactile_image, contact_output)
return contact_output, depth_output
A clean inference pipeline ensures that the model can be deployed consistently and reliably. By encapsulating the full prediction process in a single function, the system can process new tactile images, generate contact masks, and reconstruct depth maps with minimal overhead.
This project was completed as part of a team research assignment, accompanied by a full LaTeX technical report detailing the model architecture, training methodology, and evaluation metrics.
Multi‑stage deep learning pipeline design
Sensor‑driven perception modeling
Feature fusion between coarse and fine networks
Depth estimation from tactile imagery
Practical machine‑learning applications for robotics
Contact prediction significantly improves downstream depth accuracy
Multi‑scale CNNs capture both global structure and fine‑grained detail
Tactile sensing can serve as a strong alternative to vision‑only depth estimation