Realtime Monocular Depth Mapping

A PyTorch-based depth mapping project using the NYU dataset. This implementation includes various neural network architectures for single image depth prediction with custom loss functions and data augmentation.

Summary

This project focuses on monocular depth mapping from single RGB images using deep neural networks. Depth mapping is a critical task in computer vision with applications in robotics, autonomous driving, 3D reconstruction, and scene understanding. This work implements and compares multiple network architectures (Autoencoder and UNet) trained on the NYU Depth dataset. Our approach combines multiple loss functions including SSIM loss and gradient-based depth loss to improve depth prediction quality. The implemented methods achieve competitive accuracy with efficient computational performance suitable for real-time applications.

Features

Multiple Model Architectures: Support for Autoencoder and UNet models
NYU Dataset Support: Built-in data loaders for NYU Depth dataset
Custom Loss Functions: SSIM and gradient-based depth loss functions
Data Augmentation: Random horizontal flipping and other augmentations
TensorBoard Integration: Real-time training visualization
GPU Support: CUDA-enabled training with multi-GPU support
Checkpoint Management: Model saving and loading capabilities

Research Focus

This project explored decoder architecture design for monocular depth mapping, with emphasis on:

Bilinear upsampling vs transposed convolution
Artifact behavior in depth reconstruction
Skip-connection effectiveness
Real-time deployment tradeoffs
Memory/performance optimization

Read Full Technical Analysis

I wrote a detailed technical breakdown covering:

Bilinear upsampling
Transposed convolution
Checkerboard artifacts
Decoder design tradeoffs
Modern depth estimation architectures
Monodepth2, DPT, MiDaS, and Depth Anything

➡️ Read the full blog post

Dependencies

PyTorch
TorchVision
NumPy
TensorboardX
OpenCV
Matplotlib
ImageIO

Key Findings

UNet outperforms Autoencoder on all depth accuracy metrics, with 15% improvement in MAE
Trade-off between quality and speed: UNet provides better results but with ~15% slower inference
GPU acceleration critical: GPU inference is ~18x faster than CPU for both networks
Memory efficiency: Autoencoder uses 31% less GPU memory, suitable for resource-constrained environments
Skip connections beneficial: UNet’s skip connections significantly improve depth detail preservation

Supported Architectures

Autoencoder: Encoder-decoder architecture for depth mapping
UNet: U-Net architecture with skip connections for depth prediction

Loss Functions

The training process uses multiple loss functions:

SSIM Loss: Structural similarity index loss for perceptual quality
Depth Gradient Loss: Gradient-based loss for preserving depth discontinuities
L1 Loss: Mean absolute error for pixel-wise accuracy

Training Features

Real-time Monitoring: TensorBoard integration for loss and metric visualization
Checkpoint Saving: Automatic model and optimizer state saving
Data Augmentation: Random horizontal flips and other augmentations
Multi-GPU Support: Configurable CUDA device selection