D3-RSMDE: 40× Faster and High-Fidelity
Remote Sensing Monocular Depth Estimation
AAAI 2026

Overview

Real-time, high-fidelity monocular depth estimation from remote sensing imagery is crucial for numerous applications, yet existing methods face a stark trade-off between accuracy and efficiency. Although using Vision Transformer (ViT) backbones for dense prediction is fast, they often exhibit poor perceptual quality. Conversely, diffusion models offer high fidelity but at a prohibitive computational cost. To overcome these limitations, we propose Depth Detail Diffusion for Remote Sensing Monocular Depth Estimation (D3-RSMDE), an efficient framework designed to achieve an optimal balance between speed and quality. Our framework first leverages a ViT-based module to rapidly generate a high-quality preliminary depth map construction, which serves as a structural prior, effectively replacing the time-consuming initial structure generation stage of diffusion models. Based on this prior, we propose a Progressive Linear Blending Refinement (PLBR) strategy, which uses a lightweight U-Net to refine the details in only a few iterations. The entire refinement step operates efficiently in a compact latent space supported by a Variational Autoencoder (VAE). Extensive experiments demonstrate that -RSMDE achieves a notable 11.85% reduction in the Learned Perceptual Image Patch Similarity (LPIPS) perceptual metric over leading models like Marigold, while also achieving over a 40 speedup in inference and maintaining VRAM usage comparable to lightweight ViT models.

Gallery

The Dataset We Used

The figure shows the locations of the five data sets we used, and the generation effects of the model. They are: Japan + Korea (2,650 pairs, coastal mountainous terrain, 30 m resolution, J&K), Southeast Asia (7,000 pairs, plains and hills, 30 m resolution, SA), Mediterranean (29,225 pairs, desert and plateau, 30 m resolution, Med), Australia (1,249 pairs, plain, 5m resolution, Ast), Switzerland (4,827 pairs, mountain, 2m resolution, Swi).

Experiments Results Comparison

Inference vs Training Time Inference vs Training VRAM

The inference/training time and VRAM consumption of our D3-RSMDE are far lower than those of other Diffusion-based models and are already comparable to lightweight ViT models.

Quantitative Comparison

Model MAE ↓ δ³ ↑ PSNR ↑ LPIPS ↓
J&K SA Med Swi Ast J&K SA Med Swi Ast J&K SA Med Swi Ast J&K SA Med Swi Ast
Adabins 16.7 28.4 29.0 19.6 44.3 79.9 68.9 86.3 93.1 73.2 22.3 17.4 17.7 21.2 14.7 0.181 0.405 0.367 0.127 0.528
DPT 17.3 34.2 29.7 30.9 43.5 77.3 62.5 81.6 84.6 72.7 22.2 16.7 17.8 17.6 14.9 0.313 0.604 0.520 0.204 0.579
Omnidata 20.1 30.7 28.0 19.2 42.6 61.9 67.8 80.9 90.8 72.1 21.2 18.5 18.2 21.6 15.0 0.354 0.482 0.479 0.135 0.553
Pix2pix 24.5 39.3 39.4 38.9 44.3 68.9 55.1 72.0 76.8 69.3 18.6 15.2 15.1 15.5 14.3 0.450 0.485 0.434 0.775 0.937
Marigold 14.2 23.7 24.7 21.3 40.0 83.1 71.7 85.8 89.6 72.8 24.3 19.6 19.3 21.4 15.7 0.162 0.326 0.329 0.144 0.488
EcoDepth 26.4 49.0 49.0 37.4 43.3 65.7 49.4 67.4 77.9 69.9 17.9 13.3 13.2 16.0 14.5 0.461 0.428 0.563 0.265 0.702
D3-RSMDE (VA_VAE) 13.6 21.7 23.4 14.1 41.7 79.1 70.0 85.9 93.3 67.8 23.7 20.1 20.0 24.2 15.1 0.203 0.366 0.301 0.107 0.574
D3-RSMDE (AEKL) 12.7 20.5 22.1 13.4 36.1 83.3 73.9 88.1 94.3 76.1 24.5 20.6 20.4 24.8 16.2 0.180 0.318 0.290 0.104 0.511

Quantitative comparison with SOTA methods on five datasets. The best result is highlighted in bold and the second best result is underlined.