MASt3R-SfM: Retrieval-Based Sparse View Pose Estimation for 3D Reconstruction
Introduction ๐ธ
Following my deep dive into MASt3R, I was introduced to MASt3R-SfM by a PhD student in my computer vision research lab. This retrieval-augmented variant integrates a transformer-based matching model with image retrieval, and itโs easily delivered some of the best pose estimations and sparse reconstructions Iโve encountered.
Unlike the original MASt3R, which is designed for stereo and dense matching, MASt3R-SfM is purpose-built for Structure-from-Motion (SfM). Itโs particularly effective in sparse-view or casually captured image sets, where traditional SfM pipelines struggle due to viewpoint inconsistency or missing metadata.
What Makes MASt3R-SfM Special ๐ก
MASt3R-SfM combines the DUSt3R backbone and MASt3Rโs matching module with an image retrieval pipeline that allows it to:
- Retrieve semantically and geometrically relevant images
- Improve pose prediction through contextualized reference views
- Produce robust pose estimations in unstructured, low-overlap datasets
This leads to:
- Better feature correspondences
- More stable camera trajectories
- Cleaner sparse reconstructions
๐ MASt3R-SfM GitHub Repo
๐ My GitHub Repo
Architecture Overview ๐ง
MASt3R-SfM uses transformers to learn dense, global correspondences across image pairs. Its retrieval system selects optimal reference views from within the image set before running pose estimation โ improving geometric consistency.
This helps overcome classical SfM limitations like:
- Viewpoint divergence
- Sparse feature overlap
- Metadata unavailability
It can export COLMAP-style camera pose data, making it easy to plug into other pipelines like Gaussian Splatting, NeRF, or Meshroom.
Setup & Installation โ๏ธ
The MASt3R-SfM setup is similar to MASt3R, with one key addition: ASMK retrieval.
๐ง Environment Setup
git clone --recursive https://github.com/naver/mast3r
cd mast3r
conda create -n mast3r python=3.11 cmake=3.14.0
conda activate mast3r
conda install pytorch torchvision pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r requirements.txt
pip install -r dust3r/requirements.txt
pip install -r dust3r/requirements_optional.txt
๐ ASMK Retrieval Setup
pip install cython
git clone https://github.com/jenicek/asmk
cd asmk/cython/
cythonize *.pyx
cd ..
pip install .
cd ..
โ๏ธ CUDA Optimization (Optional)
cd dust3r/croco/models/curope/
python setup.py build_ext --inplace
cd ../../../../
๐ฆ Download Checkpoints
mkdir -p checkpoints/
wget https://download.europe.naverlabs.com/ComputerVision/MASt3R/MASt3R_ViTLarge_BaseDecoder_512_catmlpdpt_metric_retrieval_trainingfree.pth -P checkpoints/
wget https://download.europe.naverlabs.com/ComputerVision/MASt3R/MASt3R_ViTLarge_BaseDecoder_512_catmlpdpt_metric_retrieval_codebook.pkl -P checkpoints/
MASt3R vs MASt3R-SfM ๐
Although both MASt3R and MASt3R-SfM share the same transformer backbone and DUSt3R foundation, they are optimized for different tasks:
๐น Pose Output Compatibility
- MASt3R: Focuses on dense stereo but does not export COLMAP-style poses.
- MASt3R-SfM: Exports COLMAP-style poses โ ideal for NeRF, Gaussian Splatting, or MVS.
๐น Retrieval-Based View Selection
- MASt3R-SfM introduces a retrieval module to select semantically/geometrically relevant views, improving performance in sparse or noisy datasets.
๐ In short: MASt3R is best for dense stereo, while MASt3R-SfM excels at sparse SfM in real-world scenes.
๐ Reflections on Camera Pose Estimation
While GaussianObject is designed to work without COLMAP, accurate camera poses still play a crucial role in reconstruction quality. Even the COLMAP-free pipeline depends on transformer-based estimators like DUSt3R or MASt3R.
After using MASt3R, I became particularly interested in MASt3R-SfM, which adds retrieval-based conditioning for even better pose accuracy.
๐ Why Retrieval-Based Conditioning Matters
Multi-view pose estimation often struggles when:
- Input views have large viewpoint differences
- Overlap is limited
- Metadata is missing
MASt3R-SfM addresses this by:
- Retrieving relevant reference views
- Conditioning pose estimation on stronger view context
- Producing more stable, accurate, and consistent camera poses
This makes it well-suited for unstructured, real-world datasets.
๐งช My Results So Far
I tested MASt3R-SfM on several object-level datasets and found:
- More accurate pose estimates vs. MASt3R and COLMAP
- Improved downstream reconstructions (e.g., GaussianObject, NeRF)
- Better view consistency and reduced reconstruction artifacts
What Iโm Exploring Next ๐งช
Given MASt3R-SfMโs success in pose estimation, my next steps are:
- Integrating with NeRF Studio
- Trying Instant-NGP from NVIDIA
- Exploring classical dense MVS techniques
- Incorporating depth maps post-pose estimation
COLMAP dense hasnโt worked for me due to GPU constraints โ but Iโll try newer pipelines that accept external poses.
Final Thoughts ๐ง
MASt3R-SfM has transformed my approach to sparse-view reconstruction. Whether using NeRF, Gaussian Splatting, or simple rendering, its pose estimates are the most stable and accurate Iโve seen.
If youโre curious, check out:
Resources & Links ๐
- MASt3R-SfM GitHub (NAVER Labs)
- My MASt3R-SfM GitHub Repo
- DUSt3R (CVPR 2024)
- MASt3R-SfM Paper
- Instant-NGP (NVIDIA)
- NeRF Studio Docs
- GaussianObject Project Page
References
Leroy et al., Grounding Image Matching in 3D with MASt3R, arXiv 2024.
Wang et al., DUSt3R: Geometric 3D Vision Made Easy, CVPR 2024.