MASt3R-SfM: Retrieval-Based Sparse View Pose Estimation for 3D Reconstruction

3 minute read

Introduction 📸

Following my deep dive into MASt3R, I was introduced to MASt3R-SfM by a PhD student in my computer vision research lab. This retrieval-augmented variant integrates a transformer-based matching model with image retrieval, and it’s easily delivered some of the best pose estimations and sparse reconstructions I’ve encountered.

Unlike the original MASt3R, which is designed for stereo and dense matching, MASt3R-SfM is purpose-built for Structure-from-Motion (SfM). It’s particularly effective in sparse-view or casually captured image sets, where traditional SfM pipelines struggle due to viewpoint inconsistency or missing metadata.

What Makes MASt3R-SfM Special 💡

MASt3R-SfM combines the DUSt3R backbone and MASt3R’s matching module with an image retrieval pipeline that allows it to:

Retrieve semantically and geometrically relevant images
Improve pose prediction through contextualized reference views
Produce robust pose estimations in unstructured, low-overlap datasets

This leads to:

Better feature correspondences
More stable camera trajectories
Cleaner sparse reconstructions

🔗 MASt3R-SfM GitHub Repo
🔗 My GitHub Repo

Architecture Overview 🧠

MASt3R-SfM uses transformers to learn dense, global correspondences across image pairs. Its retrieval system selects optimal reference views from within the image set before running pose estimation — improving geometric consistency.

This helps overcome classical SfM limitations like:

Viewpoint divergence
Sparse feature overlap
Metadata unavailability

It can export COLMAP-style camera pose data, making it easy to plug into other pipelines like Gaussian Splatting, NeRF, or Meshroom.

Setup & Installation ⚙️

The MASt3R-SfM setup is similar to MASt3R, with one key addition: ASMK retrieval.

🔧 Environment Setup

git clone --recursive https://github.com/naver/mast3r
cd mast3r

conda create -n mast3r python=3.11 cmake=3.14.0
conda activate mast3r

conda install pytorch torchvision pytorch-cuda=12.1 -c pytorch -c nvidia

pip install -r requirements.txt
pip install -r dust3r/requirements.txt
pip install -r dust3r/requirements_optional.txt

🔁 ASMK Retrieval Setup

pip install cython

git clone https://github.com/jenicek/asmk
cd asmk/cython/
cythonize *.pyx
cd ..
pip install .
cd ..

⚙️ CUDA Optimization (Optional)

cd dust3r/croco/models/curope/
python setup.py build_ext --inplace
cd ../../../../

📦 Download Checkpoints

mkdir -p checkpoints/
wget https://download.europe.naverlabs.com/ComputerVision/MASt3R/MASt3R_ViTLarge_BaseDecoder_512_catmlpdpt_metric_retrieval_trainingfree.pth -P checkpoints/
wget https://download.europe.naverlabs.com/ComputerVision/MASt3R/MASt3R_ViTLarge_BaseDecoder_512_catmlpdpt_metric_retrieval_codebook.pkl -P checkpoints/

MASt3R vs MASt3R-SfM 🔄

Although both MASt3R and MASt3R-SfM share the same transformer backbone and DUSt3R foundation, they are optimized for different tasks:

🔹 Pose Output Compatibility

MASt3R: Focuses on dense stereo but does not export COLMAP-style poses.
MASt3R-SfM: Exports COLMAP-style poses — ideal for NeRF, Gaussian Splatting, or MVS.

🔹 Retrieval-Based View Selection

MASt3R-SfM introduces a retrieval module to select semantically/geometrically relevant views, improving performance in sparse or noisy datasets.

👉 In short: MASt3R is best for dense stereo, while MASt3R-SfM excels at sparse SfM in real-world scenes.

🔍 Reflections on Camera Pose Estimation

While GaussianObject is designed to work without COLMAP, accurate camera poses still play a crucial role in reconstruction quality. Even the COLMAP-free pipeline depends on transformer-based estimators like DUSt3R or MASt3R.

After using MASt3R, I became particularly interested in MASt3R-SfM, which adds retrieval-based conditioning for even better pose accuracy.

🔁 Why Retrieval-Based Conditioning Matters

Multi-view pose estimation often struggles when:

Input views have large viewpoint differences
Overlap is limited
Metadata is missing

MASt3R-SfM addresses this by:

Retrieving relevant reference views
Conditioning pose estimation on stronger view context
Producing more stable, accurate, and consistent camera poses

This makes it well-suited for unstructured, real-world datasets.

🧪 My Results So Far

I tested MASt3R-SfM on several object-level datasets and found:

More accurate pose estimates vs. MASt3R and COLMAP
Improved downstream reconstructions (e.g., GaussianObject, NeRF)
Better view consistency and reduced reconstruction artifacts

What I’m Exploring Next 🧪

Given MASt3R-SfM’s success in pose estimation, my next steps are:

Integrating with NeRF Studio
Trying Instant-NGP from NVIDIA
Exploring classical dense MVS techniques
Incorporating depth maps post-pose estimation

COLMAP dense hasn’t worked for me due to GPU constraints — but I’ll try newer pipelines that accept external poses.

Final Thoughts 🧠

MASt3R-SfM has transformed my approach to sparse-view reconstruction. Whether using NeRF, Gaussian Splatting, or simple rendering, its pose estimates are the most stable and accurate I’ve seen.

If you’re curious, check out:

Resources & Links 📚

References
Leroy et al., Grounding Image Matching in 3D with MASt3R, arXiv 2024.
Wang et al., DUSt3R: Geometric 3D Vision Made Easy, CVPR 2024.

Share on

X Facebook LinkedIn Bluesky

Anekha Sokhal