3 minute read

Introduction ๐Ÿ“ธ

Following my deep dive into MASt3R, I was introduced to MASt3R-SfM by a PhD student in my computer vision research lab. This retrieval-augmented variant integrates a transformer-based matching model with image retrieval, and itโ€™s easily delivered some of the best pose estimations and sparse reconstructions Iโ€™ve encountered.

Unlike the original MASt3R, which is designed for stereo and dense matching, MASt3R-SfM is purpose-built for Structure-from-Motion (SfM). Itโ€™s particularly effective in sparse-view or casually captured image sets, where traditional SfM pipelines struggle due to viewpoint inconsistency or missing metadata.


What Makes MASt3R-SfM Special ๐Ÿ’ก

MASt3R-SfM combines the DUSt3R backbone and MASt3Rโ€™s matching module with an image retrieval pipeline that allows it to:

  • Retrieve semantically and geometrically relevant images
  • Improve pose prediction through contextualized reference views
  • Produce robust pose estimations in unstructured, low-overlap datasets

This leads to:

  • Better feature correspondences
  • More stable camera trajectories
  • Cleaner sparse reconstructions

๐Ÿ”— MASt3R-SfM GitHub Repo
๐Ÿ”— My GitHub Repo


Architecture Overview ๐Ÿง 

MASt3R-SfM uses transformers to learn dense, global correspondences across image pairs. Its retrieval system selects optimal reference views from within the image set before running pose estimation โ€” improving geometric consistency.

This helps overcome classical SfM limitations like:

  • Viewpoint divergence
  • Sparse feature overlap
  • Metadata unavailability

It can export COLMAP-style camera pose data, making it easy to plug into other pipelines like Gaussian Splatting, NeRF, or Meshroom.


Setup & Installation โš™๏ธ

The MASt3R-SfM setup is similar to MASt3R, with one key addition: ASMK retrieval.

๐Ÿ”ง Environment Setup

git clone --recursive https://github.com/naver/mast3r
cd mast3r

conda create -n mast3r python=3.11 cmake=3.14.0
conda activate mast3r

conda install pytorch torchvision pytorch-cuda=12.1 -c pytorch -c nvidia

pip install -r requirements.txt
pip install -r dust3r/requirements.txt
pip install -r dust3r/requirements_optional.txt

๐Ÿ” ASMK Retrieval Setup

pip install cython

git clone https://github.com/jenicek/asmk
cd asmk/cython/
cythonize *.pyx
cd ..
pip install .
cd ..

โš™๏ธ CUDA Optimization (Optional)

cd dust3r/croco/models/curope/
python setup.py build_ext --inplace
cd ../../../../

๐Ÿ“ฆ Download Checkpoints

mkdir -p checkpoints/
wget https://download.europe.naverlabs.com/ComputerVision/MASt3R/MASt3R_ViTLarge_BaseDecoder_512_catmlpdpt_metric_retrieval_trainingfree.pth -P checkpoints/
wget https://download.europe.naverlabs.com/ComputerVision/MASt3R/MASt3R_ViTLarge_BaseDecoder_512_catmlpdpt_metric_retrieval_codebook.pkl -P checkpoints/

MASt3R vs MASt3R-SfM ๐Ÿ”„

Although both MASt3R and MASt3R-SfM share the same transformer backbone and DUSt3R foundation, they are optimized for different tasks:

๐Ÿ”น Pose Output Compatibility

  • MASt3R: Focuses on dense stereo but does not export COLMAP-style poses.
  • MASt3R-SfM: Exports COLMAP-style poses โ€” ideal for NeRF, Gaussian Splatting, or MVS.

๐Ÿ”น Retrieval-Based View Selection

  • MASt3R-SfM introduces a retrieval module to select semantically/geometrically relevant views, improving performance in sparse or noisy datasets.

๐Ÿ‘‰ In short: MASt3R is best for dense stereo, while MASt3R-SfM excels at sparse SfM in real-world scenes.


๐Ÿ” Reflections on Camera Pose Estimation

While GaussianObject is designed to work without COLMAP, accurate camera poses still play a crucial role in reconstruction quality. Even the COLMAP-free pipeline depends on transformer-based estimators like DUSt3R or MASt3R.

After using MASt3R, I became particularly interested in MASt3R-SfM, which adds retrieval-based conditioning for even better pose accuracy.


๐Ÿ” Why Retrieval-Based Conditioning Matters

Multi-view pose estimation often struggles when:

  • Input views have large viewpoint differences
  • Overlap is limited
  • Metadata is missing

MASt3R-SfM addresses this by:

  • Retrieving relevant reference views
  • Conditioning pose estimation on stronger view context
  • Producing more stable, accurate, and consistent camera poses

This makes it well-suited for unstructured, real-world datasets.


๐Ÿงช My Results So Far

I tested MASt3R-SfM on several object-level datasets and found:

  • More accurate pose estimates vs. MASt3R and COLMAP
  • Improved downstream reconstructions (e.g., GaussianObject, NeRF)
  • Better view consistency and reduced reconstruction artifacts

What Iโ€™m Exploring Next ๐Ÿงช

Given MASt3R-SfMโ€™s success in pose estimation, my next steps are:

COLMAP dense hasnโ€™t worked for me due to GPU constraints โ€” but Iโ€™ll try newer pipelines that accept external poses.


Final Thoughts ๐Ÿง 

MASt3R-SfM has transformed my approach to sparse-view reconstruction. Whether using NeRF, Gaussian Splatting, or simple rendering, its pose estimates are the most stable and accurate Iโ€™ve seen.

If youโ€™re curious, check out:


References
Leroy et al., Grounding Image Matching in 3D with MASt3R, arXiv 2024.
Wang et al., DUSt3R: Geometric 3D Vision Made Easy, CVPR 2024.