Icg-Mvsnet: Learning Intra-View and Cross-View Relationships For Guidance in Multi-View Stereo
Icg-Mvsnet: Learning Intra-View and Cross-View Relationships For Guidance in Multi-View Stereo
Abstract—Multi-view Stereo (MVS) aims to estimate depth raw RGB image into the Feature Pyramid Network (FPN) [11],
and reconstruct 3D point clouds from a series of overlapping results in an overly large and resource-intensive model.
images. Recent learning-based MVS frameworks overlook the
arXiv:2503.21525v1 [cs.CV] 27 Mar 2025
Overall (mm)
Overall (mm)
intra-view feature fusion module that leverages the feature 0.33 0.33
0.32 0.32
coordinate correlations within a single image to enhance robust
0.31 0.31
cost matching. Additionally, we introduce a lightweight cross-
0.30 0.30
view aggregation module that efficiently utilizes the contextual 0.29 0.29
information from volume correlations to guide regularization. 0.280.0 0.28
0.2 0.4 0.6 0.8 1.0 1.2 3 4 5 6 7 8
Our method is evaluated on the DTU dataset and Tanks and Runtime (s) Memory Consumption (GB)
Temples benchmark, consistently achieving competitive perfor- CasMVSNet EPP MVSNet NP CVP MVSNet TransMVSNet GeoMVSNet
PatchmatchNet CDS MVSNet UniMVSNet MVSTER Ours
mance against state-of-the-art works, while requiring lower
computational resources. Fig. 1. Comparison with state-of-the-art methods in runtime and GPU
Index Terms—Multi-View Stereo, 3D Reconstruction consumption on DTU [12]. Our method achieves state-of-the-art performance
while maintaining efficient inference time and low memory usage.
Fusion
IVF W CVA 3D CNN
(c) Depth Map Point Cloud
Addition
Fig. 2. The overall architecture. Our method is a coarse-to-fine framework that estimates depths from low resolution (stage ℓ) to high resolution (stage ℓ+1),
where ℓ = 0, 1, 2, resulting in a total of 4 stages. Features of reference and source images {Fi }N
i=0 are extracted by a feature pyramid network with the help
of Intra-View Fusion (IVF), whose details are illustrated in (a). The source image features are warped into the D frustum planes of the reference camera and
an element-wise multiplication is used to correlate each source image with the reference image. These correlations are aggregated into a single cost volume
C. In finer stages (stage 1, 2, and 3), both current and previous stage correlations are used in Cross-View Aggregation (CVA), whereas in stage 0, the cost
volume is not updated due to the absence of contextual correlations from a previous stage. Details of this process are illustrated in (b) and (c). Regularization
(3D CNN) yields the probability volume P , from which the depth hypothesis with the highest probability is selected for the final depth map. Depth maps
from multiple viewpoints are fused into a point cloud, in a non-learnable process.
different viewpoints under various lighting conditions. PatchmatchNet [21] 0.427 0.277 0.352 0.25 2.9
EPP-MVSNet [22] 0.413 0.296 0.355 0.52 8.2
Tanks and Temples (T&T) [16] is a challenging and realistic CDS-MVSNet [23] 0.352 0.280 0.316 0.40 4.5
NP-CVP-MVSNet [24] 0.356 0.275 0.315 1.20 6.0
dataset, which presents a valuable resource for evaluating UniMVSNet [25] 0.352 0.278 0.315 0.29 3.3
TransMVSNet [5] 0.321 0.289 0.305 0.99 3.8
MVS methods in demanding real-world scenarios. MVSTER∗ [6] 0.340 0.266 0.303 0.17 4.5
BlendedMVS [17] is a large-scale synthetic training dataset GeoMVSNet [9] 0.331 0.259 0.295 0.26 5.9
Ours 0.327 0.251 0.289 0.13 3.2
with 17, 000+ images and precise ground truth 3D structures.
TABLE II
B. Implementation Details D EPTH MAP ERRORS ON DTU [12]. T HE ade REPRESENTS THE AVERAGE
ABSOLUTE DEPTH ERROR ( MM ), WHILE tde(X) INDICATES THE
Following the established paradigm, our model is trained on
PERCENTAGE OF PIXELS WITH AN ERROR ABOVE X MM . T HE COLORS
the DTU training set [12] and evaluated on the DTU testing INDICATE RANKINGS , WITH RED REPRESENTING THE TOP POSITION ,
set, using the same data split and view selection criteria as [1], ORANGE INDICATING SECOND PLACE , AND YELLOW MARKING THIRD .
[6], [9], [14] for comparability. Additionally, we fine-tune our Method ade ↓ tde(1) ↓ tde(2) ↓ tde(4) ↓ tde(8) ↓ tde(16) ↓
model on BlendedMVS [17] and evaluate on T&T [16]. MVSNet [14] 14.7356 28.22 20.01 16.19 14.00 12.18
CasMVSNet [1] 8.4086 25.41 17.80 14.06 11.42 8.97
Training. We configure the number of input images to be CVP-MVSNet [3] 6.9875 26.75 18.80 14.10 10.22 6.75
N = 5 for the DTU [12], each with a resolution of 640 × 512 TransMVSNet [5]
MVSTER [6]
15.1320
9.3494
26.09
26.94
18.03
19.36
15.17
15.84
13.18
13.41
11.57
11.03
pixels. For the BlendedMVS [17], we employ a total of N = 7 UniMVSNet [25] 11.5872 28.90 19.66 13.73 11.02 8.92
GeoMVSNet [9] 11.6586 24.06 16.53 13.28 11.08 9.21
images, each with a resolution of 768 × 576 pixels. We train Ours 6.6450 24.38 16.81 13.20 10.55 8.05
the model with an initial learning rate of 0.001 for 15 epochs
on one NVIDIA RTX 4090 GPU, with a batch size of 4. C. Benchmark Performance
Evaluation. For the DTU [12] evaluation, the images are DTU. We compare our results with traditional methods and
cropped to a resolution of 1600 × 1152, and the number of learning-based methods. We selected some representative vi-
CasMVSNet UniMVSNet TransMVSNet MVSTER GeoMVSNet
Reference View Ours
(CVPR2020) (CVPR2022) (CVPR2022) (ECCV2022) (CVPR2023)
scan4
scan10
scan114
Fig. 3. Qualitative comparison with other methods on the DTU [12] dataset. The depth map estimated by our method has a more complete and continuous
surface and also has clearer outlines at the edges.
TABLE III
Q UANTITATIVE COMPARISON ON TANKS AND T EMPLES [16]. T HE COLORS INDICATE RANKINGS , WITH RED REPRESENTING THE TOP POSITION ,
ORANGE INDICATING SECOND PLACE , AND YELLOW MARKING THIRD .
Scene
Method
Mean↑ Family Francis Horse L.H. M60 Panther P.G. Train
COLMAP [20] 42.14 50.41 22.25 25.63 56.43 44.83 46.97 48.53 42.04
CasMVSNet [1] 56.42 76.36 58.45 46.20 55.53 56.11 54.02 58.17 46.56
PatchmatchNet [21] 53.15 66.99 52.64 43.24 54.87 52.87 49.54 54.21 50.81
UniMVSNet [25] 64.36 81.20 66.43 53.11 63.46 66.09 64.84 62.23 57.53
TransMVSNet [5] 63.52 80.92 65.83 56.94 62.54 63.06 60.00 60.20 58.67
MVSTER [6] 60.92 80.21 63.51 52.30 61.38 61.47 58.16 58.98 51.38
GeoMVSNet [9] 65.89 81.64 67.53 55.78 68.02 65.49 67.19 63.27 58.22
Ours 65.53 81.73 68.92 56.59 66.10 64.86 64.41 62.33 59.26
TABLE IV
A BLATIONS FOR PROPOSED MODULES . T HE MODEL IS EVALUATED ON us at the 1mm and 2mm thresholds, it performs worse at
DTU [12]. B OTH MODULES CAN IMPROVE THE ACCURACY AND other levels and has larger absolute depth errors, indicating
COMPLETENESS OF THE POINT CLOUD . potential reliance on point cloud fusion to eliminate outliers.
Method Sec. II-A Sec. II-B Acc.↓ (mm) Comp.↓ (mm) Overall↓ (mm) This explains why their depth maps are either highly accurate
baseline 0.350 0.276 0.313 or significantly off, yet achieve strong point cloud scores.
+ intra view ✓ 0.333 0.257 0.295
+ cross view ✓ 0.327 0.255 0.291 Tanks and Temples. We further validate the generalization ca-
proposed ✓ ✓ 0.327 0.251 0.289 pability of our method on the T&T [16] and report quantitative
results in Tab. III. Overall, our method achieves comparable
sualizations, as shown in Fig. 3. The depth map estimated precision and recall to GeoMVSNet [9], but with the added
by our method has a more complete and continuous surface benefits of reduced GPU memory usage—GeoMVSNet [9]
and also has clearer outlines at the edges. Therefore, we can needs 8.85 GB, whereas our method only requires 7.37 GB.
obtain accurate and complete point clouds, especially for the
structures of the subject, even without rich textures such as the D. Ablation Study
bird’s head and statue edges. Using the official codes for point Tab. IV and Tab. V show the ablation results of our method.
cloud evaluation, we present results in Tab. I. Our method The baseline MVSTER [6] method is re-implemented without
ranks first among the methods evaluated. Methods with similar the monocular depth estimator.
scores need more inference time and GPU resources, while Intra-View Fusion. As shown in Tab. IV, Intra-View Fusion
those with similar resource use perform worse. which encodes both long-range dependencies and feature
Point cloud fusion involves various tricks to optimize bench- channel relationships can significantly improve the accuracy
mark results, but this contradicts the essence of MVS. To and completeness. This is because our method effectively
demonstrate that our method produces better depth maps rather captures coordinate information along with precise positional
than achieving high scores through parameter tuning, we ana- details, which is crucial for accurate understanding.
lyze the estimated depth maps directly. We calculate the mean Cross-View Aggregation. As shown in Tab. IV, the cost
absolute error (mm) and the percentage of pixels with errors volume aggregation equipped with the cross-view correlation
greater than 1mm, 2mm, etc., as shown in Tab. II. Our depth derived from the previous stage can effectively improve the
maps have lower average errors and more pixels within specific score because the correlations offer relationships to describe
error thresholds. While GeoMVSNet [9] slightly outperforms the positions of pixels, which means that some adjacent pixels
are implicitly constrained to lie on a plane, this especially [4] Tianqi Liu, Xinyi Ye, et al., “When epipolar constraint meets non-
helps in texture-less regions. Besides, the contextual informa- local operators in multi-view stereo,” in Proceedings of the IEEE/CVF
International Conference on Computer Vision (ICCV), October 2023,
tion extracted from correlations highlights the effect of various pp. 18088–18097. 1
viewpoints, which enhances the robustness of the method to [5] Yikang Ding, Wentao Yuan, et al., “Transmvsnet: Global context-aware
handle unreliable inputs. multi-view stereo network with transformers,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
The number of channels in cross-view aggregation. As 2022, pp. 8585–8594. 1, 2, 4, 5
shown in Tab. V, the cross-view correlation between the [6] Xiaofeng Wang, Zheng Zhu, et al., “Mvster: Epipolar transformer for
current stage and the previous stage can provide additional efficient multi-view stereo,” in European Conference on Computer
Vision. Springer, 2022, pp. 573–591. 1, 3, 4, 5
information to enhance performance. We conducted ablations [7] Chenjie Cao, Xinlin Ren, et al., “Mvsformer: Multi-view stereo by learn-
by varying the number of channels used for correlation, testing ing robust image features and temperature-based depth,” Transactions
up to 4 channels within the bounds of the original volume. of Machine Learning Research, 2023. 1, 2
[8] Xinlin Ren Chenjie Cao et al., “Mvsformer++: Revealing the devil in
We found that more channels did not significantly improve transformer’s details for multi-view stereo,” in International Conference
completeness but slightly decreased accuracy, resulting in on Learning Representations (ICLR), 2024. 1
a minor overall score reduction. This suggests that while [9] Zhe Zhang, Rui Peng, et al., “Geomvsnet: Learning multi-view stereo
with geometry perception,” in Proceedings of the IEEE/CVF Conference
correlation data captures essential trends across viewpoints, on Computer Vision and Pattern Recognition, 2023, pp. 21508–21518.
additional channels may introduce noise without significant 1, 2, 3, 4, 5
benefit, and increase memory usage. Therefore, we optimized [10] Jiang Wu, Rui Li, et al., “Gomvs: Geometrically consistent cost
aggregation for multi-view stereo,” in CVPR, 2024. 1
our model by limiting the channels to 1. [11] Tsung-Yi Lin, Piotr Dollár, et al., “Feature pyramid networks for object
detection,” in Proceedings of the IEEE conference on computer vision
TABLE V and pattern recognition, 2017, pp. 2117–2125. 1, 2
A BLATIONS FOR THE NUMBER OF CHANNELS OF CROSS - VIEW [12] Rasmus Jensen, Anders Dahl, et al., “Large scale multi-view stereopsis
AGGREGATION . T HE MODEL IS EVALUATED ON DTU [12]. T HE N UM C IS evaluation,” in Proceedings of the IEEE conference on computer vision
THE NUMBER OF CHANNELS FROM THE PREVIOUS STAGE AND THE N UM F and pattern recognition, 2014, pp. 406–413. 1, 4, 5, 6, 2
IS THE NUMBER OF CHANNELS FROM THE CURRENT STAGE . [13] Xiaoyang Guo, Kai Yang, et al., “Group-wise correlation stereo
network,” in Proceedings of the IEEE/CVF Conference on Computer
Method Numc Numf Acc.↓ (mm) Comp.↓ (mm) Overall↓ (mm) Vision and Pattern Recognition, 2019, pp. 3273–3282. 2
baseline+intra-view 0 0 0.333 0.257 0.295
[14] Yao Yao, Zixin Luo, et al., “Mvsnet: Depth inference for unstructured
0 1 0.3271 0.2585 0.2928
multi-view stereo,” in Proceedings of the European conference on
1 0 0.3300 0.2582 0.2941
2 2 0.3296 0.2492 0.2894
computer vision (ECCV), 2018, pp. 767–783. 2, 4
3 3 0.3323 0.2503 0.2913 [15] Yao Yao, Zixin Luo, et al., “Recurrent mvsnet for high-resolution
4 4 0.3280 0.2544 0.2912 multi-view stereo depth inference,” in Proceedings of the IEEE/CVF
proposed 1 1 0.3270 0.2510 0.2890 conference on computer vision and pattern recognition, 2019, pp. 5525–
5534. 2, 4
IV. C ONCLUSION [16] Arno Knapitsch, Jaesik Park, et al., “Tanks and temples: Benchmarking
large-scale scene reconstruction,” ACM Transactions on Graphics, vol.
In this paper, we propose ICG-MVSNet which explic- 36, no. 4, 2017. 4, 5, 1, 3
[17] Yao Yao, Zixin Luo, et al., “Blendedmvs: A large-scale dataset
itly integrates intra-view and cross-view relationships for for generalized multi-view stereo networks,” in Proceedings of the
depth estimation. Specifically, we construct Intra-View Fusion, IEEE/CVF Conference on Computer Vision and Pattern Recognition,
which leverages the positional knowledge within a single 2020, pp. 1790–1799. 4, 1
[18] Jianfeng Yan, Zizhuang Wei, et al., “Dense hybrid recurrent multi-view
image, enhancing robust cost matching without incorporating stereo net with dynamic consistency checking,” in European conference
complicated external dependencies. Besides, we introduce a on computer vision. Springer, 2020, pp. 674–689. 4, 1
lightweight Cross-View Aggregation that efficiently utilizes [19] Silvano Galliani, Katrin Lasinger, et al., “Massively parallel multiview
stereopsis by surface normal diffusion,” in Proceedings of the IEEE
the cross-scale contextual information from correlations. The International Conference on Computer Vision, 2015, pp. 873–881. 4
proposed method is extensively evaluated on public datasets, [20] Johannes L Schonberger and Jan-Michael Frahm, “Structure-from-
consistently achieving competitive performance against the motion revisited,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2016, pp. 4104–4113. 4, 5
state-of-the-arts, while requiring no extra input and lower [21] Fangjinhua Wang, Silvano Galliani, et al., “Patchmatchnet: Learned
computational resources. Our method still needs large labeled multi-view patchmatch stereo,” in Proceedings of the IEEE/CVF
datasets to train the model. Future work could focus on devel- Conference on Computer Vision and Pattern Recognition, 2021, pp.
14194–14203. 4, 5
oping unsupervised methods to overcome these limitations. [22] Xinjun Ma, Yue Gong, et al., “Epp-mvsnet: Epipolar-assembling based
depth prediction for multi-view stereo,” in Proceedings of the IEEE/CVF
R EFERENCES International Conference on Computer Vision, 2021, pp. 5732–5740. 4
[1] Xiaodong Gu, Zhiwen Fan, et al., “Cascade cost volume for high- [23] Khang Truong Giang, Soohwan Song, et al., “Curvature-guided dynamic
resolution multi-view stereo and stereo matching,” in Proceedings of scale networks for multi-view stereo,” arXiv:2112.05999, 2021. 4
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, [24] Jiayu Yang, Jose M Alvarez, et al., “Non-parametric depth distribution
2020, pp. 2495–2504. 1, 4, 5 modelling based depth inference for multi-view stereo,” in Proceedings
[2] Shuo Cheng, Zexiang Xu, et al., “Deep stereo using adaptive thin of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
volume representation with uncertainty awareness,” in Proceedings of tion, 2022, pp. 8626–8634. 4
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, [25] Rui Peng, Rongjie Wang, et al., “Rethinking depth estimation for multi-
2020, pp. 2524–2534. 1 view stereo: A unified representation,” in Proceedings of the IEEE/CVF
[3] Jiayu Yang, Wei Mao, et al., “Cost volume pyramid based depth Conference on Computer Vision and Pattern Recognition, 2022, pp.
inference for multi-view stereo,” in Proceedings of the IEEE/CVF 8645–8654. 4, 5, 1
Conference on Computer Vision and Pattern Recognition, 2020, pp.
4877–4886. 1, 4
V. S UPPLEMENTARY M ATERIAL significantly better accuracy, completeness, and overall per-
In the supplementary material, we present more details that formance. (2) GoMVS relies on additional normal inputs.
are not included in the main text, including: In particular, it requires normal map estimation as a prepro-
cessing step, which imposes a strong planar constraint that
• Depth map fusion method which integrates depth maps
undoubtedly improves its benchmark performance. However,
into 3D point clouds.
this also means that the results are highly dependent on the
• Extended comparison with additional recent works based
accuracy of the normal estimation method. Moreover, even if
on reviewer suggestions.
we disregard the preprocessing time required to obtain normal
• Depth map visualizations: comparisons between our
maps, GoMVS still consumes more than twice the memory
method and other approaches.
and takes over three times longer to run compared to the
• Point cloud visualizations: comparisons with other meth-
proposed method. (3) MVSFormer++ relies heavily on a pre-
ods and our final reconstructed point clouds on DTU [12]
trained model. While its overall score is better, it is crucial to
and Tanks and Temples [16].
note that MVSFormer++ benefits from extensive engineering
A. Depth Map Fusion and experimental optimizations. It not only leverages a pre-
trained DINOv2 model for feature extraction but also in-
In previous Multi-View Stereo (MVS) approaches, various
corporates efficiency-focused components like FlashAttention.
fusion techniques have been utilized to integrate predicted
However, these optimizations come at the cost of significantly
depth maps from multiple viewpoints into a coherent point
increased model complexity—MVSFormer++ has over 30
cloud. In this study, we adopt the dynamic checking strategy
times (783MB) more parameters than our model (20MB).
proposed in [18] for depth filtering and fusion.
Moreover, its training is highly resource-intensive, requiring
On the DTU [12] dataset, we filter the confidence map of the
four A6000 GPUs (48GB each) for a full day, whereas our
final stage using a confidence threshold to assess photometric
model can be trained in less than a day using just a single
consistency. For geometric consistency, we apply the following
RTX 4090. In addition, MVSFormer++ demands more GPU
criteria:
memory and longer inference time, as shown in Table VI,
making it less efficient in practical deployment.
errc < threshc , errd < threshd (12)
Here, errc and errd represent the reprojection coordinate TABLE VI
Q UANTITATIVE COMPARISON ON DTU. T HE POINT CLOUD METRICS
error and relative error of reprojection depth, respectively. ARE TAKEN FROM THE RESPECTIVE PAPERS . A LL RUNTIME AND MEMORY
threshc and threshd indicate the corresponding thresholds. USAGE DATA ARE MEASURED ON THE SAME HARDWARE (RTX 4090) FOR
On the Tanks and Temples [16] benchmark, we adjust hy- A FAIR COMPARISON . N OTE THAT CT-MVSN ET DOES NOT PROVIDE A
PRE - TRAINED MODEL , WITHOUT WHICH WE COULD NOT TEST ON OUR
perparameters for each scene following the approach outlined MACHINE .
in [25], including confidence thresholds and geometric criteria.
Our model is fine-tuned on the BlendedMVS dataset [17] for Method Acc.↓ (mm) Comp.↓ (mm) Overall↓(mm) Time↓(s) GPU↓(GB)
Fig. 4. Qualitative comparison with other methods on DTU [12]. The depth map estimated by our method has a more complete and continuous surface
and also has clearer outlines at the edges.
MVSTER (ECCV2022) GeoMVSNet (CVPR2023) Ours
Precision
Recall
Fig. 5. Point clouds error comparison of state-of-the-art methods on the Tanks and Temples dataset [16]. τ is the scene-relevant distance threshold
determined officially and darker means large error. The first row shows Precision and the second row shows Recall. Taking the Horse in the intermediate
subset as an example, our method is able to reduce large amounts of outliers while ensuring completeness.
Fig. 6. Point clouds reconstructed for all scenes from DTU [12] dataset.
Fig. 7. Point clouds reconstructed for all scenes from Tanks and Temples [16] dataset.