0% found this document useful (0 votes)
41 views10 pages

Icg-Mvsnet: Learning Intra-View and Cross-View Relationships For Guidance in Multi-View Stereo

The document presents ICG-MVSNet, a novel approach for Multi-view Stereo (MVS) that integrates intra-view and cross-view relationships to enhance depth estimation and 3D point cloud reconstruction. The proposed method includes an Intra-View Fusion module for robust cost matching and a lightweight Cross-View Aggregation module for efficient contextual information utilization, achieving competitive performance with lower computational resource requirements. Evaluations on the DTU dataset and Tanks and Temples benchmark demonstrate its effectiveness compared to state-of-the-art methods.

Uploaded by

Clark Ren
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views10 pages

Icg-Mvsnet: Learning Intra-View and Cross-View Relationships For Guidance in Multi-View Stereo

The document presents ICG-MVSNet, a novel approach for Multi-view Stereo (MVS) that integrates intra-view and cross-view relationships to enhance depth estimation and 3D point cloud reconstruction. The proposed method includes an Intra-View Fusion module for robust cost matching and a lightweight Cross-View Aggregation module for efficient contextual information utilization, achieving competitive performance with lower computational resource requirements. Evaluations on the DTU dataset and Tanks and Temples benchmark demonstrate its effectiveness compared to state-of-the-art methods.

Uploaded by

Clark Ren
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

ICG-MVSNet: Learning Intra-view and Cross-view

Relationships for Guidance in Multi-View Stereo


Yuxi Hu1 Jun Zhang1 Zhe Zhang2 Rafael Weilharter1
Yuchen Rao1 Kuangyi Chen1 Runze Yuan1 Friedrich Fraundorfer1
1 Graz University of Technology 2 Peking University

Abstract—Multi-view Stereo (MVS) aims to estimate depth raw RGB image into the Feature Pyramid Network (FPN) [11],
and reconstruct 3D point clouds from a series of overlapping results in an overly large and resource-intensive model.
images. Recent learning-based MVS frameworks overlook the
arXiv:2503.21525v1 [cs.CV] 27 Mar 2025

geometric information embedded in features and correlations,


leading to weak cost matching. In this paper, we propose ICG- 0.36 Runtime vs Overall 0.36 Memory Consumption vs Overall
MVSNet, which explicitly integrates intra-view and cross-view 0.35 0.35
relationships for depth estimation. Specifically, we develop an 0.34 0.34

Overall (mm)

Overall (mm)
intra-view feature fusion module that leverages the feature 0.33 0.33
0.32 0.32
coordinate correlations within a single image to enhance robust
0.31 0.31
cost matching. Additionally, we introduce a lightweight cross-
0.30 0.30
view aggregation module that efficiently utilizes the contextual 0.29 0.29
information from volume correlations to guide regularization. 0.280.0 0.28
0.2 0.4 0.6 0.8 1.0 1.2 3 4 5 6 7 8
Our method is evaluated on the DTU dataset and Tanks and Runtime (s) Memory Consumption (GB)
Temples benchmark, consistently achieving competitive perfor- CasMVSNet EPP MVSNet NP CVP MVSNet TransMVSNet GeoMVSNet
PatchmatchNet CDS MVSNet UniMVSNet MVSTER Ours
mance against state-of-the-art works, while requiring lower
computational resources. Fig. 1. Comparison with state-of-the-art methods in runtime and GPU
Index Terms—Multi-View Stereo, 3D Reconstruction consumption on DTU [12]. Our method achieves state-of-the-art performance
while maintaining efficient inference time and low memory usage.

I. I NTRODUCTION Diverging from current approaches, we propose investigat-


ing information embedded in single-view features and cross-
Multi-view Stereo (MVS) is a fundamental area in computer
view correlations. Specifically, we leverage positional informa-
vision that aims to reconstruct 3D geometry from an array of
tion in FPN [11] and capture dependencies in one coordinate
overlapping images. This research domain has evolved signif-
direction while retaining positional information in the other.
icantly over the years, driving progress in autonomous driving
Meanwhile, coarse cross-view correlations with abundant geo-
and virtual reality. Existing methods typically utilize deep
metric information from different views are exploited to guide
learning to construct cost volumes from multiple camera views
the correlation distributions of fine stages. We found that pair-
and estimate depth maps, thereby simplifying the complex
wise correlation within cost volumes has rich information from
reconstruction task into manageable steps. This depth map-
multiple perspectives. Connections exist not only between
based strategy enhances flexibility and robustness through per-
feature channels under the same depth hypothesis and the
view depth estimation and point cloud fusion.
same feature channel under different depth hypotheses, but
Recent progress in MVS has featured cascade-based ar- also between different channels at different depth hypotheses.
chitectures, which adopt a hierarchical approach. Notable Therefore, we design a cross-view aggregation scheme to di-
examples include [1]–[4], which refine predictions progres- rectly process correlations across stages, depth hypotheses, and
sively from coarse to fine, gradually narrowing down depth features, resulting in a lightweight but robust cost matching.
hypotheses to optimize computational efficiency. Other strate-
Our main contributions are summarized as follows.
gies, including transformer-based techniques (e.g., [5]–[8]),
employ carefully designed external structures to enhance fea- • We propose ICG-MVSNet that includes Intra-View Fu-
ture extraction. However, they typically miss out on valuable sion (IVF), which embeds positional information along
contextual prior and primarily concentrate on pixel-level depth two coordinate directions into feature maps within a
attributes. Several recent papers have attempted to incorpo- single image for robust matching,
rate geometric information into MVS [9], [10]. In particular, • and Cross-View Aggregation (CVA), a lightweight cross-
GeoMVSNet [9] proposes using Gaussian mixture models to view aggregation scheme that efficiently utilizes the prior
represent geometric scene information. However, the feature guidance from previous correlations.
fusion module they design, which embeds the depth map and • We conduct extensive experimental comparisons against
state-of-the-art methods. The results demonstrate that the
This work was supported by the China Scholarship Council proposed method achieves competitive performance with
(ID: 202208440157). Correspondence to: [email protected],
[email protected]. Code and supplementary an optimal balance between effectiveness and computa-
material available at: https://github.com/YuhsiHu/ICG-MVSNet tional efficiency, as shown in Fig. 1.
Post-processing

Fusion
IVF W CVA 3D CNN
(c) Depth Map Point Cloud

Addition

Depth Estimation Multiplication


𝐿𝑜𝑠𝑠!"
Position Encoding W Warping
C
C Concatenation
Ground Truth
(a) (b)

Fig. 2. The overall architecture. Our method is a coarse-to-fine framework that estimates depths from low resolution (stage ℓ) to high resolution (stage ℓ+1),
where ℓ = 0, 1, 2, resulting in a total of 4 stages. Features of reference and source images {Fi }N
i=0 are extracted by a feature pyramid network with the help
of Intra-View Fusion (IVF), whose details are illustrated in (a). The source image features are warped into the D frustum planes of the reference camera and
an element-wise multiplication is used to correlate each source image with the reference image. These correlations are aggregated into a single cost volume
C. In finer stages (stage 1, 2, and 3), both current and previous stage correlations are used in Cross-View Aggregation (CVA), whereas in stage 0, the cost
volume is not updated due to the absence of contextual correlations from a previous stage. Details of this process are illustrated in (b) and (c). Regularization
(3D CNN) yields the probability volume P , from which the depth hypothesis with the highest probability is selected for the final depth map. Depth maps
from multiple viewpoints are fused into a point cloud, in a non-learnable process.

II. M ETHOD A. Feature Extraction


Given a set of images, I0 denotes the reference image for Existing works [14], [15] extract deep features {Fi }N i=0
which the depth is to be estimated, while {Ii }N from input images {Ii }N i=0 using FPN [11], which do not fully
i=1 represents N
source images—adjacent images which serve as an auxiliary explore the intra-view knowledge. GeoMVSNet [9] highlights
input for estimating the depth of reference view. Our network the value of pixel coordinates while integrating depth and RGB
estimates the depth map with width W and height H. images for feature extraction, but this increases GPU usage
We employ a coarse-to-fine network for depth estimation. and runtime due to the added channels and separate convo-
For each pixel, we uniformly sample D depth discrete values lutions. Moreover, GeoMVSNet [9] assumes that the absolute
within the range defined by the minimum and maximum values of coordinates contribute to feature extraction, but this
depths [dmin , dmax ]. The depth hypothesis d with the highest assumption may lack generalization. We focus on determining
probability is selected and also used as the center for the next the relative importance of positions, which is sufficient for
stage. A narrower range of depth hypotheses is then generated capturing coordinate information. Unlike previous methods
around this center, enabling progressively finer depth estima- that rely on Transformer or pre-trained models [5], [7], we
tion that converges toward the true depth value. For the number exploit intra-view relationships by multiplying features with
of depth hypothesis planes, we set D to 8, 8, 4, 4 for each two attention maps that capture long-range dependencies.
level of the stage ℓ, ensuring an appropriate balance between Intra-View Fusion (IVF). The Intra-View Fusion block is
precision and computational efficiency. a lightweight unit that aims to enhance the expressive abil-
ity of the features, which stands out for its simplicity and
The overall architecture of our network is illustrated in
efficiency. It can encode both long-range dependencies and
Fig. 2. Image features {Fi }N i=0 of both reference and source
feature channel relationships with positional information to
images are first extracted by the FPN [11]. Then the features
capture valuable information in images. For each feature Fi in
of source views are warped into the D fronto-parallel planes of
features {Fi }Ni=0 , we omit the index i and denote the pyramid
the reference camera frustum, denoted as {Vi }N i=1 . The feature
level by ℓ. At level ℓ, the feature F ℓ is in lower resolution,
volume of the reference view V0 is obtained by expanding
while at level ℓ + 1, the feature F ℓ+1 is twice the height H
reference feature F0 to D depth hypothesis. The element-wise
and width W . We first calculate the weights as Th and Tw
multiplication is performed to obtain the correlation between
using Eqn. 1. For a feature x in the c-th channel, the pooling
each warped source volume Vi and reference volume V0 . After
operation at height h and width w is calculated as follows:
that, multiple feature correlation volumes are aggregated into a
W H
single cost volume C ∈ RG×D×H×W , where G is the group- 1 X 1 X
wise correlation channel [13]. In the finer stages, the cross- Th (h) = x(h, i) , Tw (w) = x(j, w) . (1)
W i=0 H j=0
view aggregation module will use both the correlation of the
current stage and the previous stage. Afterward, the 3D CNN Following this, the weights Th (h) and Tw (w) are concate-
is applied to obtain the probability volume P ∈ RD×H×W , nated and passed through a convolution layer for a coordinate
which will be used to select the depth hypothesis with the fusion which integrates the height and width importance
largest probability in each pixel to obtain the final depth map. into a unified representation. Then the representation is split
With the depth map, the 3D point cloud is generated via fusion. again to produce two separate attention maps Ah and Aw
N
which encode dependencies along one coordinate direction and X W i ⊙ Corri
C= PN (7)
preserve positional information along the other direction. Next,
i=1 i=1 W i
the attention maps Ah and Aw are broadcasted to match the
shape of F ℓ+1 and applied to the feature F ℓ+1 using the pixel- Although this operation assigns a weight to each source
wise multiplication. Subsequently, the feature F ℓ is upsampled view based on the sum of the feature values in that view,
and merged through addition. The fused intra-view feature in it fails to integrate low-level semantic information from the
the finer stage Ffℓ+1
used is formulated as:
previous stage or capture the global connections between
feature channels and depth assumptions across multiple views.
Ffℓ+1 ℓ ℓ+1
used = F↑ ⊕ FA , (2) Therefore, we design a lightweight module to extract cross-
scale information in cross-view correlation.
FAℓ+1 = Ah ⊙ Aw ⊙ F ℓ+1 , (3)
As shown in Fig. 2(b), the feature channel G and the depth
where ↑, ⊙, and ⊕ represent the upsample, pixel-wise multipli- channel D of correlations are integrated to capture global
cation, and addition respectively. FAℓ+1 represents the updated correlation distribution information. Previous approaches such
feature after applying the attention maps. This simple ad- as GeoMVSNet [9] process these channels separately using
justment constructs features based on coordinates, preserving additional 3D CNNs, neglecting the interdependencies be-
robust constraints and establishing a foundation for subsequent tween depth and feature dimensions. Besides, it incorporates
multi-view correlation calculations. geometry information by combining the probability volume
and cost volume, which leads to computational overhead. In
B. Aggregation
contrast, our method employs a 2D CNN to efficiently com-
With features extracted, homography warping is performed press cross-scale contextual knowledge into a single channel.
to transform the source view features into the reference per- This approach significantly enhances efficiency.
spective. Each source feature F ∈ RG×H×W is warped into D We flatten the feature channels G and depth hypotheses
fronto-parallel planes of the reference viewpoint to construct D into a single dimension. This operation reformulates the
feature volumes {Vi }N i=1 . The D depth hypotheses enable correlation tensor into a shape of (G × D, H, W ), where
matching: the correct depth yields high feature correlation. The each flattened channel encodes combined information from
mathematical representation of the differentiable homography both the feature and depth domains. This joint representation
Hi (d) which warps each pixel at depth d as: allows for the simultaneous processing of spatial and depth-
(t0 − ti ) · n⊤ aware features, enhancing the model’s ability to capture multi-
Hi (d) = Ki · Ri · (I − 0
) · R0⊤ · K0−1 (4) dimensional correlations. The contextual knowledge embedded
d
in cross-view correlations extracted by the 2D CNN is con-
where {Ki , Ri , ti }N
i=0 are the parameters that denote in- catenated with the original correlation to form a unified cost
trinsics, extrinsics and translations respectively, n0 represents
volume C. This procedure can be described as:
the normal axis of reference viewpoints, I is the identity
matrix. Then, the element-wise multiplication is applied to the C1 = Conv2D(C ℓ ), C2 = Conv2D(C ℓ+1 ) (8)
reference feature volume V0 and each warped source feature
volume {Vi }N i=1 to obtain the pair-wise correlation:
Cupdate = Concat(C, C1 , C2 ) (9)
Corri = V0 ⊙ Vi (5) Here, C ℓ represents the cost volume from the previous
stage, C ℓ+1 is the cost volume from the current stage,
Therefore, we get the 3D feature correlations Corr ∈
Conv2D denotes a 2D convolution with batch normalization
RG×D×H×W between one pair of the reference image and
and ReLU activation, and Concat refers to concatenation
the source image. Typically, those 3D feature correlations
along the G×D dimension. The 2D CNN extracts information
{Corri }N i=1 are then aggregated into single cost volume
from the cost volume regarding connections between features
C ∈ RG×D×H×W , which reflects the overall correlations
and depths. The coarse branch encodes prior information,
of reference pixels and source pixels on the discrete depth
while the fine branch captures details from the current stage,
planes. To further enhance the robustness and precision of
enabling the model to capture contextual relationships across
the depth estimation, we integrate cross-view context into the
stages, depth assumptions, and feature channels.
cost volume. We use correlations from both the previous and
current stages as guidance to extract contextual priors. C. Regularization
Cross-View Aggregation (CVA). In the first stage, we use the
We employ a lightweight regularization network similar
basic aggregation commonly employed in previous works [6],
to [9]. The network consists of 3D CNNs that progressively
[9]. The pair-wise attention weight W is normalized by
downsample the input volumes, followed by transposed convo-
the number of channels and a temperature scaling factor ϵ,
lutions that upsample the volumes back to the original resolu-
and correlations are aggregated into one cost volume C ∈
tion. Skip connections are employed to combine features from
RG×D×H×W :
PG ! corresponding layers during downsampling and upsampling.
g=1 Corri [g] The final layer produces a probability volume P ∈ RD×H×W
W i = sof tmax (6)
ϵ indicating confidence in each depth hypothesis.
Subsequently, we adopt a winner-takes-all strategy to derive input views remains N = 5. When it comes to the T&T [16],
the depth map. After applying softmax to P , the index of we resize the height of the images to 1024, while the width
the highest probability among the D depth hypotheses for is retained at either 1920 or 2048, depending on the specific
each pixel is selected. The corresponding depth value is then scene under evaluation. In this case, following the previous
assigned as the pixel’s depth. This depth value is used both method, we increase the number of input views to N = 11.
to compute the loss for the current stage and as the center for For depth estimation on the DTU [12], the inference runtime
defining depth hypotheses in the next stage. is 0.13 seconds and the model utilizes 3.21 GB of memory.
On the T&T [16] dataset, it executes within 0.6 seconds and
D. Loss Functions
consumes 7.37 GB of memory. For depth fusion, we adopt a
The loss functions are similar to those in previous meth- dynamic fusion strategy [18], similar to previous approaches,
ods [6], [9]. Specifically, for the probability volume P , a to achieve dense point cloud integration.
softmax operation is applied along the depth dimension D Metrics. For the DTU [12], we use distance metrics of
to obtain a distribution of probabilities for each pixel. Each point clouds to evaluate accuracy and completeness. Accuracy
depth hypothesis corresponds to a specific depth value, and (Acc.) measures the distance from the reconstructed 3D points
the classification is based on selecting the hypothesis with to the closest points in the ground truth, while completeness
the highest probability. Ground truth labels are encoded using (Comp.) measures the distance from the ground truth to the re-
a one-hot scheme, where the depth hypothesis closest to the construction. To demonstrate that our method’s advancements
true depth value is marked as 1, and all others are set to 0. are due to the increased accuracy of the depth map, rather than
For simplicity, we use only the pixel-wise cross-entropy term, fusion tricks, we also directly assess the depth map errors of
without additional components: various methods. For T&T [16], we adopt percentage-based
metrics for accuracy and completeness. We use the official
X
Loss pw = (−Pgt (z) log[P (z)]) , (10)
z∈Ψ
online evaluation platform for standardized assessments. For
qualitative results, we used the official code and pre-trained
where Ψ represents the set of valid pixels with ground truth,
weights to generate depth maps for visualization Fig. 3 and
P denotes the estimated probability for each specific depth,
error analysis (Table II). Quantitative scores were taken from
while Pgt corresponds to the probability volume of the ground
the original papers, except for memory and inference speed,
truth. The overall loss is a weighted sum of Loss pw in Eqn. 11,
which we re-evaluated under identical hardware for fairness.
where λℓ = 1 among each stage in our experiments.
L
TABLE I
X
ℓ Q UANTITATIVE COMPARISON ON DTU [12]. ∗ MEANS MVSTER [6] IS
Loss = λ Loss pw . (11) TRAINED ON FULL - RESOLUTION IMAGES . T HE COLORS INDICATE
ℓ=0 RANKINGS , WITH RED REPRESENTING THE TOP POSITION , ORANGE
INDICATING SECOND PLACE , AND YELLOW MARKING THIRD .
III. E XPERIMENTS
Method Acc.↓ (mm) Comp.↓ (mm) Overall↓(mm) Time↓(s) GPU↓(GB)
A. Dataset Gipuma [19] 0.283 0.873 0.578 - -
COLMAP [20] 0.400 0.664 0.532 - -
DTU [12] is an indoor dataset comprising 124 distinct objects, R-MVSNet [15] 0.383 0.452 0.417 - -
CasMVSNet [1] 0.325 0.385 0.355 0.49 5.4
with data for each scene meticulously captured from 49 CVP-MVSNet [3] 0.296 0.406 0.351 - -
learning-based

different viewpoints under various lighting conditions. PatchmatchNet [21] 0.427 0.277 0.352 0.25 2.9
EPP-MVSNet [22] 0.413 0.296 0.355 0.52 8.2
Tanks and Temples (T&T) [16] is a challenging and realistic CDS-MVSNet [23] 0.352 0.280 0.316 0.40 4.5
NP-CVP-MVSNet [24] 0.356 0.275 0.315 1.20 6.0
dataset, which presents a valuable resource for evaluating UniMVSNet [25] 0.352 0.278 0.315 0.29 3.3
TransMVSNet [5] 0.321 0.289 0.305 0.99 3.8
MVS methods in demanding real-world scenarios. MVSTER∗ [6] 0.340 0.266 0.303 0.17 4.5
BlendedMVS [17] is a large-scale synthetic training dataset GeoMVSNet [9] 0.331 0.259 0.295 0.26 5.9
Ours 0.327 0.251 0.289 0.13 3.2
with 17, 000+ images and precise ground truth 3D structures.
TABLE II
B. Implementation Details D EPTH MAP ERRORS ON DTU [12]. T HE ade REPRESENTS THE AVERAGE
ABSOLUTE DEPTH ERROR ( MM ), WHILE tde(X) INDICATES THE
Following the established paradigm, our model is trained on
PERCENTAGE OF PIXELS WITH AN ERROR ABOVE X MM . T HE COLORS
the DTU training set [12] and evaluated on the DTU testing INDICATE RANKINGS , WITH RED REPRESENTING THE TOP POSITION ,
set, using the same data split and view selection criteria as [1], ORANGE INDICATING SECOND PLACE , AND YELLOW MARKING THIRD .

[6], [9], [14] for comparability. Additionally, we fine-tune our Method ade ↓ tde(1) ↓ tde(2) ↓ tde(4) ↓ tde(8) ↓ tde(16) ↓
model on BlendedMVS [17] and evaluate on T&T [16]. MVSNet [14] 14.7356 28.22 20.01 16.19 14.00 12.18
CasMVSNet [1] 8.4086 25.41 17.80 14.06 11.42 8.97
Training. We configure the number of input images to be CVP-MVSNet [3] 6.9875 26.75 18.80 14.10 10.22 6.75
N = 5 for the DTU [12], each with a resolution of 640 × 512 TransMVSNet [5]
MVSTER [6]
15.1320
9.3494
26.09
26.94
18.03
19.36
15.17
15.84
13.18
13.41
11.57
11.03
pixels. For the BlendedMVS [17], we employ a total of N = 7 UniMVSNet [25] 11.5872 28.90 19.66 13.73 11.02 8.92
GeoMVSNet [9] 11.6586 24.06 16.53 13.28 11.08 9.21
images, each with a resolution of 768 × 576 pixels. We train Ours 6.6450 24.38 16.81 13.20 10.55 8.05
the model with an initial learning rate of 0.001 for 15 epochs
on one NVIDIA RTX 4090 GPU, with a batch size of 4. C. Benchmark Performance
Evaluation. For the DTU [12] evaluation, the images are DTU. We compare our results with traditional methods and
cropped to a resolution of 1600 × 1152, and the number of learning-based methods. We selected some representative vi-
CasMVSNet UniMVSNet TransMVSNet MVSTER GeoMVSNet
Reference View Ours
(CVPR2020) (CVPR2022) (CVPR2022) (ECCV2022) (CVPR2023)
scan4
scan10
scan114

Fig. 3. Qualitative comparison with other methods on the DTU [12] dataset. The depth map estimated by our method has a more complete and continuous
surface and also has clearer outlines at the edges.
TABLE III
Q UANTITATIVE COMPARISON ON TANKS AND T EMPLES [16]. T HE COLORS INDICATE RANKINGS , WITH RED REPRESENTING THE TOP POSITION ,
ORANGE INDICATING SECOND PLACE , AND YELLOW MARKING THIRD .

Scene
Method
Mean↑ Family Francis Horse L.H. M60 Panther P.G. Train
COLMAP [20] 42.14 50.41 22.25 25.63 56.43 44.83 46.97 48.53 42.04
CasMVSNet [1] 56.42 76.36 58.45 46.20 55.53 56.11 54.02 58.17 46.56
PatchmatchNet [21] 53.15 66.99 52.64 43.24 54.87 52.87 49.54 54.21 50.81
UniMVSNet [25] 64.36 81.20 66.43 53.11 63.46 66.09 64.84 62.23 57.53
TransMVSNet [5] 63.52 80.92 65.83 56.94 62.54 63.06 60.00 60.20 58.67
MVSTER [6] 60.92 80.21 63.51 52.30 61.38 61.47 58.16 58.98 51.38
GeoMVSNet [9] 65.89 81.64 67.53 55.78 68.02 65.49 67.19 63.27 58.22
Ours 65.53 81.73 68.92 56.59 66.10 64.86 64.41 62.33 59.26

TABLE IV
A BLATIONS FOR PROPOSED MODULES . T HE MODEL IS EVALUATED ON us at the 1mm and 2mm thresholds, it performs worse at
DTU [12]. B OTH MODULES CAN IMPROVE THE ACCURACY AND other levels and has larger absolute depth errors, indicating
COMPLETENESS OF THE POINT CLOUD . potential reliance on point cloud fusion to eliminate outliers.
Method Sec. II-A Sec. II-B Acc.↓ (mm) Comp.↓ (mm) Overall↓ (mm) This explains why their depth maps are either highly accurate
baseline 0.350 0.276 0.313 or significantly off, yet achieve strong point cloud scores.
+ intra view ✓ 0.333 0.257 0.295
+ cross view ✓ 0.327 0.255 0.291 Tanks and Temples. We further validate the generalization ca-
proposed ✓ ✓ 0.327 0.251 0.289 pability of our method on the T&T [16] and report quantitative
results in Tab. III. Overall, our method achieves comparable
sualizations, as shown in Fig. 3. The depth map estimated precision and recall to GeoMVSNet [9], but with the added
by our method has a more complete and continuous surface benefits of reduced GPU memory usage—GeoMVSNet [9]
and also has clearer outlines at the edges. Therefore, we can needs 8.85 GB, whereas our method only requires 7.37 GB.
obtain accurate and complete point clouds, especially for the
structures of the subject, even without rich textures such as the D. Ablation Study
bird’s head and statue edges. Using the official codes for point Tab. IV and Tab. V show the ablation results of our method.
cloud evaluation, we present results in Tab. I. Our method The baseline MVSTER [6] method is re-implemented without
ranks first among the methods evaluated. Methods with similar the monocular depth estimator.
scores need more inference time and GPU resources, while Intra-View Fusion. As shown in Tab. IV, Intra-View Fusion
those with similar resource use perform worse. which encodes both long-range dependencies and feature
Point cloud fusion involves various tricks to optimize bench- channel relationships can significantly improve the accuracy
mark results, but this contradicts the essence of MVS. To and completeness. This is because our method effectively
demonstrate that our method produces better depth maps rather captures coordinate information along with precise positional
than achieving high scores through parameter tuning, we ana- details, which is crucial for accurate understanding.
lyze the estimated depth maps directly. We calculate the mean Cross-View Aggregation. As shown in Tab. IV, the cost
absolute error (mm) and the percentage of pixels with errors volume aggregation equipped with the cross-view correlation
greater than 1mm, 2mm, etc., as shown in Tab. II. Our depth derived from the previous stage can effectively improve the
maps have lower average errors and more pixels within specific score because the correlations offer relationships to describe
error thresholds. While GeoMVSNet [9] slightly outperforms the positions of pixels, which means that some adjacent pixels
are implicitly constrained to lie on a plane, this especially [4] Tianqi Liu, Xinyi Ye, et al., “When epipolar constraint meets non-
helps in texture-less regions. Besides, the contextual informa- local operators in multi-view stereo,” in Proceedings of the IEEE/CVF
International Conference on Computer Vision (ICCV), October 2023,
tion extracted from correlations highlights the effect of various pp. 18088–18097. 1
viewpoints, which enhances the robustness of the method to [5] Yikang Ding, Wentao Yuan, et al., “Transmvsnet: Global context-aware
handle unreliable inputs. multi-view stereo network with transformers,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
The number of channels in cross-view aggregation. As 2022, pp. 8585–8594. 1, 2, 4, 5
shown in Tab. V, the cross-view correlation between the [6] Xiaofeng Wang, Zheng Zhu, et al., “Mvster: Epipolar transformer for
current stage and the previous stage can provide additional efficient multi-view stereo,” in European Conference on Computer
Vision. Springer, 2022, pp. 573–591. 1, 3, 4, 5
information to enhance performance. We conducted ablations [7] Chenjie Cao, Xinlin Ren, et al., “Mvsformer: Multi-view stereo by learn-
by varying the number of channels used for correlation, testing ing robust image features and temperature-based depth,” Transactions
up to 4 channels within the bounds of the original volume. of Machine Learning Research, 2023. 1, 2
[8] Xinlin Ren Chenjie Cao et al., “Mvsformer++: Revealing the devil in
We found that more channels did not significantly improve transformer’s details for multi-view stereo,” in International Conference
completeness but slightly decreased accuracy, resulting in on Learning Representations (ICLR), 2024. 1
a minor overall score reduction. This suggests that while [9] Zhe Zhang, Rui Peng, et al., “Geomvsnet: Learning multi-view stereo
with geometry perception,” in Proceedings of the IEEE/CVF Conference
correlation data captures essential trends across viewpoints, on Computer Vision and Pattern Recognition, 2023, pp. 21508–21518.
additional channels may introduce noise without significant 1, 2, 3, 4, 5
benefit, and increase memory usage. Therefore, we optimized [10] Jiang Wu, Rui Li, et al., “Gomvs: Geometrically consistent cost
aggregation for multi-view stereo,” in CVPR, 2024. 1
our model by limiting the channels to 1. [11] Tsung-Yi Lin, Piotr Dollár, et al., “Feature pyramid networks for object
detection,” in Proceedings of the IEEE conference on computer vision
TABLE V and pattern recognition, 2017, pp. 2117–2125. 1, 2
A BLATIONS FOR THE NUMBER OF CHANNELS OF CROSS - VIEW [12] Rasmus Jensen, Anders Dahl, et al., “Large scale multi-view stereopsis
AGGREGATION . T HE MODEL IS EVALUATED ON DTU [12]. T HE N UM C IS evaluation,” in Proceedings of the IEEE conference on computer vision
THE NUMBER OF CHANNELS FROM THE PREVIOUS STAGE AND THE N UM F and pattern recognition, 2014, pp. 406–413. 1, 4, 5, 6, 2
IS THE NUMBER OF CHANNELS FROM THE CURRENT STAGE . [13] Xiaoyang Guo, Kai Yang, et al., “Group-wise correlation stereo
network,” in Proceedings of the IEEE/CVF Conference on Computer
Method Numc Numf Acc.↓ (mm) Comp.↓ (mm) Overall↓ (mm) Vision and Pattern Recognition, 2019, pp. 3273–3282. 2
baseline+intra-view 0 0 0.333 0.257 0.295
[14] Yao Yao, Zixin Luo, et al., “Mvsnet: Depth inference for unstructured
0 1 0.3271 0.2585 0.2928
multi-view stereo,” in Proceedings of the European conference on
1 0 0.3300 0.2582 0.2941
2 2 0.3296 0.2492 0.2894
computer vision (ECCV), 2018, pp. 767–783. 2, 4
3 3 0.3323 0.2503 0.2913 [15] Yao Yao, Zixin Luo, et al., “Recurrent mvsnet for high-resolution
4 4 0.3280 0.2544 0.2912 multi-view stereo depth inference,” in Proceedings of the IEEE/CVF
proposed 1 1 0.3270 0.2510 0.2890 conference on computer vision and pattern recognition, 2019, pp. 5525–
5534. 2, 4
IV. C ONCLUSION [16] Arno Knapitsch, Jaesik Park, et al., “Tanks and temples: Benchmarking
large-scale scene reconstruction,” ACM Transactions on Graphics, vol.
In this paper, we propose ICG-MVSNet which explic- 36, no. 4, 2017. 4, 5, 1, 3
[17] Yao Yao, Zixin Luo, et al., “Blendedmvs: A large-scale dataset
itly integrates intra-view and cross-view relationships for for generalized multi-view stereo networks,” in Proceedings of the
depth estimation. Specifically, we construct Intra-View Fusion, IEEE/CVF Conference on Computer Vision and Pattern Recognition,
which leverages the positional knowledge within a single 2020, pp. 1790–1799. 4, 1
[18] Jianfeng Yan, Zizhuang Wei, et al., “Dense hybrid recurrent multi-view
image, enhancing robust cost matching without incorporating stereo net with dynamic consistency checking,” in European conference
complicated external dependencies. Besides, we introduce a on computer vision. Springer, 2020, pp. 674–689. 4, 1
lightweight Cross-View Aggregation that efficiently utilizes [19] Silvano Galliani, Katrin Lasinger, et al., “Massively parallel multiview
stereopsis by surface normal diffusion,” in Proceedings of the IEEE
the cross-scale contextual information from correlations. The International Conference on Computer Vision, 2015, pp. 873–881. 4
proposed method is extensively evaluated on public datasets, [20] Johannes L Schonberger and Jan-Michael Frahm, “Structure-from-
consistently achieving competitive performance against the motion revisited,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2016, pp. 4104–4113. 4, 5
state-of-the-arts, while requiring no extra input and lower [21] Fangjinhua Wang, Silvano Galliani, et al., “Patchmatchnet: Learned
computational resources. Our method still needs large labeled multi-view patchmatch stereo,” in Proceedings of the IEEE/CVF
datasets to train the model. Future work could focus on devel- Conference on Computer Vision and Pattern Recognition, 2021, pp.
14194–14203. 4, 5
oping unsupervised methods to overcome these limitations. [22] Xinjun Ma, Yue Gong, et al., “Epp-mvsnet: Epipolar-assembling based
depth prediction for multi-view stereo,” in Proceedings of the IEEE/CVF
R EFERENCES International Conference on Computer Vision, 2021, pp. 5732–5740. 4
[1] Xiaodong Gu, Zhiwen Fan, et al., “Cascade cost volume for high- [23] Khang Truong Giang, Soohwan Song, et al., “Curvature-guided dynamic
resolution multi-view stereo and stereo matching,” in Proceedings of scale networks for multi-view stereo,” arXiv:2112.05999, 2021. 4
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, [24] Jiayu Yang, Jose M Alvarez, et al., “Non-parametric depth distribution
2020, pp. 2495–2504. 1, 4, 5 modelling based depth inference for multi-view stereo,” in Proceedings
[2] Shuo Cheng, Zexiang Xu, et al., “Deep stereo using adaptive thin of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-
volume representation with uncertainty awareness,” in Proceedings of tion, 2022, pp. 8626–8634. 4
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, [25] Rui Peng, Rongjie Wang, et al., “Rethinking depth estimation for multi-
2020, pp. 2524–2534. 1 view stereo: A unified representation,” in Proceedings of the IEEE/CVF
[3] Jiayu Yang, Wei Mao, et al., “Cost volume pyramid based depth Conference on Computer Vision and Pattern Recognition, 2022, pp.
inference for multi-view stereo,” in Proceedings of the IEEE/CVF 8645–8654. 4, 5, 1
Conference on Computer Vision and Pattern Recognition, 2020, pp.
4877–4886. 1, 4
V. S UPPLEMENTARY M ATERIAL significantly better accuracy, completeness, and overall per-
In the supplementary material, we present more details that formance. (2) GoMVS relies on additional normal inputs.
are not included in the main text, including: In particular, it requires normal map estimation as a prepro-
cessing step, which imposes a strong planar constraint that
• Depth map fusion method which integrates depth maps
undoubtedly improves its benchmark performance. However,
into 3D point clouds.
this also means that the results are highly dependent on the
• Extended comparison with additional recent works based
accuracy of the normal estimation method. Moreover, even if
on reviewer suggestions.
we disregard the preprocessing time required to obtain normal
• Depth map visualizations: comparisons between our
maps, GoMVS still consumes more than twice the memory
method and other approaches.
and takes over three times longer to run compared to the
• Point cloud visualizations: comparisons with other meth-
proposed method. (3) MVSFormer++ relies heavily on a pre-
ods and our final reconstructed point clouds on DTU [12]
trained model. While its overall score is better, it is crucial to
and Tanks and Temples [16].
note that MVSFormer++ benefits from extensive engineering
A. Depth Map Fusion and experimental optimizations. It not only leverages a pre-
trained DINOv2 model for feature extraction but also in-
In previous Multi-View Stereo (MVS) approaches, various
corporates efficiency-focused components like FlashAttention.
fusion techniques have been utilized to integrate predicted
However, these optimizations come at the cost of significantly
depth maps from multiple viewpoints into a coherent point
increased model complexity—MVSFormer++ has over 30
cloud. In this study, we adopt the dynamic checking strategy
times (783MB) more parameters than our model (20MB).
proposed in [18] for depth filtering and fusion.
Moreover, its training is highly resource-intensive, requiring
On the DTU [12] dataset, we filter the confidence map of the
four A6000 GPUs (48GB each) for a full day, whereas our
final stage using a confidence threshold to assess photometric
model can be trained in less than a day using just a single
consistency. For geometric consistency, we apply the following
RTX 4090. In addition, MVSFormer++ demands more GPU
criteria:
memory and longer inference time, as shown in Table VI,
making it less efficient in practical deployment.
errc < threshc , errd < threshd (12)
Here, errc and errd represent the reprojection coordinate TABLE VI
Q UANTITATIVE COMPARISON ON DTU. T HE POINT CLOUD METRICS
error and relative error of reprojection depth, respectively. ARE TAKEN FROM THE RESPECTIVE PAPERS . A LL RUNTIME AND MEMORY
threshc and threshd indicate the corresponding thresholds. USAGE DATA ARE MEASURED ON THE SAME HARDWARE (RTX 4090) FOR
On the Tanks and Temples [16] benchmark, we adjust hy- A FAIR COMPARISON . N OTE THAT CT-MVSN ET DOES NOT PROVIDE A
PRE - TRAINED MODEL , WITHOUT WHICH WE COULD NOT TEST ON OUR
perparameters for each scene following the approach outlined MACHINE .
in [25], including confidence thresholds and geometric criteria.
Our model is fine-tuned on the BlendedMVS dataset [17] for Method Acc.↓ (mm) Comp.↓ (mm) Overall↓(mm) Time↓(s) GPU↓(GB)

reconstructing scenes in this benchmark. CT-MVSNet 0.341 0.264 0.302 - -


MVSFormer++ 0.309 0.252 0.281 0.30 5.6
B. Depth Map Visualizations GoMVS 0.347 0.227 0.287 0.45 7.7
Ours 0.327 0.251 0.289 0.13 3.2
The depth map visualizations illustrate the performance of
our method compared to other approaches. These visualiza-
tions (Fig. 4) highlight the accuracy and completeness of D. Point Cloud Visualizations
the predicted depth maps, showcasing improvements in fine Fig. 5 illustrates the error comparison of the point clouds.
details and handling of challenging regions such as edges and Taking the Horse in Tanks and Temples [16] dataset as an
textureless areas. example, our method is able to reduce large amounts of
C. Extended Comparison with Additional Recent Works outliers while ensuring completeness.
Besides, we show all points clouds reconstructed using our
Our focus is on uncovering the core essence of the prob- method on DTU [12] dataset and Tanks and Temples [16]
lem so we prefer not to rely on pre-trained models or dataset, respectively. As illustrated in Fig. 6 and Fig. 7, our
additional inputs as priors, which is why we specifically method demonstrates robust reconstruction capabilities across
compare with works that train from scratch to ensure a fair various scales, effectively handling both small objects and
evaluation. Nonetheless, we include the quantitative results of large-scale scenes.
CT-MVSNet (MMM2024), MVSFormer++ (ICLR2024), and
GoMVS (CVPR2024) on the DTU dataset in Table VI as
requested, to provide additional comparisons and references
for potential readers. Overall, our proposed method achieves
competitive performance against SOTA while requiring no ex-
tra input and lower computational resources. (1) Our method
consistently outperforms CT-MVSNet. Our method achieves
CasMVSNet UniMVSNet TransMVSNet MVSTER GeoMVSNet
(CVPR2020) (CVPR2022) (ECCV2022) Ours
(CVPR2022) (CVPR2023)
scan1
scan9
scan15
scan32
scan48
scan49
scan75
scan77
scan110

Fig. 4. Qualitative comparison with other methods on DTU [12]. The depth map estimated by our method has a more complete and continuous surface
and also has clearer outlines at the edges.
MVSTER (ECCV2022) GeoMVSNet (CVPR2023) Ours
Precision
Recall

Fig. 5. Point clouds error comparison of state-of-the-art methods on the Tanks and Temples dataset [16]. τ is the scene-relevant distance threshold
determined officially and darker means large error. The first row shows Precision and the second row shows Recall. Taking the Horse in the intermediate
subset as an example, our method is able to reduce large amounts of outliers while ensuring completeness.
Fig. 6. Point clouds reconstructed for all scenes from DTU [12] dataset.

Fig. 7. Point clouds reconstructed for all scenes from Tanks and Temples [16] dataset.

You might also like