6.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.5

置信度

创新性2.5

质量2.8

清晰度2.3

重要性2.3

NeurIPS 2025

VA-GS: Enhancing the Geometric Representation of Gaussian Splatting via View Alignment

Qing Li,Huifang Feng,Xun Gong,Yu-Shen Liu

OpenReview PDF

提交: 2025-05-07更新: 2025-10-29

摘要

关键词

Gaussian SplattingNeural RenderingSurface ReconstructionNovel View Synthesis

评审与讨论

审稿意见

评分: 4置信度: 32025-06-26

This paper proposes a new method for 3DGS-based surface reconstruction. It enforces multi-view color and feature consistency, as well as image gradient-guided local normal consistency, normal smoothness. The results show that the proposed method can achieve better performance on both reconstruction and rendering.

优缺点分析

Strengths:

The novel contribution in this paper is mainly the multi-view feature-level consistency, which learns from the MVS pipeline to align the high-dimensional features.

Weaknesses:

Significance: The results show no significant performance improvement over the previous state-of-the-art PGSR. According to Table 4, the proposed feature alignment only improves F1 Score by 0.01, and without which the performance becomes similar to PGSR. Therefore, I struggle to find the real significance of this work.

问题

What's the difference between the proposed Multi-View Photometric Alignment and the one in PGSR? An ablation study to illustrate the difference would be helpful.
In Equation 4, an ablation study to evaluate the effectiveness of the edge-aware coefficient is needed, since it's the main difference from the existing methods.

局限性

Please see the weaknesses and questions above.

最终评判理由

The authors showed more detailed results and I think overall this paper provides some further improvements over PGSR.

格式问题

N.A.

作者回复

2025-07-29

1. Concern about the significance and no significant performance improvement over PGSR.

Firstly, we would like to clarify that our method is not a direct extension of PGSR with feature alignment added on top. Instead, our framework introduces a new combination of single-view alignment (including edge-aware image reconstruction and normal-based geometry supervision) and multi-view alignment (photometric and feature alignment), which are structurally and conceptually different from PGSR.
As shown in the ablations of Table 4, even without feature alignment, our method already achieves better performance than PGSR. This demonstrates that our core design, particularly single-view and multi-view photometric supervision, is effective. The additional feature alignment further improves the results, validating our design choices. Even when the improvement appears numerically small on the TNT dataset, reaching the same or slightly higher performance than PGSR still represents a state-of-the-art level.
Moreover, on the DTU dataset, the contribution of the feature alignment loss $\mathcal{L}_f$ is more significant. As shown in the following table, removing $\mathcal{L}_f$ increases Chamfer distance from 0.49 to 0.52, which is still better than PGSR’s reported 0.53. This shows that our method offers consistent geometric improvement across datasets.
Finally, our method also outperforms PGSR on the Mip-NeRF 360 dataset for novel view synthesis, as shown in Table 3. We achieve better scores on all metrics, indicating that our method is not merely comparable to PGSR but surpasses it in reconstruction/synthesis quality and generalizability across varied benchmarks. The ablation experiment in Question 2 also shows that the strategy in our method can provide better performance.

	w/o $\mathcal{L}_{f}$	Ours	PGSR
Chamfer distance $\downarrow$	0.52	0.49	0.53

2. The difference between the proposed Multi-View Photometric Alignment and that in PGSR.

Our Multi-View Photometric Alignment differs from PGSR’s in several key aspects:

Optimization complexity: PGSR couples its photometric consistency loss with geometric consistency regularization, which minimizes forward–backward reprojection error of neighboring views, resulting in more variables participating in network backpropagation and a more complex gradient computation. In contrast, our method simplifies the backpropagation path by aligning Gaussian orientations first and then applying image/feature-level constraints, which leads to better optimization efficiency.
Multi-view formulation: PGSR computes photometric consistency between one reference view and one source view. Our framework uses one reference view and multiple source views (three by default, varied in ablation), which introduces richer multi-view constraints and improves geometric supervision.
Gaussian flattening: PGSR flattens 3D Gaussians using scale regularization of the Gaussian ellipsoid to mimic planar surfaces. In our method, we found this strategy ineffective through ablation in Table 4, and therefore designed our supervision differently.
Efficiency and performance: As shown in Table 2, our method achieves better reconstruction quality across more scenes on real outdoor scenes of the TNT dataset while requiring less training time than PGSR.
Occlusion modeling: We explicitly define a visibility term and occlusion weight to mitigate the effect of outlier pixels caused by occlusion or misalignment. We give their motivation and clearly show the derivation of their formulas in the method section.

To further clarify the differences, we provide a cross-method ablation comparison on the DTU and TNT datasets in the following table. These results show that our alignment strategy is both more effective and more efficient than PGSR’s when integrated into either framework. “PGSR+Ours” means adding our solution to PGSR and vice versa.

	PGSR	PGSR+Ours	Ours	Ours+PGSR
Chamfer distance on DTU $\downarrow$	0.53	0.51	0.49	0.56
F1-score on TNT $\uparrow$	0.52	0.53	0.54	0.48
Time on TNT (minutes) $\downarrow$	25.5	19.7	20.8	22.1

3. Add an ablation study of the edge-aware coefficient in Equation 4.

Thank you for the helpful suggestion. To evaluate its impact, we conducted an ablation study where we removed this coefficient and kept the rest of the loss formulation unchanged. The table below shows that the edge-aware coefficient improves reconstruction performance. In boundary regions, Gaussian primitives often exhibit ambiguous or noisy normal directions, which can lead to incorrect supervision signals. The coefficient reduces the loss contribution from these areas, allowing the network to focus learning on more reliable surface regions. While the improvement is modest, it reflects the fact that shape boundaries constitute a relatively small proportion of the scene, and thus affect only a small number of sampled points during evaluation.

	Precision $\uparrow$	Recall $\uparrow$	F1-score $\uparrow$
w/o edge-aware weighting	0.50	0.59	0.53
Ours	0.51	0.60	0.54

评论- Comments from authors

2025-08-06

Dear Reviewer 78eD,

Thank you again for your time and valuable feedback on our paper. We sincerely hope to follow up and see if you have any further comments or questions regarding our rebuttal. We would greatly appreciate any additional feedback you could provide during the discussion phase.

2025-08-07

Thanks for your detailed reply. I think your reponse has solved most of my questions and I would like to raise my score to borderline accept.

审稿意见

评分: 4置信度: 32025-06-29

The authors address the problem of surface reconstruction from a 3D Gaussian Splatting scene. They introduce several losses using geometrical and multi-view cues to improve the reconstruction. The method achieves better reconstruction performance, albeit at a longer optimization time. The evaluation is conducted by comparing the surface reconstruction with a broad selection of related methods, including NeRF and Gaussian splatting approaches on several commonly accepted datasets.

优缺点分析

Strenghts The paper presents a clear mathematical formulation of the proposed loss terms, accompanied by a well-reasoned justification for how these additions enhance the geometric refinement of the Gaussians. The authors argue convincingly that the introduced terms address the limitations of purely photometric supervision by enforcing geometric consistency. The method is thoroughly evaluated through both quantitative and qualitative comparisons against a broad set of related approaches.

Weaknesses The paper suffers from several issues that affect its clarity and coherence.

For example:

In Section 5.2, a scale loss is introduced in the ablation study without any prior mention or explanation in the method section.
Placing result tables within the method section disrupts the logical flow of the paper and blurs the line between methodological description and experimental validation.

In terms of performance, the proposed method reports improved Chamfer distances in 9 out of 14 scenes on the DTU dataset, as shown in Table 1. However, these gains come at the cost of significantly longer training times - 15.5 minutes compared to 3.4 minutes for 3DGS, which diminishes the impact of the reported improvements, especially when they are relatively modest.

Moreover, although the title emphasizes multi-view alignment, the actual performance gains attributed to multi-view supervision are minimal. The increase in Precision, Recall, and F1-score when scaling from a single view (N=1) to four views (N=4) is only +0.02, which raises doubts about the effectiveness of the method’s multi-view strategy and the strength of its central claim.

问题

Could you explain how the recall in Table 4 is determined? Is there something classified (threshold of distance, which one was used)? While referring to previous work, it would make the paper more complete if you explained how you calculated it and why it matters.

How is the ablation in 5.2 done? Are the other terms rescaled, or is each term just dropped? Are the iterations the same?

Line 221: “Additionally, due to lighting variations, the color of the same surface point may differ across views, making photometric consistency unreliable.” - Why is this a problem for geometry? Don't spherical harmonics address exactly this?

局限性

The primary limitation is the optimization time required by the method. One might question whether such a small gain in reconstruction quality is worth the resources spent on optimization.

最终评判理由

The paper presents a clear formulation of the proposed loss terms, accompanied by a well-reasoned justification for how these additions improve the geometric refinement of the Gaussians. The authors argue convincingly that the introduced terms address the limitations of purely photometric supervision by enforcing geometric consistency. The method is thoroughly evaluated through both quantitative and qualitative comparisons against a broad set of related approaches. The authors adequately addressed all the questions raised during the review.

格式问题

None

作者回复

2025-07-30

1. The scale loss in Section 5.2 is introduced without any prior mention.

In Lines 131–135 of Section 3 (“Normal and Depth Estimation from Gaussians”), we describe how the Gaussian scale matrix $\boldsymbol{S}$ encodes ellipsoid shape, and how its smallest eigenvalue determines the normal vector (" $\boldsymbol{S}$ encodes the scale along these directions......the smallest scale factor as the normal $\boldsymbol{n}$ of the Gaussian."). However, we agree that the scale loss itself was not explicitly defined, which may affect clarity.

The scale loss is a common regularization strategy in recent 3D Gaussian-based methods, often employed to encourage ellipsoids to flatten into local planes, thereby improving the estimation of surface normals and tangent geometry. Our purpose in including this loss in the ablation study (Section 5.2) was to show that, unlike in prior work, this commonly used regularization is unnecessary in our framework, due to the effectiveness of our view-alignment and geometry-aware supervision. In other words, the added flattening has no additional benefit. We will add a formal description of this loss and clarify its context in the final version.

2. Tables placed in the method section disrupt logical flow.

Thank you for the helpful suggestion. In the revised version, we will relocate these tables to the experiment section to improve logical flow and distinguish between method description and validation.

3. The performance gains come at the cost of significantly longer training time.

While our training time is longer than that of vanilla 3DGS, the added cost stems from the newly introduced geometry supervision and view-alignment mechanisms, which directly improve surface accuracy. These components are essential for multi-view reconstruction, especially in challenging scenes. While the gains over the latest baselines may appear modest in some scenes, these differences represent noticeable geometric improvements.
Moreover, increased time is a common trade-off among geometry-enhanced 3DGS variants, where introducing geometric priors and view constraints improves reconstruction quality at the cost of runtime (see Tables 1 & 2). The original 3DGS remains the fastest due to its lightweight constraints, but also exhibits the least accurate geometry. In future work, we plan to explore adaptive Gaussian pruning to further reduce training cost without sacrificing accuracy.
Our goal of this work is not to accelerate or compress 3DGS, but to enhance its geometric representation. On DTU, we reduce Chamfer distance from 1.96 to 0.49; on TNT, we raise F1-score from 0.09 to 0.54, a substantial improvement over 3DGS; on Mip-NeRF 360, we also observe consistent gains across all metrics. As shown in the table below, when only using edge-aware image reconstruction loss $\mathcal{L}_I$ on TNT, our method runs faster than 3DGS (5.9 min vs. 7.5 min) and achieves a higher F1-score (0.13 vs. 0.09).
In terms of practicality, our training time on TNT (~20 min) is production-viable for many offline applications. In contrast, neural implicit SDF methods often require 10+ hours of training (see Tables 1 & 2). We believe that accuracy is often prioritized over marginal runtime gains in such scenarios, with the rapid advancement of hardware.

	3DGS	Ours (only $\mathcal{L}_I$ )
F1-score $\uparrow$	0.09	0.13
Time (m) $\downarrow$	7.5	5.9

4. Ablation from N=1 to N=4 casts doubt on the effectiveness of the multi-view strategy.

We believe this is based on a misunderstanding of the ablation experiment in Section 5.2. Specifically, N denotes the number of source views, not the total number of images used. As stated in Line 192, our framework takes both a reference view and N source views. Thus, N=1 still uses 2 input images and does not correspond to a single-view setup.

To directly evaluate the effectiveness of the multi-view strategy, we conducted an additional ablation by removing all multi-view-based losses: the photometric alignment loss $\mathcal{L} _ {p}$ and feature alignment loss $\mathcal{L} _ {f}$ (Eqs. 7 and 12), while retaining only single-view terms: image reconstruction $\mathcal{L} _ {I}$ , normal consistency $\mathcal{L} _ {nc}$ , and normal smoothing $\mathcal{L} _ {ns}$ . As shown in the table below, the average F1-score on TNT drops from 0.54 to 0.36, confirming that our multi-view strategy plays a critical role.

Finally, we emphasize that our central contribution includes not only multi-view alignment but also single-view alignment (edge-aware image loss and normal-based supervision). As shown in Table 4, each component contributes meaningfully to the overall performance.

	Precision $\uparrow$	Recall $\uparrow$	F1-score $\uparrow$
w/o $\mathcal{L} _ {p} + \mathcal{L} _ {f}$	0.33	0.40	0.36
Ours	0.51	0.60	0.54

5. Explanation of how the recall in Table 4 is determined.

The results in Table 4 are based on the Tanks & Temples (TNT) benchmark, and we follow the official evaluation protocol and code provided by the dataset authors, which is publicly available on GitHub. The precision, recall, and F1-score are computed as follows.

Let $\mathcal{G}$ be the ground‑truth point set and $\mathcal{R}$ the reconstructed points.
For each ground‑truth point $g \in \mathcal{G}$ , we compute $e_{g \to \mathcal{R}} = \min_{r \in \mathcal{R}} \|g - r\|$ .
Given a threshold $d$ , recall is defined as: $R(d) = \frac{100}{|\mathcal{G}|} \sum_{g \in \mathcal{G}} [e_{g \to \mathcal{R}} < d]$ .
Precision $P(d)$ is defined similarly by measuring distance from reconstructed points to ground-truth. The F1-score is their harmonic mean: $F(d) = \frac{2 P(d) R(d)}{P(d)+R(d)}$ .

The distance threshold $d$ defines a successful match between reconstruction and ground truth. Importantly, the distance threshold is not a fixed value, it is defined per-scene within the TNT evaluation code, based on each scene’s scale and ground-truth annotations.

Although prior works on TNT typically report only the F1-score, which is the harmonic mean of precision and recall, we additionally report precision and recall separately in Table 4 to provide a more comprehensive ablation study. This helps assess the impact of each component (e.g., loss terms, view counts) on reconstruction accuracy (precision) and completeness (recall). We will clarify this detail in the revised version for completeness.

6. More explanation of the ablations in Section 5.2.

In all ablation studies reported in Section 5.2, we adopt a consistent and controlled protocol to ensure fair comparisons:

When ablating a loss term (e.g., $\mathcal{L} _ {p}$ or $\mathcal{L} _ {f}$ ), we drop the corresponding term entirely from the optimization objective.
The remaining loss weights are kept unchanged, and we do not re-normalize or re-scale other terms to preserve consistent training dynamics.
All experiments are conducted with the same number of training iterations and identical optimization settings (e.g., learning rate, batch size, data split).
For the hyperparameter ablations, we similarly vary only the parameter under study (e.g., source view count, threshold value), while keeping all other parameters and training settings fixed.

This ensures that any performance change can be attributed directly to the presence or absence of the specific loss being tested. We will clarify these implementation details in the revised version.

7. Why is the lighting variation a problem for geometry if spherical harmonics model illumination?

Although spherical harmonics (SH) can approximate lighting variations in many scenarios, their reconstruction is inherently low-frequency—suitable for modeling large-scale diffuse illumination but insufficient for high-frequency variations such as specular highlights, cast shadows, and fine-scale reflectance changes. Geometry inference based solely on photometric consistency requires accurate alignment of surface appearance across views. However, when lighting varies (due to different exposures, shadowing, or specularity), the corresponding pixel intensities may diverge even when geometry is correct. This breaks the photometric consistency assumptions common in MVS pipelines, degrading normal and depth estimation accuracy.

Moreover, SH assumes Lambertian, low-frequency lighting and cannot recover sharp contrast variations. In uncontrolled multi-view datasets, lighting conditions may vary significantly, and SH-based alignment often fails to support robust geometric supervision. Therefore, while SH is useful in graphics/rendering pipelines, it does not guarantee photometric invariance for geometry reconstruction in real-world multi-view imagery. Our method relies on view alignment and multi-view geometric cues (edge-aware, normal-based, and feature alignment supervision) to remain robust to photometric inconsistency.

评论- Official comment from Reviewer 78eD

2025-08-05

Thanks to the authors for their diligent effort. I'm satisfied with the response and will keep my score.

审稿意见

评分: 5置信度: 42025-07-02

The paper proposes a method to enhance surface reconstruction in 3D Gaussian Splatting (3DGS) by introducing geometry-aware constraints into the rendering process. The authors incorporate four novel constraints: (1) an edge-aware image reconstruction loss that sharpens object boundaries, (2) a normal-based geometry alignment loss that refines surface orientation while suppressing noise and over-smoothing, (3) a multi-view photometric alignment loss that enforces visibility-aware color consistency across views using local homographies, and (4) a feature alignment loss based on deep features to handle variations in appearance (e.g., due to illumination change). The final objective combines all these losses to jointly optimize geometry and appearance. The method achieves state-of-the-art results on standard benchmarks in both surface reconstruction and novel view synthesis.

优缺点分析

--- Strengths ---

The paper is well-executed, with clear ideas and thorough descriptions.
The proposed loss functions are sensible and appear to contribute meaningfully to improving geometric accuracy.
The experimental section convincingly demonstrates the benefits of the proposed method on standard datasets.

--- Weaknesses ---

I did not identify any major flaws in the paper.

The most notable weakness is the significantly increased run-time compared to standard 3DGS. This is expected, given the complexity of the proposed losses. It would be valuable to provide an approximate run-time for each loss. For instance, according to Table 4, L_f has only a marginal impact on geometric accuracy, so understanding its computational cost would help assess the trade-off.
The ablation study reports only geometric errors; including photometric errors would provide a more complete evaluation.
Figure 2 is not informative in its current form. It is unclear where the process begins or how the components interact. I recommend redrawing the figure to make the pipeline clearer.

问题

I am mostly interested in the run-time cost of each introduced loss.

局限性

Yes

最终评判理由

The authors addressed my concerns in the rebuttal.

格式问题

No concern

作者回复

2025-07-30

1. The run-time cost of each loss.

We provide the runtime of each loss ablation on the TNT dataset in the table below. With all losses enabled, the training takes 20.9 minutes on RTX 4090. Removing $\mathcal{L}_f$ (feature alignment) reduces training time to 15.94 minutes, yielding a ~5-minute saving, while still preserving strong performance. By comparison, removing $\mathcal{L}_p$ (photometric alignment) saves more time (8.3 min) but sacrifices more reconstruction quality. When only $\mathcal{L}_I$ (image reconstruction) is used, our method has a lower time cost than the original 3DGS (5.9m vs. 7.5m) and provides better F1-score results than 3DGS (0.13 vs. 0.09).
We agree that $\mathcal{L}_f$ brings modest improvements on the TNT dataset. $\mathcal{L}_f$ improves general robustness and is beneficial in scenes with consistent lighting or texture. It improves robustness under low-texture or lighting-variant scenes, which is not fully reflected by the F1-score metric alone. Removing it often leads to over-smoothed meshes in challenging regions.
Moreover, the contribution of the feature alignment loss $\mathcal{L}_f$ is more significant on the DTU dataset: removing $\mathcal{L}_f$ increases Chamfer distance from 0.49 to 0.52. We attribute this to dataset characteristics. DTU scenes benefit more from learned feature-level consistency due to cleaner lighting and smoother structure. TNT contains diverse indoor/outdoor scenes and severe lighting variation. The expressive power of off-the-shelf image features is limited, which may partially underutilize $\mathcal{L}_f$ . We plan to explore stronger feature extractors in future work.

	3DGS	Only $\mathcal{L}_I$	w/o $\mathcal{L}_{nc}$	w/o $\mathcal{L}_{ns}$	w/o $\mathcal{L}_{p}$	w/o $\mathcal{L}_{f}$	Full
F1-score $\uparrow$	0.09	0.13	0.52	0.51	0.50	0.53	0.54
Time (m) $\downarrow$	7.5	5.9	20.3	17.1	12.6	15.9	20.9
Time Gap	-	15.0	0.6	3.8	8.3	5.0	0.0

2. Include photometric errors in ablation studies.

Thank you for the suggestion. As shown in the table below, we provide additional photometric evaluation in our extended experiments. On the TNT dataset, we compute image reconstruction errors (PSNR) for several ablation settings. The results indicate that photometric differences across configurations follow a similar trend as geometric accuracy, and removing normal or alignment losses degrades image fidelity.

	w/o edge item	w/o $\mathcal{L}_{nc}$	w/o $\mathcal{L}_{ns}$	w/o $\mathcal{L}_{p}$	w/o $\mathcal{L}_{f}$	w/ scale loss	Full
F1-score $\uparrow$	0.53	0.52	0.51	0.50	0.53	0.54	0.54
PSNR $\uparrow$	24.23	24.36	24.38	24.31	24.16	24.18	24.41

3. Figure 2 is not informative in its current form.

We will redesign the figure to clearly indicate the data flow in the final version, including:

the starting point (posed images → Gaussian initialization),
how each loss module (e.g., $\mathcal{L} _ I$ , $\mathcal{L} _ p$ , $\mathcal{L} _ f$ ) operates on intermediate outputs,
the sequential or parallel interactions among modules during training,
label key intermediate outputs (e.g., rendered normals, projected patches),
use arrows and modular blocks to better convey the pipeline structure.

2025-08-01

I am happy with the answers. I will keep my rating.

审稿意见

评分: 4置信度: 42025-07-02

This paper proposes VA-GS, a method for enhancing the geometric representation of 3D Gaussian Splatting via view alignment. The approach incorporates five key loss functions to improve surface reconstruction: edge-aware image reconstruction, normal-based geometry alignment, visibility-aware multi-view photometric alignment, and multi-view feature alignment. The method addresses limitations in boundary delineation and illumination-induced artifacts that affect surface quality in standard 3DGS. Experiments on DTU and TNT datasets demonstrate improved surface reconstruction metrics while maintaining novel view synthesis quality.

优缺点分析

Strengths

The paper addresses important limitations of 3DGS by incorporating multiple complementary constraints - edge awareness, normal consistency, photometric alignment, and feature-level consistency - creating a well-rounded solution for geometric enhancement.
The method integrates cleanly with existing 3DGS frameworks without requiring architectural changes, making it easily adoptable. For instance, the visibility-aware photometric alignment is particularly well-designed for handling occlusions.
Comprehensive evaluation on standard benchmarks (DTU, TNT, Mip-NeRF 360) shows consistent improvements in surface reconstruction metrics while maintaining competitive novel view synthesis performance.

Weakness

Novelty concern. The proposed technique resembles prior methods such as PGSR and PSDF. While the combination is effective, individual components (edge-aware losses, normal consistency, photometric alignment) are well-established techniques.
Missing comparison with PSDF. Despite being directly related in spirit, the paper does not compare VA-GS with more advanced methods such as PSDF.

问题

I suggest the author to include comparisons with PSDF to better position the contribution and justify the novelty concern.

局限性

Please refer to Questions and Weakness.

最终评判理由

As the rebutal has addressed most of my concern, I would like to keep my positive recommendation.

格式问题

No Formatting Concerns.

作者回复

2025-07-30

1. Novelty concern.

Unlike PGSR’s planar‑based splatting approach and PSDF’s neural implicit surface framework, our main innovation lies in the Gaussian-based view alignment strategy, which explicitly aligns Gaussian primitive orientations and projections across multiple views and applies feature-level constraints. This design enables consistent geometry reconstruction even in high-curvature or complex lighting scenarios, which is verified by qualitative comparison results.

In contrast, PGSR introduces planar splatting of Gaussians and couples photometric consistency with geometric regularization via plane-based homography. It relies on scale regularization for flattening Gaussian ellipsoids and employs supervision with more complex gradient backpropagation paths (embedding geometry into homography). PGSR uses a standard combination of L1 and SSIM losses to compute the difference between rendered and reference images, without incorporating our newly proposed edge-aware term. In addition, it does not include our newly introduced normal smoothing loss in its normal-related constraints. On DTU, we reduce Chamfer distance from 0.53 to 0.49; on TNT, we raise F1-score from 0.52 to 0.54; on Mip-NeRF 360, we observe consistent substantial improvements across all metrics (See Tables 1, 2 & 3).

On the other hand, PSDF is built on a neural implicit SDF representation, leveraging external MVS priors and importance sampling for multi-view consistency. It does not operate on explicit Gaussian primitives, nor does it align oriented Gaussians across views as we do. On DTU, we reduce Chamfer distance from 0.55 to 0.49; on TNT, we raise F1-score from 0.53 to 0.54 (See Question 2 for details).

In summary, although our work shares some conceptual goals with prior methods, our implementation and integration of these ideas are fundamentally different in both design and effect. Our contributions are novel and distinct: we operate at the level of Gaussian primitive orientation alignment, consolidate single-view and multi-view alignment mechanisms in a unified framework, and achieve performance that exceeds both PGSR and PSDF in quality and efficiency across multiple datasets.

2. Comparison with PSDF.

We appreciate the reviewer’s suggestion. We have now included a quantitative comparison with PSDF on both the DTU and TNT datasets. As shown in the following tables, our method outperforms PSDF on most scenes and categories, achieving a lower average Chamfer distance on DTU (0.49 vs. 0.55) and a higher mean F1-score on TNT (0.54 vs. 0.53). This confirms the effectiveness of our method for high-fidelity surface reconstruction. While PSDF slightly outperforms our method on a few specific categories, we see this as valuable feedback and plan to improve feature extraction and robustness to large view variations in future work. Note that the results of PSDF are taken from the original paper, as its source code is not publicly available. We will cite this work and include these comparisons in the revised version.

Additionally, our method is significantly more efficient. As reported by the PSDF authors, training on TNT takes 17.05 hours on an RTX 3090, whereas our approach completes training in 21 minutes on an RTX 4090. This highlights the practical advantage of our explicit Gaussian-based design over neural implicit SDF frameworks, which typically require long training times (more than ten hours, as shown in Tables 1 & 2 of our paper).

Quantitative comparison of Chamfer distances on the DTU dataset (lower is better).

	24	37	40	55	63	65	69	83	97	105	106	110	114	118	122	Mean
PSDF	0.36	0.60	0.35	0.36	0.70	0.61	0.49	1.11	0.89	0.60	0.47	0.57	0.30	0.40	0.37	0.55
Ours	0.32	0.49	0.32	0.30	0.77	0.68	0.43	1.05	0.61	0.57	0.36	0.52	0.28	0.33	0.30	0.49

Quantitative comparison of F1-scores on the TNT dataset (higher is better).

	Barn	Caterpillar	Courthouse	Ignatius	Meetingroom	Truck	Mean
PSDF	0.62	0.39	0.42	0.79	0.47	0.53	0.53
Ours	0.71	0.45	0.21	0.82	0.40	0.64	0.54

2025-08-05

The authors have convincingly justified the novelty of the work in their response. I remain positive about the paper and keep my favorable recommendation.

最终决定Accept (poster)

2025-09-17

In this paper, the authors presented a new method for 3D Gaussian Splatting (3DGS) by view alignment to enhance the geometric representation. Specifically, edge-aware image cues, photometric alignment, and cross-view consistency are incorporated to enhance the geometric representations. Experiments on standard benchmark datasets show the effectiveness of the proposed method.

This paper received review comments from four expert reviewers,with an initial divergent rating but reached a consensus after the rebuttal: 1 Accept and 3 Borderline Accept. The reviewers appreciate the motivation and idea, as well as the strong performance. Although there were some concerns about the novelty, missing comparison to related work, and some clarity issues, after the rebuttal and discussion phase, the major concerns were addressed. As a result, the AC is pleased to recommend Accept. But the authors are advised to incorporate all the further clarifications during the discussions and rebuttal into their final version.