H3D-DGS: Exploring Heterogeneous 3D Motion Representation for Deformable 3D Gaussian Splatting
Streaming dynamic scene reconstruction with a novel 3d motion representation, accurate, fast, compact.
摘要
评审与讨论
This paper proposes H³D-DGS, a novel framework for dynamic 3D scene reconstruction using Deformable 3D Gaussian Splatting. The key idea is to decouple the observable and unobservable components of 3D motion by leveraging optical flow to directly compute motion projected onto the image plane and learning the residual (unobservable) motion via gradient-based methods. This results in Heterogeneous 3D (H3D) control points, which serve as a more efficient and physically grounded representation of 3D motion.
优缺点分析
Strengths:The paper introduces a novel combination of scene graphs and object-level latent vectors, along with a deformable guidance strategy that balances realism and generalizability. Extensive experiments across four datasets, including ablations, support its effectiveness. The method is clearly presented with helpful figures like Fig. 2 and 3.
Weakness:
- Equation (6): Euler2Quat is not defined or referenced, leaving readers unclear on its role.
- Equation (11): The loss function is incomplete—no variables or inputs are specified in the expression.
- Table 1 highlights advantages like “Large Move,” “Fast Train,” “Compact Rep,” “Motion Manip,” but the experiments do not systematically correspond to these capabilities.
- In Tables 3 and 4, the meaning of GoS1, GoS5, GoS10, GoS2 is not explained, which obstructs full understanding of results.
问题
-
The paper introduces several key components—motion decoupling, control point interpolation, object-wise manipulation, and residual compensation—but no ablation study is conducted to isolate the contribution of each module.
-
The loss function (Eq. 11) adopts a standard L_1_D-SSIM formulation, but there is no analysis of its necessity or comparison with other regularization terms (e.g., temporal smoothness, deformation consistency, or sparsity) commonly used in dynamic 3D reconstruction.
-
How does the method perform in cluttered or partially labeled 3D scenes, where object-wise segmentation may be noisy or incomplete?
---Update after rebuttal--- The author rebuttal has addressed most of my concerns, so I will increase my score to borderline accept.
局限性
See the weaknesses
最终评判理由
Thank you for providing a clear and comprehensive response. Your rebuttal has addressed most of my concerns, so I will increase my score to borderline accept.
格式问题
NA
W1: Equation (6) — Undefined use of Euler2Quat
Euler2Quat is not defined or referenced, leaving readers unclear on its role.
A1:
Thank you for pointing this out. The function Euler2Quat converts Euler angles to quaternions. Euler angles are more intuitive and interpretable for decoupling and controlling rotation components, which is why we adopt them during motion decomposition. However, the underlying 3D Gaussian Splatting framework operates using quaternions to represent rotation. Thus, we convert the decoupled Euler angle outputs into quaternions before feeding them into the Gaussian representation.
We will add a brief explanation and reference in the revised version to clarify this conversion step.
W2: Equation (11) — Incomplete loss function
The loss function is incomplete—no variables or inputs are specified in the expression.
A2:
Thank you for your suggestion. The loss function in Equation (11) is computed between the rendered image and the ground truth image. We agree that the variables and inputs should be made explicit for clarity.
In the camera-ready version, we will revise the equation to include these input terms and ensure the notation clearly reflects the role of each variable in the loss computation.
W3: Table 1 claims vs. experimental validation
Table 1 highlights advantages like "Large Move," "Fast Train," "Compact Rep," "Motion Manip," but the experiments do not systematically correspond to these capabilities.
A3:
Thank you for pointing this out. We clarify that each of the capabilities listed in Table 1 is supported by specific experimental evidence:
-
Large Move: The CMU-Panoptic dataset contains large-scale motions that challenge many existing dynamic 3DGS methods. As described in Lines 220–221 and 251–253, our method successfully reconstructs scenes in this setting, demonstrating robustness to large motion.
-
Fast Train: We discuss training efficiency in Table 2 and Lines 270–272. Our method converges faster than prior approaches, validating the "Fast Train" claim.
-
Compact Rep: Table 5 shows that our H3D control points form a compact representation of motion. Unlike dense per-frame residuals, our control points use fewer attributes and less than 2% of the parameter count — achieving efficient motion encoding.
-
Motion Manip: The mechanism by which our H3D control points influence Gaussians is described in Lines 174–180. Similar to SC-GS, our control points are designed to support object-level manipulation. As the motion is modeled explicitly through these control points, they naturally enable motion editing and manipulation.
W4: Unclear meaning of GoS1, GoS5, GoS10, GoS2 in Tables 3 and 4
In Tables 3 and 4, the meaning of GoS1, GoS5, GoS10, GoS2 is not explained, which obstructs full understanding of results.
A4:
Thank you for raising this point. We provide a detailed explanation of the GoS settings in Lines 199–206 of the paper. Briefly, GoS (Group of Splatting) refers to the frequency of residual compensation used during dynamic reconstruction.
A GoS-N setting means that for every group of N frames, there is one key frame and N–1 non-key frames. The key frame is fully optimized, while the non-key frames rely more heavily on motion propagation from the key frame using our H3D control point mechanism.
A smaller N (e.g., GoS1) corresponds to more frequent optimization and higher reconstruction fidelity, while a larger N (e.g., GoS10) increases the challenge by reducing supervision frequency, thereby placing greater emphasis on the robustness of the motion representation.
Q1: Missing ablation for key components
The paper introduces several key components—motion decoupling, control point interpolation, object-wise manipulation, and residual compensation—but no ablation study is conducted to isolate the contribution of each module.
A5:
Thank you for the detailed observation.
The term “control point interpolation” appears to be a typo and likely refers to the process of generating and placing H3D control points, which form the core of our proposed motion representation. This novel mechanism is central to reconstruction quality. We evaluate the impact of control point density in Appendix Figure 8 (Lines 483–491), which supports our choice of a compact yet expressive control structure.
We also isolate the effect of motion decoupling and residual compensation in our ablation study (Figure 6). These experiments demonstrate that both components are critical for handling complex motion: the decoupled motion provides a strong prior from optical flow, while the residual term corrects the unobservable dynamics, leading to significant improvements in temporal consistency and reconstruction accuracy.
1. Effectiveness of motion decoupling
We conduct an ablation analysis to isolate the contributions of the observable and unobservable components in our decoupled motion representation. The blue curve represents the setting without any motion modeling, leading to the fastest degradation in reconstruction quality. The orange curve reflects the effect of using only the fixed, flow-derived observable motion, which significantly slows error accumulation. The red curve corresponds to the full model that additionally incorporates the learned unobservable component, yielding the best performance. This progression clearly demonstrates that both components are essential and complementary for achieving stable and accurate dynamic reconstruction.
2. Effectiveness of residual compensation
As shown in Figure 6, we provide an ablation across different timesteps to assess the residual compensation module. Without this module, PSNR drops progressively over time (see the red curve across timestep 0-9, 10-19, 20-29), indicating error accumulation due to drift from imperfect motion estimation or appearance changing. When residual compensation is enabled, the scene fidelity remains consistently high (timestep 9-10, 19-20), demonstrating the module’s effectiveness in maintaining reconstruction quality.
Regarding object-wise manipulation, this feature serves primarily as a qualitative validation of our motion representation. It emerges naturally from the H3D control point mechanism (Lines 174–180), enabling semantically meaningful and interpretable manipulations. While it is difficult to isolate quantitatively, its effectiveness is reflected in the structured spatial allocation of control points and their semantic correspondence across the scene.
Q2: Lack of analysis on loss function design
The loss function (Eq. 11) adopts a standard L₁-DSSIM formulation, but there is no analysis of its necessity or comparison with other regularization terms (e.g., temporal smoothness, deformation consistency, or sparsity) commonly used in dynamic 3D reconstruction.
A6:
Thank you for raising this point. In the current version, we deliberately adopt a streamlined loss function—based on L₁ and DSSIM—to emphasize the core contributions of our method: motion representation and structural regularization. Despite its simplicity, this formulation achieves robust performance across a range of datasets and scenarios, as demonstrated in our experimental results.
We agree that integrating additional regularization terms, such as temporal smoothness, deformation consistency, or sparsity, could further improve the stability and fidelity of dynamic reconstruction, particularly under extreme motion or occlusion. We view this as a valuable direction for future work and are encouraged by the potential for the community to build upon our framework with more sophisticated loss formulations.
Q3: Performance in cluttered or partially labeled 3D scenes
How does the method perform in cluttered or partially labeled 3D scenes, where object-wise segmentation may be noisy or incomplete?
A7:
In cluttered or partially labeled 3D scenes, our residual compensation module helps mitigate the effects of noisy or incomplete segmentation by refining motion estimation beyond what is captured by initial segmentation labels. This contributes to improved robustness in moderately challenging conditions.
However, the overall performance may still degrade under severely erroneous segmentation, as inaccurate object boundaries can lead to improper control point grouping and motion estimation, resulting in suboptimal reconstruction.
Fortunately, modern 2D segmentation techniques exhibit strong generalization and are effective in many practical scenarios, which helps provide reliable segmentation masks. We also recognize the importance of advancing 3D semantic understanding and view this as a promising direction for future work—one that would directly enhance the effectiveness of methods like ours.
This paper proposes a method to improve 4D Gaussian reconstruction. Moving objects are segmented from the background, and their motion is modeled using control points. Optical flow is incorporated to enhance motion reconstruction accuracy.
优缺点分析
4D reconstruction is a key task in computer vision. The proposed approach is technically sound and demonstrates good efficiency, as shown in the experiments. However, the results still contain some artifacts in the background, and the experimental evaluation could be further strengthened. In terms of methodology, while some of the components, such as foreground-background segmentation and control point modeling, are inspired by prior work, the implementations with integration of optical flow appear to improve the performance. Therefore, I would consider the paper to offer moderate novelty. Besides, the paper is easy to follow, and the code release of the proposed method would also be valuable to the community.
问题
- Some methods are only compared in Tables 3 and 4, but lack corresponding qualitative comparisons in the video.
- Important baselines are missing, such as Spacetime Gaussian, which is mentioned in the Related Work section.
- Jittering artifacts are noticeable in the background (e.g., windows in the video from 0:00–0:08 and 0:45–1:20). Is this caused by inaccurate segmentation or reflectance-related issues?
- Tables 3 and 4 should include citations for the compared methods.
- The organization of Section 3 could be improved by adding an overview paragraph before introducing symbols.
局限性
Yes
最终评判理由
Although the paper has some limitations, the proposed method outperforms state-of-the-art approaches, and the authors have committed to improving the paper. Therefore, I maintain a generally positive view of the paper.
格式问题
No major formatting issues are found
Q1: Qualitative comparisons missing for some methods in Tables 3 and 4
Some methods are only compared in Tables 3 and 4, but lack corresponding qualitative comparisons in the video.
A1:
Thank you for the suggestion. While the supplemental video includes representative sequences to illustrate key differences, it does not currently cover all evaluated methods from Tables 3 and 4 due to space constraints.
We agree that a complete qualitative comparison is valuable. In the final version, we will include additional qualitative results covering all methods to provide a fair and comprehensive visualization of reconstruction quality across the full benchmark suite.
Q2: Missing baseline
Important baselines are missing, such as Spacetime Gaussian, which is mentioned in the Related Work section.
A2:
We appreciate the reviewer’s comment. We carefully selected baselines that align with the core architectural and functional design of our method. Specifically:
- Control-point-based motion representation: We compare against SC-GS and SP-GS, which also use control points to guide Gaussian motion.
- Streaming reconstruction: We include DynamicGS as a baseline due to its online, frame-wise reconstruction pipeline.
- Optical flow as a motion prior: We include MA-GS, which incorporates optical flow similarly to our method.
Regarding Spacetime Gaussian (STG), we expand on this comparison in Lines 227–228. While STG is a strong recent method, several key differences make a direct comparison challenging:
-
Variable vs. fixed primitives: STG uses temporally varying opacity to allow Gaussians to appear and disappear over time. This increases flexibility but results in a much larger effective number of Gaussians for longer sequences. Our method maintains a fixed and compact set of H3D control points, enabling more efficient and interpretable motion modeling within a constrained representation space.
-
Training efficiency: STG reports 40–60 minutes of training for a 50-frame sequence on a single A6000 GPU. Extrapolating this to the full Neural 3DV dataset (2700 frames) implies more than 45 hours of training time. In contrast, our method reconstructs the same dataset in just 1.9 hours.
We acknowledge the significance of STG and will clarify its relation to our work in the final version. However, given the substantial differences in assumptions, runtime complexity, and representation scale, we believe our current selection of baselines provides the most meaningful and fair evaluation of our method's contributions.
Q3: Jittering artifacts in the background
Jittering artifacts are noticeable in the background (e.g., windows in the video from 0:00–0:08 and 0:45–1:20). Is this caused by inaccurate segmentation or reflectance-related issues?
A3:
Thank you for the observation. The jittering artifacts in the background primarily arise from the characteristics of our streaming reconstruction framework. Unlike global optimization methods that jointly optimize across all frames, our method updates each frame incrementally based on the previous frame and the current input.
While this design significantly improves flexibility for real-time or streaming scenarios—allowing for low-latency processing and online reconstruction—it also introduces susceptibility to cumulative noise and temporal drift. These factors contribute to the observed background artifacts such as floaters and jittering.
Addressing these limitations is a valuable direction for future work, potentially by integrating lightweight temporal smoothing or consistency mechanisms without compromising streaming efficiency.
Q4: Missing citations in Tables 3 and 4
Tables 3 and 4 should include citations for the compared methods.
A4:
Thank you for the suggestion. We acknowledge this oversight and will include proper citations for all baseline and compared methods in Tables 3 and 4 in the camera-ready version to ensure clarity and proper attribution.
Q5: Section 3 organization improvement
The organization of Section 3 could be improved by adding an overview paragraph before introducing symbols.
A5:
Thank you for your suggestion. We plan to add the following overview paragraph at the beginning of Section 3 to improve clarity and guide the reader:
“In this section, we introduce our method for dynamic scene reconstruction with Gaussians. First, we present the notations and symbols used throughout the section to facilitate understanding. We then begin with the key idea of our approach—motion decoupling—and provide its detailed formulation. Next, we introduce the control point action pattern, where motion information is constructed using the proposed motion decoupling strategy. Based on the resulting H3D control points, we describe the full pipeline for dynamic scene reconstruction. Finally, we present the associated loss functions used to optimize the model.”
We believe this will improve the readability and logical flow of the section.
Thank you for the detailed rebuttal. It addressed many of my concerns and helped clarify several aspects of the method and experimental results.
Thank you for your thoughtful review and feedback. I’m glad the rebuttal helped address your concerns and clarify our work.
This paper proposes H3D control points, a heterogeneous 3D motion representation that separates observable motion (estimated via optical flow) from unobservable motion (learned via gradients). The method achieves faster convergence and higher reconstruction quality on dynamic scenes compared to prior 3D Gaussian Splatting approaches.
优缺点分析
Pros:
- Decoupling the motion into two parts seems reasonable, and the paper presents a good analysis on this in the introduction.
- The proposed approach achieves clear improvement compared to the baseline methods.
Cons:
- The paper seems to lack experimental analysis on the unobservable and observable parts. For example, could you find the corresponding observable parts contributed by optical flow, and how the proposed method contributes to the unobservable parts?
- You should include the cites in Table 3.
- Since you have utilized the third-party optical flow estimation method, could you provide the corresponding inference time for reconstruction? I think it is better to provide a comparison with other methods for the content in L270-L272.
- Importantly, to some extent, this paper focuses on leveraging motion flow to facilitate deformable 3D Gaussian Reconstruction. However, this paper does not discuss previous similar approaches, neither in the introduction nor related work. From L70-L92, this paper merely elaborates previous works, rather than discussing the relations between this submission and previous approaches. e.g., MotionGS, Gaussian-Flow, and so on. For the performance comparison, this paper also did not compare with those approaches. Could you provide any explanations?
问题
- Could you provide any evidence for the decoupling between the observable and unobservable components?
- In Figure 5, we can find that the proposed method is better than 4D-GS, while the numbers of the method are worse. Do you have any explanations?
- Please explain the results in detail in Tab.3. especially why the proposed method on coffee_martini is significantly worse, while the proposed approach is significantly better.
I think this paper should be compared with recent motion-based approaches with external optical flow. e.g., Motion-GS, at least provide a discussion with it. Threrefore, In the pre-rebuttal phrase, I give a current rating, though I think the idea makes sense and the performance is fine.
局限性
yes
最终评判理由
I think the response has partly addressed my concerns. For the comparison with previous similar work, the response claims that MotionGS is for single-view, while this paper is for multiple-view reconstruction. I did not think this was an efficient explanation. Meanwhile, there are only explanations, no quantative comparisons. I thus maintain my rating.
格式问题
n/a
W1&Q1: Decoupling observable and unobservable motion
The paper seems to lack experimental analysis on the unobservable and observable parts. For example, could you find the corresponding observable parts contributed by optical flow, and how the proposed method contributes to the unobservable parts? Could you provide any evidence for the decoupling between the observable and unobservable components?
A1:
We appreciate the reviewer’s thoughtful question. The decoupling between observable and unobservable motion is not only a conceptual design but is explicitly enforced in the architecture of the proposed H3D control points.
The observable component is derived directly from optical flow and remains fixed throughout the optimization process. In contrast, the unobservable component is learned through gradient-based optimization and is responsible for compensating motion not recoverable from 2D flow cues alone — such as depth ambiguity or occluded motion.
We provide two pieces of evidence to support the effectiveness of this decoupling strategy:
1. Structural argument
Since the observable part is fixed and derived solely from optical flow, any failure in the decoupling mechanism would directly impair 3D motion estimation. This would manifest as degraded reconstruction quality both qualitatively and quantitatively. However, our method achieves strong performance across all benchmarks (Tables 1–3), indicating that the unobservable component complements the flow-based motion effectively and validates the robustness of the decoupling design.
2. Ablation study (Figure 6)
We conduct an ablation analysis to isolate the contributions of the observable and unobservable components in our decoupled motion representation. The blue curve represents the setting without any motion modeling, leading to the fastest degradation in reconstruction quality. The orange curve reflects the effect of using only the fixed, flow-derived observable motion, which significantly slows error accumulation. The red curve corresponds to the full model that additionally incorporates the learned unobservable component, yielding the best performance. This progression clearly demonstrates that both components are essential and complementary for achieving stable and accurate dynamic reconstruction.
Summary
The decoupling is structurally enforced and empirically validated through benchmark performance and targeted ablation. Our design ensures both interpretability and effectiveness in capturing complex 3D motion.
W2: Missing citations in Table 3
You should include the cites in Table 3.
A2:
Thank you for pointing this out. We acknowledge the omission and will include proper citations for all baseline methods in Table 3 in the camera-ready version to ensure clarity and credit to prior work.
W3: Runtime of optical flow estimation
Since you have utilized the third-party optical flow estimation method, could you provide the corresponding inference time for reconstruction? I think it is better to provide a comparison with other methods for the content in L270-L272.
A3:
Thank you for the suggestion. The inference time of the optical flow estimation module is as follows:
- 0.03 seconds for DIS on CPU
- 0.03 seconds for SpyNet on an RTX 4070 GPU
- 0.04 seconds for PWC-Net on an RTX 4070 GPU
We will include this information in the revised version for clarity and fairness in runtime comparisons.
Importantly, the optical flow is computed only once per frame and does not incur any iterative optimization overhead, making it efficient to integrate into our pipeline. In addition, we will expand our discussion in Lines 270–272 to provide a more complete comparison of overall runtime performance with baseline methods.
W4: Missing discussion of related approaches
Importantly, to some extent, this paper focuses on leveraging motion flow to facilitate deformable 3D Gaussian reconstruction. However, this paper does not discuss previous similar approaches, neither in the introduction nor related work. From L70–L92, this paper merely elaborates previous works, rather than discussing the relations between this submission and previous approaches. e.g., MotionGS, Gaussian-Flow, and so on. For the performance comparison, this paper also did not compare with those approaches. Could you provide any explanations?
A4:
We appreciate the reviewer’s feedback and agree that a thorough comparison with related work is essential. Our selection of baseline methods was guided by architectural and methodological alignment to ensure fair and meaningful evaluation:
-
Control point–based motion: Our method leverages sparse control points to represent motion. Accordingly, we compare against SP-GS and SC-GS, which similarly rely on control point clustering for deformable Gaussian control.
-
Streaming reconstruction: Our pipeline operates in an online, streaming setting. Therefore, we include Dynamic3DGS, which is designed for streaming-based Gaussian reconstruction.
-
Optical flow as prior: We utilize external optical flow to guide motion inference, akin to MA-GS. Though our integration strategy differs significantly, MA-GS serves as the most appropriate baseline for fair comparison under this design paradigm.
We also acknowledge the relevance of more recent approaches and will clarify these relationships in the revised manuscript:
-
MotionGS: This method shares conceptual similarities with MA-GS in its use of optical flow as supervision. However, its official implementation is limited to monocular input, whereas our pipeline is designed for multi-view dynamic reconstruction. Given this, we believe that comparison with MA-GS offers a more appropriate and fair evaluation of the effectiveness of our approach.
-
Gaussian-Flow: This recent method adopts a dense modeling approach, using a mixture of trigonometric and polynomial functions to represent both motion and appearance in the Gaussian field. In contrast, our proposed H3D control points offer a sparse, interpretable, and significantly more compact representation for dynamic motion.
Qualitative differences
As illustrated in Figure 4 of the Gaussian-Flow paper, sequences like coffee_martini still exhibit visible motion artifacts such as blurred hands. Our Figure 5, which visualizes the same sequence, demonstrates clearer structural fidelity, validating the effectiveness of our motion representation.
Efficiency and scalability
Gaussian-Flow reconstructs only the first 300 frames of each sequence, requiring 30k iterations to achieve 30.5 PSNR. By contrast, our model reconstructs more challenging sequences up to 1200 frames with a higher PSNR of 30.9, using fewer parameters and faster convergence. This highlights both the scalability and the compactness of our design.
We appreciate the reviewer’s suggestion and will revise the Related Work and Experimental sections to explicitly address these comparisons.
Q2: Discrepancy between visual quality and metrics in Figure 5
In Figure 5, we can find that the proposed method is better than 4D-GS, while the numbers of the method are worse. Do you have any explanations?
A5:
Yes — our method achieves higher perceptual quality in dynamic regions, despite reporting lower global metrics. This discrepancy arises from limitations in the scene and the evaluation protocol.
As discussed in Lines 255–260, the coffee_martini sequence suffers from limited camera coverage, especially over static regions such as the background. These areas are poorly reconstructed and introduce floating artifacts and noise, which negatively affect global metrics like PSNR and SSIM.
However, our proposed heterogeneous motion representation enables high-fidelity reconstruction of the dynamic foreground (e.g., hands), which is clearly superior to that of 4D-GS in Figure 5. Unfortunately, standard metrics weigh all pixels equally, so background artifacts dominate the score — even when foreground quality improves significantly.
This case highlights a known mismatch between human perceptual quality and pixel-level error metrics, especially in unbalanced scenes. We will clarify this issue more explicitly in the revised paper.
Q3: Clarification of results in Table 3
Please explain the results in detail in Table 3, especially why the proposed method on coffee_martini is significantly worse, while the proposed approach is significantly better.
A6:
The lower metrics on coffee_martini are caused by poor initial static reconstruction, while our method still achieves strong dynamic reconstruction quality.
Our method follows a streaming paradigm, where the final performance depends heavily on the initial reconstruction quality. As shown in Appendix Figure 10 (Lines 495–504), suboptimal geometry early in the sequence limits the effectiveness of later motion modeling. In the coffee_martini sequence, limited camera viewpoints and scene complexity yield a weaker initial reconstruction, degrading the overall PSNR and SSIM metrics.
Despite this, our heterogeneous motion representation enables accurate reconstruction of dynamic regions — particularly visible in the motion of the hands. This highlights the strength of our approach in modeling deformable motion, even when the static base is suboptimal.
We believe this observation underscores a promising direction for future work: improving the robustness of static initialization to further enhance dynamic scene performance.
The author proposed a new method to model Gaussian dynamics. Unlike previous gradient based method, the author tries to derive the motions of Gaussian control points in a hybrid style: for the full 6 DoF Vx, Vy, Vz and Wx, Wy, Wz, half of them are estimated from optical flow in a closed form style, while the other half are optimized with backpropagation. Finally, the full set of Gaussians are deformed in accordance to its K-nearest control points. Other modules including residual compensation and 3D segments are also introduced to further enhance the robustness of the proposed method.
优缺点分析
Good:
- AFAIK, it's the first work that tries to directly infer the (partial of) 3D motion of control point in a closed form, compared with most other works leveraging optical flow as a differentiable loss or regularization.
- The proposed hybrid motion estimation method greatly reduce the optimization space of Gaussians' motion, hence achieve a fast per-frame optimization speed. This could be promising for real time dynamic reconstruction for the future works.
Weakness:
- Inappropriate selection of baseline methods. SP-GS is a very weak baseline and SC-GS/DynamicGS are rather earlier works. There are plenty other works that also tries to model gaussian dynamics, including but not limited to: shape-of-motion, Gaussian-Flow.
- Inappropriate selection of evaluation task. since it is rather a motion modeling instead of 3D dynamic reconstruction, the proposed method is expected to be evaluated on dynamic point tracking task instead of merely dynamic NVS.
- the ablation study is somewhat limited. Though the hybrid motion estimation reduce the optimization space, the introduced residual compensation is somewhat "cheating" by introducing extra optimization space. To validate the effectiveness of proposed method, ablation study of such residual block is necessary.
问题
- confusing per-frame optimization cost: seems that you have keyframe 40s and non-keyframe 2s as optimization cost. Considering the method is evaluated in the GoS-10 setup, then how it comes in Figure 1 that your average per frame optimization cost is less than 4s?
- it is still uncertain how you select control points from optical flows. From my understand, optical flow is a dense map, did you make any subsampling in the pixel space?
- is the depth information Z required when trying to estimate the control points' motion. If it is, how do you acquire it.
局限性
Null
最终评判理由
Overall the paper shows an innovative method that directly infer Gaussian motion attributes from 2d trajectories, indicating a clear difference from previous optimization-dominant methods. Though im still not satisfied with its evaluation sections with absence of point tracking tasks like dynamic3DGS yet i understand its hard for the author to implement this during rebuttal period. **Yet i will highly recommend the author add such task evaluations in their cam ready version. **
格式问题
Null
W1: Inappropriate selection of baseline methods.
SP-GS is a very weak baseline and SC-GS/DynamicGS are rather earlier works. There are plenty other works that also tries to model gaussian dynamics, including but not limited to: shape-of-motion, Gaussian-Flow.
A1:
We appreciate the reviewer’s suggestion. We are confident that our selection of baseline methods is both adequate and fair given the design goals of our approach, and we have conducted further analysis of the two suggested methods — Shape-of-Motion and Gaussian-Flow — to better clarify our positioning.
Our response is structured as follows:
1. Adequacy and fairness of the selected baselines
Our method is designed around a compact, control-point-based motion representation in a streaming paradigm. Based on these principles, we selected the following baseline methods:
- SC-GS and SP-GS, both of which adopt control point clustering for Gaussian management and serve as representative approaches in this design space.
- DynamicGS, which supports online reconstruction and dynamic updates in a streaming setting.
- MA-GS, which incorporates optical flow as a prior for estimating Gaussian motion.
These methods align closely with the structure, input setting, and objectives of our work, enabling a focused and fair comparison.
2. Additional analysis of Shape-of-Motion and Gaussian-Flow
While we agree that recent methods such as Shape-of-Motion and Gaussian-Flow are valuable to discuss, they present limitations that hinder fair empirical comparisons:
-
Shape-of-Motion addresses a different task — reconstructing dynamic scenes from a single monocular video. Our method, in contrast, is designed for multi-view dynamic reconstruction. Furthermore, its official implementation does not support multi-view settings, making direct comparison infeasible under our task formulation.
-
Gaussian-Flow adopts a dense modeling strategy that differs significantly from ours. It fits a motion field using a mixture of trigonometric functions and polynomials applied to a large number of Gaussians. Our method instead uses sparse H3D control points, resulting in a more compact, interpretable, and scalable design.
Detailed comparison: In Figure 4 of the Gaussian-Flow paper, sequences such as coffee_martini exhibit noticeable blurring in hand regions. Our Figure 5 shows clearer reconstructions on the same sequence, indicating improved fidelity. In terms of efficiency, Gaussian-Flow reconstructs only the first 300 frames with 30k iterations, achieving 30.5 PSNR. Our method reconstructs the full 1200-frame sequences and achieves 30.9 PSNR, demonstrating stronger performance with fewer Gaussians and faster convergence.
Summary
In summary, while we acknowledge the significance of newer works, our chosen baselines remain the most fair and informative given the task alignment and technical feasibility. We also provide further comparison and analysis to clarify distinctions and support our methodological decisions.
W2: Inappropriate selection of evaluation task.
Since it is rather a motion modeling instead of 3D dynamic reconstruction, the proposed method is expected to be evaluated on dynamic point tracking task instead of merely dynamic NVS.
A2:
We appreciate the reviewer’s insight. While we agree that dynamic point tracking could provide a direct validation of motion estimation, our method is designed and evaluated primarily as a dynamic 3D scene reconstruction system. Accordingly, we believe dynamic novel view synthesis (NVS) remains the most appropriate and fair evaluation protocol for our objective, consistent with prior work such as 4D-GS and MA-GS.
1. Task alignment
Our method focuses on reconstructing dynamic 3D scenes from multi-view videos, and the quality of novel views directly reflects the fidelity of the learned motion and geometry. This makes NVS a natural evaluation target and aligns with the standard benchmarks used by closely related methods.
2. Motion modeling insight
To further reveal the behavior of our motion representation, we include schematic visualizations of the inferred 3D motion field in Appendix Figure 9. These qualitative results demonstrate that our hybrid motion estimation produces temporally coherent and spatially structured motion fields, offering interpretability beyond raw NVS performance.
3. Broader applicability
We view our motion representation as a generalizable building block that can serve various downstream tasks. Although our current focus is on improving reconstruction performance, the representation is expressive enough to potentially support point tracking or motion segmentation with minor adaptation.
Summary
Our evaluation strategy centers on dynamic 3D reconstruction, for which dynamic NVS is the most appropriate benchmark. We agree that evaluating on point tracking is meaningful, and we believe future extensions of our method could naturally enable such tasks.
W3: The ablation study is somewhat limited.
Though the hybrid motion estimation reduce the optimization space, the introduced residual compensation is somewhat "cheating" by introducing extra optimization space. To validate the effectiveness of proposed method, ablation study of such residual block is necessary.
A3:
We thank the reviewer for this valuable suggestion. We agree that the residual compensation module introduces additional flexibility, and we have conducted an ablation study to examine its contribution in isolation.
1. Effectiveness of residual compensation
As shown in Figure 6, we provide an ablation across different timesteps to assess the residual compensation module. Without this module, PSNR drops progressively over time (see the red curve across timestep 0-9, 10-19, 20-29), indicating error accumulation due to drift from imperfect motion estimation or appearance changing. When residual compensation is enabled, the scene fidelity remains consistently high (timestep 9-10, 19-20), demonstrating the module’s effectiveness in maintaining reconstruction quality.
2. Role and justification
While it does expand the representation space, residual compensation is not a standalone deformation model — it refines the output of hybrid motion estimation rather than replacing it. We designed it to address small misalignments and accumulated errors that arise naturally in long sequences. Its role is analogous to a corrective layer that stabilizes motion over time.
Summary
We agree that ablation of the residual compensation module is important and have already included such a study (Fig. 6). The results show that it meaningfully improves temporal stability without undermining the goal of compact motion modeling.
Q1: Confusing per-frame optimization cost
seems that you have keyframe 40s and non-keyframe 2s as optimization cost. Considering the method is evaluated in the GoS-10 setup, then how it comes in Figure 1 that your average per frame optimization cost is less than 4s?
A4:
We appreciate the reviewer for highlighting this inconsistency. There was indeed a typographical error in the originally stated optimization times. The correct optimization cost is approximately 10 seconds for keyframes and 1.5 seconds for non-keyframes, as also reported in Table 2.
1. Linearity in per-frame optimization cost
Despite the alternating update scheme between keyframes and non-keyframes, the optimization time per frame is roughly proportional to the training iteration across the sequence. This results in a near-linear(5:1) trend in per-frame cost when averaged over the entire sequence.
2. Correction in reporting
We will correct the stated values in the camera-ready version to avoid future confusion and ensure consistency across all sections of the paper.
Q2: Optical flow subsampling
From my understand, optical flow is a dense map, did you make any subsampling in the pixel space?
A5:
Yes, we perform subsampling as part of our control point selection strategy.
1. Subsampling strategy
As described in the main paper (L237–240) and the appendix (L471–482), we sample control points in a regular grid pattern over the image space to reduce redundancy and computational overhead.
2. Per-view control
Each camera view independently samples its own set of control points from the corresponding optical flow field. These control points are used to extract motion cues in the local image space.
3. Aggregation and usage
The sampled control points across views are aggregated and jointly used to guide the motion of the Gaussians via our control-based deformation framework.
Q3: Use of depth information
Is the depth information Z required when trying to estimate the control points' motion. If it is, how do you acquire it.
A6:
Yes, depth information from the previous frame(kept fixed in the optimization process of the current frame) is required to estimate the 3D positions and motion of control points.
1. Purpose of depth
Control points are distributed on the object surface in the previous frame and are used to warp the corresponding regions to the next frame. To compute their 3D motion, we must first recover their 3D positions, which requires depth information. Moreover, motion information of control points also requires depth information.
2. Source of depth
We obtain this depth using the diff-gaussian-rasterization-w-depth module provided by Dynamic 3D Gaussian. This module renders both the RGB image and the depth map for each frame. The depth values are computed via alpha-blending over Gaussians, consistent with volumetric rendering formulations. This approach ensures smooth and physically meaningful depth estimation from the current Gaussian representation.
For W3/A3, did you mean "Residual block" is your implementation of "the other half of motion attributes other than the projected motion attributes"? Since you just said Fig6 is your ablation of this residual block.
We appreciate the reviewer’s question and would like to clarify that the residual compensation module is not a direct implementation of the "unobserved motion" component in our motion decoupling strategy. Instead, it serves as an additional optimization that refines the overall attributes of the 3D Gaussians.
While our projected motion components (derived from optical flow) are designed to be orthogonal to the learned motion components, any error in the projected motion cannot be eliminated by adjusting the learned motion alone. This structural orthogonality motivates the need for residual compensation to maintain high-fidelity reconstruction.
Figure 6 provides a composite ablation analysis that demonstrates the effectiveness of our decoupled motion representation. We encourage the reviewer to interpret the results as follows:
- Blue curve: No motion applied — this setting serves as a baseline and shows the fastest degradation in reconstruction quality.
- Orange curve: Only fixed, projected motion is applied via H3D control points. This leads to better performance than the blue curve, validating the role of the observable motion component.
- Red curve: Full model with both projected motion and residual compensation. This curve achieves the best performance, demonstrating the critical role of the residual module in stabilizing long-term reconstruction.
By examining the gap between curves over each segment (e.g., steps 0–9, 10–19, 20–29), one can observe the progressive contribution of each component. Moreover, focusing solely on the red curve, which corresponds to the GoS-10 setting in our main experiment, one can evaluate how reconstruction quality evolves over time in a streaming setup. As small errors accumulate frame-by-frame (e.g., 0–9, 10–19, 20–29), the residual compensation module applied at keyframes (e.g., 9–10, 19–20) effectively corrects these accumulated errors and restores high-quality reconstruction.
Please feel free to let us know if we have misunderstood your point or if you have any additional questions.
I see, the leap of every redline at keyframes speaks.
Overall the paper shows an innovative method that directly infer Gaussian motion attributes from 2d trajectories, indicating a clear difference from previous optimization-dominant methods. Though im still not satisfied with its evaluation sections with absence of point tracking tasks like dynamic3DGS yet i understand its hard for the author to implement this during rebuttal period. **Yet i will highly recommend the author add such task evaluations in their cam ready version. **
I have raised my score to accept. have a good night.
Thanks so much for the thoughtful feedback and the updated score — really appreciate it! We’re glad the core idea came through clearly, and totally agree that adding point tracking would strengthen the work. Have a great night!
This paper proposes a new fast dynamic Gaussian splatting method based on heterogeneous 3D control points. The main idea is to reduce the optimization space utilizing optical-flow back-projection. All reviewers agreed that this new hybrid representation is novel and innovative. All reviewers also agreed that the efficiency of the representation is convincing. Besides the "easy" questions that were quickly resolved in the rebuttal (e.g., requests for more ablation studies, which were already present in the appendix, and some minor technical questions), there were two main issues discussed during the discussion phase:
-
Inappropriate comparison: After some discussion, it was shown that some of the recent methods are under single-view settings, which are not directly comparable to multi-view settings like the proposed method. Similarly, other methods also had some issues for direct comparisons (e.g., based on sparse vs. dense control points or different working regions). After the discussion, the comparison issues were mostly resolved.
-
Background artifacts: The authors argued that this is mainly due to the streaming setting of the proposed method. The reviewer accepted this explanation.
AC agrees that the hybrid representation is novel and the efficiency is convincing, and also agrees on the conclusions of the above two issues. Accordingly, an accept decision is recommended. One of the reviewers suggested including evaluations on motion modeling itself, which AC also believes to be valuable if it is included in the final version.