PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
4
4
4
3.5
置信度
创新性3.0
质量2.8
清晰度3.3
重要性2.8
NeurIPS 2025

ROSE: Remove Objects with Side Effects in Videos

OpenReviewPDF
提交: 2025-05-06更新: 2025-10-29

摘要

关键词
video inpainting; diffusion transformer; video generation

评审与讨论

审稿意见
5

Removing objects in videos is challenging because not only the objects themselves but also their associated effects must be removed. The authors address this issue by proposing the ROSE framework. Since real-world data containing both objects and their side effects is scarce, they construct a novel synthetic dataset using a rendering tool. This dataset enables effective training of the ROSE framework. To handle side effects, the method incorporates mask region guidance into a diffusion-based backbone. Additionally, to account for potential human errors in object masking, they introduce mask augmentation techniques during training. With these contributions, the proposed framework achieves effective and robust video object removal.

优缺点分析

Strengths

  1. Novel dataset, which contains various inpainting scenarios with object mask and multi-view scenes.
  2. Effectively removing not only the target object but also its side effects reflects the natural human expectation of comprehensive object removal in video content.

Weaknesses

  1. While the proposed method shows promising performance, an analysis of computational efficiency (e.g., number of parameters, FLOPs) is missing. A comparison of resource usage with baseline methods is necessary to demonstrate that the framework achieves its results without excessive computational overhead.
  2. Ablation analysis is insufficient. It is important to provide a detailed breakdown of how each module contributes to the overall performance improvement, based on the ablation results.

问题

1.The paper requires a comparative analysis of efficiency with baseline methods such as ProPainter and DiffuEraser. In particular, metrics such as FLOPS and per-frame runtime during inference should be reported. Additionally, as mentioned in the limitations, it would be valuable to show the difference in per-frame inference time between long and short clips.

  1. The motivation in lines 156–159 clearly justifies the necessity of MRG, and its effectiveness is well demonstrated through the "+MRG" results in Tab. 1. Given this, it would be interesting to see whether similarly dramatic improvements can be observed when mask region guidance is applied to other methods as well. Do baseline models like ProPainter or DiffuEraser exhibit the same performance gains when MRG is incorporated?

  2. Mask augmentation, despite being presented as a key component of the proposed training strategy, shows only marginal performance improvements in Tab. 1, even leads to decreased performance. What are the results when only MRG and DMP are applied without MA? As discussed in Section 4.3, the authors should provide concrete examples and analysis of specific scenarios where this augmentation brings actual benefits, along with visual or quantitative evidence to support its effectiveness.

  3. I’m curious whether masking only part of an object results in side effects being removed only for that region, or if the effect of mask augmentation is strong enough to eliminate all related side effects. For example, in a scene where a person and a horse are jumping over a fence, what would the result look like if only the person is masked out?

局限性

yes

最终评判理由

By reading the rebuttal, I found that the paper has become more complete in terms of necessary content. In particular, the detailed explanations regarding the ablation studies and the inclusion of computational resource analysis provide more objective metrics for evaluating the paper’s performance. Several other concerns have also been resolved. If the points raised in the rebuttal are properly incorporated into the final version, I believe the paper will offer thorough and well-supported explanations. Accordingly, I am raising my score to Accept.

格式问题

no issue i found

作者回复

Thank you for your constructive suggestions. We will follow your recommendations to polish our paper.

W1 & Q1. While the proposed method shows promising performance, an analysis of computational efficiency (e.g., number of parameters, FLOPs) is missing. A comparison of resource usage with baseline methods is necessary to demonstrate that the framework achieves its results without excessive computational overhead

Firstly, in practical deployment scenarios, we evaluated our model on a single NVIDIA H800 GPU. For a 6-second video (15 fps, 720×480 resolution) using float16 dtype, the inference process requires approximately 20-22GB of VRAM, with a per-frame inference time of 1.39 seconds.

Also, detailed runtime characteristics and comparative benchmarks are presented in Table below. While diffusion-based mechanisms are widely adopted for their superior generation capabilities, our model demonstrates competitive performance in real-world applications. As shown in the table, our model's per-frame inference time and memory usage are comparable to current state-of-the-art inpainting models, making it practical for production environments. And in real-world applications, we can adopt temporal tiling window to address longer videos and the processing time is proportional to the length of frames.

Settings: input video is 65 frames with resolution of 720*480; GPU is NVIDIA H800; data type is torch.float16

Inference Efficiency
All Parameters(M)Average Runtime(s/frame)Average GPU Memory(G)
DiffuEraser1952.31.0226.12
ProPainter39.40.1011.75
FuseFormer64.10.1316.30
FloED1346.81.6331.41
FGT42.31.4314.38
ROSE(Ours)1564.41.3921.05

W2 & Q3. Ablation analysis is insufficient. It is important to provide a detailed breakdown of how each module contributes to the overall performance improvement, based on the ablation results. As discussed in Section 4.3, the authors should provide concrete examples and analysis of specific scenarios where this augmentation brings actual benefits, along with visual or quantitative evidence to support MA’s effectiveness.

Thank you for your valuable feedback. We sincerely apologize for the lack of clarity in our explanation of Tab.1. To clarify, the results presented in the table are cumulative, meaning that each module, i.e. MRG, MA, and DMP, was added sequentially. Therefore, the specific contribution of the DMP module can be observed by comparing the last two columns of the table.

Firstly, to further demonstrate the standalone accuracy and effectiveness of the DMP module, we conducted additional experiments, as shown in the first table below. The baseline setting employs the "mask-and-inpaint" paradigm without incorporating any of the proposed modules or strategies.

From the experimental results, we can clearly observe the quantitative improvements achieved by introducing the difference mask predictor (DMP) module to supervise the training process. This further validates the effectiveness of our proposed approach.

Secondly, to validate the impact of our synthetic dataset and assess the role of the backbone model, we conducted an ablation study on DiffuEraser. Specifically, we compared its performance before and after training with our dataset. The results (shown in the second table below) indicate that while our dataset contributes to measurable improvements in DiffuEraser’s performance, the overall enhancement remains limited. This observation motivated us to explore a more robust backbone model to achieve higher performance.

Thirdly, to support MA’s effectiveness, we conduct an experiment in inference, comparing the effect of output video while dealing with ground-truth masks and imperfect masks generated by SAM2. The quantitative results can be seen in the subsequent third table. And from the table we can find that by adopting MA in training process, our model performs robustness while the input masks are not perfect. For visual results proving the robustness, we’re sorry that we can’t provide them currently for the requirements of conference committee and we promise we’ll supplement them later in our open-source project page.

Table 1: Ablation studies on each module

CategoryMetricsBasew/ MRGw/ MAw/ DMP
CommonPSNR32.5835.2433.5435.68
SSIM0.9370.9500.9490.943
LPIPS0.0530.0400.0460.045
ShadowPSNR30.6533.2931.6332.85
SSIM0.9140.9200.9220.920
LPIPS0.0810.0610.0720.064
LightPSNR24.9930.3728.8930.13
SSIM0.8940.9230.9110.922
LPIPS0.1120.0740.0820.077
ReflectionPSNR25.3927.7126.9727.41
SSIM0.8360.8430.8450.841
LPIPS0.1310.1090.1110.110
MirrorPSNR22.6328.4526.5027.65
SSIM0.9050.9410.9320.932
LPIPS0.1420.0760.0860.092
TranslucentPSNR27.4330.9831.1431.24
SSIM0.9250.9490.9480.946
LPIPS0.0870.0520.0560.059
MeanPSNR27.2830.8429.7730.82
SSIM0.9020.9180.9160.917
LPIPS0.1010.0710.0760.074

Table 2: Ablation studies on the backbone model

MetricsDiffuEraser Before trainingDiffuEraser After trainingWan using baseline settingsROSE(Ours)
MeanPSNR26.502427.016227.286331.1221
SSIM0.89810.90100.90250.9170
LPIPS0.12840.11060.10130.0772

Table 3: Experiments on the input masks during inference

CategoryMetricsUsing ground-truth masksUsing imperfect masks generated by SAM2
CommonPSNR36.6036.23
SSIM0.9520.951
LPIPS0.0410.040
ShadowPSNR33.7933.67
SSIM0.9230.919
LPIPS0.0630.071
LightPSNR30.0729.95
SSIM0.9210.921
LPIPS0.0860.087
ReflectionPSNR27.7327.69
SSIM0.8720.867
LPIPS0.1130.118
MirrorPSNR28.3527.35
SSIM0.9330.926
LPIPS0.0880.097
TranslucentPSNR31.4331.42
SSIM0.9470.947
LPIPS0.0600.061
MeanPSNR31.1231.05
SSIM0.9230.920
LPIPS0.0770.079

Q2. The motivation in lines 156–159 clearly justifies the necessity of MRG, and its effectiveness is well demonstrated through the "+MRG" results in Tab. 1. Given this, it would be interesting to see whether similarly dramatic improvements can be observed when mask region guidance is applied to other methods as well. Do baseline models like ProPainter or DiffuEraser exhibit the same performance gains when MRG is incorporated?

Thank you for your insightful observation. Indeed, we initially explored the possibility of using the models you mentioned. However, we encountered several challenges that led us to reconsider their suitability for our framework.

First, ProPainter relies on an image propagation approach, which requires a masked video as input. This design is fundamentally incompatible with our mask region guidance paradigm, making further experimentation with ProPainter impractical for our specific use case.

As for DiffuEraser, its dual-branch architecture depends on a pretrained image inpainting model (BrushNet). Specifically, its inpaintng ability relies on the pretrained BrushNet's image inpainting ability and to adapt MRG for training, it needs to change BrushNet's architecture completely and involve more additional design, which makes it difficult to conduct a reasonable comparison experiment. Given these constraints, we decided against pursuing further experiments with these two models.

Q4. I’m curious whether masking only part of an object results in side effects being removed only for that region, or if the effect of mask augmentation is strong enough to eliminate all related side effects. For example, in a scene where a person and a horse are jumping over a fence, what would the result look like if only the person is masked out?

Thank you for raising this interesting question. The effect of partial masking depends on the specific context of the image and the relationships between objects.

For instance, in the scenario you described (Fig.7, line 2) where a person and horse are jointly jumping over a fence, masking only the person would successfully remove just the human figure while preserving the horse. This demonstrates that when objects are distinctly separate entities, partial masking can produce localized effects.

However, when objects are intrinsically connected, the outcome differs. As shown in Fig.1 of the supplementary material featuring a cow with a neck knot, masking just the cow's body while leaving the knot unmasked produces the same result as masking both elements. This occurs because the knot is perceived as an integral part of the cow's anatomy by the model.

评论

Thank you for the response. The rebuttal has provided sufficient clarification to the questions I raised, and I appreciate the thoughtful answers even to those posed out of pure curiosity. I trust that the content of the rebuttal will be faithfully reflected in the final version, and I am increasing my score accordingly.

评论

Thank you very much for your suggestions. We greatly value the feedback you provided, and we will follow your suggestions to make this paper clearer and better.

评论

Hello, a gentle reminder for discussions. Authors have provided a rebuttal, and it would be great if you could acknowledge it and comment whether it addresses your concerns!

审稿意见
4

This work proposes to create a video removal dataset that focuses on removing the objects and the side effects of the objects. Firstly, the authors investigate five types of side effects. Then, a 3D rendering engine is used to generate the samples and ensure the quality of side effects in the videos. In the training stage, an auxiliary branch is used to achieve additional supervision, which could help the model learn to recognize the region of interest. Also, the ROSE-Bench is used to evaluate the performance of video object erasing models.

优缺点分析

S:

  1. The construction of a high-quality video removal dataset is necessary for the community. And the authors conclude six side-effect categories, which could guide following works.
  2. The 3D data successfully solves the problems of side effects in the previous datasets through the proposed construction pipeline. W:
  3. Fig. 1 is not clear when I zoom in, please use PDF images.
  4. The synthetic dataset only includes videos with camera motion and without object motion, which limits the capability of processing videos with moving objects. This point is only mentioned in Sec. 6, no visual results are shown in the paper.
  5. The side effects are divided into six types, but only three types are shown in Fig. 7.
  6. The lack of experiments on investigating the effectiveness of the mask-based filtering.
  7. Sec. 4.3 claims that the user-provided masks are often imperfect and thus proposes mask augmentation to mitigate this issue. But there is no such case in the experiment to show how ROSE could deal with the imperfect masks.
  8. The visual comparisons are superior in Fig. 7, but the PSNR value in Tab. 2 is lower than ProPainter.

问题

  1. Does ROSE-Bench include similar imperfect masks as well to better support the function of mask augmentation?
  2. Does the low PSNR in Tab. 2 mean that ROSE cannot generalize well to the real data due to the synthetic training dataset?

局限性

The suggestions are provided above.

最终评判理由

The authors provide clear explanations and directly clarify my concerns. All of the concerns are carefully solved by listing the experimental results and the analysis. What I care about the most is the performance on real-world video datasets and the capability of handling imperfect masks. The authors state the rationale for why the model trained on videos with only camera motion can generalize well to real videos, and the comparison between the performance of dealing with GT and SAM2 masks. It’s a pity that I cannot see the visualization results now. But I would like to say that the intrinsic limitation of the synthetic dataset and the data construction pipeline (only camera motion) could affect the performance. Overall, I would raise my score from Borderline reject to Borderline accept.

格式问题

NA

作者回复

Thank you for your constructive suggestions. We will follow your recommendations to polish our paper.

W1. Fig. 1 is not clear when I zoom in, please use PDF images.

We apologize for the quality of figure and will replace it with clear PDF file in the revised version.

W2. The synthetic dataset only includes videos with camera motion and without object motion, which limits the capability of processing videos with moving objects. This point is only mentioned in Sec. 6, no visual results are shown in the paper.

Firstly, though there lack videos with moving object in the synthetic dataset, ROSE still show great performance on erasing moving object, as shown in Fig.1 in the supplementary material. The performance could be attributed as follows: The main purpose of ROSE is to guide the video model to distinguish the side effect area of selected object and conduct joint erasing. Erasing the object itself is not the main challenge of ROSE, which requires to fill the object area with proper video dynamics. Therefore, ROSE shows generalized performance on either moving or static objects since the object masks are provided, focusing on erasing the side effect areas. Furthermore, videos with camera motion help to construct relative object motion, which also enhance the model's performance on moving objects.

Secondly, we show the limitations of ROSE in Fig.1 (line1-8) of supplementary material, showcasing objects with large motion or non-rigid motions. Though sometimes showing filckering effects, the overall performance of ROSE is still reasonable in many scenarios with substantial motions.

Thirdly, for future improvement, since constructing synthetic video with object motion is challenging, it could be explored to combining pseudo real-world data or generated data into the dataset for further improve the generalization of ROSE.

W3. The side effects are divided into six types, but only three types are shown in Fig. 7.

Thanks for the suggestion. In Fig.7 of the main paper, we show four types of side effects (shadow, common artifacts, reflection, and light source). In Fig.2 of the supplementary material, we present further visual comparisons covering reflection, shadow, light source, and mirror effects. We acknowledge that the translucent category was not included. We have prepared the visualization results on this category. Since no images can be attached in the rebuttal, we will add them in the revised version.

W4. The lack of experiments on investigating the effectiveness of the mask-based filtering.

Thank you for your observation regarding the mask-based filtering method presented in Fig.2. Below, we provide a detailed explanation of its purpose and implementation:

Objective of Mask-Based Filtering

In our synthetic dataset construction pipeline using Unreal Engine, efficiency is critical due to the large-scale generation of video pairs. A key challenge is ensuring the quality of these pairs by identifying and filtering out problematic cases, such as those with severe occlusion.

Implementation Details

(1)We first render videos alongside their corresponding object masks.

(2)By calculating the ratio of masked pixels to total pixels in each frame, we quantitatively assess occlusion levels.

(3)Camera trajectories resulting in occlusion ratios beyond a predefined threshold are automatically discarded.

Deterministic Nature of the Process

Since the filtering mechanism is entirely rule-based and deterministic, the outcome is fully predictable given the input masks and thresholds to ensure purpose that the targeting object exists in the video.

This approach significantly accelerates dataset construction by eliminating manual inspection, while ensuring consistent quality across all generated video pairs. We appreciate your attention to this methodological detail and hope this clarification addresses your question.

W5. Sec. 4.3 claims that the user-provided masks are often imperfect and thus proposes mask augmentation to mitigate this issue. But there is no such case in the experiment to show how ROSE could deal with the imperfect masks.

Thank you for your valuable feedback. We appreciate your observation regarding the handling of imperfect masks in our experiments. Due to the conference committee's requirements, we were unable to include specific examples demonstrating ROSE's performance with imperfect masks in the current submission. However, we plan to supplement these cases in future work to provide a more comprehensive evaluation.

Real-World Application

In practice, our framework leverages SAM2, a state-of-the-art segmentation model, to generate high-quality masks based on user input. SAM2's robustness ensures that most masks are accurate, but even for imperfect cases, ROSE is designed to:

(1)Automatically infer the complete object mask from partial or noisy inputs.

(2)Produce plausible and stable results despite mask inaccuracies.

Experimental Validation

To quantitatively assess ROSE's resilience to imperfect masks, we conducted an additional experiment comparing results using ground-truth masks and results using intentionally imperfect masks generated by SAM2. The quantitative results (shown in the table below) confirm that ROSE maintains strong performance even with imperfect mask inputs, underscoring its practical applicability.

We hope this clarification addresses your concern, and we welcome further discussion on this point.

Table 1: Experiments on the input masks during inference

CategoryMetricsUsing ground-truth masksUsing imperfect masks generated by SAM2
CommonPSNR36.6036.23
SSIM0.9520.951
LPIPS0.0410.040
ShadowPSNR33.7933.67
SSIM0.9230.919
LPIPS0.0630.071
LightPSNR30.0729.95
SSIM0.9210.921
LPIPS0.0860.087
ReflectionPSNR27.7327.69
SSIM0.8720.867
LPIPS0.1130.118
MirrorPSNR28.3527.35
SSIM0.9330.926
LPIPS0.0880.097
TranslucentPSNR31.4331.42
SSIM0.9470.947
LPIPS0.0600.061
MeanPSNR31.1231.05
SSIM0.9230.920
LPIPS0.0770.079

W6 & Q2. The visual comparisons are superior in Fig. 7, but the PSNR value in Tab. 2 is lower than ProPainter. Does the low PSNR in Tab. 2 mean that ROSE cannot generalize well to the real data due to the synthetic training dataset?

The evaluation data of Tab.2 is constructed using "copy-and-paste" approach to create real-world video pairs. The results only reflect the performance of erasing object itself. ROSE shows comparable results on PSNR and better results on SSIM and LPIPS, which demonstrates that ROSE can effectively remove the object in real videos.

However, the results of Tab.2 does not reflect the side effect erasing, which is the core objective of ROSE. Through extensive visual comparisons and quantitative experiments in Fig.2 and Tab.3, we have demonstrated that our model excels in this key capability.

Also, while ProPainter achieves a marginally higher PSNR score, our model demonstrates superior overall performance by outperforming it in both SSIM and LPIPS metrics. This balanced excellence across all three standard evaluation criteria confirms that our method delivers more structurally accurate results, produces visually superior outputs and maintains competitive fidelity.

To sum up, with comprehensive comparison on the three subsets evaluation of ROSE-Bench, we demonstrate that ROSE achieves satisfying performance on both side effect and object removal on real-world and synthetic videos.

Q1. Does ROSE-Bench include similar imperfect masks as well to better support the function of mask augmentation?

Thank you for your question regarding our mask augmentation approach.

We sincerely apologize that we are unable to provide visual demonstrations of the mask augmentation effects at this time due to the requirements of the conference committee.

To substantiate the effectiveness of our mask augmentation method, we have included comprehensive quantitative results in the table under W5's answer. The data clearly demonstrates our model's robust performance when processing imperfect masks. The empirical evidence confirms that our augmentation strategy successfully enhances the model's capability to handle various mask imperfections while maintaining stable output quality.

评论

The authors provide clear explanations and directly clarify my concerns. Thanks for your detailed response. I think the addition of the imperfect cases of the input mask in the final version may emphasize the contribution of the mask and make this work look more practical. I would like to see the visual results of how your model processes the imperfect mask and outputs the precise mask when dealing with real-world videos. Looking forward to your datasets and codes.

评论

Thank you very much for your suggestions. We greatly value the feedback you provided, and we will follow the discussion to make this paper clearer and better.

评论

Hello, a gentle reminder for discussions. Authors have provided a rebuttal, and it would be great if you could acknowledge it and comment whether it addresses your concerns!

审稿意见
4

The paper proposes a video object removal framework that also removes associated side effects like shadows, reflections, and lighting. To address the lack of real paired data, the authors use a 3D rendering engine to generate synthetic video pairs. ROSE is a diffusion-based inpainting model that uses full video context and introduces a difference mask predictor for explicit supervision. A new benchmark, ROSE-Bench, is created to evaluate various side-effect scenarios. Experiments show that ROSE outperforms prior methods in both accuracy and realism.

优缺点分析

Strengths

1.The paper addresses a practical problem in video inpainting—removing not just objects but also their environmental side effects such as shadows and reflections. Its approach is well-motivated, with a clear design that combines a diffusion-based inpainting model, full video context input, and a difference mask predictor for explicit supervision

2.The use of a 3D rendering pipeline to generate paired synthetic videos is effective and well-documented. Additionally, the introduction of ROSE-Bench provides a useful benchmark for evaluating side-effect removal.

  1. This paper is well-written and easy-to-follow.

Weakness

  1. While the paper presents strong quantitative and qualitative comparisons, it lacks video-based results, which are crucial for evaluating temporal consistency and perceptual quality in video inpainting. The current comparisons rely mostly on metrics and static frame visualizations, which may not fully reflect performance on dynamic content. Including sample videos would strengthen the evaluation.

  2. Additionally, the paper does not report inference time or model size, making it difficult to assess the method’s efficiency and scalability compared to prior work.

  3. To better validate the effectiveness of the constructed dataset, the authors should consider training baseline video inpainting models (e.g., DiffuEraser or ProPainter) on the same synthetic data and comparing their improvements. Since ROSE builds on top of Wan, a powerful generative model, it's unclear how much of the performance gain comes from the dataset versus the backbone.

问题

See weakness

局限性

yes

格式问题

no

作者回复

Thank you for your constructive suggestions. We will follow your recommendations to polish our paper.

W1. While the paper presents strong quantitative and qualitative comparisons, it lacks video-based results, which are crucial for evaluating temporal consistency and perceptual quality in video inpainting. The current comparisons rely mostly on metrics and static frame visualizations, which may not fully reflect performance on dynamic content. Including sample videos would strengthen the evaluation.

Thanks for your suggestions. We mainly show temporal sequential results in Fig.1 of main paper and Fig.1&2 in supplementary material, which could briefly reflect the erasing performance on dynamic content. These frame sequences demonstrate temporal consistency of ROSE's outputs and high-qualified video reconstruction. Due to conference committee requirements, we are unable to share full video demonstrations now, and will present more results in both revised version and open-sourced project page. We appreciate your understanding and welcome further discussion on these results.

W2. Additionally, the paper does not report inference time or model size, making it difficult to assess the method’s efficiency and scalability compared to prior work.

Firstly, in practical deployment scenarios, we evaluated our model on a single NVIDIA H800 GPU. For a 6-second video (15 fps, 720×480 resolution) using float16 dtype, the inference process requires approximately 20-22GB of VRAM, with a per-frame inference time of 1.39 seconds.

Secondly, detailed runtime characteristics and comparative benchmarks are presented in Table below. While diffusion-based mechanisms are widely adopted for their superior generation capabilities, our model demonstrates competitive performance in real-world applications. As shown in the table, our model's per-frame inference time and memory usage are comparable to current state-of-the-art inpainting models, making it practical for production environments. And in real-world applications, we can adopt temporal tiling window to address longer videos and the processing time is proportional to the frames.

Settings: input video is 65 frames with resolution of 720*480; GPU is NVIDIA H800; data type is torch.float16

Inference Efficiency
All Parameters(M)Average Runtime(s/frame)Average GPU Memory(G)
DiffuEraser1952.31.0226.12
ProPainter39.40.1011.75
FuseFormer64.10.1316.30
FloED1346.81.6331.41
FGT42.31.4314.38
ROSE(Ours)1564.41.3921.05

W3. To better validate the effectiveness of the constructed dataset, the authors should consider training baseline video inpainting models (e.g., DiffuEraser or ProPainter) on the same synthetic data and comparing their improvements. Since ROSE builds on top of Wan, a powerful generative model, it’s unclear how much of the performance gain comes from the dataset versus the backbone.

Thank you for your valuable suggestion. To validate the effectiveness of our synthetic dataset, we conducted preliminary experiments by training DiffuEraser on the same data and comparing its performance with the original model. The results demonstrated that the trained DiffuEraser exhibited some capability in mitigating object removal side effects.

However, due to the conference committee's requirements, we are currently unable to include the visual comparison results in this submission. We plan to supplement these details in the final version of the paper.

During our experiments, we observed that the performance of the trained DiffuEraser, while improved, did not fully meet our expectations. Since our focus is more on the removal effect of side effects, DiffuEraser after training didn't give satisfactory visual results. Also, we provide a table below which quantitatively shows the enhancement of DiffuEraser after training on the synthetic dataset. This led us to explore alternative base models, and we ultimately adopted Wan due to its stronger generative capabilities. The switch to Wan enhanced the model's overall performance, confirming that the choice of backbone architecture plays a critical role in the task. Also we shows the improvement made by changing the backbone model to Wan. In this table, we trained Wan with baseline settings, which uses "mask-and-inpaint" paradigm and without using any strategies and modules we design.

To further boost the model's robustness and inpainting quality, we introduced additional techniques, including mask region guidance, mask augmentation, and a difference mask predictor. These innovations collectively address the limitations observed in earlier experiments and contribute to the superior results presented in our work.

Table 1: Ablation studies on the backbone model testing on synthetic paired data

MetricsDiffuEraser Before trainingDiffuEraser After trainingWan using baseline settingsROSE(Ours)
MeanPSNR26.502427.016227.286331.1221
SSIM0.89810.90100.90250.9170
LPIPS0.12840.11060.10130.0772

评论

Hello, a gentle reminder for discussions. Authors have provided a rebuttal, and it would be great if you could acknowledge it and comment whether it addresses your concerns!

审稿意见
4

This paper introduces ROSE (Remove Objects with Side Effects), a framework for video object removal that explicitly models and removes object-induced visual side effects such as shadows, reflections, lighting changes, and translucency. ROSE combines a novel synthetic dataset generation pipeline using Unreal Engine with a reference-based inpainting model that incorporates a difference mask predictor to guide side-effect localization. The authors also present ROSE-Bench, a benchmark tailored for evaluating side-effect-aware object removal, and demonstrate that ROSE significantly outperforms prior methods across synthetic and real-world datasets.

优缺点分析

Strengths:

  • Timely Problem Formulation: The paper targets a practical and underexplored problem (removing not only objects but their visible environmental effects) which is both novel and relevant.
  • Robust Synthetic Dataset: The large-scale, automatically generated dataset provides high-quality, temporally aligned video pairs, enabling strong supervision for training and evaluation.
  • Architectural Design: The use of full-reference video input and a difference mask predictor enhances temporal consistency and helps isolate object-related changes effectively.
  • Comprehensive Evaluation: Experiments span synthetic, realistic, and unpaired real-world videos, with consistent improvements across standard metrics and strong ablation studies that isolate component contributions.

Weaknesses:

  • Limited Real-World Validation: The model’s performance on truly in-the-wild videos is insufficiently analyzed. Quantitative comparisons for real-world unpaired settings are lacking, and failure modes, particularly flickering under motion, are noted but not deeply explored.
  • Computational Demands: Both training and inference are resource-intensive due to the diffusion-based backbone and large-scale rendering pipeline. Practical deployment concerns are not addressed.
  • Ablation Gaps: While key components are evaluated, there is limited analysis of the difference mask predictor’s standalone effectiveness or its per-category contributions. The backbone model’s role is also not isolated.
  • Unclear Sim-to-Real Transfer: The domain gap between synthetic training data and complex real-world scenarios is acknowledged but not quantitatively assessed. The generalization capacity of ROSE remains uncertain.

问题

  1. Domain Gap: How does ROSE's performance degrade when applied to uncurated, real-world videos with side effects outside the defined categories? Have you considered domain adaptation techniques to bridge this gap?
  2. Failure Cases: Can you provide quantitative metrics and visual examples of failure modes, such as flickering or artifacts in scenes with large motion or complex lighting?
  3. Difference Mask Evaluation: What is the standalone accuracy or effectiveness of the difference mask predictor across different types of side effects?
  4. Inference Efficiency: What are the model’s runtime characteristics (e.g., FPS, memory usage) on long videos? Are there strategies for reducing compute requirements?
  5. Generalizability: How transferable is ROSE to new video domains, such as mobile or drone footage, or to settings with unseen environmental side effects?

局限性

Yes, the authors shortlist main limitations and societal impacts.

最终评判理由

The authors' rebuttal and the subsequent discussion have strengthened the paper. They successfully resolved my concerns regarding computational costs by providing concrete inference metrics. They also addressed questions about generalizability and failure modes by supplying new quantitative VBench data for new domains and known failure cases.

The primary unresolved issue is the scope of the "in-the-wild" evaluation. The authors' claim that the limitations of their "copy-and-paste" benchmark are not significant remains an unsupported assertion.

Despite this, the paper's strengths and the new data provided are sufficient to warrant a recommendation for acceptance. The unresolved point is a notable limitation but does not invalidate the core contributions.

格式问题

None.

作者回复

Thank you for your constructive suggestions. We will follow your recommendations to polish our paper.

W1 & Q2. Limited Real-World Validation: The model’s performance on truly in-the-wild videos is insufficiently analyzed. Quantitative comparisons for real-world unpaired settings are lacking, and failure modes, particularly flickering under motion, are noted but not deeply explored.

In Tab.4, we have evaluated real-world unpaired data with quantitative metrics from VBench. Since the real-world unpaired dataset lacks the ground-truth edited video, we can’t evaluate with metrics like PSNR, SSIM and LPIPS. To comprehensively evaluate ROSE’s performance on real-world scenarios, we constructed real-world paired dataset as mentioned in Sec.5.1 which has the ground-truth edited video we needed for evaluation. And we apologize for that currently we can't provide some visual examples of failure where the videos exist flickering under very large motion for the requirements of conference committee. Later we'll supplement more examples including failure cases in the paper and our open-source project page.

W2 & Q4. Computational Demands: Both training and inference are resource-intensive due to the diffusion-based backbone and large-scale rendering pipeline. Practical deployment concerns are not addressed. What are the model’s runtime characteristics (e.g., FPS, memory usage) on long videos? Are there strategies for reducing compute requirements?

Firstly, in practical deployment scenarios, we evaluated our model on a single NVIDIA H800 GPU. For a 6-second video (15 fps, 720×480 resolution) using float16 dtype, the inference process requires approximately 20-22GB of VRAM, with a per-frame inference time of 1.39 seconds.

Secondly, the rendering pipeline is specifically designed for dataset preparation and does not contribute to runtime computational costs. Rendering a 6-second video (15 fps, 1920×1080 resolution) takes approximately 9 seconds.

Thirdly, detailed runtime characteristics and comparative benchmarks are presented in Table below. While diffusion-based mechanisms are widely adopted for their superior generation capabilities, our model demonstrates competitive performance in real-world applications. As shown in the table, our model's per-frame inference time and memory usage are comparable to current state-of-the-art inpainting models, making it practical for production environments. And in real-world applications, we can adopt temporal tiling window to address longer videos and the processing time is proportional to the frames.

Settings: input video is 65 frames with resolution of 720*480; GPU is NVIDIA H800; data type is torch.float16

Inference Efficiency
All Parameters(M)Average Runtime(s/frame)Average GPU Memory(G)
DiffuEraser1952.31.0226.12
ProPainter39.40.1011.75
FuseFormer64.10.1316.30
FloED1346.81.6331.41
FGT42.31.4314.38
ROSE(Ours)1564.41.3921.05

W3 & Q3. Ablation Gaps: While key components are evaluated, there is limited analysis of the difference mask predictor’s standalone effectiveness or its per-category contributions. What is the standalone accuracy or effectiveness of the difference mask predictor across different types of side effects? The backbone model’s role is also not isolated.

Thank you for your valuable feedback. We sincerely apologize for the lack of clarity in our explanation of Tab.1. To clarify, the results presented in the table are cumulative, meaning that each module, i.e. MRG, MA, and DMP, was added sequentially. Therefore, the specific contribution of the DMP module can be observed by comparing the last two columns of the table.

Firstly, to further demonstrate the standalone accuracy and effectiveness of the DMP module, we conducted additional experiments, as shown in the first table below. The baseline setting employs the "mask-and-inpaint" paradigm without incorporating any of the proposed modules or strategies.

From the experimental results, we can clearly observe the quantitative improvements achieved by introducing the difference mask predictor (DMP) module to supervise the training process. This further validates the effectiveness of our proposed approach.

Secondly, to validate the impact of our synthetic dataset and assess the role of the backbone model, we conducted an ablation study on DiffuEraser. Specifically, we compared its performance before and after training with our dataset. The results (shown in the second table below) indicate that while our dataset contributes to measurable improvements in DiffuEraser’s performance, the overall enhancement remains limited. This observation motivated us to explore a more robust backbone model to achieve higher performance.

Table 1: Ablation studies on each module

CategoryMetricsBasew/ MRGw/ MAw/ DMP
CommonPSNR32.5835.2433.5435.68
SSIM0.9370.9500.9490.943
LPIPS0.0530.0400.0460.045
ShadowPSNR30.6533.2931.6332.85
SSIM0.9140.9200.9220.920
LPIPS0.0810.0610.0720.064
LightPSNR24.9930.3728.8930.13
SSIM0.8940.9230.9110.922
LPIPS0.1120.0740.0820.077
ReflectionPSNR25.3927.7126.9727.41
SSIM0.8360.8430.8450.841
LPIPS0.1310.1090.1110.110
MirrorPSNR22.6328.4526.5027.65
SSIM0.9050.9410.9320.932
LPIPS0.1420.0760.0860.092
TranslucentPSNR27.4330.9831.1431.24
SSIM0.9250.9490.9480.946
LPIPS0.0870.0520.0560.059
MeanPSNR27.2830.8429.7730.82
SSIM0.9020.9180.9160.917
LPIPS0.1010.0710.0760.074

Table 2: Ablation studies on the backbone model

MetricsDiffuEraser Before trainingDiffuEraser After trainingWan using baseline settings
MeanPSNR26.502427.016227.2863
SSIM0.89810.90100.9025
LPIPS0.12840.11060.1013

W4. Unclear Sim-to-Real Transfer: The domain gap between synthetic training data and complex real-world scenarios is acknowledged but not quantitatively assessed. The generalization capacity of ROSE remains uncertain.

We acknowledge the potential domain gap between synthetic training data and complex real-world scenarios. However, our model demonstrates strong generalization capabilities in the diverse real-world scenarios we tested.

To rigorously assess our model’s real-world performance, we conducted quantitative evaluations using both paired and unpaired real-world data:

  • Paired Data (Tab. 2): Using a "paste-and-copy" approach to construct paired real-world data, we measured the model’s ability to remove objects and inpaint the affected regions. The metrics confirm high performance in these tasks.
  • Unpaired Data (Tab. 4): For scenarios lacking ground-truth videos, we employed VBench metrics (e.g., video quality, temporal consistency) to evaluate the model’s output. These results further validate our model’s robustness and generalization capacity in real-world domains.

While the domain gap remains a challenge in sim-to-real transfer, our extensive visual and quantitative results demonstrate that ROSE generalizes effectively to the tested real-world scenarios. We will continue to explore broader conditions to further bridge this gap.

Q1. Domain Gap: How does ROSE’s performance degrade when applied to uncurated, real-world videos with side effects outside the defined categories? Have you considered domain adaptation techniques to bridge this gap?

Thank you for your insightful question regarding the domain gap and ROSE's performance on uncurated real-world videos.

First, as demonstrated in Fig. 1 and Fig. 7, ROSE exhibits strong performance across both synthetic and real-world domains, outperforming baseline models in most scenarios. During testing, we observed that the model adapts well to real-world videos, though it encounters challenges with highly complex side effects such as ripples (Fig. 1, rows 2–4) where the results, while improved, may not be flawless.

While we have not yet implemented advanced domain adaptation techniques, we are open to suggestions. Your recommendations on specific methods to bridge this gap would be highly valuable.

We appreciate your feedback and would welcome any further insights you might have.

Q5. Generalizability: How transferable is ROSE to new video domains, such as mobile or drone footage, or to settings with unseen environmental side effects?

Thanks for your suggestion. First, we evaluate ROSE across more real-world video domains, including mobile, drone footage, and underwater scenes. The results indicate robust performance of ROSE in these scenarios. Due to conference committee requirements, we are currently unable to share visualization results, and will present them in the revised version.

Second, while the model handles known side effects effectively, its performance on unseen types remains an area for further exploration. For example, when dealing with the scenarios where objects have more complex physical interactions with the environment like ripples generated by a duck and water vapor generated by hot food, our model may fail to perceive those intricate interactions and thus can't remove those side effects together with objects. But with current experience, we will include such limitations in revised version, and will leverage a wider range of synthetic technologies to address these issues in future works.

评论

To the authors,

Thank you for the clarifications on computational costs and the backbone model. However, key concerns about the empirical validation of your model's performance and limitations remain. I would appreciate your response on the following points:

  1. DMP Contribution: Your new ablation study (rebuttal, Table 1) indicates that the DMP module degrades SSIM and LPIPS scores for several categories. Could you please explain this negative impact on key perceptual metrics?

  2. Failure Modes and Generalizability: Given the restrictions on sharing new visual results for failure cases (e.g., flickering) and new domains (e.g., drone footage), could you provide a more detailed quantitative breakdown to help assess the model's practical limits? For example, under what conditions was temporal flickering most pronounced in your VBench evaluation?

  3. "In-the-Wild" Evaluation Scope: Your "copy-and-paste" benchmark avoids complex, naturally occurring side effects like water ripples. How significant is this limitation for the model's performance on truly "in-the-wild" videos?

评论

Q1. DMP Contribution: Your new ablation study (rebuttal, Table 1) indicates that the DMP module degrades SSIM and LPIPS scores for several categories. Could you please explain this negative impact on key perceptual metrics?

We sincerely appreciate your careful review of our work. Regarding Table 1 in the new ablation study, we'd like to kindly clarify some points that may have caused potential misunderstanding.

In Table 1, we systematically evaluate the individual contributions of three key components: mask region guidance (MRG), mask augmentation (MA), and difference mask predictor (DMP). These were tested incrementally against our base model, which implements the "mask-and-inpaint" approach without additional modules. We're pleased to report that all three components show measurable performance improvements over the base setting, with DMP in particular showing no negative effects.

While the degree of improvement does vary between modules, which we believe is quite natural, we'd be happy to elaborate on these differences. For example, MRG appears particularly effective in certain categories, likely because its direct use of the original video input enables more accurate modeling of object-environment interactions, thereby enhancing overall performance.

Q2. Failure Modes and Generalizability: Given the restrictions on sharing new visual results for failure cases (e.g., flickering) and new domains (e.g., drone footage), could you provide a more detailed quantitative breakdown to help assess the model's practical limits? For example, under what conditions was temporal flickering most pronounced in your VBench evaluation?

Thank you for your question.

Firstly, as mentioned in rebuttal, ROSE exhibits strong performance across both synthetic and real-world domains and also shows strong generalization ability in new domains like drone footage and underwater scenes. Those visual results will be contained in our revised version due to the restrictions. And since those real world scenarios are unpaired, we evaluate them quantitatively on VBench.

Secondly, since we also can’t provide failure cases currently, we evaluate these cases using VBench as well and by assessing those failure cases’ relevant indicators like temporal flickering, we can roughly provide an estimation of relations between failure cases and VBench indicators. And qualitatively speaking, the videos with large motions will likely cause the temporal flikering phenomenon which can be seen from the examples provided in the revised version.

DomainsMotion Smoothness ↑Background Consistency ↑Temporal Flickering ↓Subject Consistency↑Imaging Quality ↑
Drone footage0.9790.9230.9300.9070.643
Underwater scenes0.9850.9240.9280.9080.632
Failure cases0.9660.9220.9460.9070.626
ROSE realistic unpaired benchmark0.9750.9230.9360.9080.630

Q3. "In-the-Wild" Evaluation Scope: Your "copy-and-paste" benchmark avoids complex, naturally occurring side effects like water ripples. How significant is this limitation for the model's performance on truly "in-the-wild" videos?

Thank you for your detailed consideration.

Due to the lack of high-quality paired video data in real-world scenarios, we constructed a synthetic benchmark to evaluate our model’s side effect removal capability. To assess our model’s performance in real-world applications, we employ three key evaluation strategies.

Firstly, we provide visual comparisons in Fig. 1 & 7 (main paper) and Fig. 1 & 2 (supplementary material), demonstrating that our model achieves superior performance over existing baselines. Additional visual results will be made available on our project page.

Secondly, to comprehensively assess output quality including temporal flickering and motion smoothness we utilize VBench. As shown in Tab. 4, our model performs competitively in these metrics.

Thirdly, to specifically evaluate object removal in real-world settings, we introduce a "copy-and-paste" benchmark. Tab. 2 confirms our model’s strong ability in this aspect. While this benchmark does not account for complex side effects, its limitations do not significantly impact our model’s performance on truly in-the-wild videos.

This multi-faceted evaluation ensures a thorough and robust assessment of our model’s capabilities.

评论

To the authors,

Thank you for the productive discussion. Your clarification on the DMP ablation study and the new quantitative data for failure modes have addressed my main concerns.

My only remaining question is regarding the assertion that the "copy-and-paste" benchmark's limitations are not significant for "in-the-wild" performance, which seems unsubstantiated.

Overall, your detailed responses have been very helpful for my final assessment. Thank you for your engagement.

评论

Thank you very much for your suggestions. For further substantiating the assertion that "copy-and-paste" benchmark's limitations are not significant for "in-the-wild" performance, we will provide more visual results in our revised version. We greatly value the feedback you provided, and we will follow the discussion to make this paper clearer and better.

评论

Hello, a gentle reminder for discussions. Authors have provided a rebuttal, and it would be great if you could acknowledge it and comment whether it addresses your concerns!

最终决定

All reviewers are in agreement to accept the paper. The rebuttal added substantial content that must be included in the final version. One remaining concern, shared also by the AC and other fellow reviewers, is the real-world performance, given the synthetic training setup. The qualitative results seem to demonstrate that this may be fine, but it is somewhat hard to validate this, given that qualitative results can always be cherry-picked. Still, as the authors provide code release, this probably can be validated as the work becomes public. The AC thus follows this unanimous recommendation to accept the paper.

One reviewer went missing after the initial review, and thus have been flagged for this behavior.