WMAdapter: Adding WaterMark Control to Latent Diffusion Models
A diffusion watermark plugin offering an improved accuracy-quality tradeoff
摘要
评审与讨论
This paper proposed a method called WMAdapter which introduced an external component to the latent diffusion model to add a contextual watermark onto the generated images. The experiments showed that WMAdapter achieved similar bitwise accuracy to other methods while maintaining better image quality. The robustness of the proposed method is also evaluated on some image post-processing method.
优点
- The proposed method maintains better image quality than the existing ones.
- The robustness of the proposed watermarking method is evaluated on some commonly seen image processing methods.
- The contextual watermarking idea is interesting.
缺点
- Though the authors claim that they keep the diffusion model intact to get better image quality, their watermarking method changes the internal embedding within the VAE decoder. I think both changing VAE decoder's parameters and changing the internal embedding will hugely affect the quality of the generated image.
- The robustness of the proposed method against adversarial attacks is worrying.
- The reason why hybrid finetuning works is not clarified.
问题
- I do not think changing the VAE decoder's parameters and changing the internal embedding of the image are mathematically different. Both of them will affect the quality of the generated images. Given the results shown in Table 2 in which WMAdapter achieves worse robustness results compared to other methods, does it mean that the better image quality of WMAdapter is just simply a trade-off between image quality and robustness?
- In Figure 5, WMAdapter achieves bad robustness results against adversarial perturbations. Specifically, the bitwise accuracy decreases to almost 50% when the PSNR is around 30, which means the watermark can be easily removed by the attacker in the real-world scenario.
- For hybrid finetuning, it is confusing that finetuning both the VAE decoder and the watermark module and using only the watermark module during the inference achieves better results than other variants. I think detailed explanations and more experiments should be included for this part.
- For Figure 8 in Appendix, the proposed method also shows worse robustness compared to Stable Signature, which also shows the improvement in image quality is only a trade-off.
We sincerely thank the reviewer for their critical insights and thoughtful suggestions. However, we would like to politely point out that some statements in the review appear to be factually incorrect. We aim to clarify these points and are open to further discussion.
Argument 1: “I think both changing the VAE decoder's parameters and changing the internal embedding will hugely affect the quality of the generated image.”
This argument contradicts the facts. Our WMAdapter demonstrates that changing the internal embedding does not significantly affect the quality of the generated image. This is supported by the numerical image quality metrics provided in Table 2, as well as the visualizations in Figures 10, 11, and 12.
Argument 2: “I do not think changing the VAE decoder's parameters and changing the internal embedding of the image are mathematically different.”
-
They are indeed mathematically different. Intuitively, pretrained knowledge (including how to generate high-quality images) is stored in the VAE decoder’s parameters. Fine-tuning the VAE decoder disrupts these parameters and significantly damages the pretrained knowledge. In contrast, changing the internal embedding leaves all network parameters intact, thus preserving the pretrained knowledge and avoiding catastrophic forgetting. This approach effectively utilizes the pretrained VAE as a regularizer, maintaining better image quality while optimizing the internal embeddings.
-
In simplified mathematical terms:
- Let represent a layer of the VAE with parameters , and represent intermediate features.
- Tuning parameters: , disrupts the pretrained knowledge stored in .
- Changing embeddings: , keeps as a regularizer, preserving the pretrained knowledge in .
-
Examples from other domains further support this distinction:
- Vison-language fine-tuning techniques: Adding and fine-tuning newly introduced modules (changing the embedding) while freezing pretrained components is a widely adopted strategy in multi-modal fine-tuning. Typical examples include flamingo [6] and blip [7].
- Video diffusion models: Many video diffusion methods, such as AnimateDiff [4] and Tune-A-Video [5], are based on well-pretrained image diffusion models. These methods only fine-tune newly introduced temporal layers while keeping all other parameters intact, thereby preserving the pretrained knowledge of image diffusion models to ensure high-quality spatial appearance generation.
Q3: Is WMAdapter just simply a trade-off between image quality and robustness?
-
Comprehensive Evaluation of the Trade-off: All watermarking methods inherently involve a trade-off between image quality and robustness. A more comprehensive and fair evaluation considers these two attributes as independent dimensions. In Figure 1, we present such a two-dimensional comparison, where points closer to the top-left corner represent a better quality-robustness trade-off. The results show that our method achieves a superior trade-off compared to other methods. Notably, we push up the trade-off boundary instead of merely moving along it.
-
Value and Insights: Our method not only provides better image quality in numerical metrics but also offers critical insights for diffusion-native methods to address the challenge of visually perceptible artifacts introduced by watermarks. This is an important and long-standing problem that prior methods have not adequately addressed (Figure 13). Our solution is both unique and novel in this regard.
-
Different Methods Have Different Strengths: It is important to recognize that different watermarking methods excel under different types of attacks. For example, while Stable Signature may perform better under certain image transformations, our method demonstrates greater robustness against regeneration attacks. detailed in the next question.
Q4: Bad robustness against adversarial perturbations.
-
Black-box Settings: We would like to point out that under black-box settings, WMAdapter is robust to adversarial attacks (an’24) as shown in Figure 5. Only regeneration-based attacks can cause significant decreases in bit accuracy while maintaining a PSNR around 30.
-
Regeneration Attacks: Regeneration attacks are a well-known limitation faced by all invisible pixel-level watermarks, as theoretically proven by [3]. These attacks successfully remove pixel-level watermarks with limited image quality degradation. Despite this inherent limitation, WMAdapter exhibits stronger robustness against regeneration attacks compared to other invisible watermarking methods like Stable Signature. As shown in Figure 5, regeneration attacks typically require a 4–6 dB PSNR reduction to remove WMAdapter’s watermark, whereas only a 2 dB reduction is sufficient to remove Stable Signature’s watermark.
-
Limitations and Focus: Different types of watermarking methods face different limitations, and there is currently no perfect watermarking solution. We want to emphasize that the primary focus of this paper is on improving the image quality of pixel-level watermarks and suppressing artifacts, rather than specifically addressing robustness against regeneration attacks.
Q5: Clarify the reason why hybrid fine-tuning works.
-
Underlying Mechanism: This question is related to Argument 2. The VAE decoder plays a critical role in determining the quality of the generated images. Conceptually, keeping the VAE decoder intact ensures that the pretrained knowledge remains preserved, resulting in high-quality image generation. The hybrid fine-tuning strategy, which uses the original VAE decoder, effectively avoids catastrophic forgetting caused by fine-tuning the VAE decoder, thereby achieving better image quality than joint fine-tuning strategies. Furthermore, the adapters only introduce a minimal 48-bit watermark and are well-adapted to the pretrained VAE decoder during the first stage of training. As a result, replacing the fine-tuned decoder with the pretrained one does not significantly disrupt its intermediate features.
-
Empirical Evidence: Table 5 and Figure 6 provide detailed comparisons of various fine-tuning strategies. The results clearly show that hybrid fine-tuning achieves superior image quality compared to other strategies.
-
Additional Insights: In Appendix 10, we showed further investigations into the lens flare artifacts introduced by fine-tuning the VAE decoder parameters. These experiments confirm that disruptions to the VAE decoder parameters lead to noticeable image quality degradation, further supporting the mechanism behind the success of hybrid fine-tuning.
Q6: Figure 8 shows worse robustness compared to Stable Signature. The improvement in image quality is only a trade-off.
-
WMAdapter shows better qualtiy-robustness tradeoff than Stable Signature (Fig. 1) under a more comprehensive evaluation. Specifically, WMAdapter-I achieves a 22% FID improvement with only a 3% accuracy gap. Our method pushes up the qualtiy-robustness tradeoff boundary, achieving a better overall balance. Please also refer to the response to Q3.
-
It is also worth emphasizing the unique advantages of WMAdapter that Stable Signature does not offer. Unlike Stable Signature, WMAdapter provides scalability by encoding watermarks directly into the adapter, eliminating the need to fine-tune the VAE decoder for each different watermark. Furthermore, as shown in the new experiments in Figure 5, WMAdapter demonstrates better robustness against regeneration attacks compared to Stable Signature.
-
In summary, our method achieves a better quality-robustness tradeoff than Stable Signature when considering both image quality and robustness. Additionally, WMAdapter offers unique advantages, including scalability and enhanced robustness to regeneration attacks, further demonstrating its distinctive strengths.
[1] Fernandez, Pierre, et al. "The stable signature: Rooting watermarks in latent diffusion models." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
[2] Xiong, Cheng, et al. "Flexible and secure watermarking for latent diffusion model." Proceedings of the 31st ACM International Conference on Multimedia. 2023.
[3] Zhao, Xuandong, et al. "Invisible image watermarks are provably removable using generative ai." arXiv preprint arXiv:2306.01953 (2023).
[4] Guo, Yuwei, et al. "Animatediff: Animate your personalized text-to-image diffusion models without specific tuning." arXiv preprint arXiv:2307.04725 (2023).
[5] Wu, Jay Zhangjie, et al. "Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
[6] Alayrac, Jean-Baptiste, et al. "Flamingo: a visual language model for few-shot learning." Advances in neural information processing systems 35 (2022): 23716-23736.
[7] Li, Junnan, et al. "Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models." International conference on machine learning. PMLR, 2023.
Dear Reviewer 7bru,
Thank you for taking the time to review our submission and for providing thoughtful feedback. We hope our rebuttal has adequately addressed your concerns. If there are any remaining questions or points requiring clarification, we would be happy to provide further details.
If our responses have resolved your concerns, we would be sincerely grateful if you could enhance the score to reflect this. We greatly appreciate your time and effort in supporting the review process.
Best regards,
The Authors
Thanks for the rebuttal provided by the authors. But I still have some concerns about this paper.
They are indeed mathematically different. Intuitively, pretrained knowledge (including how to generate high-quality images) is stored in the VAE decoder’s parameters. Fine-tuning the VAE decoder disrupts these parameters and significantly damages the pretrained knowledge. In contrast, changing the internal embedding leaves all network parameters intact, thus preserving the pretrained knowledge and avoiding catastrophic forgetting. This approach effectively utilizes the pretrained VAE as a regularizer, maintaining better image quality while optimizing the internal embeddings.
Does this imply that the output image quality remains unchanged regardless of modifications to the internal embedding, provided that the parameters of the VAE remain intact? This behavior seems counterintuitive—could you provide evidence or references to support this claim?
WMAdapter shows better qualtiy-robustness tradeoff than Stable Signature (Fig. 1) under a more comprehensive evaluation. Specifically, WMAdapter-I achieves a 22% FID improvement with only a 3% accuracy gap. Our method pushes up the qualtiy-robustness tradeoff boundary, achieving a better overall balance. Please also refer to the response to Q3.
Comparing two methods becomes challenging when one outperforms the other in one metric but underperforms in another. A more effective approach might be to tune the hyperparameters to ensure comparable image quality for robustness comparisons, or comparable robustness for image quality evaluations.
We sincerely appreciate your thoughtful feedback and are pleased to provide the following responses to address the concerns raised.
Q7: Does this imply that the output image quality remains unchanged regardless of modifications to the internal embedding, provided that the parameters of the VAE remain intact? This behavior seems counterintuitive—could you provide evidence or references to support this claim?
No, this does not imply that the output image quality remains entirely unchanged when the internal embedding is modified. Modifying the internal embedding and fine-tuning the parameters of the VAE both impact the output image quality. This just means that optimizing the internal embedding is a significantly better approach than fine-tuning the VAE parameters, as it preserves the pretrained knowledge and results in much higher image quality.
For evidence, please refer to the image quality metrics comparison in Table 5, Table 2 and the visual results in Figure 6. Both Adapter-F and Adapter-I (modifying the embedding) achieve substantially better PSNR, SSIM, and FID scores than Adapter-V and StableSignature (fine-tuning the parameters). Furthermore, Adapter-I suppresses visual artifacts more effectively.
Q8: Tune the hyperparameters to ensure comparable image quality for robustness comparisons or comparable robustness for image quality evaluations.
Thank you for the suggestion. Evaluating all methods at comparable visual quality or robustness levels is an effective approach. However, in practice, most watermarking methods lack explicit hyperparameters to directly control the final trained model to achieve a predetermined visual quality or robustness level. Additionally, the trade-offs between image quality and robustness vary significantly across different methods. Forcing these methods to achieve the same level of quality or robustness would likely result in suboptimal solutions for most. Consequently, controlling variables for a direct comparison is challenging, which is why prior works typically compare the final model results as they are.
In Figure 1, we present an alternative, more fair and practical approach: directly comparing the two-dimensional robustness-quality trade-off of the final models. Our method achieves a better trade-off than other methods. Importantly, we push up the trade-off boundary rather than merely moving along it, clearly demonstrating the improvements enabled by our approach.
Thanks for your response. This brings me back to my original concern. Given that modifying either the VAE or the internal embedding impacts image quality, why does modifying the internal embedding result in better image quality than modifying the VAE? Your experimental results show that your method of modifying the internal embedding achieves better image quality (albeit with worse robustness) compared to the method used by Stable Signature to modify the VAE. But why does this occur?
Q9: Why does modifying the internal embedding achieve better image quality than modifying the VAE?
As explained in our response to Argument 2, optimizing the embedding treats the pretrained VAE as a regularizer, preserving the pretrained knowledge stored in its parameters and avoiding catastrophic forgetting. In contrast, fine-tuning the VAE disrupts this pretrained knowledge, leading to a decline in image quality. Similar approaches are used in other areas, such as multimodal LLMs and video diffusion models, to achieve superior generation quality by preserving pretrained components.
Dear Reviewer 7bru,
As the deadline for the final decision is approaching, we want to kindly check if you have any further concerns or questions regarding our rebuttal. Please let us know if there is any additional clarification or information we can provide to address your concerns. We appreciate your efforts and understanding, especially during this busy period.
Best regards,
The Authors
Watermarking serves as a prominent means for actively identifying generated content, extensively applied in diffusion model image detection and attribution. Existing methods integrate watermark embedding with image generation but have not considered the relationship between watermark information and image semantics. This paper proposes a watermarking method for diffusion models called WMAdapter, which designs a Contextual Adapter to fuse watermark information with the VAE features in the diffusion model. To further improve the visual quality of watermarked images, the paper also proposes a Hybrid Finetuning strategy. This strategy first fine-tunes the adapter, then simultaneously fine-tunes both the adapter and the VAE decoder, using the fine-tuned adapter with the original VAE decoder during final image generation. Experimental results show that WMAdapter can effectively reduce artifacts.
优点
- The paper is well-written and clearly explained, making it easy for readers to understand.
- A Hybrid Finetuning Strategy is proposed, which further reduces artifacts in the watermark through multiple rounds of fine-tuning.
- WMAdapter is lightweight and easy to deploy, with low training overhead.
缺点
- The WMAdapter integrates features from the VAE with watermark information, representing a contextual watermark. However, the authors state in line 413 that WADIFF "uses a context-less structure to encode the watermark." I do not believe that WADIFF is a context-less watermark, as it combines latent representations with watermark information and passes them through a trainable UNet layer. Therefore, both WMAdapter and WADIFF can be considered contextual watermarks. I encourage the authors to provide a more comprehensive comparison between the two methods.
- Based on the experimental results (Table.2 and Figure.8), compared to Stable Signature, this paper sacrifices some robustness in order to achieve better visual quality.
问题
- I believe the experimental setup for the FID metric needs to be reconsidered. FID is used to measure the quality of generated images, and in related work, it is typically calculated between watermarked images and real images from the dataset to assess the impact of the watermark on the model's generation quality. This approach would be more reasonable.
- In section 4.3, "Robustness to More Attacks," this paper only presents results for WMAdapter-F against the Regeneration attack and the Surrogate detector attack. To better evaluate the robustness of WMAdapter, the results for WMAdapter-I should also be included.
- The Hybrid Finetuning strategy is a highlight of this paper; however, the final implementation uses the original VAE decoder instead of the finetuned VAE decoder. While Figure 6 intuitively demonstrates the effectiveness of this approach, I hope the authors can further investigate the reasons behind the occurrence of lens flare artifacts.
Q1: WADIFF is contextual watermarks.
-
You are correct that WADIFF is contextual watermarks. We apologize for the incorrect citation in line 413, which may have caused confusion. This has been corrected, and the citation has been removed in the revised manuscript. In the introduction (line 90), we briefly compared WADIFF and WMAdapter. Here, we provide a more detailed comparison.
-
Comparison between WADIFF and WMAdapter:
- Adapter Structure: WADIFF uses ControlNet as its adapter structure, which is heavyweight, with approximately 700M parameters. In contrast, WMAdapter is specifically designed to be lightweight, with only 1.3M parameters, greatly reducing memory cost.
- Computational Cost: WADIFF integrates the adapter into the diffusion UNet backbone, requiring a 1000-step denoising sampling process during training, which incurs significant computational overhead. WMAdapter, on the other hand, introduces the adapter at the VAE decoder level, avoiding the denoising sampling process during training and greatly reducing computational cost.
- Visual Effects: WADIFF alters the layout of the generated images, while WMAdapter maintains the original layout and is an imperceptible watermark.
- Numerical Results: WMAdapter achieves better tracing accuracy than WADIFF (~1.000 vs 0.934 when tracing keys), as demonstrated in Table 3.
Q2: Compared to Stable Signature, WMAdapter sacrificies robustness to achieve better visual quality.
-
All watermarking methods inherently involve a tradeoff between image quality and robustness. A more comprehensive and fair evaluation considers these two attributes as separate dimensions, directly comparing the quality-robustness tradeoff in a two-dimensional figure. In Figure 1, we present such a comparison, where points closer to the top-left corner indicate a better balance. The results show that our method achieves a superior quality-robustness tradeoff compared to Stable Signature. Specifically, WMAdapter-I achieves a 22% FID improvement with only a 3% accuracy gap. Unlike merely moving along the existing quality-robustness tradeoff boundary, our method pushes up the boundary, achieving a better overall balance.
-
It is also worth emphasizing the unique advantages of WMAdapter that Stable Signature does not offer. Unlike Stable Signature, WMAdapter provides scalability by encoding watermark patterns directly into the adapter, eliminating the need to fine-tune the VAE decoder for each different watermark. Furthermore, as shown in the new experiments in Figure 5, WMAdapter demonstrates better robustness against regeneration attacks compared to Stable Signature. Regeneration attacks typically cost 4-6 db PSNR to remove our watermark, whereas removing Stable Signature costs 2 db.
-
In summary, our method achieves a better quality-robustness tradeoff than Stable Signature when considering both image quality and robustness. Additionally, WMAdapter offers unique advantages, including scalability and enhanced robustness to regeneration attacks, further demonstrating its distinctive strengths.
Q3: Reconsider setup for the FID metric.
Thank you for your careful inspection. We apologize for the misleading description in the original manuscript. To clarify, we did calculate the FID between the watermarked images and the real images from the COCO validation set (referred to as "the images before watermarking" in the previous version). We have updated the description in the revised manuscript to avoid any confusion.
Q4: Evaluate WMAdapter-I against adaptive attacks.
We have updated Figure 5 and corresponding texts to include WMAdapter-I and another baseline method, Stable Signature, for a more comprehensive comparison under different adaptive attacks (regeneration attacks, black/white-box adversarial attacks, black-box query-based attacks). All three methods are robust to black-box adversarial attacks and vulnerable to white-box adversarial attacks. For regeneration attacks, the results show that attackers need to incur a 4–6 dB PSNR reduction to remove our watermark (both WMAdapter-I and WMAdapter-F), whereas only a 2 dB reduction is sufficient to remove the watermark of Stable Signature. This demonstrates that our method exhibits significantly better robustness to regeneration attacks. Although the query-based attack can successfully attack all three methods, it will cause the image quality to drop significantly to about 8db.
Q5: Further exploration on the occurrence of lens flare artifacts
-
We conducted several experiments to investigate the cause of lens flare artifacts.
-
Initial Experiment:
- We collected 80 images containing lens flare artifacts and manually replaced the small regions with artifacts using the corresponding regions from the original images (without artifacts).
- We then compared the decoded watermark bit accuracy before and after the replacement.
- The results show that lens flare artifacts do not affect the watermark bit accuracy under various attacks. This indicates that lens flare artifacts do not carry any useful watermark information for attack resistance and can be safely removed as harmless visual artifacts.
-
Source of Lens Flare Artifacts:
Lens flare artifacts have been observed in watermarking methods that fine-tune the VAE decoder, such as Stable Signature and FSW. This suggests that changes in the VAE decoder parameters contribute to the artifacts. To explore this further, we examined how different layers of the VAE decoder affect the occurrence of lens flare artifacts:- The VAE decoder structure consists of: conv_in + mid_block + 4× up_blocks (0,1,2,3) + conv_norm_out + conv_out.
- Starting from the final layer (conv_out), we sequentially replaced layers of the fine-tuned VAE decoder with the corresponding layers from the pretrained decoder. After each replacement, we regenerated the watermarked images and observed the changes in the lens flare artifacts (Figure 10).
- The results show that after replacing the last four layers (conv_out, conv_norm_out, up_blocks.3, and up_blocks.2), the lens flare artifacts nearly disappeared. This suggests that these layers play a significant role in the occurrence of lens flare artifacts.
- Further experiments revealed that the artifacts are primarily caused by changes in the weight parameters rather than the bias parameters of these layers.
-
Conclusion and Future Work:
Based on these findings, we hypothesize that the occurrence of lens flare artifacts is due to insufficient constraints from the image consistency loss used in current methods, which fail to adequately regularize the parameter space of the VAE decoder.- Future research could explore stronger image quality losses to better constrain VAE training or impose explicit regularization to limit the deviation between the fine-tuned and pretrained VAE decoders.
- Gradient-based methods could also be used for more precise identification of the factors contributing to lens flare artifacts.
These findings have been detailed in Appendix 10 of the revised manuscript.
Dear Reviewer mTvH,
Thank you for taking the time to review our submission and for providing thoughtful feedback. We hope our rebuttal has adequately addressed your concerns. If there are any remaining questions or points requiring clarification, we would be happy to provide further details.
If our responses have resolved your concerns, we would be sincerely grateful if you could enhance the score to reflect this. We greatly appreciate your time and effort in supporting the review process.
Best regards,
The Authors
Dear Reviewer mTvH,
As the deadline for the final decision is approaching, we want to kindly check if you have any further concerns or questions regarding our rebuttal. Please let us know if there is any additional clarification or information we can provide to address your concerns. We appreciate your efforts and understanding, especially during this busy period.
Best regards,
The Authors
The paper proposes a method for generating watermarked images without visible artifacts by introducing a module called the contextual adapter and a targeted fine-tuning strategy.
优点
This method offers a dynamic watermarking mechanism, allowing watermarking of a given image with any arbitrary key (binary message) without needing retraining to adapt to a new key.
缺点
- Lack of Baseline Comparisons:
- Post-hoc watermarking: The authors have shown some comparisons with frameworks like StegaStamp [1] and RivaGAN [2]. However, a more comprehensive evaluation could have been conducted to establish stronger baseline comparisons.
- Native watermarking: Stable Messenger [3], a similar steganography framework to RoSteALS, demonstrates a better trade-off between accuracy and image quality. However, the authors did not provide any comparisons with this approach.
-
Confusing Contribution: The authors claim hybrid fine-tuning as a key contribution of this paper. However, in Table 5, the results of no extra fine-tuning and hybrid fine-tuning primarily demonstrate a trade-off between accuracy and image quality, a common issue in this field. This leaves the exact contribution of this fine-tuning strategy unclear.
-
Unclear Robustness Evaluation: While it's beneficial to see the method's robustness against regeneration attacks, Figure 5 does not clearly demonstrate if WMAdapter is more robust to these attacks than other watermarking methods such as Stable Signature or AquaLoRA.
References:
[1] Tancik, Matthew, Ben Mildenhall, and Ren Ng. "Stegastamp: Invisible hyperlinks in physical photographs." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
[2] Zhang, Kevin Alex, et al. "Robust invisible video watermarking with attention." arXiv preprint arXiv:1909.01285 (2019).
[3] Nguyen, Quang, et al. "Stable Messenger: Steganography for Message-Concealed Image Generation." arXiv preprint arXiv:2312.01284 (2023).
问题
Please refer to Weaknesses part.
Q1: Lack of Baseline Comparisons
-
Comparison with StegaStamp: We have included an additional comparison with StegaStamp. Our results show that WMAdapters achieve better image quality and generally stronger robustness compared to StegaStamp. Note that we don't include ECC for a fair comparison.
Method PSNR FID TPR None JPEG Crop Brightness Combine StegaStamp 29.3 9.9 1.0 0.96 0.96 0.49 0.94 0.49 WMAdapter-F 33.1 2.7 1.0 0.99 0.92 0.99 0.99 0.92 WMAdapter-I 34.8 2.5 1.0 0.98 0.90 0.97 0.97 0.90 -
Comparison with Stable Messenger: Stable Messenger is concurrent work, and since its code is not open-sourced and it was trained on different datasets, performing a direct numerical comparison is challenging. Instead, we provide a comparison from the perspective of method design.
- Motivation: Stable Messenger primarily focuses on improving message (word) accuracy and introduces a novel Loss-Sum-Exponential (LSE) Loss and a latent-aware message encoder to achieve this goal. In contrast, WMAdapter focuses on improving image quality and suppressing generated artifacts.
- Solution and Model Design: Both works adopt different solutions tailored to their objectives. While the latent-aware message encoder in Stable Messenger is somewhat similar to our contextual adapter, our adapter is specifically designed to take multi-level features from the VAE decoder as contexts, whereas the latent-aware message encoder relies solely on single-level latent features.
- Robustness: Stable Messenger, as noted in its limitations section, is vulnerable to geometric transformations. In contrast, WMAdapter demonstrates robustness to such transformations, including crop and resize operations.
These differences in motivation, solution, model design, and robustness characteristics distinguish WMAdapter from Stable Messenger.
-
Updates:
- We have included a comparison with Stable Messenger in the updated Related Work section.
- The results for StegaStamp have been added to Table 2 in the updated manuscript.
Q2: Confusing Contribution – No extra fine-tuning and hybrid fine-tuning demonstrate a trade-off between accuracy and image quality, a common issue in this field
-
Methodological Perspective: While it is true that "no extra fine-tuning" and different fine-tuning strategies result in varying numerical trade-offs, we want to emphasize that hybrid fine-tuning is a novel and unique strategy. It is the only approach that effectively suppresses visual artifacts beyond the numerical trade-offs. As shown in Figure 6, both "no extra fine-tuning" and other fine-tuning strategies introduce perceptible artifacts. Hybrid fine-tuning is neither a trivial solution nor merely a trade-off mechanism that sacrifices accuracy for improved numerical quality metrics.
-
Value and Insights: Hybrid fine-tuning provides crucial insights into how diffusion-native methods can eliminate visually perceptible artifacts introduced by watermarks. This addresses an important and long-standing problem that prior methods (see the artifacts generated by other methods in Figure 13, although some of them achieve good numerical indicators) or other fine-tuning strategies have not resolved.
Q3: Unclear Robustness Evaluation – Evaluating other watermarking methods under regeneration attacks
-
Comparison with Stable Signature: We have included a side-by-side comparison with Stable Signature and WMAdapter-I in Figure 5. For regeneration attacks (Cheng’20, Balle’18, Zhao’23), we observe that removing the watermark imprinted by WMAdapter requires a PSNR drop of 4–6 dB. In contrast, the watermark of Stable Signature can be removed with only a 2 dB reduction in image quality. This demonstrates that WMAdapter exhibits significantly better robustness against regeneration attacks compared to Stable Signature.
-
In terms of AquaLoRA: AquaLoRA greatly alters the layout of the original image, resulting in a very low PSNR of 8.6 dB, which deviates substantially from imperceptible watermarks like Stable Signature and WMAdapter, both of which maintain PSNR around 30 dB. This significant difference makes it challenging to include AquaLoRA in Figure 5 for a fair comparison on PSNR changes under different levels of attacks.
The paper proposes WMAdapter, a watermarking solution for latent diffusion models that embeds user-specific watermarks without altering the underlying diffusion components, maintaining image quality. Key innovations include a Contextual Adapter that adapts watermark embeddings based on image content and a Hybrid Finetuning strategy to enhance visual quality while preserving watermark robustness.
优点
-
The paper is generally well-written and easy to follow.
-
: The idea of Content-Aware Watermarking is knid of interesting and different from previous work: Contextual Adapter adapts watermark embedding based on the image's content, improving the concealment and robustness of watermarks while minimizing visual impact.
-
High Image Quality Preservation.
缺点
-
Limited novelty in watermarking pipeline.
-
Not comprehensive evaluation of robustness in adaptive settings.
-
Dependency on Pretrained Components.
问题
-
The watermarking pipeline shares similarities with Stable Signature, particularly in its foundational approach. However, while Stable Signature fine-tunes the Diffusion U-Net, this study emphasizes modifications to the VAE decoder. Additionally, the concept of integrating an external adaptor is not novel. It is recommended to explicitly highlight the distinctions between this work and prior methods, such as Stable Signature and ControlNet, to better establish its unique contributions.
-
For robustness part, most perturbations evaluated in experiments are common image post-processing methods, such as JPEG and Crop. Apart from surrogate detecor attack and regeneration attack, there are more adaptive attacks such as query-based black-box attack or white-box attack. It is important to consider a comprehensive setting.
-
In terms of robustness, it is observed that the bit accuracy drops to approximately 0.9 when JPEG compression at a quality level of 80 is applied. Is this reduction sufficient for practical robustness? Given that JPEG 80 is considered a moderate perturbation, it would be valuable to analyze the system's response to more severe JPEG compression levels to understand the potential vulnerabilities if an attacker deploys more aggressive degradation techniques.
-
Is there existing literature that defines PSNR values of 32 or 30 as indicative of poor visual quality? This question is relevant since some adaptive attacks result in significant decreases in bit accuracy even when the PSNR remains above 30.
Q1: Dependency on Pre-trained Components
Our method can work with any public or privately trained watermark decoders. In principle, it also supports joint fine-tuning of the watermark decoder during training. Using a pretrained decoder aligns with the conventions of other watermarking methods, such as Stable Signature, SSL, and WADIFF. Additionally, it facilitates direct and fair comparisons with the baseline method, Stable Signature.
Q2: Compare with Stable Signature and ControlNet to highlight unique contributions
-
Comparison with Stable Signature: Stable Signature fine-tunes the VAE decoder to embed a fixed watermark, requiring a new fine-tuning process each time a different watermark is embedded. This fine-tuning, especially on small datasets, often degrades image quality and introduces artifacts. In contrast, WMAdapter trains an adapter capable of embedding any 48-bit message, offering scalability and eliminating the need for repeated fine-tuning. Compared to other external watermark adapters [1][2], WMAdapter introduces a contextual structure and employs a hybrid fine-tuning strategy, avoiding modifications to pretrained VAE parameters and maintaining high image quality while embedding watermarks.
-
Comparison with ControlNet: ControlNet is a heavyweight adapter attached to the diffusion UNet, with a parameter size of approximately 700M. In contrast, WMAdapter is a lightweight adapter attached to the VAE decoder, with only 1.3M parameters. Watermarking methods using ControlNet, such as WADIFF [3], alter the layout of generated images and require an additional computationally expensive training process with 1000 steps of diffusion sampling. WMAdapter, however, does not alter the image layout, and its training process is simple and low-cost, avoiding the lengthy diffusion sampling steps. In addition, WMAdapter achieves better tracing accuracy than WADIFF, as shown in Table 3.
-
Summary: WMAdapter fundamentally differs from both Stable Signature and ControlNet in methodology, providing a scalable, lightweight, and efficient solution while maintaining high-quality image generation.
Q3: Include additional adaptive attacks
-
Per the reviewer's request, we have added experiments for both query-based black-box attacks and white-box attacks. We also compared the performance of WMAdapter-I and StableSignature under these new attack scenarios. The results and relevant descriptions have been updated in Figure 5 and the corresponding experimental section. Please refer to the manuscript for detailed explanations.
-
Below is a summary table of the results for WMAdapter-F under the newly introduced adaptive attacks.
- For the query-based black-box attack, we adopted the WEvade-B-Q [4] method and set the detection threshold bits to control . For the white-box attack, we adopt the targeted attack with random labels.
- The results show that both the black-box query-based attack and the white-box attack were successful. However, the black-box query-based attack significantly degraded image quality, with an average PSNR of approximately 8 dB.
- Another key observation is that defending against white-box attacks remains challenging for watermarks, as this issue applies generally to any other classifiers. Incorporating adversarial training with PGD augmentation could potentially improve watermark robustness against white-box attacks. However, resistance to white-box attacks is not the main focus of current watermarking methods.
-
Comparison Table
Attack Method PSNR Bit Accuracy Black-Box Regeneration Attack 29.2 0.56 Black-Box Adversarial Attack 32.5 0.97 White-Box Adversarial Attack 32.2 0.51 Black-Box Query-based Attack 8.2 0.77
Q4: More severe JPEG compression levels
Figure 8 in the Appendix provides a more detailed performance evaluation across various attack intensities and types. Notably, under a more severe JPEG compression rate (JPEG 30), WMAdapter still achieves a bit accuracy of approximately 0.8. Additionally, WMAdapter demonstrates strong robustness under other types of image transformations as well.
Q5: Some adaptive attacks result in significant decreases in bit accuracy even when the PSNR remains above 30.
-
A PSNR above 30 does not necessarily indicate poor visual quality. For black-box attacks, only regeneration-based attacks can cause significant decreases in bit accuracy while maintaining a PSNR above 30.
-
Regeneration attacks are a well-known limitation for all invisible pixel-level watermarks, as theoretically proven by [5]. These attacks can successfully remove pixel-level invisible watermarks with limited impact on image quality. Despite this inherent limitation, WMAdapter demonstrates stronger robustness against regeneration attacks compared to other invisible watermarking methods such as Stable Signature. As shown in Figure 5, regeneration attacks generally require a 4-6 dB PSNR reduction to remove our watermark, whereas they require only about a 2 dB reduction to remove the watermark in Stable Signature.
-
Different types of watermarking methods face different limitations, and there is currently no perfect watermarking solution. We would like to emphasize that the focus of this paper is on improving the image quality of pixel-level watermarks and suppressing artifacts, rather than addressing robustness against regeneration attacks.
[1] Bui, Tu, et al. "Rosteals: Robust steganography using autoencoder latent space." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
[2] Xiong, Cheng, et al. "Flexible and secure watermarking for latent diffusion model." Proceedings of the 31st ACM International Conference on Multimedia. 2023.
[3] Min, Rui, et al. "A watermark-conditioned diffusion model for ip protection." arXiv preprint arXiv:2403.10893 (2024).
[4] Jiang, Zhengyuan, Jinghuai Zhang, and Neil Zhenqiang Gong. "Evading watermark based detection of AI-generated content." Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 2023.
[5] Zhao, Xuandong, et al. "Invisible image watermarks are provably removable using generative ai." arXiv preprint arXiv:2306.01953 (2023).
Dear Reviewer zjqX,
Thank you for taking the time to review our submission and for providing thoughtful feedback. We hope our rebuttal has adequately addressed your concerns. If there are any remaining questions or points requiring clarification, we would be happy to provide further details.
If our responses have resolved your concerns, we would be sincerely grateful if you could enhance the score to reflect this. We greatly appreciate your time and effort in supporting the review process.
Best regards,
The Authors
Thanks for the detailed responses. Some of my concerns have been addressed. I decide to maintain the current positive score.
We sincerely thank the reviewer for recognizing and appreciating our work. Your positive feedback is highly encouraging. We are grateful for your valuable time and constructive suggestions, which helps us improve and refine our work.
This paper presents WMAdapter, a plug-in solution for diffusion-based watermarking. The authors propose training a contextual adapter and introduce a hybrid fine-tuning scheme to mitigate grid artifacts. Experimental results highlight WMAdapter's advantages, particularly in terms of visual quality.
优点
- WMAdapter is compatible with various diffusion models and watermarking decoders, enhancing its adaptability for real-world applications.
- The hybrid fine-tuning scheme is innovative, effectively reducing artifacts.
缺点
-
The mechanism behind hybrid fine-tuning is not fully explained. After fine-tuning the adapter with the VAE decoder, it is unclear why it achieves better quality with the original decoder (Adaptor-I in Table 5), yet produces degraded reconstructions with the fine-tuned decoder (Adaptor-V in Table 5). Further clarification is needed.
-
Although WMAdapter demonstrates superior visual quality, its robustness is weaker than that of StableSignature (Figure 8) and SSL (Table 2). Given the trade-off between distortion and robustness, a more balanced comparison could involve evaluating all methods at comparable visual quality levels to assess robustness fairly.
-
The bit length used by methods like RoSteALS is not specified. Do these methods also embed a consistent 48 bits? If so, are they trained on this bit length for a fair comparison?
问题
Please refer to the weakness.
Q1: The mechanism behind hybrid fine-tuning is not fully explained
-
The core idea of hybrid fine-tuning is to preserve as much knowledge as possible that the pretrained VAE decoder has learned from large-scale datasets, such as reconstructing high-quality images, as the quality of the VAE decoder significantly impacts the final image generation quality.
-
We argue that fine-tuning the VAE decoder can easily disrupt its finely tuned pretrained state due to the following factors: (1) Fine-tuning involves hyperparameters such as learning rate, batch size, and data distribution, all of which differ from those in pre-training and may cause large parameter state shift during finetuning. (2) Fine-tuning on a small dataset risks overfitting to the specific data domain. These challenges often lead to a significant drop in generation quality, introducing additional artifacts. Despite using various image consistency regularizations, it remains difficult to match the reconstruction quality achieved during pretraining. Such phenomena have been observed in watermarking methods like StableSignature and FSW [1] that fine-tune VAE decoders. In our experiments, Adapter-V, which uses a fine-tuned VAE decoder, shows a similar drop in image quality to StableSignature, with comparable PSNR scores (29.9 vs. 29.7).
-
By substituting the fine-tuned VAE decoder with the original pretrained VAE decoder (as done in Adapter-I), we effectively mitigate the quality degradation caused by fine-tuning. Since the Adapter adapts well to the original VAE decoder during the first stage of training and introduces only a minimal 48-bit watermark, they do not significantly compromise the generation quality in our design. This explains why Adapter-I achieves better quality than Adapter-V in our experiments, even though this outcome might seem counterintuitive at first.
Q2: The robustness is weaker than that of StableSignature and SSL, evaluating all methods at comparable visual quality levels to assess robustness fairly.
-
Thanks for the suggestion. We agree that all watermarking methods inherently involve a tradeoff between image quality and robustness. Evaluating all methods at comparable visual quality levels is indeed a fairer approach. However, in practice, most watermarking methods lack explicit hyperparameters to directly control the final trained model to achieve a predetermined, unified visual quality level (e.g., in terms of PSNR, FID, and visual artifacts). As a result, controlling variables for a direct comparison is challenging, which is why prior works typically compare the final model results as they are.
-
Another approach for a fair comparison is to treat quality and robustness as two dimensions and directly compare the quality-robustness tradeoff in a two-dimensional figure. We present the relevant results in Figure 1, where points closer to the top-left corner represent a better robustness-quality tradeoff. Our method achieves a superior tradeoff compared to both SSL and StableSignature:
- Compared to SSL, WMAdapter-I achieves an 83% improvement in FID and a 2% improvement in accuracy under combined attacks.
- Compared to StableSignature, WMAdapter-I achieves a 22% improvement in FID with only a 3% accuracy gap.
- Rather than simply moving along the quality-robustness tradeoff boundary, our method pushes up this boundary to achieve a better balance
-
Additionally, it is worth highlighting the unique advantages of WMAdapter that neither StableSignature nor SSL can provide.
- Compared to StableSignature, WMAdapter offers scalability, eliminating the need to fine-tune the VAE decoder for each new watermark. As shown in the new experiments in Figure 5, WMAdapter also demonstrates superior robustness against regeneration attacks.
- Compared to SSL, WMAdapter introduces no visual artifacts beyond numerical quality metrics (Figure 12) and avoids the need for per-image optimization.
Q3: The bit length used by methods like RoSteALS is not specified.
- All methods embed a consistent 48-bit message. We train RoSteALS and WOUAF from scratch to make it a fair comparison.
[1] Xiong, Cheng, et al. "Flexible and secure watermarking for latent diffusion model." Proceedings of the 31st ACM International Conference on Multimedia. 2023.
Thank the authors for your detailed reply. My concerns regarding Q2 and Q3 have been well addressed. While the explanation for Q1 primarily focuses on why fine-tuning may cause quality degradation, it does not well answer why the proposed hybrid fine-tuning works.
Actually, during joint fine-tuning of the adaptor and the VAE decoder, the latent space of adaptor shifts from the original VAE. Theoretically, this shift means the fine-tuned adaptor might not match the original VAE decoder. While the experimental results suggest that this mismatch is minor for the SD's VAE decoder, the robustness of hybrid fine-tuning across other VAE variants remains uncertain without a theoretical justification.
I encourage the authors to provide a deeper explanation of the mechanism underlying hybrid fine-tuning to address this concern. If the mechanism cannot be adequately explained, I would consider hybrid fine-tuning to be a specific trick tailored for the VAE in SD, rather than a broadly applicable technique for generative models.
We thank the reviewer for the insightful observation regarding the potential mismatch between the fine-tuned watermark adapter and the pretrained VAE decoder in the hybrid fine-tuning strategy. While it is reasonable to expect the latent space learned by the adapter during joint fine-tuning might diverge from that of the pretrained VAE decoder, our empirical results demonstrate that the pretrained VAE decoder is highly robust to such mismatches, resulting in superior image quality compared to using the jointly fine-tuned VAE decoder.
We hypothesize that this is due to the pretrained VAE decoder’s strong optimization for reconstructing high-quality images from its latent space. Its robustness to slight perturbations enables effective integration with the fine-tuned adapter. In contrast, during joint fine-tuning, the VAE decoder may overfit to patterns introduced by the adapter, leading to reduced generalization and degraded image quality.
The further improvement in image quality with hybrid fine-tuning likely arises from partial adaptability. Specifically, hybrid fine-tuning allows the VAE decoder to adjust slightly alongside the adapter, improving alignment in the latent space while preserving the pretrained VAE’s robust reconstruction capability. This nuanced balance between adaptability and generalization may also be the key to hybrid fine-tuning’s superior performance.
To validate this observation, we evaluated hybrid fine-tuning across various VAE variants from popular latent diffusion models. The results (see table below) show that hybrid fine-tuning consistently improves image quality, with the most significant gains observed on SD1.5 and SD2.1, and notable improvements on SDXL and DiT. Adapter-B: no extra finetuning.
| PSNR | SD1.5 | SD 2.1 | SDXL | DiT |
|---|---|---|---|---|
| Adapter-B | 32.1 | 32.8 | 31.2 | 32.4 |
| Adapter-V (joint) | 29.3 | 29.9 | 30.5 | 28.9 |
| Adapter-I (hybrid) | 35.0 | 34.8 | 32.6 | 33.1 |
Exploring the theoretical foundations of this phenomenon is an exciting direction for future research. We are happy to discuss and welcome further suggestions from the reviewer on how to develop these theoretical insights.
Dear Reviewer bnPU,
Thank you for taking the time to review our submission and for providing thoughtful feedback. We hope our rebuttal has adequately addressed your concerns. If there are any remaining questions or points requiring clarification, we would be happy to provide further details.
If our responses have resolved your concerns, we would be sincerely grateful if you could enhance the score to reflect this. We greatly appreciate your time and effort in supporting the review process.
Best regards,
The Authors
Thank the authors for the additional feedback. While PSNR has been evaluated across different diffusion models, decoding accuracy has not been reported, leaving the robustness of the hybrid fine-tuning approach uncertain. Nonetheless, I believe hybrid fine-tuning is a valuable technique for current diffusion models, so I stand by my original positive score.
We sincerely thank the reviewer for recognizing and appreciating our work. Your positive feedback is highly encouraging. We are grateful for your valuable time and constructive suggestions, which helps us improve and refine our work.
Dear Reviewers,
We sincerely thank all the reviewers for their valuable feedback and constructive suggestions, which have greatly helped us improve the quality of our work. Based on your insights, we have carefully revised the manuscript to address your comments and clarify key points. We look forward to further discussions and are happy to provide additional explanations or clarifications as needed.
Best Regards, Paper 4765 Authors
This paper proposes a watermarking method for latent diffusion models, which is denoted as WMAdapter with a contextual Adapter to fuse watermark information with the VAE features in the diffusion model. The paper proposes a hybrid finetuning strategy to further improve the visual quality of watermarked images. This strategy first fine-tunes the adapter, then fine-tunes both the adapter and the VAE decoder, using the fine-tuned adapter with the original VAE decoder during final image generation. Experimental results show that WMAdapter can effectively reduce artifacts. The reviewers agree that: (1) The paper is well-written and easy to follow. (2) The proposed wMAdapter is compatible with various diffusion models and watermarking decoders. (3) The hybrid fine-tuning scheme can reduce artifacts to some extent. However, Reviewers still concerned about the trade-off between image quality and watermark robustness: the proposed method demonstrates worse robustness compared to Stable Signature, although it preserves better image quality. The work does not adequately demonstrate the advantages of the proposed method. One reviewer raised the concern that other reviewers with positive scores also acknowledged during the discussion. Meanwhile, modifying the pre-trained VAE is also not explained well. We have to reject this work due to these concerns and encourage the resubmission with improved version.
审稿人讨论附加意见
All reviewers provide solid comments in the first round of reviewing; Three reviewers provide feedback on the authors' responses and join the discussion after rebuttal. One reviewer thought the feedback could not convince him/her and detailed the concerns. Two reviewers acknowledge the concerns after discussion.
Reject