Towards Lightweight Deep Watermarking Framework
摘要
评审与讨论
This paper addresses the limitations of current deep learning-based watermarking models used for copyright protection, which often have too many parameters, making them impractical for deployment, and lack robustness and invisibility. The authors identify a key issue: a mismatch between commonly used decoding losses and the actual decoding objectives, leading to redundant parameters. To tackle this, they propose two innovations: a Decoding-oriented surrogate loss (DO), which redesigns the loss function to minimize the impact of irrelevant optimization directions, and a Detachable projection head (PH), which adds a redundant module during training to handle irrelevant directions and is removed during inference. Additionally, the paper introduces a new watermarking framework with five submodules that allow for parameter reduction in each component. The proposed model achieves high efficiency, invisibility, and robustness, using only 2.2% of the parameters of current state-of-the-art frameworks, making it well-suited for resource-constrained environments.
优点
- The experiments are comprehensive, yielding satisfactory results, particularly under lightweight structural conditions (where the number of model parameters is significantly reduced), with performance surpassing the baselines.
- The motivation is fairly strong.
缺点
Cons:
- The contribution of the paper primarily relies on the proof and explanation of Equation (3). However, the explanations for the three types of loss lack intuition, or the intuition is not sufficiently evident, which undermines the technical rigor of the paper.
- The proposed framework does not introduce a clear distinction from existing methods, and its novelty is not apparent.
- The writing in the methods section (Section 3) is poor, making the purpose of adding the Detachable Projection Head unclear. The writing approach is problematic; you should first clarify the goal of this method (is it based on insights or analysis from Equation (3)?) and then explain how it achieves this goal. Currently, the explanation is difficult to follow.
- Furthermore, the discussion of the gap is limited to issues with MSE loss; the paper should also address cases with non-MSE loss functions.
Minor Points:
- Ensure that the full term accompanies the abbreviation the first time it appears. In Figure 1, the abbreviations are challenging to understand without prior context.
问题
Questions and Suggestions for the Authors:
-
Clarification of Equation (3) and Loss Functions
Could you provide a more intuitive explanation of the three loss functions associated with Equation (3)? While Equation (3) appears central to the paper’s contribution, the intuition behind the chosen loss functions is not immediately clear. Offering a deeper rationale or illustrative examples could significantly strengthen the technical clarity and impact of the work. -
Differentiation of Framework from Existing Methods
How does the proposed framework fundamentally differ from other established methods? Currently, the distinctions between this framework and prior work are unclear, making it difficult to assess the novelty of your approach. A discussion contrasting your framework’s unique aspects or providing a comparative analysis would be helpful. -
Purpose and Rationale for the Detachable Projection Head
Can you elaborate on the motivation for including the Detachable Projection Head in your framework? The methods section does not clarify its purpose or the anticipated effect on the model’s performance. It would be beneficial to specify whether its addition is grounded in the analysis of Equation (3) or derived from other empirical insights. Furthermore, explaining how the Detachable Projection Head contributes to the objectives of the model would improve readability and comprehension. -
Addressing Non-MSE Loss Functions
Are there insights or considerations regarding loss functions beyond MSE? The current discussion appears to focus solely on MSE loss, yet it would be valuable to understand how the proposed framework performs or could adapt with non-MSE loss functions. Addressing this could broaden the applicability and relevance of your approach. -
Terminology and Abbreviations in Figures
In Figure 1 and throughout the text, could you ensure that each abbreviation is first accompanied by its full term? This will greatly enhance readability, especially for readers unfamiliar with the specific terminology.
伦理问题详情
N/A
Weakness 4 & Question 4: Addressing Non-MSE Loss Functions
Thank you for your insightful suggestion. In the manuscript, we have discussed the most commonly used decoding losses in watermarking tasks, such as mean squared error (MSE) and binary cross-entropy (BCE) loss. Given that the analysis and decomposition of MSE and BCE losses share similar principles, and due to space constraints, we prioritized presenting the analysis for MSE loss in Section 2.2, The Gap between Two Objectives. The analysis of binary cross-entropy loss (BCE) is provided in Appendix A to address this aspect comprehensively. We hope this clarifies your concern. Furthermore, we believe that our decomposition method is broadly applicable, and we are open to further discussions and extending to other loss functions in future work.
We hope our responses to your comments sufficiently address your concerns and further highlight the contributions of our work. Your thoughtful feedback has been instrumental in improving the quality and clarity of our manuscript. If you have any additional questions or suggestions, we remain open and ready to make further refinements.
Weakness 3 & Question 3: Purpose and Rationale for the Detachable Projection Head
Thank you for your valuable question and kind suggestion. We will address your question in two parts.
Purpose for Detachable Projection Head
The motivation for introducing the Detachable Projection Head stems from our decomposition and analysis in Section 2.2, "The Gap between Two Objectives" (which we have explained in more detail in our response to Weakness 1 & Question 1). As discussed in our analysis of Equation (3), is the part that aligns well with the decoding objective. However, directly using alone leads to instability during model training (as reported in Table 5). Therefore, to reduce the redundant optimization directions in the MSE loss and ensure stable model training, we propose two solutions. The first direct method is to modify the loss function: for the incorrectly decoded bits, we apply as the penalty, and for the correctly decoded bits, we only penalize those that are too close to the boundary. The second method still uses the MSE loss, but during the training phase, we introduce an additional Detachable Projection Head to handle the redundant optimization directions in the MSE loss. This ensures the stability of the model during training, and after training, we can discard the Detachable Projection Head to obtain a lightweight model capable of correct decoding.
Rationale for the Detachable Projection Head
From the forward direction of the Decoder during training, the projection head does not care about the magnitude of the backbone model’s outputs but requires these outputs to be distinguishable. If the outputs of the backbone model are indistinguishable, the projection head cannot correctly project them, making it impossible to minimize the MSE loss. Therefore, from the backward optimization direction, when using MSE loss to optimize the overall model, the MSE loss forces the projection head’s outputs to become more accurate to their label values (either -1 or 1). In turn, the projection head forces the backbone model’s outputs to become more distinguishable, thereby the backbone lightweight’s output aligning well with the decoding goal. The result is that the output of the projection head will be densely concentrated around -1 and 1, while the output of the backbone model, although more spread out, remains distinguishable. A detailed presentation of the distributions of the decoded messages from both the backbone model and the projection head is available in Appendix E.8.
We hope this explanation addresses your concerns and clarifies the purpose and rationale for the Detachable Projection Head.
Weakness 2 & Question 2: Differentiation of Framework from Existing Methods
Thank you for your insightful question. Previous works, such as MBRS and CIN, often treat decoders as integrated design units, which tends to overlook finer-grained parameter allocation and model design. This holistic approach makes it difficult to further reduce model parameters and allocate limited parameters efficiently. In contrast, our contribution introduces a novel perspective on model design by decoupling the internal structure of the decoder into two distinct modules: the Noised Watermarked Image Preprocessing (NWIP) module and the Message Extraction (ME) module. This structural separation provides us with greater flexibility to reduce parameters for each module individually, without the need for them to share the same number of channels. As a result, this design enables a more focused and efficient decoder architecture. This distinction is crucial for lightweight model design, as it allows for more effective parameter reduction and more targeted model size reductionr.
Additionally, we would like to emphasize that the most important innovation in our work is the identification of a key reason that limits the performance of existing watermarking frameworks: a mismatch between commonly used decoding losses (e.g., mean squared error and binary cross-entropy loss) and the actual decoding goal, which leads to parameter redundancy. We also propose two innovative solutions: DO and PH. Comprehensive and thorough experiments (Tables 2, 3, 4, 7, 8, and 20) have demonstrated the effectiveness of our two methods in improving the performance of lightweight models.
We also believe that these training methods (DO and PH) are highly applicable to various lightweight model architectures, and we are confident that they will be increasingly adopted to help other researchers achieve better performance in lightweight watermarking models in the future.
Dear Reviewer CaL5, thank you for your thoughtful and constructive feedback on our submission 6590, as well as for your recognition of the motivation behind our work. We truly appreciate the time you have taken to review our manuscript and provide such detailed insights. Below, we will try our best to address each of the concerns and questions you raised.
Weakness 5 & Question 5: Terminology and Abbreviations in Figures
We sincerely appreciate your valuable suggestions, and we apologize for any inconvenience this may have caused. In the latest version of our manuscript, we have ensured that all abbreviations are spelled out the first time they appear in both Figure 1 and the text. We hope this revision can address your concern.
Weakness 1 & Question 1: Clarification of Equation (3) and Loss Functions
Thank you for your important question. In the original manuscript, we tried to explain the three components in Equation (3) from a functional perspective. This time, we will provide a concrete example to better illustrate the point.
Assume that the watermark has a length of 4 bits. After the encoder embeds the watermark and the noise layer distorts the watermarked image, we obtain the noised watermark image . After passing through the decoder, we get the extracted watermark , which we assume to be .
For the MSE loss, clearly, none of the four values perfectly reach 1 or -1, so each of them incurs a loss and penalizes the model. However, in the decoding objective, the magnitude of the decoded bits is not of concern. We only care that the decoded values are distinguishable (bigger or smaller than the boundary value 0), so we should only penalize the bits in that are decoded incorrectly (i.e., the first and third bits in this case). This inconsistency in the penalization range leads the model to assign some parameters to optimize values that are unrelated to decoding accuracy (in this case, the second and fourth values in , which results in redundant model parameters.
To further investigate which part of the MSE loss causes this inconsistency, we break down the MSE loss in Equation (3) (detailed proof in Appendix A. We found that the part that aligns well with the decoding objective (which only penalizes the incorrectly decoded bits in is . On the other hand, and affect the correctly decoded parts of , which is unnecessary for the decoding objective. However, as we discussed in Section 2.2, "The Gap between Two Objectives," even though and are unrelated to the decoding objective, they are important for the model’s training stability. Directly using as the training loss would lead to training instability. This analysis is also validated in Section 4.4, "Ablation Study: Impact of MSE Components on Model Performance."
To further clarify the calculation process of the decomposed loss, we will demonstrate this using the example. For the extracted watermark , for convenience, we denote correctly decoded bits as (Right) and incorrectly decoded bits as (Wrong). Hence, the decoding result is . Additionally, based on the signs of , we can simplify its sign information to . Combining the decoding correctness and the sign information , we obtain the final sequence .
According to the breakdown in Equation (3):
-
: All the incorrectly decoded bits (whether + or -) are fully assigned to . In this case, the first (0.3) and third (-0.6) bits will be assigned to . The value of is always greater than or equal to 0, and in this case, it is:
-
: The correctly decoded bits (whether + or -) are assigned to . In this case, the second (-0.4) and fourth (0.2) bits will be assigned to . The value of is always less than or equal to 0, and in this case, it is:
-
: The correctly decoded bits are also assigned to . In this case, the second (-0.4) and fourth (0.2) bits will also be assigned to . The value of in this case is:
We hope that our further explanation, along with the example and calculations, clarifies the concerns and addresses your questions.
The paper provides a proper exploration of parameter lightening for watermarking models. The problem of mismatch between the actual decoding objective and the optimization objective of the commonly used decoding loss is solved. The solution to the above problem is attempted from the point of view of adding projection blocks and proxy losses. And the impact of each block on robustness at fine granularity is discussed after subdividing the watermarking framework.
优点
For the first time, we identify the mismatch between the optimization objectives of commonly used decoding losses (e.g., mean-square error and binary cross-entropy loss) and the actual decoding objectives, and confirm the existence of such a mismatch and its impact through ablation studies, which provides a new perspective for model optimization. The proposed separable projection head (PH) and decoding-oriented alternative loss (DO) effectively mitigate the negative impact of irrelevant optimization directions, allowing the lightweight model to achieve SOTA performance while maintaining high performance. The lightweight model outperforms existing models in terms of invisibility, robustness, and efficiency for other domains with limited computational resources, and minimizes performance loss by further compressing the model with a fine-grained deep watermarking framework.
缺点
- the authors don't seem to have considered the issue of capacity, and suggest discussing this part;
- there are some spelling problems, e.g. "robust- ness" in table9 at line 1005
- we know that robustness depends mainly on the type and intensity of noise added during the training phase, i.e., the NWIP module, but in the The paper does not give detailed experimental parameters, but only describes the noise parameter settings in the testing phase.
问题
- In terms of comparison experiments, because the main concern in watermarking is capacity, imperceptibility, robustness and efficiency. In this paper, we mainly focus on the efficiency improvement, but we should ensure that the other parameters are the same for a fair comparison.HiDDen's algorithm can guarantee that the embedding in the grey scale image of size 1616 embedded in the length of 52 bits of information, the embedded information capacity can be up to 0.203 BPP.And the image used in this paper is a 3128*128 colour image, the embedded information length is 64 bits, the capacity is much lower than HiDDen's, so the result of such a comparison should be understood as unfair, please explain why. Besides, the embedding capacities of several other compared algorithms are not consistent, how are they compared?
- Discriminative networks have been used in image watermarking frameworks since HiDDen and have shown advantages in enhancing the invisibility of watermarking frameworks. The authors of this paper did not consider this module in the newly proposed framework, please explain the reason, is it because high invisibility can be achieved without discriminative networks, to reduce the parameters so it is not used or there is another reason, please explain.
- In response to the decomposition of the MSE loss function in the paper, we see the role of the different loss components, would like to ask if this decomposition is first proposed in this paper? Or is there already a readily available scheme in the field of knowledge distillation that performs a similar decomposition of the loss function, and this will affect the reassessment of the paper's innovativeness.
Thank you for your thorough review and the constructive comments on our submission 6590. We highly value your feedback. Below, we address your comments to clarify and improve our manuscript:
Weakness 2: Spelling Problems
Thank you for carefully pointing out this issue. We apologize for any inconvenience caused during your initial reading. We have carefully revised this issue in the latest version.
Weakness 3: Experiment Settings in Noise Layer
Thank you for your valuable feedback. We provided a more accurate and detailed description of the training phase Noise Layer settings in Appendix E. For your convenience, here is a summary:
The combined noise during the training phase includes seven types of distortions:
- Gaussian Blur (GB): with a standard deviation of 2.0 and a kernel size of 7.
- Median Blur (MB): with a kernel size of 7.
- Gaussian Noise (GN): with a variance of 0.05 and a mean of 0.
- Salt & Pepper Noise (S&P): with a noise ratio of 0.1.
- Dropout (DP): with a drop ratio of 0.6.
- JPEG Compression (JPEG): with a quality factor of 50.
- JPEGSS: with a quality factor of 50 (simulated differentiable JPEG distortion).
We hope this clarification addresses your concerns.
Question 2: Why not use Discriminative networks?
Thank you for your important question. For watermarking tasks, discriminative networks are an additional module. In our work, to clearly demonstrate and validate the effectiveness of our proposed DO and PH methods, we chose to eliminate other factors that could potentially impact the model's visual quality and robustness. Therefore, we did not use discriminative networks in our work. Moreover, the impact of the discriminative network on our method is indeed minimal. Below, we present experiments where we used the discriminative network from the MBRS work:
The Effect of Discriminator (DIS) on the Proposed Lightweight Model
| Method | PSNR (dB) ↑ | Dropout (%) | JPEG (%) | GN (%) | S&P (%) | GB (%) | MB (%) | Ave (%) |
|---|---|---|---|---|---|---|---|---|
| DO w/o DIS | 41.70 | 100 | 99.12 | 97.40 | 100 | 100 | 99.64 | 99.36 |
| DO w DIS | 41.21 | 100 | 98.57 | 97.79 | 99.99 | 100 | 99.70 | 99.34 |
| PH w/o DIS | 41.67 | 99.99 | 98.92 | 97.21 | 99.99 | 99.96 | 99.59 | 99.28 |
| PH w DIS | 41.07 | 100 | 98.73 | 97.64 | 100 | 100 | 99.49 | 99.31 |
Benchmark Comparisons on Visual Quality Based on Five Different Metrics with and without Discriminator (DIS)
| Method | PSNR (dB) ↑ | SSIM ↑ | LPIPS ↓ | l₂ ↓ | l₊ₗ₋ₓ ↓ |
|---|---|---|---|---|---|
| DO w/o DIS | 41.70 | 0.97 | 0.001 | 3.46 | 0.06 |
| DO w DIS | 41.21 | 0.97 | 0.001 | 3.74 | 0.08 |
| PH w/o DIS | 41.67 | 0.97 | 0.002 | 3.36 | 0.07 |
| PH w DIS | 41.07 | 0.97 | 0.002 | 3.84 | 0.09 |
As shown in the table, the visual quality and robustness of our methods with or without the discriminative network are almost identical.
Additionally, as you mentioned, reducing the number of parameters is also one of our considerations. To clarify this, we tested the parameter size and computational complexity of the discriminative network:
Comparison of Parameter Size and FLOPs Between the Lightweight Model and the Discriminator
| Method | Size | FLOPs |
|---|---|---|
| Discriminator | 113.15K | 1.86G |
| Lightweight Model | 16.59K | 0.22G |
As shown in the table, the parameter size and computational complexity of a single discriminative network are already 6.8 times larger and 8.5 times more computationally expensive than the entire lightweight model. For the sake of maintaining the lightweight nature of our model, we chose not to use the discriminative network. This discussion has been included in Appendix E.7 of our revised manuscript.
Question 3: Innovation of the Decomposition Method in the Paper
Thank you for your important question. To the best of our knowledge, our work is the first to identify the mismatch between existing decoding losses and decoding objectives. It is also the first to propose this decomposition method and leverage this decomposition to achieve lightweight watermarking.
We hope these revisions to our submission adequately address your concerns and enhance its overall contributions. Your thoughtful feedback has been crucial in elevating the quality and coherence of our manuscript. Should you have additional comments or concerns, we remain receptive and prepared to undertake any further modifications.
Weakness 1 & Question 1: Regarding Capacity
Thank you for your insightful question. Following your suggestion, we revisited the settings in the original HiDDeN paper. HiDDeN divides the tests for Capacity (Section 4.1) and Robustness (Section 4.2) into two distinct parts:
-
Capacity Testing: According to the original description, the experimental setup is: "We train our model to encode binary messages of length L = 52 in grayscale images of size 16×16, giving our trained model a capacity of 52/(16×16) ≈ 0.203 BPP." Although the BPP is 0.203, the training and testing were conducted in a noiseless (No Noise Layer) environment.
-
Robustness Testing: For robustness, the setup in the original paper states: "We train our model on YUV color images of size C × H × W = 3×128×128 with message length L = 30," which implies a BPP of 0.00061, far smaller than the BPP when tested under noiseless conditions.
In real-world applications, noiseless conditions are almost impossible, which is why our primary focus in testing is on robustness under combined noise conditions.
In our Robustness experiments, to ensure fairness in the comparisons, we used the same experimental settings for all models tested (message length of 64 bits, cover image size of 128×128 with three channels) and the same Noise Layer. For other models (HiDDeN, MBRS, FIN, CIN), we used the official public code, which natively supports 64-bit watermark embedding in 3×128×128 images without any modifications. We believe this can ensure the fair comparison.
Furthermore, your suggestion to consider the issue of capacity is very valuable. We followed HiDDeN’s assumptions for capacity testing and also tested our model’s accuracy under various BPPs in a noiseless environment.
| Method | Message Length | Bits Per Pixel | PSNR (dB) | Accuracy (%) |
|---|---|---|---|---|
| PH | 64 | 0.0013 | 43.15 | 100 |
| 256 | 0.0052 | 42.14 | 100 | |
| 1024 | 0.0208 | 41.11 | 100 | |
| 4096 | 0.0833 | 40.41 | 100 | |
| 16384 | 0.3333 | 39.12 | 100 | |
| DO | 64 | 0.0013 | 42.91 | 100 |
| 256 | 0.0052 | 42.14 | 100 | |
| 1024 | 0.0208 | 41.03 | 100 | |
| 4096 | 0.0833 | 40.12 | 100 | |
| 16384 | 0.3333 | 39.01 | 100 |
This demonstrates that our model has good capacity under noiseless conditions. Thank you for your constructive suggestion, and this discussion has been added to our Appendix E.3.
This article discusses the need for efficient deep learning-based watermarking models for copyright protection. It identifies a mismatch between common decoding losses and actual goals as a key issue, leading to parameter redundancy. The authors propose two solutions: a redesigned loss function and a detachable module for training. They introduce a new framework with submodules for parameter reduction, achieving better efficiency, invisibility, and robustness with significantly fewer parameters, making it suitable for resource-constrained applications. The proposed methods are designed for easy integration into future lightweight models.
优点
- The paper is well-written and well-organized.
- The idea is interesting to the digital watermarking community. The method is simple and easy to deploy.
- The evaluation is comprehensive, thoroughly considering robustness against multiple distortions.
缺点
- The proposed end-to-end framework doesn't seem to have many modifications compared to existing frameworks. Why does this framework reduce parameters?
- I understand that due to space limitations, the authors have placed a lot of content in the Appendix. However, some important content should be in the main text, such as some visual results and the implementation of the projection blocks.
- Although the discussion in the evaluation section is thorough, I think it is necessary to summarize the advantages and disadvantages of the two proposed methods, PH and DO, in the conclusion.
问题
- The Decoding-Oriented Surrogate Loss is derived from the MSE loss. Why not use the BCE loss derived in Eq. (21)?
- What are A and B in Eq. (4)? Are they learnable? I did not find details about A and B in Appendix C.
- Why does the PSNR reach as high as 72.64 and 67.75 in Table 6 and Table 8?
- The paper claims that NWIP mitigates the impact of distortions on the watermarked image. However, NWIP and the ME module are deeply coupled and trained end-to-end. How can you determine the function of each module when they function together?
Question 1: Why not use the BCE loss derived in Eq. (21) for DO?
Thank you for your insightful question. As described in Section 2.2, the analysis and decomposition of BCE loss and MSE loss share commonalities. The idea behind the DO method is widely applicable, and both MSE loss and BCE loss can be transformed into corresponding DO losses. We provide a more detailed explanation of the BCE-based DO method in Appendix E.1. For convenience, we present the comparison results below:
Comparison of Performance between DO (MSE-based) and DO (BCE-based) Methods
| Method | PSNR (dB) ↑ | Dropout (%) | JPEG (%) | GN (%) | S&P (%) | GB (%) | MB (%) | Ave (%) |
|---|---|---|---|---|---|---|---|---|
| DO (MSE) | 41.70 | 100 | 99.12 | 97.40 | 100 | 100 | 99.64 | 99.36 |
| DO (BCE) | 41.10 | 100 | 98.11 | 97.96 | 100 | 99.94 | 99.74 | 99.29 |
As shown in the table, the performance of DO (MSE-based) and DO (BCE-based) is quite similar.
Question 3: Why does the PSNR reach as high as 72.64 and 67.75?
Thank you for your valuable question. The two PSNR values you mentioned correspond to the noise layers of Dropout and Salt & Pepper distortions. In Dropout, the selected pixels in the watermarked images are replaced with the corresponding pixels from the cover image. In Salt & Pepper noise, selected pixels are randomly replaced with 0 or 1. Both of these attacks are sparse, pixel-wise perturbations. Because convolutional neural networks have local receptive fields and spatial invariance, we believe these pixel-wise attacks have a limited impact on overall feature extraction. More specifically, for our DO and PH methods, the decoding process does not require the precise magnitude of the decoder’s outputs. We only need the decoded values to be distinguishable (bigger or smaller than the boundary value 0). This means that the necessary information for decoding is less, enabling the encoder to embed less information while preserving goood decoding performance. Consequently, the watermarked image achieves higher visual quality, and the decoder maintains satisfactory performance under these challenging conditions.
Question 4: Functionality Separation in Decoder
Thank you for your insightful question. The purpose of the separation within the decoder is to more efficiently allocate parameters, thus further reducing the model size. The separation of the two modules is based on a specific criterion: layers with a stride of 1, which do not reduce the dimensionality of the input, are designated as data processing layers, while layers with a stride greater than 2, which reduce the input dimension, are designated as data extraction layers. Previous methods, such as MBRS and CIN, often mix or alternate these two types of layers. In our framework, however, these two types of layers are grouped into two distinct modules: NWIP and ME. This division is primarily a structural separation. As you pointed out, since NWIP and ME are trained simultaneously and share the same objective function, a complete functional distinction between them is not feasible. However, the advantage of this structural separation is that the two modules do not need to share the same number of channels. This allows us to independently reduce the parameters of each part, enabling a more focused study and design of the decoder.
We hope our response addresses your question and concerns. Your insightful and kind comments and suggestions are valuable to us in enhancing the quality and coherence of our work. Should you have additional comments or concerns, we remain receptive and ready to make any necessary modifications.
Thank you for the authors' responses. Most of my concerns have been addressed. I still have questions about the parameter reduction aspect of your proposed five-module framework. In your rebuttal, you state that 'by decoupling, we can independently design each module and allocate parameters more effectively, which further reduces the model's parameters.' However, I cannot locate this analysis in the paper. Could you elaborate on how you specifically designed each module independently to achieve parameter reduction.
Thank you for your continued engagement and valuable feedback. We are delighted that most of your concerns have been addressed. We sincerely appreciate this opportunity to elaborate on the parameter reduction aspect of our framework.
How Can We Reduce the Number of Parameters?
In this paper, our primary focus was not on targeting specific frameworks for parameter reduction. Instead, we analyzed the structural requirements of the watermarking system and aimed to design watermarking networks with the minimal number of parameters needed to meet user' specific robustness requirements. Based on our analysis, we identified five essential components:
- Image Preprocessing (IP) module
- Message Preprocessing (MP) module
- Feature Fusion (FF) module
- Noised Watermarked Image Preprocessing (NWIP) module
- Message Extraction (ME) module
Components 1, 2, 3, and 5 are always necessary for the watermarking task. However, the NWIP module (which is related to noise processing) is not required in scenarios where robustness is not a concern, such as in noiseless environments.
The core framework consists of modules 1, 2, 3, and 5, which together form the basic building blocks of the watermarking model. The number of parameters in this minimal configuration can be adjusted based on the specific requirements of the task. Therefore, our model design follows a small-to-large approach: once the model meets the necessary requirements, the expansion stops. For instance, in more challenging environments, such as those with combined noise, we can expand the model to ensure better robustness. However, this expansion is done incrementally and selectively, only expanding the necessary modules. For example:
- In the encoder, the MP module (used for matching shapes) does not need to be expanded. Instead, we extend the IP and FF modules to enhance image feature understanding and fusion capabilities.
- In the decoder, the ME module (also used for shape matching) remains unchanged, but we increase the depth of the NWIP module to improve feature extraction and handle more complex distortions.
By following this incremental expansion approach, we can design an architecture that adapts to the specific demands of each use case while maintaining a minimal number of parameters. The modular and decoupled nature of our framework not only facilitates independent parameter allocation but also allows users to efficiently validate each module’s design within a small model. Starting small and scaling up only as needed, our approach offers both flexibility and efficiency, adapting as needed while scaling from a small model.
Thank you for the authors' detailed responses. I have updated my rating to 8.
We sincerely thank you for actively engaging with our rebuttal and thoughtfully considering our responses. We are delighted to hear that our rebuttal has successfully addressed your concerns.
We sincerely appreciate your detailed review and valuable feedback on our submission. Thank you for dedicating time to assess our work and provide constructive suggestions. Below, we respond to your comments with the aim of clarifying and enhancing our manuscript.
Weakness 3: Summarize the advantages and disadvantages of the two proposed methods in the conclusion.
Thank you for your valuable suggestion. Initially, we included the limitations of the DO and PH methods in the Appendix's Limitation section. Following your advice, we will move this part to the conclusion in the latest version to further enhance the completeness of the paper.
Weakness 2 & Question 2: More visual results and the implementation details in main text.
Thank you for your constructive comment! Following your suggestion, we have moved the visual comparison of watermarked images into the main text. Additionally, we have provided a new structural diagram of the projection block and included it in the main text, hoping that it will address your questions about the projection block and better meet your expectations.
To facilitate your reference, here is a summary of the projection block and Eq. (4). Each projection block contains two learnable deep learning modules, A and B, which are both involved in optimization. A and B have identical structures and inputs, but their parameters are independent and not shared. The computation process of A and B is as shown in Eq. (4). The input and output tensors of A (and B) have the same shape, and the internal structure is as follows: four transposed convolution layers are used to upsample the input with shape 1×8×8 to 32×128×128, followed by four convolution layers to downsample it back to 1×8×8. The activation function used between each pair of layers is LeakyReLU. We hope that this explanation resolves your concerns.
Furthermore, we commit to releasing the code for this project, which will enable other researchers to facilitate their future work.
Weaknesses 1: Why does this framework reduce parameters?
Thank you for your insightful question. We would like to address your concern from two perspectives:
-
Model Design Perspective: Overall, our Lightweight Model still follows the Encoder-Noise Layer-Decoder (END) structure. However, from the model design perspective, we are the first to further decouple the internal structure of the Decoder and divide it into the NWIP module and the ME module. In previous decoder architectures, these two modules were often deeply coupled, requiring uniform dimensions for the intermediate features. By decoupling, we can independently design each module and allocate parameters more effectively, which further reduces the model’s parameters.
-
Redundant Parameters Perspective: As we analyzed in Section 2.2 "The Gap between Two Objectives" regarding MSE loss and BCE loss, previous models had a large number of redundant parameters used to maintain training stability, which did not directly reduce the decoding error. To reduce redundant parameters without affecting the model’s performance, we proposed two methods: DO and PH. In summary, the DO method directly reduces parameters at the optimization level, allowing for a lightweight model from the design stage. The PH method reduces parameters indirectly by using a projection head during training for normalization, which is discarded in the inference phase, thus reducing parameters at the inference stage.
The paper analyzes the limitation of existing deep watermarking approach in the gap of the losses in the training and decoding phase. Therefore, they propose a Decoding-Oriented Surrogate Loss and a detachable projection head, which maintains the performance of the watermarking but also requires much less computational costs.
优点
- The motivation of the paper is grounded in analysis and the experimental results sufficiently validate the claim.
- They also considered the robustness against diffusion-based attacks.
- The paper is well-written.
缺点
- The citation format should be fixed. At least the items should be wrapped in a bracket.
问题
- I'm wondering what about the robustness against clipping?
- What is the motivation behind the light-weight deep watermarking architecture design?
伦理问题详情
N/A
Question 2: What is the motivation behind the lightweight deep watermarking architecture design?
Thank you for your interest in the lightweight deep learning framework. The motivation behind the lightweight deep watermarking architecture stems from the practical demands of deploying models in resource-constrained environments. The motivations can be summarized as follows:
1. Practical Need for Lightweight Models in Real-World Applications Digital watermarking plays a crucial role in protecting intellectual property across various domains, such as images, videos, and 3D content. However, high-performance models are often impractical for deployment due to their:
- Large parameter sizes, which increase storage requirements.
- High computational demands, which exceed the capacity of many real-world systems.
This challenge is particularly acute in scenarios like video streaming and online education, where devices must operate efficiently within strict computational and energy constraints. High-performance lightweight models are essential to enable such deployment while remaining effective for copyright protection.
2. Importance of Lightweight Design in SoC (system-on-chip) Architectures In SoC-based environments, lightweight design is not just beneficial but imperative:
- Storage Constraints: The storage capacity in embedded devices is limited, making parameter-efficient models essential.
- Computational Efficiency: SoCs prioritize energy efficiency and lack significant computational power. Lightweight models reduce inference time and energy consumption, enabling practical deployment on edge devices.
By focusing on minimizing computational complexity (e.g., FLOPs) and reducing storage requirements, our architecture is suitable for environments where resources are scarce, but performance must remain competitive.
We hope these revisions and additional analyses can address your concerns and strengthen the contribution of our work. We appreciate the constructive guidance you have provided, which has substantially improved the manuscript. If you have additional comments or concerns, we welcome your input and are ready to make any necessary adjustments.
Thank you for your thorough review and insightful comments on our submission. We appreciate the time you invested in evaluating our work and offering constructive feedback. Below, we address your comments to clarify and improve our manuscript.
Weakness 1: The citation format should be fixed.
Thank you for your valuable suggestions, and we really apologize for any inconvenience caused by the citation format. We have carefully reviewed this part and corrected the issue in the revised version.
Question 1: Robustness Against Clipping
Taking your valuable advice, we expanded our experiments to include comparisons under clipping distortions. Specifically, we considered two types of clipping:
-
Pixel-Based Clamp Distortion:
The original pixel intensity range of the images is . In clamp distortion, we progressively shrink the pixel range to by adjusting the clamping strength . -
Geometry-Based Crop Distortion:
We adopted the publicly available implementation of crop distortion from MBRS for our tests, where controls the area of the retained region.
Clamp Distortion:
| Method | PSNR | c=0.1 | c=0.2 | c=0.3 |
|---|---|---|---|---|
| MBRS | 54.62 | 94.86 | 87.87 | 75.37 |
| FIN | 54.82 | 99.75 | 98.17 | 91.41 |
| DO | 55.86 | 99.79 | 98.29 | 91.90 |
| PH | 55.37 | 99.88 | 98.49 | 90.31 |
Crop Distortion:
| Method | PSNR | p=0.75 | p=0.5 | p=0.25 |
|---|---|---|---|---|
| MBRS | 54.62 | 99.90 | 99.12 | 84.86 |
| FIN | 54.82 | 100 | 99.78 | 92.49 |
| DO | 55.86 | 100 | 100 | 94.24 |
| PH | 55.37 | 100 | 100 | 94.13 |
From the results, it can be observed that our proposed DO method achieves the best visual quality (PSNR) and robustness (decoding accuracy) under both clamp and crop distortions. For the PH method, its decoding accuracy is slightly lower than FIN only under the strongest clamp distortion (). These results demonstrate the wide applicability and effectiveness of our DO and PH methods.
This paper examines the loss function in deep watermarking models and introduces two innovative methods. The first method, the Detachable Projection Head, enhances decoding accuracy by aligning outputs with the intended classification through projection close to the decision boundary. The second method, Decoding-Oriented Surrogate Loss, ensures a stable training process and emphasizes decoding accuracy by maintaining a safe distance for the deflation and inflation parts of the loss function.
优点
Discusses a very noteworthy issue: watermarking of several types of data.
缺点
Currently, the safe distance (\epsilon) in the DO loss is manually set. Could this parameter be improved with automated tuning or adaptive methods to enhance performance and ease of use?
Geometric distortions are common in real-world scenarios, yet PH and DO do not achieve the best performance under these conditions. Can the methods be improved to better handle such distortions?
While PH is not used during inference, it increases the complexity of the training process. How large is the PH in terms of parameter count and computational demand? Is substantial computational power required for its training, and could its size be optimized?
问题
Refer to Weakness.
Weakness 3: Size and Computational Complexity of PH? & How to Further Reduce PH?
Thank you for your constructive questions. The PH size used in this paper is 165.89K, with a FLOPs of 0.56G. Even with the addition of PH during the training phase, the overall model size (182.48K) and computational complexity (0.78G) remain significantly smaller than those of other models. Therefore, our PH method still offers advantages in efficiency.
The Projection Head used in our work consists of 4 identical projection blocks, each with 32 intermediate channels. To explore further reduction of PH, we considered two approaches:
- Reducing the number of projection blocks while keeping the intermediate channel number fixed at 32 to reduce the number of parameters. This experiment is presented in Table 15 from Appendix E.8.
- Reducing the number of intermediate channels while keeping the number of projection blocks fixed at 4. We conducted additional experiments for this approach, and for your convenience, we present the Average Accuracy below. The detailed accuracy for each distortion is shown in Appendix E.8.
| Channel Numbers | PSNR (dB) | Ave (%) |
|---|---|---|
| 4 | 40.44 | 98.17 |
| 8 | 40.56 | 98.30 |
| 16 | 41.43 | 99.09 |
| 32 | 41.67 | 99.28 |
From these two experiments, we can conclude that there is a trade-off between the size of PH and the performance of the Lightweight Model. We have conducted detailed experiments and provided these results to help other users choose the appropriate PH size based on their computational resources and practical requirements. Even with the full PH, our method still achieves significantly lower model size and computational complexity than other models.
We hope that the revisions made to our submission effectively address your concerns and strengthen its overall contributions. Your insightful feedback has played a significant role in improving the quality and clarity of our manuscript. If you have any further comments or concerns, we remain open to additional suggestions and are ready to make further improvements.
Weakness 2: Robustness under Geometric Distortions
Thank you for your insightful and practical suggestions. As we mentioned in Appendix E.11 and the LIMITATIONS section, geometric distortions are relatively complex, and the performance of the original proposed Lightweight Model with its limited number of parameters does not achieve optimal results (it performs just below MBRS). Following your recommendation, we test a extended Lightweight Model (called Lightweight Model +) to improve its robustness against geometric distortions.
Specifically, since the noised watermarked image preprocessing (NWIP) module is the first layer interacting directly with the noise, we enhanced this module by adding SE blocks and increasing the number of channels in the intermediate layers from 12 to 32. This modification strengthens the model's feature extraction ability for noised watermarked images. After the expansion, the total number of parameters in the model is 56.01K, which remains significantly smaller than other watermarking models—12.33% of HiDDeD, 7.49% of FIN, 0.27% of MBRS, and 0.16% of CIN. This constructive discussion has been included in Appendix E.11 of our revised manuscript.
The performance of this enhanced model is summarized in the table below:
| Method | PSNR (dB) | RandomPerspective (%) | RandomAffine (%) | RandomElasticTransform (%) | Avg (%) |
|---|---|---|---|---|---|
| HiDDeN | 37.08 | 66.65 | 66.30 | 69.63 | 67.53 |
| MBRS | 48.09 | 98.05 | 98.44 | 99.71 | 98.73 |
| FIN | 42.05 | 73.63 | 74.45 | 99.51 | 82.53 |
| PH with Lightweight Model | 42.33 | 83.26 | 83.24 | 99.72 | 88.74 |
| DO with Lightweight Model | 43.66 | 84.23 | 84.69 | 99.68 | 89.53 |
| PH with Lightweight Model + | 48.14 | 98.07 | 98.56 | 99.62 | 98.75 |
| DO with Lightweight Model + | 48.42 | 98.79 | 99.51 | 99.83 | 99.38 |
As shown in the table, the enhanced Lightweight Model shows significant improvement in robustness against geometric distortions. The DO method achieves the best performance in both visual quality and the three types of distortions. PH performs slightly worse than MBRS only in RandomElasticTransform but achieves better results in both visual quality and average accuracy.
Overall, the motivation of our work is to explore the feasibility of lightweight deep learning watermarking models and propose broadly applicable methods to help improve the performance of such lightweight models. Therefore, we primarily focused on validating the feasibility and effectiveness of the two methods (DO and PH) on a very simple, lightweight model with a small number of parameters. However, we would like to emphasize that the two proposed training methods (DO and PH) are widely applicable to lightweight models with different architectures. We believe they will be more widely adopted in future research and will help other researchers improve the performance of their lightweight watermarking models.
Thank you for your detailed review and valuable feedback on our submission. We truly appreciate the time you dedicated to evaluating our work and providing thoughtful comments. Below, we address your points to clarify and enhance our manuscript.
Weakness 1: Can safe distance () in DO be automated tuning?
Thank you for your insightful suggestion. Following your recommendation, we attempted to make the safe distance () a learnable parameter and incorporate it into the model’s optimization process. We initialized with a value of 0.1 and keep other training settings unchanged with Section 4.2. During the training process, we continuously monitored the model's decoding accuracy (ACC) and the value of . We found that as training progressed, continuously decreased and eventually degraded to 0. In later stages of training, the model became unstable, and ACC rapidly dropped to around 50%, making the model unable to decode.
We analyzed this behavior and identified that in the DO method, the role of is to penalize values that are too close to the boundary (0). The larger is, the broader the penalty range; as approaches 0, the penalty range shrinks, and at , no penalty is applied. A shortcut to minimizing the is to set , causing the DO method (i.e., ) to degrade to . This situation, where only is present, aligns with the analysis in Section 2.2 ("The Gap between Two Objectives") of our original paper. In this case, model training becomes unstable, as further verified and reported in Table 5 of Section 4.4 (Ablation Study).
We regret that we could not directly incorporate as a learnable parameter into the training, but your suggestion is highly valuable. As a compromise, we conducted extensive experiments on the model's performance under different values of , which are presented in Appendix E.10. These experiments provide valuable insights for selecting the appropriate value of . We also plan to explore more automated and convenient methods in future research. We hope these explanation can address your concerns and contribute to a clearer understanding of the challenges we faced. We remain open to any additional comments or suggestions.
Dear reviewers,
We sincerely thank you for your thoughtful feedback, which has greatly contributed to improving the quality of our work. Below, we summarize the key revisions we made in response to your suggestions, as reflected in both the rebuttal and the revised manuscript:
- Enhanced Visual Results in Main Text: We have incorporated more visual results directly into the main text, such as Figure 3 (Visual comparison of watermarked images) and Figure 2 (Structure of the projection block).
- More Detailed Analysis of the Detachable Projection Head (PH): We added a structural diagram of the projection block (Figure 2) in the main text and conducted further analysis on the optimization of PH size, presented in Appendix E.8 (Table 15 and 16).
- Improved Robustness to Geometric Distortions: We expanded discussions on lightweight model extensibility and the broad applicability of our DO and PH methodsl, along with enhanced results for geometric distortions in Appendix E.11 (Table 20).
- Capacity Evaluation: We included experiments evaluating capacity, detailed in Appendix E.3 (Table 8).
- Impact of Discriminative Networks: We analyzed the influence of discriminative networks on model performance and efficiency, as shown in Appendix E.7 (Tables 11, 12, and 13).
- Further Discussion on BCE Loss: Additional discussions on BCE loss, along with clarifications and experiments for the DO method based on BCE loss, were added in Appendix A and E.1 (Table 6).
- Revised Writing and Explanations: We further clarified the motivation and concept behind the proposed Detachable Projection Head and lightweight watermarking architecture design, providing more detailed examples in the rebuttal to address your specific questions.
We deeply value your time and expertise. As the discussion deadline approaches, we would greatly appreciate your prompt response. We remain open to further suggestions and are ready to make additional adjustments as needed.
Thank you again for your guidance and support throughout this process.
Warm regards,
Authors of Submission 6590
In this paper, motivated by the observations that ...many high-performance models are limited in practical deployment due to their large number of parameters'' and ...the robustness and invisibility performance of existing lightweight models are unsatisfactory...,'' the authors proposed a lightweight deep watermarking framework.
As also described in the first paragraph of Introduction, ``Robustness'' is an important requirement that watermarking methods must satisfy.
So, in the first paragraph of Section 4, the authors ``chose Combined Noise, which incorporates six different distortions: Gaussian Blur (GB) with a standard deviation of 2.0 and a kernel size of 7, Median Blur (MB) with a kernel size of 7, Gaussian Noise (GN) with a variance of 0.05 and a mean of 0, Salt & Pepper Noise (S&P) with a noise ratio of 0.1, JPEG Compression (JPEG) with a quality factor of 50, and Dropout (DP) with a drop ratio of 0.6.'' for robustness test.
In order to train a robust watermarking'' framework, the concept of Noise Layer'' previously proposed in HiDDeN was used but it also constrains the capability of resisting a broad range of attacks. Actually, only a very limited of selected attacks/distortions were adopted in the Noise Layer.
Although most reviewers are satisfied with the responses from the authors, it is found that this paper outweighs lightweight'' over robustness'', and does not advance the achievement of robustness against, in particular, geometric distortions.
Reviewer h3SZ also concerns about the problem of less robustness against geometric distortions.
Therefore, sufficient robustness evaluation is ignored by most reviewers and the authors. This is unacceptable. The authors are encouraged to take the traditional benchmark-Stirmark into consideration for their robustness evaluation. Since robustness is a crucial requirement, treating it seriously can actually achieve a better and meaningful trade-off among all the watermarking requirements, including fidelity, capacity, robustness, and lightweight.
审稿人讨论附加意见
Robustness is a critical requirement in conventional and deep learning-based digital watermarking methods.
The authors' response to Reviewer h3SZ's comment on Robustness under Geometric Distortions'' is still unsatisfactory. According to the AC's experience on digital watermarking, this paper indeed outweighs lightweight'' over `robustness,'' and does not advance the achievement of robustness against geometric distortions.
Reject