Dear Reviewer kwvu,

Thank you very much for your careful reading and recognition of our work, we will address each one sequentially below:

W1: Lacks some theoretical analysis. In the experiments, do you need to use the same VAE model for the attack as you have used during watermarking training? If not, then an analysis may be needed as to why it is still robust against attacks with different VAE models.

A1: In the experiment, it is not necessary to use the same VAE model for attacks as the one used for watermark training.

Essentially, adding information will inevitably alter some aspects of the image, and our goal is to ensure that this information remains as intact as possible during attacks. As demonstrated in Section 5.3 and Table 3, different optimization methods exhibit distinct characteristics.

Directly embedding watermarks in the Pixel or Latent domains makes them highly susceptible to various traditional attacks, as specific bits of information may independently reside in one or more pixels. In contrast, embedding watermarks in the frequency domain effectively resists multiple traditional attacks because this method distributes the information across the entire image.
Similarly, watermarks embedded in the pixel space are weak against regeneration attacks, whereas embedding watermarks in the latent space significantly enhances robustness against such attacks. This is due to the stronger correlation between the optimized latent embeddings and the image semantics, making the watermark more resistant to disruption by regeneration attacks.

FreqMark embeds watermarks in the latent frequency space, combining these two aspects to produce a synergistic effect, making FreqMark robust against traditional attacks and regeneration attacks.

Furthermore, due to FreqMark's characteristic of training the image itself, it exhibits stronger robustness against regeneration attacks using the same VAE.

Bit Accuracy of regeneration attack using the same VAE

PSNR after VAE attack	31.43	30.31	28.98	27.39	25.82
Bit Acc	1.000	1.000	0.998	0.990	0.975

W2: Since the proposed method requires case-by-case optimization, what is the watermarking time for each case, and how does it compare to other competing methods?

A2: We employed half-precision training to reduce watermarking time and GPU memory usage without compromising performance. In our current experiments, using a single A-100 GPU with 40GB memory, and process four 512x512 images in parallel for 400 steps takes about 0.75 minutes per image. Increasing the batch size will improve overall efficiency.

As shown in Figure 7 in the Appendix, FreqMark demonstrates considerable performance at just 200 steps, and even at 100 steps. This allows for the reduction of steps as needed to save time. According to our experiments, processing four 512x512 images in parallel for 100 steps can be accelerated to about 12 seconds per image.

Bit Accuracy with different number of steps

Bit Acc	JPEG	Gauss noise	VAE-B	VAE-C	Diffusion
400 steps	1.000	0.934	0.925	0.897	0.945
200 steps	1.000	0.930	0.923	0.885	0.933
100 steps	0.987	0.922	0.921	0.881	0.888

To train the Stable Signature, one must first spend a day training the watermark extractor on 8 GPUs. Subsequently, for any specific hidden message, the stable signature requires approximately 1 minute to fine-tune the VAE decoder.

SSL directly optimizes the image pixels, resulting in faster processing speed. Under the same experimental settings, it takes approximately 1 second to process per image. However, FreqMark demonstrates a significantly stronger robustness advantage than SSL.

We will further attempt to reduce the optimization time without compromising performance.

W3: Why do we need a set of pre-trained N-dimensional direction vectors, but not directly produce the message?

A3: The method of extracting hidden watermark information by predefined vector directions is referenced from SSL. Overall, this approach is more suitable for post-generation self-supervised methods, providing stronger robustness for hidden watermarks while maintaining image quality. Additionally, this method effectively addresses the challenge of obtaining hidden watermark information without the need for additional training of the extraction network.

W4: In Equation (4), why ? N is a number, not a set, so is not appropriate. Besides, using to denote a vector set and to represent a vector can lead to confusion. In Figure 2, there are vectors in the set, I think is the bit length.

Thank you for your careful reading and for pointing out the errors. This section of the paper indeed has some incorrect expressions that caused confusion, and we will make the necessary corrections.

The dimension of the vector is N-dimensional. For clarity and expression, the number of pre-defined vectors is defined as K (we will correct Vector 0 to Vector 1 in Figure 2). To avoid interference between different bits of information, the pre-defined vectors should be orthogonal to each other, thus K should be less than or equal to N. This means that a maximum of N bits of information can be embedded for an N-dimensional vector.

Therefore, The correct expression should be that the pre-defined vectors are a set of K N-dimensional vectors .

And Equation 4 should be corrected to:

We will continue to carefully review and correct any mistaken and confused statements in the paper.

Thanks again for your suggestions and corrections!