PaperHub
7.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
4
5
5
3.8
置信度
创新性2.8
质量3.3
清晰度3.3
重要性3.0
NeurIPS 2025

Watermarking Autoregressive Image Generation

OpenReviewPDF
提交: 2025-05-05更新: 2025-10-29
TL;DR

We study token-level watermarking in the context of autoregressive image generation models.

摘要

关键词
watermarkingautoregressivetextimageLLMmultimodal

评审与讨论

审稿意见
5

This paper presents the first approach to watermarking autoregressive image generation models at the token level, adapting techniques from language model watermarking. The key challenge identified is the lack of reverse cycle-consistency (RCC), where re-tokenizing generated image tokens significantly alters the token sequence, potentially erasing the watermark. To address this, the authors propose two main solutions: (1) a custom tokenizer-detokenizer finetuning procedure that improves RCC by optimizing the decoder and a replica encoder to maintain token consistency, and (2) a watermark synchronization mechanism that enhances robustness to geometric transformations by embedding localized synchronization signals. Extensive experiments on models like TAMING and CHAMELEON demonstrate that this approach enables robust watermark detection while preserving generation quality, showing resilience against various transformations and attacks including diffusion purification and neural compression. The method also extends to multimodal scenarios, allowing joint watermarking of interleaved text and image generations with theoretically grounded p-values. This work represents a significant step towards reliable provenance tracking for autoregressive generative models, addressing a previously unexplored but important area in the field of AI watermarking.

优缺点分析

Strengths:

  • Watermarking for auto-regressive image generation models is a novel and timely topic.
  • The paper identifies a key challenge when directly applying language watermarking techniques: the lack of reverse cycle-consistency, and proposes effective solutions through tokenizer-detokenizer finetuning procedure and watermark synchronization.

Weaknesses:

  • The paper misses some concurrent works [1,2]. It would be beneficial to include comparisons with these concurrent approaches.

[1] A Watermark for Auto-Regressive Image Generation Models. (Wu et al.)

[2] Training-Free Watermarking for Autoregressive Image Generation. (Tong et al.)

问题

See the weaknesses part.

局限性

yes

最终评判理由

This paper represents pioneering work in watermarking auto-regressive image generation models. It is technically sound and intellectually stimulating. The authors have addressed my earlier concern regarding comparisons with concurrent work by noting that these works were released after the NeurIPS submission deadline. Based on these considerations, I believe the paper merits a score of 5.

格式问题

None

作者回复

We thank the reviewer for their feedback. We are glad that they recognize our work as novel and timely and consider our solutions to be effective. We reply to the question raised by the reviewer below.

Q1: Why does the paper not include comparisons to concurrent works [1] and [2]?

We remark that it was not possible for us to include these in our submission as both of these works were first made public after the NeurIPS submission deadline. We however appreciate the references and will look at them in detail. We certainly agree that relevant concurrent work should be prominently discussed, and will add a detailed paragraph about all such works to the Related Work section in the next revision.

As this resolves the only point raised by the reviewer as a weakness of our work and the remainder of the review is positive, we respectfully ask the reviewer to consider updating their score.

评论

Thanks for your clarifications, I've updated the score.

审稿意见
4

This paper is the first to propose a method for watermarking the outputs of autoregressive image generation models at the token level. The authors identify a key challenge in directly applying watermarking techniques from language models to the image domain: a lack of reverse cycle-consistency (RCC). RCC refers to the inconsistency observed when a sequence of image tokens is decoded into an image and then re-encoded back into tokens, which can erase the watermark. To address this, the paper introduces a finetuning procedure for the image tokenizer's encoder and decoder to improve RCC. Furthermore, a post-hoc watermark synchronization step is proposed to enhance robustness against spatial distortions. The paper presents comprehensive experiments demonstrating that the proposed method effectively mitigates the RCC problem, thereby improving the watermark's effectiveness and robustness while maintaining high fidelity in the generated images. The authors also validate the method's applicability to multimodal watermarking and its potential extension to the audio domain.

优缺点分析

Strengths

  • The paper is well-written, clear, and easy to follow.
  • It provides a thorough and comprehensive analysis of the key challenge it identifies, reverse cycle-consistency (RCC).
  • The experimental evaluation is extensive and convincing. The results clearly show that the finetuning process improves RCC and that this enhancement directly translates to better watermark detectability and robustness against a variety of transformations.
  • The paper includes valuable extension experiments that demonstrate the method's effectiveness in joint multimodal (text-image) watermarking and its applicability to the audio domain, broadening the potential impact of the work.

Weaknesses

  • The paper does not investigate whether introducing a watermark at the token-level during generation affects the diversity of the output images.
  • The evaluation of image quality degradation relies on FID, which may not be sufficient for a fine-grained assessment. A comparison of the similarity (e.g., using PSNR and SSIM) between images decoded by the original tokenizer and the finetuned tokenizer from the exact same token sequence would provide a more direct measure of the visual impact of finetuning the decoder.
  • The use of the synchronization (SYNC) to improve robustness to geometric transforms comes at the cost of reduced visual quality and added computational expense. It is not made sufficiently clear why this approach is superior to simply using a robust post-hoc watermarking method or a localized watermark directly for provenance.
  • The necessity of finetuning the tokenizer may present a significant overhead, potentially limiting the method's practical applicability. The paper could strengthen its argument by providing a more direct comparison of the trade-offs (e.g., cost vs. robustness) against less computationally intensive post-hoc watermarking alternatives.

问题

  1. Could you provide any metrics to evaluate whether the proposed watermarking scheme reduces the diversity of the generated images?
  2. Following on from Weakness #2, could you provide metrics such as PSNR and SSIM to compare the images decoded from the identical token sequence by the original decoder versus the finetuned decoder? This would help quantify the perceptual changes introduced solely by the finetuning process.
  3. In Figure 6, the SYNC variant shows improved robustness to cropping. Could you elaborate on how the synchronization mechanism effectively improves the robustness?
  4. Regarding Weakness #3, could you further justify the use of the SYNC, which degrades image quality, over applying a robust post-hoc localized watermark directly? A more detailed discussion on the trade-offs in terms of robustness, security (e.g., provable p-values ), and visual quality would be beneficial.
  5. Regarding Weakness #4, could you elaborate on the practicality of the tokenizer finetuning step? What are the key advantages of this approach that justify the additional computational cost compared to applying existing robust post-hoc watermarks?

局限性

Yes

最终评判理由

Thanks for the response. I keep my score.

格式问题

No

作者回复

We thank the reviewer for their detailed feedback and for recognizing our work as well-written and clear, our experiments as extensive and convincing, and our multimodal and audio experiments as valuable extensions. We respond to the reviewer’s questions below, numbered Q1-Q4. We kindly ask the reviewer to let us know if their concerns were addressed, and we are happy to continue the discussion further if there are follow-up questions.

Q1: Can you quantify the effect of the watermark on diversity?

We note that FID, which we report in the paper, captures both fidelity and diversity. However, prompted by this question, we also measure diversity via Recall, following [R1] and in the same setup as in Sec. 4. We observe no significant degradation of Recall for both Chameleon and Taming, reinforcing our conclusion that our method does not harm generation. We provide complete results below and will include these in the next revision.

RecallBASEFTFT+AUGSFT+AUGS+SYNC
Taming0.600.600.590.58
Chameleon0.370.370.370.37

[R1] Kynkäänniemi et al., 2019, “Improved Precision and Recall Metric for Assessing Generative Models”

Q2: Can you compare PSNR between original and finetuned decoders on the same tokens?

We agree that a more fine-grained quality analysis on top of FID is beneficial, and for this reason already report PSNR in Appendix F.4. We will make this experiment more prominent in the next revision of the paper, adding explicit pointers from the main text. As seen there, PSNR remains high for both models: After full RCC finetuning mean PSNR between token sequences decoded with the original and finetuned decoder is 49.7 for Taming and 48.0 for Chameleon. This demonstrates that decoder finetuning introduces minimal perceptual changes.

Q3: Can you clarify how synchronization improves robustness to crops?

Certainly. As detailed in Appendix E.2 and illustrated in Figure 8 in the appendix, we consider crops from the top-left corner of the image followed by upscaling to the generative model’s original generation size. As each image token loosely corresponds to a region in the image, this can result in completely different tokens, erasing our token-level watermark.

Resolving this via RCC finetuning is not possible, as it would teach the encoder to assign the same token ID to fundamentally different content (e.g., orange fish scales and black background in Figure 8). Instead, as shown in the figure, our synchronization layer estimates the extent of the crop, allowing us to revert it by downscaling and padding appropriately. This (approximately) recovers original tokens within the area that was not cropped, which are often sufficient to detect the image as watermarked.

We hope this helps clarify the mechanism; please follow up if you have additional questions.

Q4: How does your approach compare to post-hoc watermarks from the angles of robustness, security, visual quality and efficiency?

Our approach (RCC finetuning + token-level watermark + synchronization layer) has several important advantages compared to post-hoc methods and, importantly, is not fundamentally inferior in any major way. Extending the discussion in the paper, we provide concrete arguments below. While we certainly agree that research on better post-hoc watermarks is also important, we believe these arguments clearly motivate the value of our approach.

Robustness: As Table 2 shows, our method is robust to valuemetric and geometric changes, as well as common neural compression and purification attacks. In contrast, 2/4 post-hoc watermarks in Table 2 are removable by geometric changes, and none of the 4 are robust to attacks, with the best evaluated baseline (TrustMark) still on average being outperformed by our method. We hypothesize that their brittleness to attacks is due to anchoring the watermark in an original clean image that the attacks can approximately recover. In contrast, for our token-level method there is no original clean image, as the image is already watermarked when the generation finishes.

Security: As discussed on L182 and L165, our token-level watermark gives theoretically rigorous p-values directly inherited from LLM watermarking, which can minimize the risk of false positives, e.g., unjustly blaming actors for covert use of AI tools. In contrast, post-hoc watermarks are generally based on neural extractors, meaning that their p-values are derived from the expected behavior of a neural net on non-watermarked images which is hard to rigorously prove and may introduce bias (see our references in the paper).

Visual quality: As our FID results (main paper and Appendix F.3) and PSNR measurements (Appendix F.4) show, our watermark does not significantly degrade visual quality.

Efficiency: Our tokenizer finetuning step is practical and incurs no significant overhead. As noted on L214, in our experiments finetuning took at most 32 GPU hours. This is a small one-time-per-model cost for a practitioner, who can then use the resulting tokenizer in generation and detection of any number of watermarked images. In comparison, the training of the Chameleon model itself took >850000 GPU hours (see Table 2 in [10]). Our watermark detection is not expensive: we refer the reviewer to our answer to Q2 of Rev. s9Sh above where we demonstrate detection latency comparable to other generation-time watermarks.

Cross-modal watermarking: Additionally, our token-level approach uniquely enables cross-modal watermarking. In particular, as we show in Sec 4.3, we can easily apply the same underlying scheme to a set of modalities that can be tokenized (e.g., text, images, audio), and then run joint detection to obtain a single rigorous p-value across multimodal content. In contrast, post-hoc methods for text, images and audio generally differ in the literature and we are not aware of attempts to unify them across modalities.

审稿意见
5

This paper aims to watermark autoregressive image models at the token level. That is, the watermark introduces a pattern in the choice of tokens, which are then converted into an image. To detect the watermark in an image, one first converts it into tokens and then checks for the watermark pattern. A significant challenge in this approach is the fact that token recovery is unreliable for image models. That is, if one takes a generated image and converts it into tokens, these tokens are often not the same as the ones that were actually used to generate it. Reverse cycle consistency (RCC) is the property that nearly all of the tokens are correctly recovered in this scenario; this paper aims to use fine-tuning to improve RCC.

While RCC is necessary for this approach to work, it is not sufficient in the presence of image modifications. For example, cropping the image may introduce further error into re-tokenization. To address issues introduced by modifications, this paper uses localization, which attempts to recognize the modification in order to reverse it.

This paper shows empirically that the watermark is detectable, and that it withstands certain image modifications such as blur, rotation, and brightening. The end watermark with synchronization is additionally robust to geometric transformations such as flipping and cropping.

优缺点分析

Strengths

  • The paper uses finetuning to improve token recovery in image models, and shows empirically that with this improved recovery, a token-level watermark performs reasonably well for autoregressive image models.
  • This paper is the first to design a watermark for autoregressive image models.

Weaknesses

  • The base watermark (FT+AUGS) is not robust to transformations that seem to move around the tokens, such as flipping and cropping. However, the end watermark (FT+AUGS+SYNC) is.
  • While localization helps with robustness to certain transformations, it is unclear how much it will generalize to more complicated transformations.

This paper does a good job of combining existing tools: KGW watermarking, fine-tuning to improve RCC, and localization. However, it does not really introduce new techniques. Therefore, significance is a bit limited there– it is more of an engineering rather than a conceptual contribution. However, the experiments seem to be high quality, and the paper gives insights on problems encountered in making the method work (mostly RCC issues). In my view, the paper is solid and clear but not hugely novel or surprising.

问题

  • Do you have intuition for why the watermark is robust to certain transformations but not others?
  • What was the reasoning for using KGW as a token-level watermark? Were other schemes considered?
  • Token-level text watermarks’ performance varies greatly depending on the amount of entropy in each token. How is the entropy of image tokens typically distributed? Are there notable differences from LLM tokens?

局限性

Yes

最终评判理由

The paper does a good job of combining existing tools: KGW watermarking, fine-tuning to improve RCC, and localization. However, it does not really introduce new techniques. The experiments are comprehensive, though, and the authors satisfactorily answered all questions, and also included interesting new entropy experiments. The authors cleared up my confusion about robustness to geometric transformations.

While its techniques are limited in novelty, this paper is the first to consider watermarking autoregressive image models. This is a solid contribution, and it is well done. I recommend accept, as it is a solid preliminary study of watermarks for autoregressive image models, and it opens the door to further technical development.

格式问题

None

作者回复

We thank the reviewer for their detailed comments and for recognizing our work as solid, clear and of high quality. We respond to all questions raised by the reviewer below, numbered Q1-Q4. We kindly ask the reviewer to let us know if their concerns were addressed, and we are happy to continue the discussion further if there are follow-up questions.

Q1: Is the watermark defeated entirely by flips and crops?

No; we believe there might be a slight misunderstanding here. Our final method, denoted FT+AUGS+SYNC in Table 2, is quite robust to geometric transformations (Geo.) which include flips, crops and rotations: we get robustness scores of 0.82 on Taming and 0.64 on Chameleon. The above rows in the same table refer to ablations of our method with some features disabled.

Q2: Why is localization necessary for geometric robustness and what kind of transformations does this support?

Due to the nature of image tokenizers we study, each token is loosely tied to regions in the image. After e.g. a flip, most regions contain completely different content and thus result in different tokens, erasing our token-level watermark. The same is true for rotation and crops followed by a resize to the canonical image size (see L195). This makes it difficult for any token-level watermark to be natively robust to geometric changes. While we focus on existing commonly used tokenizers, another way around this issue would be to develop spatially invariant (e.g., semantic) tokenizers, which is a research question orthogonal to our work on watermarking.

For this reason, we introduce a synchronization layer that can add geometric robustness to token-level watermarks. Our current implementation successfully estimates and reverts all geometric transformations that we explicitly consider. This (approximately) recovers original tokens that we can then detect as watermarked. The same approach could be easily extended to generalize to a broader range of transformations of interest, potentially using ideas we discuss in Sec. 6. We find this to be a fruitful research topic for future work with many interesting directions. For example, for our follow-up work we are currently experimenting with a synchronization implementation that can directly estimate a homography matrix defined by 4 coordinates, and are seeing promising first results.

Q3: Can you motivate the choice of KGW as the underlying scheme?

We focus on KGW as arguably the most prominent scheme in the field. As we are the first to study watermarking for autoregressive image models, choosing a relatively simple, well-studied and well-understood scheme allowed us to focus on key issues particular to our setting (e.g., RCC). We do think that it would be interesting to explore other schemes such as AAR [1] or KTH [44] in this setting, as well as more elaborate schemes built on top of KGW. We note that our key contributions (RCC finetuning, synchronization) could be directly used in those cases, as they are orthogonal to the underlying scheme choice. We are looking forward to such exploration in future work.

Q4: How does the entropy of next-token distributions differ for text and images?

This is an interesting question. Prompted by this, we ran a new experiment where we measured the entropy in each modality in the same setting as our Joint Watermarking experiment (Sec. 4.3) on 20 outputs (~20k logits). As we can’t include plots in this rebuttal, we summarize key results below. We will add full results of this experiment to the next revision of our paper.

Both entropy distributions have a high peak close to 0, but the entropy is generally higher and less concentrated for image tokens. In particular, the (mean, std) of entropy is (0.45, 0.74) for text and (2.93, 1.90) for images. While to make reliable conclusions we would need to also consider quality preservation and variance between tokenizers, these results provide some evidence that image tokens may be fundamentally easier to watermark.

In both cases, entropy consistently decreases as a function of token position, i.e., as more content is generated. For images, there is an interesting phenomenon: the entropy has a periodic pattern during generation due to the row-major generation order. In particular, generating a token that starts a new image row always comes with unusually high entropy.

评论

Thank you for your responses.

Q1/Q2: I originally viewed FT+AUGS+SYNC as a bit of a patch: by anticipating certain transformations that might be made, one can use synchronization to revert these transformations in detection. This is why I stated that the watermark (which I considered to be FT+AUGS) is not robust to cropping and flips. I view this as a bit of a cat-and-mouse game. I worry that the watermark can be made robust to specific geometric transformations, but if an attacker knows what these transformations are, they can choose a different transformation. That being said, I understand that geometric robustness is difficult and many works in the field are not geometrically robust to more complicated transformations.

Q3: This makes sense. I think it is worth adding this justification to the paper.

Q4: Thank you for the interesting additional experiments, and I am glad you will add them to the paper. Following Q3, it would also be nice to add discussion about how one could design token-level watermarks better tailored to autoregressive image models, by using these properties of the entropy.

审稿意见
5
  • The paper presents the first method for applying watermarking for outputs of autoregressive image generation models, which, unlike diffusion models, have not really been a focus of watermarking research.
  • Major contribution: Identification of a technical barrier - lack of RCC. They show that while image tokenizers are designed for forward consistency, the reverse is not true. Re-tokenizing a de-tokenized image leads to a different token sequence, with low baseline token match which gets even worse with common image transformations. This causes erasure of any watermark embedded in the original token sequence.
  • To address the challenge the authors propose a two part solution.
    • A lightweight fine-tuning procedure for the model's de-tokenizer and replica encoder which optimizes for RCC by minimizing the distance between the original and re-encoded latent representations. This improves robustness against valuemetric transformations and attacks like diffusion purification.
    • To handle geometric transformations that change token sequences, a watermark synchronization step is proposed. This repurposes a localized watermark as a synchronization signal to detect and revert these transformations before applying the primary watermark detector.
  • The paper demonstrates that the approach produces robust and quality preserving watermark through multiple experiments on TAMING and CHAMELEON models.

优缺点分析

STRENGTHS

  • Novelty: The paper addresses a new and important area of research - generation time watermarking for autoregressive models.
  • Principled Method:
    • Identification of RCC is an important insight in the paper.
    • The two part solution that is proposed is sound. The RCC fine-tuning procedure is principled with a clear objective function (Equation 6) that balances the goal of RCC with the preservation of generation quality using a perceptual loss regularizer.
  • Comprehensive Experimentation:
    • The method is validated on multiple autoregressive models (TAMING and CHAMELEON) and a comprehensive ablation study across four variants of their approach (BASE, FT, FT+AUGS, FT+AUGS+SYNC) is shown to isolate the impact of each component.
    • Robustness testing covers multiple valuemetric and geometric transformations. Tests are performed on advanced as well as realistic removal attacks including diffusion purification and multiple neural compressors, showing strong performance as seen in Table 2 and Figure 6.
    • Extensions to other modalities ex: audio ex: in Section 5 strengthens the paper by showcasing the flexibility and generalizability of the core framework.

WEAKNESSES

  • It would be nice to explain trade offs for practitioners who would need to choose between geometric robustness and valuemetric robustness. A clear explanation of the conditions under which the synchronization layer might harm detection would be valuable.
  • It would be nice to have analysis of the computational cost and specific latency figures ex: milliseconds/image for watermark detection. This will be useful to assess practical feasibility of the method for large scale deployment.

问题

Please see weaknesses section above which are also questions/comments for the authors..

局限性

Yes

最终评判理由

The authors have addressed my questions. I will maintain my rating of 5.

格式问题

N/A

作者回复

We thank the reviewer for providing valuable feedback and recognizing our work as novel, principled, and comprehensive. We respond to the questions Q1 and Q2 below. We are happy to continue the discussion further if the reviewer has follow-up questions.

Q1: Can you elaborate how synchronization can harm detection in the presence of valuemetric transformations?

Certainly. Expanding on our discussion on L256 in the paper, we recognize three distinct modes:

  • Under weak valuemetric transformations (e.g., JPEG 90), the synchronization signal remains mostly intact. We estimate that no geometric transformation was applied, and the detection behaves as if there was no synchronization.
  • Under strong valuemetric transformations (e.g., JPEG 30), the synchronization signal disappears. As we are not able to estimate which geometric transformation was applied, the detection again behaves as if there was no synchronization.
  • Finally, for some moderate strengths, the synchronization signal is disrupted. In Fig. 8 (bottom row) in the appendix we see one such example for JPEG 70. While in this example we still successfully estimate that no geometric transformation was applied, such cases often lead to an estimate of a nonexistent transform (e.g., -5 degree rotation). Our algorithm then rotates the image by 5 degrees in an attempt to revert this, which changes the tokens and degrades the watermark signal. To make this clearer, we will add examples of such failures in the next revision of the paper.

The last case leads to a small drop in valuemetric robustness seen in Table 2 (~0.1), traded off for a big jump in geometric robustness (from ~0 to ~0.80/0.65). We believe this tradeoff is generally desirable as it leaves no simple way to reliably and consistently remove the watermark. This can change if the practitioner has a very specific prior, e.g., they only care about JPEG robustness as it is inherent to their use-case.

More importantly, the only cause for this is the insufficient valuemetric robustness of the localized embedder that we use, and is not a fundamental limitation of our approach. As discussed in Sec. 6 and confirmed by the early experiments done in our follow-up work, replacing the off-the-shelf embedder with a targeted synchronization model can reduce this effect.

Q2: Can you provide experimental results on detection latency?

Prompted by the reviewer’s question we benchmarked our detection on Taming generations on a single H100 GPU. We obtain consistent times averaging 837ms for synchronization, 6ms for subsequent tokenization, and 450ms for watermark detection. This makes the detection practical for large-scale deployments, and in the order of magnitude of SOTA diffusion model watermarks, e.g., detecting Tree-Ring [88] took ~2s in our experiment. Note that we did not explicitly optimize the latency nor experiment with parallelization which should be simple in this case and further improve speed.

评论

Thank you for addressing the questions. I will maintain my current rating.

最终决定

This paper introduces the first method for watermarking autoregressive image generation models, addressing the lack of reverse cycle-consistency (RCC) issue in this problem, and proposing fine-tuning and synchronization techniques to enable robust, token-level watermark detection.

Strengths: The problem is novel and timely, the identification of RCC is an important insight, and the proposed solutions are principled, experimentally validated across multiple models, and extendable to multimodal settings. Weaknesses: The work is primarily an engineering contribution that combines existing techniques, and one reviewer noted that the trade-offs between robustness and computational cost are underexplored.

Reviewers agreed that the RCC insight and the strong empirical results make this a significant contribution, and the rebuttal effectively resolved concerns raised by reviewers. I recommend acceptance as this pioneering and technically rigorous paper establishes a foundational approach to watermarking autoregressive image models, supported by convincing experiments.