WMarkGPT: Watermarked Image Understanding via Multimodal Large Language Models
We first propose WMarkGPT, a multimodal large language model (MLLM) that enhances watermark content understanding by integrating visibility prediction and textual description, surpassing mainstream MLLMs in effectiveness and depth.
摘要
评审与讨论
The paper introduces WMarkGPT, a multimodal large language model (MLLM) designed to understand watermarked images without requiring access to the original images. Specifically, it integrates a visual encoder, learnable queries, a visual abstractor, and an LLM to generate detailed descriptions of watermarks and predict their visibility. This work proposes a three-stage training pipeline that progressively enhances the model's ability to understand object positioning, watermark characteristics, and semantic corruption in watermarked images. The authors construct three visual question-answering (VQA) datasets: an object location-aware dataset, a synthetic watermarking dataset, and a real watermarking dataset. Extensive experiments demonstrate that WMarkGPT significantly outperforms existing MLLMs in terms of watermark description relevance and visibility prediction accuracy.
给作者的问题
- What is the specific network structure of Visual Abstractor? Is it a transformer block or just one layer of attention?
- Have the baseline models compared by the author been fine-tuned on the constructed WQA dataset? If not, I would suggest evaluating the performance of these models after fine-tuning them to provide a more comprehensive comparison.
论据与证据
The claims made in the paper are well-supported by clear and convincing evidence.
方法与评估标准
The proposed methods and evaluation criteria are appropriate for the problem.
理论论述
The paper does not present any formal theoretical claims or proofs. The focus is on empirical improvements and model design. Therefore, this aspect is not applicable.
实验设计与分析
The experimental designs and analyses appear sound and valid:
- The authors conducted extensive experiments on both synthetic and real watermarking datasets, demonstrating the model's effectiveness.
- The use of multiple evaluation metrics provides a comprehensive assessment of the model's performance.
- The ablation studies provide insights into the impact of different training stages and dataset sizes on model performance.
补充材料
No additional supplementary materials.
与现有文献的关系
The key contributions of the paper are well-grounded in the broader scientific literature:
- The use of multimodal large language models (MLLMs) for image understanding aligns with recent advancements in vision-language models (e.g., Qwen, LLaVA, VILA).
- The focus on watermark understanding addresses a significant gap in existing evaluation methods, which rely on pixel-wise metrics and require access to original images.
- The proposed datasets and training pipeline build upon prior work in VQA and multimodal learning, providing a new benchmark for watermark security.
遗漏的重要参考文献
The paper has cited relevant works in multimodal learning, watermarking, and corresponding evaluation metrics.
其他优缺点
Strengths:
- The author carried out thorough ablation experiments to assess the effectiveness of the proposed method.
- The manuscript’s expression and structure enhance its readability, making it easy for readers to understand.
- This paper first proposes the use of MLLM to evaluate the content and visibility of watermarks, which is an interesting and meaningful topic.
Weaknesses:
- The authors do not appear to have conducted any ablation experiments to demonstrate the effectiveness of the learnable query setting for this task. I wonder whether similar performance could be achieved through data-efficient multi-stage SFT training alone ( like the paradigm of general Multi-modal understanding LLM ).
- As the WMarkGPT model is capable of directly predicting the specific location of the watermark. The benchmark seems to lack an evaluation metric for watermark position prediction. If it would be possible to design an evaluation metric, similar to object detection tasks, that includes precision and recall to assess the model's effectiveness in watermark location prediction.
其他意见或建议
Please refer to Weaknesses and Questions.
Thanks for the valuable comments and recognition of the novelty and meaningfulness of our research. The reviewer acknowledged the clarity and thoroughness of our experimental design, as well as the strong evidence backing our claims. They highlighted WMarkGPT’s superior performance in watermark description and visibility prediction. The reviewer also thought that our datasets and training pipeline provided a new benchmark for watermark security. In addition, the reviewer commended our comprehensive ablation studies and found the manuscript well-structured and easy to follow.
Q1: Effective of the learnable query
Answer: The learnable queries are employed to extract high-level semantic features from images and filter out redundant visual noise. We conducted an experiment in which we remove the learnable queries and applied multi-stage SFT directly to the model, resulting in considerable performance drops. This shows that multi-stage SFT alone is not sufficient, as it lacks the targeted feature abstraction provided by the learnable queries.
| WQA-Synthetic | BLEU-1 | ROUGE-L | LLM-Score | ACC |
|---|---|---|---|---|
| w/o Learnable Query | 0.474 | 0.433 | 82.981 | 0.613 |
| WMarkGPT | 0.488 | 0.446 | 87.751 | 0.645 |
| WQA-Real | BLEU-1 | ROUGE-L | LLM-Score | ACC |
|---|---|---|---|---|
| w/o Learnable Query | 0.401 | 0.394 | 69.788 | 0.516 |
| WMarkGPT | 0.424 | 0.418 | 71.950 | 0.546 |
Q2: Evaluating watermark position prediction
Answer: We agree that evaluating watermark position prediction is crucial. As suggested, we conducted additional evaluations to verify our performance. We employed the following prompt and GPT to assess the position prediction of our WMarkGPT:
Task: Determine whether the following two sentences describe approximately the same watermark position, allowing for minor variations in phrasing or slight positional bias.
Sentence 1: {predicted description} Sentence 2: {ground truth}
Judgment Criteria: 1. Do they describe the same general location, even if there are small differences? 2. Are the differences within an acceptable range (e.g., slight shifts in coordinates or wording variations like "top-left" vs. "upper-left")? 3. Is there any ambiguity that affects the interpretation? 4. Does the predicted description avoid mentioning any position information, as the watermark is inherently invisible?
Expected Output: Just Final Consistency Judgment: (Yes/No)
The precision, recall, and accuracy of the classification are computed referring to the visibility/invisibility of the watermark:
| Dataset | Precision | Recall | ACC |
|---|---|---|---|
| WQA-Synthetic | 0.851 | 0.830 | 0.823 |
| WQA-Real | 0.792 | 0.763 | 0.734 |
Q3: Structure of Visual Abstractor
Answer: The visual abstractor comprises six transformer layers, each employing cross-attention between extracted visual features and learnable queries. Through cross-attention with learnable queries, the model progressively refines visual feature extraction, ensuring these queries capture high-level semantic information while filtering out irrelevant visual clues. This design also strengthens the subsequent alignment between image features and textual descriptions. We will release the code after acceptance to clarify it.
Q4: The experimental comparisons
Answer: We appreciate the reviewer’s concern regarding the experimental setup of baseline models. To clarify, all baseline models were fine-tuned on our proposed datasets rather than directly evaluated in a zero-shot manner.
The authors innovatively propose a new multi-modal large language model WMarkGPT for watermarked image understanding. This paper points out that traditional methods rely on indicators such as PSNR, require the original image, and cannot fully evaluate the influence of watermark on content. WMarkGPT predicts the visibility of watermarks and generates detailed descriptions without the need for the original image. In addition, the authors constructed three VQA datasets and designed a three-stage training process. The experiments show that WMarkGPT is superior to existing MLLMs in understanding watermark images on synthetic and real datasets.
update after rebuttal
Authors' rebuttal have solved my concerns, and I will keep my rating. Thanks a lot.
给作者的问题
- Some training details are briefly described in the main text, and the appendix supplements some of them, but the reproducibility still needs to be further refined.
论据与证据
Yes, the proposed methods and evaluation criteria are highly relevant and appropriate for the problem of watermark image understanding. The authors introduce WMarkGPT, a multi-modal large language model designed specifically to predict watermark visibility and generate detailed descriptions without requiring the original image. This approach effectively addresses the limitations of traditional methods like PSNR and SSIM, which rely on the original image and fail to comprehensively evaluate the semantic impact of watermarks. Additionally, the construction of three high-quality VQA datasets, including real watermark datasets, provides valuable resources for evaluating and improving the model's performance.
方法与评估标准
Yes, the claims made in the submission are well-supported by clear and convincing evidence. The authors provide comprehensive experiments demonstrating WMarkGPT's superior performance compared to existing multi-modal large language models on watermark image understanding tasks. The three-stage training process is validated through ablation studies, which clearly show the contribution of each stage to the final performance. The use of both synthetic and real datasets for evaluation ensures that the model's effectiveness is tested across diverse scenarios.
理论论述
N/A
实验设计与分析
- The authors construct three high quality VQA data sets, especially the marking of real watermarking data sets provides valuable resources for field research. The public commitment of the data set will greatly facilitate subsequent research.
- The three-stage progressive training strategy proposed by the authors is reasonably designed, and the ablation experiment fully verifies the contribution of each stage to the final performance.
补充材料
Yes, Related Works, Templates of Object Location-aware Dataset, and Implementation Details.
与现有文献的关系
WMarkGPT innovatively addresses the limitations of traditional watermark evaluation methods by proposing a novel multi-modal model that predicts watermark visibility and generates descriptions without needing the original image. This contribution, along with the creation of high-quality VQA datasets and a robust training strategy, significantly advances the field of watermark detection and multi-modal understanding
遗漏的重要参考文献
N/A
其他优缺点
Strengths:
- The content of this paper is full and the language is standard.
- This paper points out the limitations of traditional watermark evaluation methods (such as PSNR and SSIM) that rely on the original image and cannot comprehensively evaluate the visibility and semantic impact of watermark. This paper proposes a specific solution, which has clear practical significance and significantly improves the practicability of the method.
Weakness:
- The real watermark dataset contains only 2.5k samples, and the watermark generation methods are limited to a small number of algorithms (such as HiNet, Safe-SD). Does small data size limit the model generalization? The authors need to further explain the scope of its application.
- It is suggested to add detailed explanations for the three-stage training strategy, such as why the visual encoder and abstractor are optimized in stages, so as to more comprehensively show the model's attention focusing ability in the watermarked area.
其他意见或建议
- The parameter size, operational efficiency, training and testing time and resource usage of the model need to be further elucidated.
- Watermark visibility prediction is also one of the important tasks of the model. Is the previous watermark related description beneficial for the final visibility prediction?
- Please add a discussion section to further analyze the limitations of the method and suggest ways to improve it in the future.
Thanks for your valuable comments and recognition of our innovation, comprehensive experiments, superior performance and practical significance. We appreciate that you highlighted WMarkGPT’s ability to predict watermark visibility without the original images, the newly constructed datasets, our progressive training strategy, and the benefits for subsequent research.
Q1: The WQ-Real dataset and our application
Answer: We adopt a three-stage training pipeline to progressively enable WMarkGPT to comprehend watermarked images. In the first stage, the model is trained on a large-scale object location-aware QA dataset based on natural images. In the second stage, it is further optimized on a synthetic watermarking QA dataset, bridging the gap from natural to watermarked images. Finally, the model is fine-tuned on a real watermarking QA dataset. Because the training sets in the first two stages are large, only a relatively small and high-quality real dataset is needed to align the model with practical data distribution. The significance of dataset quality over quantity has been well-established in MLLM research. To ensure that the watermarked images closely mirror real scenarios, we conducted a thorough investigation and selected typical image watermarking algorithms—including Hidden, BalujaNet, WengNet, HiNet, and Safe-SD—to build our real dataset. Experimentally, we expanded the WQA-Real dataset step by step and found that performance stabilized at around 2.5K samples.
| WQA-Synthetic | BLEU-1 | ROUGE-L | LLM-Score | ACC |
|---|---|---|---|---|
| 1000 | 0.299 | 0.329 | 57.654 | 0.412 |
| 1500 | 0.367 | 0.379 | 62.035 | 0.437 |
| 2000 | 0.404 | 0.400 | 64.654 | 0.502 |
| 2250 | 0.423 | 0.410 | 71.654 | 0.551 |
| 2500 | 0.424 | 0.418 | 71.950 | 0.546 |
Originally, we designed WMarkGPT to evaluate watermarked images more precisely and comprehensively, particularly when original images are unavailable for generative watermarking. The success can facilitate the development of MLLMs for other tasks involving mixed visual patterns, such as deepfake detection and edited image comprehension. In addition, our dataset collection paradigm and training strategy can also be consulted.
Q2: Effect of optimizing visual encoder and abstractor
Answer: For brief, the staged optimization of the vision encoder and visual abstractor enables the model to progressively refine its focus on watermarked regions. The vision encoder capture low- to mid-level semantics, providing the basis for further abstraction and alignment with the language model. The visual abstractor refines the visual features into high-level representations through a set of learnable queries. By aggregating crucial image features and filtering out noise, it delivers more compact and meaningful visual embeddings for subsequent processing by the language model. We will add more explanations in the final paper.
Q3: Efficiency of WMarkGPT
Answer: As detailed in Section 3.1 Model Architecture, WMarkGPT is built upon the LLaMA-2-7B framework (7 billion parameters). The complete model, including the vision encoder, visual abstractor and learnable queries, has a total of 8.198 billion parameters. We have made a detailed summary of training and testing costs in the response to Reviewer oNvm Q7.
Q4: Effect of previous watermark related description
Answer: Our experiments show that predicting watermark visibility without first extracting a watermark-related description results in lower accuracy on both the WQA-Synthetic and WQA-Real datasets. This indicates that capturing key features—such as spatial structure and texture—beforehand is crucial for accurately estimating watermark visibility.
| ACC | WQA-Synthetic | WQA-Real |
|---|---|---|
| w/o Prefix | 0.639 | 0.541 |
| WMarkGPT | 0.645 | 0.546 |
Q5: Limitations and future works
Answer: The current WMarkGPT model is trained exclusively on image datasets, whereas video watermarking and its more precise evaluations remain crucial yet underexplored aspects of copyright protection. Future research will focus on collecting video watermarking datasets and developing video-based MLLMs to understand watermarked videos, further advancing digital watermarking technology.
Q6: More training details
Answer: We provide the complete configurations for the three training phases in the table below to ensure full reproducibility. After acceptance, we will release all the training code.
| Environment | CUDA: 12.1.105 Torch: 2.4.1 | ||
|---|---|---|---|
| Traning | Stage 1 | Stage 2 | Stage 3 |
| Batch size | 32 | 16 | 16 |
| Learning rate | |||
| Epoch | 3 | 5 | 5 |
| Optimizer | AdamW | ||
| Scheduler | CosineAnnealingLR | CosineAnnealingLR | CosineAnnealingLR |
| Warmup ratio | 0 | 0.03 | 0.03 |
| Precision | bf16 |
This paper constructs datasets comprised of watermarked images with different level of watermark visibility. The paper trains a model specified for describing watermark patterns and evaluating watermark visibility level. The authors compare their model with several existing multimodal large language models on their datasets, and they achieve better performance on four metrics. The paper is also clear and well written.
给作者的问题
No
论据与证据
Not really. Please check the weaknesses.
方法与评估标准
Partially. Please check the weaknesses.
理论论述
NA
实验设计与分析
Partially. Please check the weaknesses.
补充材料
No
与现有文献的关系
Yes
遗漏的重要参考文献
No
其他优缺点
Paper Strengths
- Well written and easy to follow.
- The authors make comprehensive comparisons with other models and they achieve better results.
Paper Weeknesses
- The authors mentioned that when constructing the “real” dataset, they randomly selected images generated by existing models. Essentially, this still relies on AI to create the dataset, meaning it is not truly a real dataset.
- In the “real” dataset constructed by the authors, only invisible watermarks are used. Given this, how do the authors later describe the position, content, and other relevant information of these watermarks?
- The description of WMarkGPT is not clear. Which LLM is used and what is its parameter size?
- Some of the evaluation metrics used by the authors require reference texts; however, the authors do not explicitly explain how these reference texts are obtained.
- The ablation study is not very meaningful, as the maximum dataset size is only 50K images, which is relatively small. It is predictable that increasing the dataset size would improve the model’s performance. Similarly, it is evident that training and fine-tuning the model would enhance its capability.
- The model comparison is unfair. It would be more reasonable to compare WMarkGPT with other models that have been fine-tuned on watermark description and evaluation.
- The authors do not discuss the efficiency of their model. How long it takes to evaluate an image and how long it takes to train the model.
- The authors do not provide the download links of their datasets.
- In fig 5, step 2, there is a misspelling 'obiect', which should be 'object'.
其他意见或建议
No
Thanks for your insightful feedback and recognition on our clear presentation, comprehensive comparisons and better results.
Q1: The explanation of the "real" dataset
Answer: Compared to our WQA-Synthetic dataset, which employs pseudo watermarking processes to generate watermarked images, WQA-Real uses genuine watermarking algorithms (e.g., Hidden, BalujaNet) to produce watermarked images that more closely align with evaluating data distributions. Additionally, all QA pairs and watermark visibility assessments in WQA-Real were manually annotated by trained human evaluators (Sec. 2.2 & Appendix E), ensuring that both descriptions and scores reflect genuine human perception rather than synthetic labels. WMarkGPT is proposed to comprehensively understand these watermarked images without accessing original images and further advance the development of watermarking algorithms. We use the term "real" to emphasize the distinctness of WQA-Synthetic and WQA-Real datasets.
Q2: Invisible watermark description
Answer: Ideally, watermarking algorithms generate watermarked images with fully invisible watermarks. However, in practice, existing methods often produce images with varying degrees of watermark visibility. To advance the research, we propose WMarkGPT for precise and comprehensive evaluations of these watermarked images. Our WQA-Real dataset thus includes both visible and invisible watermarks (see Fig. 4 for distribution). For genuinely invisible ones, we label them simply as “invisible” without adding spatial descriptions (Appendix E).
Q3: Details of WMarkGPT
Answer: As detailed in Section 3.1 Model Architecture, WMarkGPT is built upon the LLaMA-2-7B framework (7 billion parameters). The complete model, including the vision encoder, visual abstractor and learnable queries, has a total of 8.198 billion parameters. We will make it clear in the final paper.
Q4: The reference texts used in LLM-Score
Answer: As detailed in Appendix G, our evaluation metric LLM-Score [1][2] uses the reference template: ``Evaluate the relevance between the following description and ground truth on a scale from 0 to 4. Higher scores indicate better relevance. Return only the numeric score. - Description: candidates - Ground Truth: references."
[1]Lu Y, Yang X, Li X, et al. Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation. Neurips 2023.
[2]Huang K, Sun K, Xie E, et al. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Neurips 2023.
Q5: The size of the WQ-Synthetic dataset
Answer: We conducted additional experiments by increasing the WQ-Synthetic dataset size. As shown in the table below, adding more training data only resulted in slight performance improvements. This might be due to the complexity of watermarked image understanding, where the model needs to predict watermark visibility and produce detailed textual descriptions of its location, content, and effect on image semantics. Our 7B-parameter LLM backbone may have reached its limit at this dataset scale. In our final paper, we will include experiments and analysis with a more powerful LLM backbone and larger datasets.
| WQA-Synthetic | BLEU-1 | ROUGE-L | LLM-Score | ACC |
|---|---|---|---|---|
| 50K | 0.488 | 0.446 | 87.751 | 0.645 |
| 60K | 0.488 | 0.448 | 87.723 | 0.638 |
| 70K | 0.490 | 0.453 | 87.758 | 0.649 |
Q6: Model comparison
Answer: We clarify that all baseline models were fine-tuned on our proposed datasets rather than directly evaluated in a zero-shot manner.
Q7: Efficiency of WMarkGPT
Answer: The training and inference costs of our model are as follows (2 Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz).
- Training: 12 hours (Training Stage 1) and 6 hours each (Training Stages 2-3) on 8×NVIDIA RTX 6000 Ada GPUs.
- Inference: 3.395 seconds for a single image (avg.) on a single NVIDIA RTX 6000 Ada GPU.
Q8: Download links of the datasets
Answer: As stated in the abstract, we will publicly release both the code and dataset upon acceptance. To comply with the double-blind review policy, we provide an anonymous download link here: https://drive.google.com/drive/folders/1JoRq91b0UAbTU4SEblCycRVwCpUgkeg2?usp=drive_link.
Q9: Typo in Fig. 5
Answer: We apologize for the typo and will correct it in the final paper.
Dear authors,
Thank you for your response. The answers from Q3-Q9 are very helpful. However, I still have concerns on the problem setting. Could you please help to motivate me on the practical usage of this type of studies? Thank you.
Thank you very much for your thoughtful follow-up and for acknowledging our previous responses. We appreciate the opportunity of further clarifying the motivation and practical significance of our study.
As we know, the research on image watermarking is critical for copyright protection and information steganography. The research can be divided into two main groups, i.e., post-processing based watermarking and text-guided generative watermarking where watermarks are directly embedded into the generation procedure and original images are not available. The core motivation of our work stems from the limitations of traditional watermarking evaluation metrics (e.g., PSNR, SSIM, MAE), which play a fundamental role of advancing watermarking algorithms. These numerical metrics rely on original images and fail to precisely reflect human perception such as watermark visibility (referring to Figure 2 and Figure 3), which is an important watermarking efficacy indicator. Their limitations are also demonstrated by other research [1, 2]. Besides, they do not provide any assessment about embedded watermarks and the disruptions on image contents. More importantly, these metrics can not be used to evaluate generative watermarking since the original images are not accessible.
To address these limitations, we propose a reference-free, MLLM-inspired evaluation paradigm that better aligns with human perception. Our model WMarkGPT only leverages watermarked images to support question-answering and provide textual descriptions about watermark visibility, location, content, and impact on image semantics, enabling a more nuanced comprehension of watermarked images. These precise and detailed descriptions facilitate the watermarking measurement and the development of advanced algorithms.
In addition, our study has potential implications on developing MLLMs for other tasks involving complex and mixed visual patterns, such as deepfake detection, edited image comprehension, and multimodal content verification. Our dataset collection strategies and progressive training pipeline of aligning vision-language domains for fine-grained perceptual tasks offer general references for the community.
We sincerely hope our explanation helps clarify the motivations and the practical significance of our work. Please let us know if you have any further questions.
[1] Patwari K, Chuah C N, Lyu L, et al. PerceptAnon: exploring the human perception of image anonymization beyond pseudonymization for GDPR. ICML 2024.
[2] Fu S, Tamir N, Sundaram S, et al. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. Neurips 2023.
This paper proposed use MLLM to detect watermarked images, the model architecture is adapted from mplug-owl 2 and proposes three training stages to progressively fine-tune model. Experimental results show improved performance.
给作者的问题
No question
论据与证据
The paper may overstate the performance gain, i.e., Section 4.1 and Table 1, the paper does not mention whether those baselines such as Qwen2-vl are trained with the same datasets as those used in WMarkGPT. If the baseline models are zero-shot while the WMarkGPT is fine-tuned, the experimental results are unfair.
方法与评估标准
See claims and evidence.
理论论述
NA
实验设计与分析
See claims and evidence.
补充材料
Yes, all
与现有文献的关系
This paper contributes to extending the MLLMs ability to watermark detection field.
遗漏的重要参考文献
No.
其他优缺点
Strengths: the paper is well-structured and easy to follow Weakness:
- The experimental designs are unfair in Section 4.1 if the baseline performance is obtained in zero-shot scenario.
- To thoroughly evaluate the effectiveness of progressive training, it is essential to introduce an additional baseline: a model trained simultaneously on all three datasets—the object position dataset, the synthetic watermark dataset, and the real watermark dataset. This approach will provide a comprehensive comparison, allowing us to assess whether progressive training offers distinct advantages over a more conventional, unified training strategy.
- The model architecture is unchanged, more specific devise can be done to fit the watermark understanding task.
其他意见或建议
No
Thanks for your valuable comments and recognition on our clear presentation, contributions of extending the MLLMs ability to watermarked image understanding.
Q1: Experimental setup of baseline models
Answer: We clarify that all baseline models were fine-tuned on our proposed datasets rather than directly evaluated in a zero-shot manner.
Q2: The effectiveness of progressive training strategy
Answer: We conducted an experiment where the model was trained on all three datasets jointly in a unified manner using the same training configurations as in the progressive setting. The experimental results below indicate that the progressive training approach consistently yields much better performance compared to the unified training strategy. For example, on the WQA-Synthetic dataset, progressive training achieves an improvement of +0.064 in BLEU-1 and +5.367 in LLM-Score relative to unified training. Similar trends are observed on the WQA-Real dataset. These results suggest that the progressive training strategy effectively mitigates the domain adaptation challenges inherent in joint training, leading to improved generalization across diverse data domains.
| WQA-Synthetic | BLEU-1 | ROUGE-L | LLM-Score | ACC |
|---|---|---|---|---|
| Unified Training | 0.424 | 0.388 | 82.384 | 0.635 |
| Progressive Training | 0.488 | 0.446 | 87.751 | 0.645 |
| WQA-Real | BLEU-1 | ROUGE-L | LLM-Score | ACC |
|---|---|---|---|---|
| Unified Training | 0.375 | 0.376 | 69.547 | 0.522 |
| Progressive Training | 0.424 | 0.418 | 71.950 | 0.546 |
Q3: Architectural enhancement for watermarked image understanding
Answer: We thank the reviewer for this valuable suggestion. In the MLLM research literature [1, 2, 3], the focus has primarily been on rigorous multimodal data collection and innovative training strategies. As the first work on watermarked image understanding using MLLMs, we also place our main emphasis on the novel WQA-Synthetic/WQA-Real datasets and a three-stage training pipeline.
That said, we recognize that specialized architectural enhancements can further improve watermark understanding. To address the challenges arising from mixed image and watermark patterns, we have integrated Mixture-of-Experts (MoE) into the LLM backbone. This addition enables more effective processing of watermark-specific features alongside natural image semantics. Experiments have shown that this modification boosts performance compared to our baseline model, and we will include a detailed analysis in the final paper.
| WQA-Synthetic | BLEU-1 | ROUGE-L | LLM-Score | ACC |
|---|---|---|---|---|
| WMarkGPT | 0.488 | 0.446 | 87.751 | 0.645 |
| WMarkGPT-MoE | 0.491 | 0.453 | 87.841 | 0.658 |
[1] Li, Chunyuan, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao, Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 2023.
[2] Lin, Ji, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024.
[3] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, andSteven Hoi. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. Advances in Neural Information Processing Systems, 2023.
This paper introduces WMarkGPT, a multimodal large language model (MLLM) designed to understand watermarked images without requiring access to the original images. The authors have done a good job of rebuttal. After rebuttal, it received scores of 3334. All the reviewers are happy about the paper, commenting that the paper is well written and easy to follow, and the authors make comprehensive comparisons with other models and they achieve better results. The AC agrees that the application of the MLLM developed in this paper is interesting; on the other hand, the novelty of the MLLM design itself also seems quite limited. Overall, given that all the reviewers are positive about the paper and the application of MLLM in this paper is indeed unique, the AC would like to recommend acceptance as well.