GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning
This paper introduces a novel reasoning-based VLM guard model dubbed GuardReasoner-VL
摘要
评审与讨论
This paper presents GuardReasoner-VL, a novel VLM guard system that employs a two-stage process of reasoning followed by moderation.This paper also curates GuardReasoner-VLTrain, a reasoning corpus for VLM guard models, comprising 123K samples with 631K reasoning steps across text, image, and text-image pairs. Enough ablation studies demonstrate the contributions of the proposed safety-aware data concatenation,dynamic clipping parameter, and length-aware safety reward.
优缺点分析
Strengths:
- The first attempt of introducing reasoning steps to safeguarding, achieving good performance.
- Creating a brand new dataset for VLM guard models.
- Enhancing model’s reasoning via online RL, integrating safety-aware data concatenation, dynamic clipping, and length-aware safety rewards.
Weaknesses:
- I would like to see more details about the training process, like batch size, lr
- To be honest, I can understand why you degisn the length-aware safety reward like that. Please give more reasons.
- I want to see more choice of constraint parameters, and their corresponding effect of the performance.
- You should also clarify the choice of your base model.
- I notice that the response length of GuardReasoner-VL-Eco-7B decreases with more training steps, could you give more analyses about this?
- Is safeguard problem limited to a binary classification format?If not, can you model solve those kind of problems?
问题
see weakness
局限性
Yes
最终评判理由
I tend to keep my postitive score of this paper.
格式问题
No paper formatting concerns
Thanks for your constructive and insightful reviews. We carefully response to your concerns as follows.
Details
Thanks for your question. We provide the details regarding the training process in Section A.5.2 of our original paper. In the R-SFT stage, the total batch size is set to 192 and the initial learning rate is set to 5e-5. In the RL stage, the batch size for the actor model is set to 256 and the initial learning rate is set to 1e-6.
Length-aware Safety Reward
We design a safety reward to guide our guard model to finish two guardrail tasks, i.e., prompt harmfulness detection and response harmfulness detection. First, the model should output in a correct format to ensure the predicted results are extracted correctly. Then, based on the correct format, we calculate the correctness between the predicted results and the ground truth of these two tasks, and combine them linearly, as shown in Formulation (8) and (9) in the original paper.
To balance the performance and token efficiency, we incorporate the length of the reasoning process into the reward. The basic idea is that when the model fails to complete these guardrail tasks correctly, it is encouraged to improve its accuracy by scaling up the reasoning length, while remaining within a constraint: , where is the normalized length of the reasoning , and is a cut-off hyper-parameter to alleviate over-thinking. Note that the numerator is constrained to be non-positive, i.e., . Thus, when the model fails to complete all tasks correctly, i.e., , it is encouraged to improve its accuracy by increasing the reasoning length, subject to the constraint .
Basically, we design our length-aware reward to encourage the model to increase the reasoning length when it fails to solve the task correctly. However, this increase in reasoning length is constrained, preventing the model from continuously expanding the reasoning process.
Base Model & More Choice of Constraint Parameters
- We select Qwen2.5-VL-Instruct 3B and Qwen2.5-VL-Instruct 7B as our base models. We provide the reasons as follows. 1) During the process of this paper, Qwen2.5-VL models are the leading open-sourced vision-language models. We believe that we can achieve better performance based on them. 2) We merely select 3B and 7B models since we have limited computational resources. 3) For the guardrail tasks, we believe they are time-sensitive, which requires a small and efficient model.
- For a wider choice of the constraint parameters, we will conduct more experiments on models with different scales due to the limitation of the computational resources.
Response Length
First of all, we designed GuardReasoner-VL-Eco-7B to improve the token efficiency of the original GuardReasoner-VL-7B model. Therefore, it is both reasonable and expected that the reasoning length decreases with more training steps. The main reason for this is our implementation of a length-aware reward, as outlined in the Length-aware Safety Reward section. By adjusting the constraint parameter , we can control the reasoning length, allowing us to strike a balance between performance and token efficiency. The more details of the hyper-parameter setting can be found in Section A.5.2 in the original paper.
Binary Classification
Conventional methods, such as LLaMA Guard, typically perform binary classification or fixed multi-class classification (for different harmful categories). In contrast, our model goes beyond binary classification by also outputting the reasoning process. This reasoning can serve as the justification for the final guardrail decision, including the identification of open-ended harmful categories. We think we have already solved this problem in our proposed model.
Thanks for your reply to my review, Your reply dispel my concerns!
Your choice of base model should be written in your paper.
We appreciate that our response can dispel your concerns.
Actually, we have already detailed the choices of base models in our original paper (see Lines 621-622 of the original paper).
If your concerns are solved, could you raise your rating to support our work?
If you have any other concerns or further suggestions for improving the quality of our paper, feel free to discuss them during the rebuttal period. Thanks for your insightful reviews again!
Dear Reviewer Wv3A,
We hope this message finds you well.
As two reviewers have decided to increase the score (DY9e, 7Lz3), and one reviewer (VpvK) has kept a score of 5 to support accepting this paper, we would greatly appreciate it if you could kindly respond to our rebuttal for paper NeurIPS 12986. We are looking forward to receiving any further questions or suggestions for improvement. If your concerns have been resolved, could you raise your rating to support our work?
Thank you very much for your support.
Kindest regards, Authors of NeurIPS 12986
Dear Reviewer Wv3A,
We hope this message finds you well.
As the rebuttal period deadline is approaching, we would greatly appreciate it if you could kindly acknowledge and respond to our rebuttal for paper NeurIPS 12986. We are looking forward to receiving any further questions or suggestions for improvement. If your concerns have been resolved, could you raise your score to support our work?
Thank you very much for your support.
Kindest regards, Authors of NeurIPS 12986
This paper introduces the GuardReasoner-VL model and its associated training dataset. The training procedure includes a cold-start SFT phase and a subsequent online RL stage. The authors also propose a novel controllable length-aware safety reward to obtain the eco version of GuardReasoner-VL. Built on Qwen2.5-VL-3B-Instruct and Qwen2.5-VL-7B-Instruct, GuardReasoner-VL exhibits superior judgment ability in various tasks and settings.
优缺点分析
Strengths
- The paper is well-organized.
- GuardReasoner-VL achieves SOTA performance on several tasks.
- Guard models are important for advancing research in safety, especially given the previous lack of an open-source and strong guard model in the VLM domain. GuardReasoner-VL serves as a valuable resource for further studies on VLM safety.
Weaknesses
- On some text-only benchmarks, the performance of GuardReasoner-VL is weaker than GuardReasoner.
- The data augmentation strategy simply concatenates two text queries, which may lead to semantic ambiguity and potentially hinder the model training.
Typo
- Rejection Samping -> Rejection Sampling (Figure 2)
问题
- How does GuardReasoner-VL perform on more challenging tasks, such as MIS-hard [1], MSSBench [2], or SIUO [3], which aim to elicit unsafe responses from models through safe image-text inputs?
- The safety reward in Eq. 8 contains both the reward for prompts and responses. Can GuardReasoner-VL assess the safety of both the prompt and the response within a single inference? The case study only presents examples of evaluating them separately.
- The performance of GuardReasoner-VL-Eco and GuardReasoner-VL looks similar. Can the token efficiency of reasoning be further improved by adjusting ?
I will further increase my rating in support of acceptance if most of my concerns are addressed.
[1] Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models [2] Multimodal situational safety [3]Cross-modality safety alignment
局限性
yes
最终评判理由
Most of my concerns have been addressed. I’m surprised by the model’s strong performance on challenging out-of-distribution benchmarks. To my knowledge, there is no large-scale multi-image dataset available for training, yet GuardReasoner-VL generalizes well to MIS.
Since the authors addressed my concerns, I raised my rating in further support of acceptance based on the following reasons. (i) The novel idea and training techniques, e.g., controllable length-aware safety reward and safety data augmentation. (ii) The SOTA performance, providing a reliable open-sourced guard model for evaluation, which is important for the community. 3) Lots of empirical analysis and insights. It's a nice work.
格式问题
N/A
Thanks for your constructive and insightful reviews. We carefully response to your concerns as follows.
Performance of GuardReasoner and GuardReasoner-VL
We admit that on some text-only benchmarks, our GuardReasoner-VL performs worse than GuardReasoner. However, we think it is reasonable since GuardReasoner is a text-only guard model while GuardReasoner-VL can handle multi-modal inputs, including text, image, and text-image modalities. In the future, we think this problem can be solved by scaling up the training data and the parameters of the VLMs.
Data Augmentation Strategy
We fully understand your concern. It is possible to lead to semantic ambiguity using data augmentations. However, the motivation of our designed data augmentation is to improve the difficulty of the training samples and improve the difficulty of the training process, thus enhancing the performance. As shown in Figure 5, we do observe that our proposed data augmentation strategy can improve the performance across various datasets and do not find that it will hinder the model training.
More Challenging Task
Our used benchmarks have already contained your mentioned kind of data, i.e., eliciting unsafe responses from models through safe input, a.k.a., jailbreak or adversarial attacks. Following your suggestions, we tested the performance on these benchmarks, i.e., MIS-hard (F1 Score), MSSBench Embodied Task (ACC), SIUO (F1 Score) [1-3]. The results are listed in the following table. We can find that our GuardReasoner-VL can achieve promising performance in these challenging benchmarks compared with the reported baselines in their original paper. We acknowledge these benchmarks are more challenging, and we will further improve the performance of our models in these benchmarks in the future. We will discuss these high-quality datasets and test our model in the final version of our paper.
| MIS-hard | MSSBench | SIUO |
|---|---|---|
| 67.10% | 62.50% | 39.42% |
[1] Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models
[2] Multimodal situational safety
[3]Cross-modality safety alignment
Single Inference
Yes, our GuardReasoner-VL can handle two tasks in a single inference. As shown in Equation (2), our guard model is designed to handle both prompt harmfulness detection task and response harmfulness detection task. For the case study, sorry for the misinformation. We merely demonstrate one task in one figure for the clarity.
Further Improvement
Yes, we believe both the performance and token efficiency can be improved by adjusting . However, due to the limitation of the computational resources, we did not try too many values of this hyper-parameter. In the future, we can further improve the performance and the token efficiency of our model by tuning hyper-parameters.
Typo
Thanks for your reminder. We will fix it in the final version.
Thank you for your responses. Most of my concerns have been addressed. I’m surprised by the model’s strong performance on challenging out-of-distribution benchmarks. To my knowledge, there is no large-scale multi-image dataset available for training, yet GuardReasoner-VL generalizes well to MIS.
Since the authors addressed my concerns, I raised my rating in further support of acceptance based on the following reasons. (i) The novel idea and training techniques, e.g., controllable length-aware safety reward and safety data augmentation. (ii) The SOTA performance, providing a reliable open-sourced guard model for evaluation, which is important for the community. 3) Lots of empirical analysis and insights. It's a nice work, Good luck!
We appreciate that our response can resolve your questions. Thanks for your support of our paper!
The paper presents GuardReasoner-VL, a novel vision-language model guard that first generates explicit multi-step reasoning before making moderation decisions on text, image, or text-image inputs. To support this, the authors construct GuardReasoner-VLTrain and cold-start the guard via supervised fine-tuning. They then further improve safety detection through online reinforcement learning, introducing safety-aware data concatenation, a dynamic clipping schedule for exploration–exploitation trade-off, and a length-aware reward to balance accuracy with token efficiency. Extensive experiments across 14 benchmarks show GuardReasoner-VL outperforms state-of-the-art VLM and LLM guards.
优缺点分析
Strengths:
- The paper is generally well-written and structured, followed by the motivation and explanation.
- The combination of safety-aware data concatenation, dynamic clipping, and length-aware rewards demonstrably improves performance over SFT alone.
- The paper benchmarks on 14 diverse datasets, including text-only, image-only, and text-image tasks.
Weaknesses:
- The dataset includes three types of inputs (text, image, and text-image), which are trained jointly. However, training on text-only documents is significantly easier than training on samples that contain images.
- In Section 2.3.1, the authors concatenate pairs of incorrect examples. Does this result in combinations? If so, it would substantially increase the training set size. I recommend incorporating a sample selection strategy to manage this.
- There is also a concern regarding potential imbalance when concatenating examples—for instance, differing image sizes or varying context lengths. In such cases, one modality may dominate the other, affecting learning dynamics.
- From Figure 5, the RL method only shows a 2% improvement over R-SFT. Given the additional computational cost of reinforcement learning, the marginal gain seems inefficient.
- The value of decays very rapidly within just a few thousand steps; it effectively becomes zero, which prematurely removes the allowance for exploration during training.
问题
- The reasoning paths are entirely generated by the LLM (GPT-4o) without any human involvement.
- In Table 1, the best result is not highlighted in bold.
- In Figure 5, the F1 score of R-SPT is significantly lower than the other methods. I suggest the authors conduct further evaluation and analysis.
局限性
Yes
最终评判理由
Most of my concerns have been addressed. However, for some of my questions, the authors responded that they are considered future work. I believe these points would be better to include in the paper, but nonetheless, I have raised my score to borderline accept since the authors addressed most of my concerns.
格式问题
No
Thanks for your constructive and insightful reviews. We carefully response to your concerns as follows.
Data Types
We appreciate your point. However, the VLM guard models are specifically designed to safeguard VLMs, which process inputs containing various modalities, such as text, images, and text-image pairs. Therefore, training with text, image, and text-image data is essential and necessary.
Combinations
No, we do not use all possible combinations. Instead, we sample the combinations randomly. Please carefully review the details in Lines 630-632 of the original paper. In the future, we plan to develop more sophisticated sampling strategies to improve the model’s performance.
Imbalance
Thank you for raising this concern. We are aware that concatenating examples from different modalities could lead to potential imbalances, especially when dealing with varying image sizes or context lengths. One possible solution is to design sampling strategies that ensure a balance between different samples. We plan to incorporate this approach into our model in the future. Besides, although the potential imbalance problem exists, we observe that our data augmentation strategy works well and does improve the RL performance.
Efficiency of RL
We believe that a 2% performance improvement is not marginal, and the RL process is not inefficient, especially considering that we only use 12K training samples for RL compared to 123K training samples for SFT. In the future, we plan to scale up our prompt set and enhance its efficiency, which should further boost the performance of the RL process.
Hyper-parameter
Thanks for your suggestion. We will consider adjusting the decay rate or implementing more gradual adjustments to ensure adequate exploration throughout the training process.
Human Involvement
Thank you for your question. Due to limitations in human resources, we are unable to involve the entire process of checking the reasoning paths. However, we conduct case studies to ensure the quality of the reasoning. Additionally, for the trained model, we perform human case studies, which are illustrated in Figure 12, Figure 13, and Figure 14. In the future, we plan to recruit more human labelers to further improve our model.
Best Result in Table 1
Thanks for your suggestion. In this table, we merely highlight the average performance for clarity. Following your suggestion, we will highlight all the best performances in the final version.
Performance of R-SFT (image)
Note that the variant refers to conducting R-SFT using only image data. Without the text data, the foundational abilities of the VLM guard model degrade rapidly. Please CAREFULLY read our paper, and we have already explained this issue in Lines 193-194 of the original paper.
Thank you for your responses. Could you please clarify the following:
- You train on text, image, and text-image samples jointly, which raises the risk of the model over-relying on the textual modality. Could you provide analyses showing that joint training does not introduce a bias toward text?
- Could you also show that the imbalance does not degrade the overall data quality?
Dear Reviewer 7Lz3,
As the other three reviewers (VpvK, DY9e, Wv3A) all agree to accept this work, we are looking forward to receiving any further questions or suggestions to improve the quality of our paper further.
Thank you once again for your valuable contributions to the conference and our paper.
Best Regards,
Authors of NeurIPS 12986
Thanks for your response. I have raised my score.
We appreciate that our response can resolve your questions. Thanks for your support of our paper!
Dear Reviewer 7Lz3,
We hope this message finds you well.
As the rebuttal period deadline is approaching, we would greatly appreciate it if you could kindly acknowledge and respond to our rebuttal for paper NeurIPS 12986. We are looking forward to receiving any further questions or suggestions for improvement. If your concerns have been resolved, could you raise your score to support our work?
Thank you very much for your support.
Kindest regards, Authors of NeurIPS 12986
Dear Reviewer 7Lz3,
Thanks for your quick response and insightful reviews. We respond to your further comments as follows.
You train on text, image, and text-image samples jointly, which raises the risk of the model over-relying on the textual modality. Could you provide analyses showing that joint training does not introduce a bias toward text?
Thanks for your question.
-
For your concerns, we have already conducted experiments by separating the different modalities of the training data (see Figure 5). From these experimental results, we can find that R-SFT (which trains the model with text, image, and text-image) can achieve better performance than R-SFT (text). This suggests that introducing the image modality can enhance performance, demonstrating that the model's performance doesn't over-rely on the textual modality. We would like to add this analysis to the final version of our paper.
-
Besides, we also analyze the performance of our model on data with different modalities. For example, in Table 1, our model can achieve promising performance on HarmImageTest (which only contains images) and SPA-VL-Eval (which only contains image-text pairs). These results also verify that our model has the generalization ability to image and text-image modalities, and our model doesn't have the risk of over-relying on the textual modality.
Following your suggestions, we added these discussions and highlighted them in red in the revised paper. Please refer to the original anonymous link in our paper.
Could you also show that the imbalance does not degrade the overall data quality?
Thanks. We admit the imbalance could degrade the overall data quality. However, the overall performance improvement shows the effectiveness of our overall data. We think this problem can be solved by designing a new sampling strategy, i.e., when the imbalance happens, we discard the corresponding samples. Besides, we carefully checked our data augmentation and didn't find any imbalance problem. The cases can be found in Figure 4 of the original paper.
Following your suggestions, we have added these discussions and highlighted them in red in the revised paper. Please refer to the original anonymous link in our paper.
Overall, we thank you for your constructive and insightful reviews and comments. We are looking forward to receiving any further questions or suggestions for improvement. If your concerns have been resolved, could you raise your score to support our work?
Dear Reviewer 7Lz3,
We have updated our response. Concretely, we add more analyses on text data over-relying and the data imbalance. Besides, we have added them to our revised paper. Please refer to the original anonymous link in our paper.
We appreciate your constructive and insightful reviews and comments. We are looking forward to receiving any further questions or suggestions for improvement.
Best Regards, Authors
The paper aims to develop a VLM guard model to prevent the VLM from being attacked by malicious prompts. The current existing approach is trained by another VLM to classify the prompt – giving the input of the prompt and output the binary label. However, this approach lacks interpretability. To this end, the paper wants to build a model that aims to improve interpretability by outputting reasoning traces.
There are three key ideas presented in the paper. To solve the issue of data limitations, they collected 123K samples and 631K reasoning steps with a mixture of text, image, and text-image samples. To solve the issue of lack of performance, they use “online” RL approach to train the model. A couple of details here. First, they expand the dataset with harmful samples. Second, there is a “dynamic clipping parameter” to tradeoff between SFT signal and RL training signal to let the model explore in the early stage and exploit in the later stage. Finally, to solve the token inefficiency issue, they created a reward function to encourage the model to generate shorter tokens.
优缺点分析
Strengths
- The paper is well written and easy to understand. For instance, in paragraph line 63, the definition of the task is clear, and the reader can easily understand the formulation in the prior work and in this work.
- The two step process of training the VLM makes sense to me. In the first step, the paper uses SFT to train the model given the input prompt and reasoning traces and the prediction labels. And then in the second step, the model uses RL to train on the data that authors collected with more diverse and harder samples. The explanation of the reward function and the training process are clearly explained.
- The evaluation and benchmark for the baselines and the proposed model are well executed. The proposed model (GuardReasoner) achieves the best results.
- The ablation study proves that the two-step training process works.
- The efficiency experiment is good as shown in table 3. This gives value to running the model in the production system.
Weaknesses
- The reasoning data itself is from another VLM model, which is GPT-4o. I am concerned about if there is any legal issue, or the way that the model learns from another model is sort of “distillation”. What if we just prompt the GPT-4o model to output reasoning traces for detecting malicious prompts, do we get a similar performance?
- The idea that for a model to learn from reasoning traces to improve the performance of the prediction is similar to chain-of-thought. This has been in the research direction for a long time.
- The baselines compared in Table 2 are small models. It would be nice to see if the more powerful models can have better performance. For instance, like directly prompting GPT-4o for doing such tasks.
Overall, I think this paper is well written and executed, and meets the bar of Neurips. There is no new stuff in the paper, but the execution is nice. It would be nice to hear from authors addressing my concerns and questions.
问题
- Could we conduct some human evaluation to check if the reasoning traces make sense or not? I am wondering if the reasoning traces actually make sense for diving the correct prediction.
- The paper does not provide an explanation why giving reasoning traces will help to improve the performance. I think it would be nice if the authors give some intuition for this, and add that into the final paper.
局限性
yes
最终评判理由
After reading other people's review, I recommend accept this paper
格式问题
no
Thanks for your constructive and insightful reviews. We carefully response to your concerns as follows.
GPT-4o
-
Legal Issue. We agree with your point. This learning process is called distillation. Actually, various famous models are using this kind of technology, e.g., DeepSeek and Qwen. However, our model is only for research and not for commercial use, thus avoiding the legal issue. We just study a new paradigm for training a reasoning-based guardrail model. If the companies plan to train a reasoning-based model, they can use our techniques and may label the reasoning data by themselves.
-
Prompting GPT-4o Directly. Actually, we have tried prompt GPT-4o to conduct the guardrail tasks directly (note that it is different from prompt GPT-4o to generate the intermediate reasoning processes ). However, when we prompt GPT-4o directly to conduct harmfulness detection tasks, we got an unpromising performance since it will always reject our requests. Differently, when we just prompt GPT-4o to generate the intermediate reasoning processes for the guardrail tasks, it works well and outputs some high-quality reasoning data. We find this interesting point and may attribute it to one kind of jailbreak attack. We will add it to the final version of our paper.
Chain-of-Thought
We agree with your point. This is a kind of CoT techniques, which is first proposed in 2022. However, our method is one kind of learning to reason techniques, i.e., teaching the model to learn to reason, which is brought by o1 model in 2024. Besides, our proposed GuardReasoner-VL is the first reasoning-based VLM guardrail model.
Baselines
As shown in Table 2, it includes not only small models but also advanced moderation APIs, such as the OpenAI Moderation API and the Azure Content Safety API. For the performance of directly prompting GPT-4o, we explain it in the GPT-4o section. Basically, when we directly prompting GPT-4o to accomplish the guardrail task, it always denies our requests, thus leading to an unpromising performance.
Human Evaluation
Actually, we have already conducted some human evaluation in our original paper. As shown in Figure 12, Figure 13, and Figure 14, we find that our proposed reasoning-based guardrail model can achieve better performance, and the reasoning process does help to improve the accuracy and the interpretability.
Intuition
We think that reasoning traces can help the model better analyze both the user's input and the model's response. By providing a step-by-step trace of the reasoning process, the model can identify key information, make more informed decisions, and improve its overall performance. This transparency also helps in refining the decision-making process and making the model's behavior more interpretable. Also, this reasoning process can be regarded as one kind of test-time scaling, which can improve the model's performance. Some intuition can be found in our case studies. And we will add these insights to our final version.
Thanks for the response! It certainly addresses my concern. I will recommend accepting this paper.
Dear Reviewer VpvK,
Thanks again for your insightful and constructive reviews. We would like to carefully modify our paper following your suggestion in the final version.
Thanks for your support of our work!
Best Regards,
Authors of NeurIPS 12986
Dear Reviewer VpvK,
We hope this message finds you well.
As the rebuttal period deadline is approaching, we would greatly appreciate it if you could kindly acknowledge and respond to our rebuttal for paper NeurIPS 12986. We are looking forward to receiving any further questions or suggestions for improvement. If your concerns have been resolved, could you raise your score to support our work?
Thank you very much for your support.
Kindest regards, Authors of NeurIPS 12986
Dear Reviewer VpvK,
We hope this message finds you well.
As the rebuttal period deadline is fast approaching (in less than 2 days), we would kindly appreciate it if you could acknowledge and respond to our rebuttal for paper NeurIPS 12986. We have responded to and solved your concerns. We are eager to address any further questions or suggestions you may have for improvement. If your concerns have been resolved, we would be grateful if you could consider raising your score or your confidence in support of our work.
Thank you very much for your time and support.
Kindest regards, Authors of NeurIPS 12986
Dear Authors and Reviewers,
I would like to thank the authors for providing detailed rebuttal messages. I would also like to thank reviewers DY9e and Wv3A for already engaging in further discussion.
For the other reviewers, I would like to encourage you to carefully read all other reviews and the author responses and engage in an open exchange with the authors. Please post your first response as soon as possible within the discussion time window, so there is time for back and forth discussion with the authors. Ideally, all reviewers will respond to the authors, so that the authors know their rebuttal has been read.
Best regards,
AC
This paper proposes a reasoning-based guard model for VLM. Specifically, the authors first construct a reasoning dataset for this purpose, and then the guard model is trained with cold-start via SFT and online RL. After the discussion, all the reviewers reached a unanimous agreement on the acceptance of this paper. Therefore, AC also recommends to accept this paper.