PaperHub
6.3
/10
Poster4 位审稿人
最低6最高7标准差0.4
6
6
6
7
3.5
置信度
COLM 2025

Customize Multi-modal RAI Guardrails with Precedent-based predictions

OpenReviewPDF
提交: 2025-03-20更新: 2025-08-26
TL;DR

We propose a precedent-based approach to enhance the adaptability and interpretability of multimodal guardrails, enabling customizable policy enforcement without extensive retraining.

摘要

关键词
Customizable Guardrail

评审与讨论

审稿意见
6

This paper proposes a new framework for customizing multi-modal Responsible AI (RAI) guardrails via precedent-based reasoning. Instead of relying on static policy definitions or extensive fine-tuning, the authors introduce a precedent database—a repository of reasoning examples that the model can condition on. These precedents are generated via a critique-revise mechanism, wherein the model self-analyzes and revises its predictions without human intervention. Experiments on the UnsafeBench dataset demonstrate superior performance in both few-shot and full-dataset scenarios, with significant improvements over baselines, especially in the generalization to novel policies.

接收理由

  1. The use of precedents to drive policy-violation detection is novel and inspired, borrowing legal metaphors to increase interpretability and flexibility.

  2. This framework avoids the need for exhaustive re-training and supports adaptation to user-specific and evolving RAI policies, a highly practical and underexplored problem.

  3. Self-supervised precedent construction via critique-revise is effective and human-free, which makes the approach cost-efficient and easy to scale.

  4. The authors explore robustness across various configurations, including number of policies, scale of data, and retrieval strategies (Table 4), providing a holistic view of system behavior.

拒绝理由

  1. While the critique-revise loop is effective, comparisons to human-annotated precedents or other automated methods for precedent construction would be valuable.

  2. Though the authors mention potential extensions to video/audio, all experiments focus solely on images. A concrete demonstration on at least a second modality would strengthen the generality claim.

  3. The authors mention multiple prompt templates were tried but only “best” ones reported. Some transparency around variance across prompts would help understand reliability.

  4. The pipeline is relatively complex (critique, revise, retrieve, fine-tune) and while effective, it would be useful to include computational overhead analysis or training cost comparison.

给作者的问题

  1. Could the critique-revise mechanism be extended to text-only RAI tasks (e.g., hate speech detection)? Would it work without images?

  2. Have you evaluated how often the retrieved precedent actually comes from the same policy as the test image? Are there errors due to cross-policy retrieval?

  3. How sensitive is the system to errors in the precedent rationale? Could the model be misled by faulty past reasoning?

  4. Could this precedent approach be combined with existing symbolic logic or rule-based systems for hybrid moderation?

  5. What are the limits of this method in open-world settings where even precedents may fail to capture unseen or abstract concepts?

评论

We thank Reviewer 3C7s for the positive comments and valuable suggestions. We are motivated to see that the reviewer finds our work (1) is novel and inspired, borrowing legal metaphors to increase interpretability and flexibility, (2) avoids the need for exhaustive re-training and supports adaptation to user-specific and evolving RAI policies, (3) is effective and human-free, which makes the approach cost-efficient and easy to scale, and (4) explore robustness across various configurations, including number of policies, scale of data, and retrieval strategies (Table 4), providing a holistic view of system behavior.

Q1. While the critique-revise loop is effective, comparisons to human-annotated precedents or other automated methods for precedent construction would be valuable.

A1. Thank you for the valuable suggestion. We are not aware of alternative automated methods that can be directly applied to precedent-based guardrail construction, as we are, to our knowledge, the first to propose this approach. Moreover, collecting a large set of human-annotated precedents would be prohibitively expensive. However, to better understand the potential upper bound of the method, we consider a human-in-the-loop framework where the critique-revise loop constructs precedents initially, and human effort is dedicated to annotating the remaining challenging cases that lack valid explanations. We report the results below.

Method% of training data utilizedF1
Baseline (LLaVA)74.40.552
+critique-revise (RAG)88.60.613
+human-in-the-loop (RAG)100.00.646

From the table one can see that our proposed framework can be integrated with minimal human effort (only labeling around 12% of the data), offering a practical and scalable solution to construct precedents for real-world RAI guardrail systems.

Q2. Though the authors mention potential extensions to video/audio, all experiments focus solely on images. A concrete demonstration on at least a second modality would strengthen the generality claim.

A2. We appreciate the reviewer’s suggestion. Our original intention in the conclusion (Lines 314–316), was to indicate that extending the framework to other modalities such as video or audio is an important direction for future work. In this paper, we focus exclusively on images to establish and validate the core methodology. We acknowledge that the original phrasing may cause ambiguity, and we will revise the text to clarify that video and audio applications are beyond the scope of this work and will be explored in future research.

Q3. The authors mention multiple prompt templates were tried but only “best” ones reported. Some transparency around variance across prompts would help understand reliability.

A3. Thank you for the suggestion. In our preliminary analysis with proprietary models, we studied two factors: (1) whether including a list of RAI categories in the prompt affects performance, and (2) performance variance across different wording variations. We found that including the list of categories is critical—on GPT-4o, performance was 0.726 with the category list and 0.579 without it. For the paraphrasing experiments, we varied the wording of two parts: the initial instruction ("An image is considered unsafe if it contains any of the following unsafe categories") and the query ("Based on the above criteria, does the image contain unsafe content?"). We sampled five paraphrased prompts and evaluated the model's performance. The mean F1 score across these prompts was 0.715 with a variance of 4.57e-5. The highest F1 score, 0.726, corresponds to the prompt reported in Figure 8. We will include this variance analysis in the revised manuscript to enhance transparency.

评论

Q7. How sensitive is the system to errors in the precedent rationale? Could the model be misled by faulty past reasoning?

A7. Great question! The system’s performance can indeed be influenced by the quality of the retrieved rationales. If the rationales are incorrect or misleading, they may propagate errors to the final predictions. Currently, there is no automatic method to evaluate the faithfulness of rationales. To address this, we manually sampled 10 precedents from each of the 11 categories (110 images in total) and assessed their faithfulness. We found that only 7 samples exhibited hallucinations or missed critical objects. Importantly, these flawed samples were not retrieved during inference and therefore did not mislead the model.

Furthermore, we observed that errors are less about reasoning flaws in the rationales and more about perceptual errors in the captions of precedents. In these cases, the model’s incorrect reasoning stems from a fundamental perception issue rather than a reasoning error. This behavior is consistent with findings from recent studies [3], which suggest that incorporating uncertainty estimation methods can help flag potentially hallucinated captions. Such techniques could be integrated to filter unreliable precedents and further mitigate the risk of erroneous reasoning.

[3] VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information, Kamoi, 2024.

Q8. Could this precedent approach be combined with existing symbolic logic or rule-based systems for hybrid moderation?

A8. We appreciate the insightful suggestion. Combining our precedent-based approach with symbolic logic or rule-based systems is indeed a promising direction. One way to achieve this is by integrating precedents with probabilistic programming or program synthesis methods, as demonstrated in [4, 5]. Specifically, textual precedents can be mapped into grounded programs or logical expressions, enabling seamless combination with existing rule-based systems.

[4] From word models to world models: Translating from natural language to the probabilistic language of thought, Wong et al., 2023.

[5] TroVE: Inducing Verifiable and Efficient Toolboxes for Solving Programmatic Tasks, Wang et al., 2024.

Q9. What are the limits of this method in open-world settings where even precedents may fail to capture unseen or abstract concepts?

A9. We acknowledge that the performance of our framework is bounded by the capabilities of current vision-language models (VLMs). If the model fails to associate presented objects with relevant RAI concepts, particularly for novel or abstract cases, it may not reliably identify unsafe content. Nonetheless, recent studies have shown the potential for leveraging existing knowledge to achieve compositional generalization to unseen concepts or patterns [6, 7], suggesting promising directions for future research. We will also add a limitations section in future revisions to explicitly discuss these challenges.

[6] Verbalized Representation Learning for Interpretable Few-Shot Generalization, Yang et al., 2024.

[7] DiSciPLE: Learning Interpretable Programs for Scientific Visual Discovery, Mall et al., 2025.

评论

Q4. The pipeline is relatively complex (critique, revise, retrieve, fine-tune) and while effective, it would be useful to include computational overhead analysis or training cost comparison.

A4. We are happy to clarify this issue. Critique and revise stages primarily involve inference using vision-language models (VLMs). As detailed in Appendix A, we leverage Sglang, which incorporates optimizations such as RadixAttention for key-value (KV) cache reuse, significantly improving inference speed and reducing GPU memory usage. We benchmarked the full pipeline on a single NVIDIA A6000 GPU with 2000 queries, achieving a total processing time of under 300 seconds. For retrieval, it involves only a single forward pass through the image or text encoder, making it highly efficient, with each retrieval completing within approximately 10 seconds.

Regarding training cost, our reflective fine-tuning does not introduce much overhead compared to standard supervised fine-tuning (SFT) with LoRA. The number of training samples used for reflective fine-tuning is comparable to or fewer than those for SFT. Moreover, our experiments show that even a single round of reflective fine-tuning results in substantial performance improvements, keeping the training cost low.

We will include the above discussion in the revised manuscript.

Q5. Could the critique-revise mechanism be extended to text-only RAI tasks (e.g., hate speech detection)? Would it work without images?

A5. Absolutely. The critique-revise mechanism can be extended to text-only RAI tasks. In fact, reflection mechanisms were originally developed in text-only domain [1, 2]. To adapt our framework to text-only tasks, one would replace the vision-language model (VLM) with a large language model (LLM) trained on reasoning data, which are widely available today (ex. LLaMA, Qwen, etc). We want to emphasize that our contribution lies not only in integrating the reflection mechanism into a precedent-based guardrail system but also in addressing challenges that arise where keyword-based filtering or fixed-policy guardrails are insufficient, particularly under customizable, multi-modal conditions.

[1] Reflexion: Language agents with verbal reinforcement learning, Shinn et al., 2023. [2] Re-rest: Reflection-reinforced self-training for language agents, Dou et al., 2024.

Q6. Have you evaluated how often the retrieved precedent actually comes from the same policy as the test image? Are there errors due to cross-policy retrieval?

A6. Yes, we have evaluated this. We find that the retrieval threshold plays a crucial role in ensuring that the retrieved precedent matches the policy of the test image (please refer to Table 4). With a threshold set at 0.8, the retrieval accuracy—defined as the proportion of precedents from the same policy as the test image—is 92%, and 80% of test images successfully retrieve a precedent. At a lower threshold of 0.7, retrieval accuracy drops to 86%, though the coverage rate increases to 92%. We also observe a corresponding 2% drop in final testing accuracy when lower threshold is applied, indicating the importance of retrieving relevant precedents.

However, we would like to note that our current framework is a prototype, and employs a relatively simple retrieval method. Future works can leverage advancements in retrieval-augmented generation (RAG) systems to further enhance retrieval precision and robustness.

评论

Dear Reviewer 3C7s,

Thank you again for your thoughtful comments and for recognizing the novelty, flexibility, and scalability of our work. As the discussion period nears its conclusion, we would like to gently follow up and see whether our responses have addressed your valuable suggestions.

In particular, during the rebuttal, we have:

  • Clarified the unique contribution of our precedent-based framework and explored a human-in-the-loop extension, showing that minimal manual annotation (12%) can further enhance performance (A1).

  • Clarified that video/audio extensions are proposed for future work, and are not within the current experimental scope (A2).

  • Detailed a variance analysis across prompt formulations to improve transparency on prompt reliability and performance stability (A3).

  • Provided a detailed breakdown of the pipeline’s computational efficiency and training cost, including runtime benchmarks and GPU usage (A4).

  • Explained the applicability of our framework to text-only RAI tasks and the rationale for using VLMs in this instantiation (A5).

  • Reported retrieval precision across policy boundaries (A6), the effect of flawed precedents (A7), and discussed faithfulness evaluation and future safeguards.

  • Discussed on integrating symbolic reasoning, noting future opportunities for hybrid moderation (A8), and outlined current limitations in open-world generalization (A9).

We sincerely appreciate your insights, which helped us clarify and strengthen the manuscript. If any further clarification would be helpful, we are happy to provide it.

Best regards,

All Authors

审稿意见
6

This paper tackles the issue of customizing user-defined RAI (Responsible AI) policies by introducing a method, which involves storing preceding examples of each independent policy and then retrieving them during inference . This method provides models with task-related examples to improve performance. And the extensive experiments have proved its effectiveness of identifying policy-violating examples in various situations.

接收理由

  • Evaluation shows the effectiveness of the proposed method for improving the performance of identifying policy-violating examples.
  • This method can be integrated with different models in various situations.
  • it reduces the affection of noise from input distribution.

拒绝理由

  • Retrieving relevant precedents is the key part while doing inference, but in this paper it does not provide more information about it.
  • Retrieval methods might affect the result of the model performance, so it is recommended that some more experiments should be conducted to prove if the retrieval methods will affect the result.
  • It was to improve the flexibility of customizing user-defined RAI policies, but it has only conducted experiments on two scenarios of one-policy and eleven-policies cases. That might weaken the confidence level of the flexibility of the method.
评论

We appreciate Reviewer zikp’s valuable suggestions and their recognition of the contributions of our work in the following aspects: (1) improving the performance of identifying policy-violating examples, (2) being adaptable for integration with different models in various situations, and (3) reducing the impact of noise in the input distribution. Please see our detailed responses to each issue below.

Q1. Retrieving relevant precedents is the key part while doing inference, but in this paper it does not provide more information about it.

A1. We are happy to clarify this issue. As described in Section 3.2, we retrieve precedents based either on the similarity between the images of the precedents and the test input, or between the captions of the precedents and the caption of the test image. For image retrieval, we use the visual encoder from CLIP with ViT-L/14; for text retrieval, we use the Contriever [1] model, extracting sentence embeddings via mean pooling. In addition, we would like to emphasize that even with this relatively simple retrieval method, our framework can still achieve significant performance by leveraging relevant precedents for the guardrail system. We will update the description of the models used in future revisions.

[1] Unsupervised Dense Information Retrieval with Contrastive Learning, Izacard et al., 2021.

Q2. Retrieval methods might affect the result of the model performance, so it is recommended that some more experiments should be conducted to prove if the retrieval methods will affect the result.

A2. We agree with the reviewers that the configuration of retrieval methods might affect the result of the final performance. As a result, we conduct an ablation analysis to explore variations in retrieval subject (text vs. image), thresholding (whether a minimum similarity threshold is applied or the most similar precedent is retrieved). We report the results in Table 4 of the original manuscript, and describe our findings in L275-L280. We present Table 4 here for clarity.

Config IDSubjectThPolicyRationaleF1 (Δ)
1--0.552
2textclosest+0.112
3text0.7+0.136
4text0.8+0.150
5text0.8+0.170
6text0.8+0.154
7img0.7+0.190
8img0.8+0.214

We find that even the least effective RAG configuration (model ID 2) improves the baseline by 11%. Moreover, setting a higher similarity ensures model only retrieve relevant precedents, resulting in the performance gain between models ID 3 and 4. Our findings also indicate that using images as the retrieval subject (e.g., comparing IDs 5 and 8) consistently yields better results.

Q3. It was to improve the flexibility of customizing user-defined RAI policies, but it has only conducted experiments on two scenarios of one-policy and eleven-policies cases. That might weaken the confidence level of the flexibility of the method.

A3. We would like to clarify that we evaluate the flexibility of our framework from three dimensions.

(1) Adaptability to novel policies (L233-L246), where the model is trained on 10 categories (160 images per policy) and adapted to a novel 11th category using few-shot data (16 images), with results reported in Table 2.

(2) Scalability of the number of RAI policies (L247-L259), where we test model's robustness against an increasing number of policies. The results are shown in Table 3.

(3) Scalability of the number of training images (L260-L274), where we evaluate the impact of scaling up the training data and plot the results in Figure 4.

Across these experiments, our method demonstrates superior robustness against novel policies, and can scale well with increasing number of policies and data, which verifies the flexibility of our proposed framework.

评论

Dear Reviewer zikp,

Thank you for your time and thoughtful feedback on our paper. As the discussion period nears its conclusion, we would like to kindly follow up and ask whether our responses have clarified your concerns. Specifically, during the rebuttal phase, we have:

  • Clarified the retrieval process originally described in Section 3.2, detailing the models used for image (CLIP ViT-L/14) and text (Contriever) retrieval.

  • Reiterated our ablation study on retrieval configurations, previously reported in Table 4 and Section 4.4, which examines the impact of different retrieval strategies on model performance.

  • Highlighted that our evaluation of framework flexibility spans three dimensions — adaptability to novel policies, scalability to more policies, and scaling with additional training data — as detailed in Tables 2, 3, and Figure 4.

We hope these clarifications help reaffirm that the concerns you raised have been addressed within the paper, and we kindly ask you to consider these points in your evaluation. Please let us know if there is anything else we can further clarify.

Best regards,

All Authors

审稿意见
6

This paper proposes to customize a guardrail system by providing "precedents", or few-shot examples of a particular type of outputs to be filtered. The paper is well organized and clearly written. The problem is somewhat motivated, and the experimental results are encouraging.

接收理由

  1. As the content created by a genAI system becomes more and more complex and diverse, we might need such a few-shot example based guardrail system.
  2. The method is reasonably and carefully designed.
  3. The experiments are comprehensive and the empirical results are encouraging.

拒绝理由

  1. At the current stage, a keyword based filtering customization system seems sufficient. However, I cannot rule out the usefulness of such a method in the future. Therefore, I describe it as "somewhat motivated".
  2. The word "user" is a bit confusing. It seems to refer to the service provider that outsources the filter system, instead of end users.

给作者的问题

N/A

评论

We thank Reviewer Hnir for the positive comments and constructive suggestions. We are encouraged to see that the reviewer finds our work (1) to address the need for a few-shot, example-based guardrail system, (2) to be reasonably and carefully designed, and (3) to include comprehensive experiments with encouraging empirical results. Please see our responses below for each raised issue.

Q1. At the current stage, a keyword based filtering customization system seems sufficient. However, I cannot rule out the usefulness of such a method in the future. Therefore, I describe it as "somewhat motivated".

A1. Because we filter images (not text), unsafe content often manifests via subtle visual cues—coded symbols, context—that keyword lists miss. And as harmful imagery evolves, precedents can act as principles to guide our system adapt instantly, without rebuilding keyword sets with human effort. These advantages ensure our precedent-based guardrail scales and adapts beyond basic filters.

Q2. The word "user" is a bit confusing. It seems to refer to the service provider that outsources the filter system, instead of end users.

A2. We would like to clarify that in this context, "user" refers to the service provider that intends to "use" the guardrail system. However, to avoid ambiguity, we will replace "user" with "service provider" in future revisions.

评论

my primary concernis that example images, as a description of the filter, are highly ambiguous, since there are many elements in an image. For example in Fig 3(a), the "drug bottle" image doesn't seem to promote drug use in my eyes. If the user provides such an example image, chance is the system may reject many harmless images due to similar visual appearances but different semantics (e.g. bottles of medicines). In contrast, text-based specifications are much less ambiguous.

评论

We would like to clarify that the model does not make decisions based solely on the reference image. Instead, it operates in a multi-modal context, jointly reasoning over both the textual policy definition (e.g., "Drug use, theft, vandalism, and other illegal activities" as in Fig. 3(a)) and the provided reference image. The precedents combine both modalities, allowing the model to align visual and textual cues to distinguish nuanced cases. In the example you mentioned, if the image depicts medicine use in a healthcare setting rather than drug abuse, the model is expected to recognize this distinction and not filter out the image. This multi-modal reasoning not only mitigates the ambiguity that can arise from relying on visual information alone but also allows images to supplement textual definitions by capturing nuances that are difficult to specify in text alone.

审稿意见
7

This paper considers the guardrail in the context of visual language models, where the goal is to filter out image content based on user-defined policies. This helps avoid hateful, harmful, and incorrect information. To this end, this work focuses on the scalability of multi-modal guardrail, aiming to extend to novel user standards with minimal training. This work proposes to condition the model’s judgment on “precedents”, which are the reasoning processes of prior data points similar to the given input. Such a manner is shown to be effective in difference scenarios.

接收理由

  • The problem of the scalability of multi-modal guardrails is reasonable and useful. This work discusses the limitations of existing fine-tuning methods and training-free methods. The former is hard to generalize to novel standards, while the latter struggles with limited context lengths. To address both issues, this work proposes to retrieve similar data points, which is sound to me.
  • The framework is clear and easy to follow. The precedent construction and utilization are clearly illustrated. Both steps are reasonable.
  • The experimental results show the proposed method can bring improvements over several baselines, such as LLaVA. Table 2 also shows the adaptability of the proposed method to novel standards.

拒绝理由

This submission is overall good. However, it would be better to include a paragraph discussing potential limitations and future directions.

给作者的问题

One important question: Will the Precedent Database be made publicly available?

评论

We thank Reviewer oTxU for the positive comments and suggestive remarks. We are excited to see that the reviewer finds our work (1) is reasonable and useful, (2) clear and easy to follow, and (3) can bring improvements over several baselines. Please see our responses below for each raised issue.

Q1. It would be better to include a paragraph discussing potential limitations and future directions.

A1. Thanks for the suggestions! In our original manuscript, due to page limit, we briefly discuss the limitation in the conclusion. We acknowledge that the performance of our framework is bounded by the capabilities of retrieval methods and current vision-language models (VLMs). If the system fails to retrieve relevant precedents or if the VLM fails to associate presented objects with relevant RAI concepts—particularly in the case of novel or abstract content—its ability to identify unsafe content may be compromised.

In addition to the future directions mentioned in the original conclusion, we note that Reviewer 3C7s also suggested several promsing extensions, including the integration with symbolic or rule-based systems (see our response to Q8) and addressing scenarios where objects are unseen or abstract to the VLM (see our response to Q9). We will include a dedicated section discussing limitations and future work in the revised version.

Q2. Will the Precedent Database be made publicly available?

A2. We will release the database in the future. In addition, we've provided the scripts to generate these precedents in the supplementary materials.

评论

Thanks for the response. I maintain my original score of accept

评论

Dear Reviewers,

We hope this message finds you well. As the deadline for the discussion period approaches, we would like to kindly remind you and inquire if there are any outstanding points you would like us to address. We would be grateful for the opportunity to provide any additional information or clarification and are more than happy to continue the conversation.

Best regards,

All authors

最终决定

Applying guardrails for content moderation in real world deployments of generative AI often requires adapting guardrails to changing policies that define acceptable vs unacceptable content. The authors, recognizing this challenge, develop an approach to condition the judgements of guardrail models on precedents generated from training data of the guardrail model, augmented with a rationale that explains the judgement. By doing so, the authors are able to ground the judgements better in rationales, thereby allowing adaptation to changing moderation policies. The authors call attention to an important problem and provide an interesting and practical approach to solve it. While the reviewers bring up valid concerns on the comprehensiveness of evaluation, I believe the COLM audience would benefit from the presentation of this paper, that brings attention to an important and understudied problem, while also proposing a promising solution that can spark further work.

[Automatically added comment] At least one review was discounted during the decision process due to quality]