Robust SuperAlignment: Weak-to-Strong Robustness Generalization for Vision-Language Models
We propose the first robust weak-to-strong generalization framework to elicit robust knowledge from a strong student VLM in an unsupervised scheme.
摘要
评审与讨论
This paper introdces the robustness of vision-language models in the context of weak-to-strong generalization. The authors reveal that weak student models cannot provide robustness by simple distillation, and emphasize the importance of incorporating adversarial samples during training. In addition, this paper proposes Entropy-Guided Uncertainty Re-Weighting and Teacher Guidance Refinement via Inverse Adversarial Examples for improving robustness of student model.
优缺点分析
Strengths:
- Clear and Logical Writing.
- To the best of my knowledge, this is the first paper to address the robustness of vision-language models specifically in the context of weak-to-strong generalization.
- The theoretical justification that increasing the teacher model’s confidence leads to greater robustness in the student model is highly logical, and the proposal of inverse adversarial examples is well-reasoned.
Weaknesses: In weak-to-strong generalization loss function, replacing the constant balancing factor with an adaptive parameter was already explored in [1], which balances the weight utilizing the confidences of both the teacher and student models. I wonder what the advantages of EUR compared to the adaptive weight [1] and how the performance would change if EUR were replaced by the adaptive weight [1].
[1] Guo, Jianyuan, et al. "Vision superalignment: Weak-to-strong generalization for vision foundation models." arXiv preprint arXiv:2402.03749 (2024).
问题
As describe in Weakness, I wonder how EUR compares to the adaptive weighting approach proposed in [1]. Specifically, what are the unique benefits of EUR, and how does the model's performance change when EUR is substituted with the adaptive weight from [1]?
[1] Guo, Jianyuan, et al. "Vision superalignment: Weak-to-strong generalization for vision foundation models." arXiv preprint arXiv:2402.03749 (2024).
局限性
Yes, the authors report the limitations and broader impacts in the Appendix.
最终评判理由
This paper presents a novel study on the robustness of superalignment in Vision-Language Models — a topic that has not been explored in prior research — and supports its claims with both theoretical analysis and extensive experimental evidence. My concerns have been satisfactorily addressed during the discussion.
格式问题
NO Formatting Concerns
Response to Reviewer zCJ1
We thank the reviewer for constructive feedback.
1. (Weakness-1) Comparison of EUR with adaptive weighting in [1]
We thank the reviewer for pointing this out. While [1] introduces an adaptive weighting based on confidence for weak-to-strong generalization, it focuses solely on single-modal vision models. Our work addresses robust weak-to-strong generalization for multimodal Vision-Language Models (VLMs), and our Entropy-guided Uncertainty Re-weighting (EUR) extends beyond raw confidence by leveraging prediction entropy, theoretically justified via Theorems 1 & 2 (feature norm classification margin robustness). In contrast, [1] provides only empirical justification.
To quantify the impact, we also implement the adaptive confidence weighting from [1] as a variant and compare it with our EUR below (average over 14 zero-shot datasets):
| Method | Clean (%) | Robust (%) |
|---|---|---|
| Confidence Re-Weighting [1] | 67.17 | 46.92 |
| Entropy Uncertainty Re-Weighting (EUR) | 68.75 | 48.38 |
Our Adv-W2S method with EUR achieves higher robustness while also improving clean accuracy, validating the benefits of entropy-based adaptive scaling. We will incorporate this comparison and clarification into the revision.
[1] Guo, Jianyuan, et al. "Vision superalignment: Weak-to-strong generalization for vision foundation models." arXiv preprint arXiv:2402.03749 (2024).
2. (Question-1) Unique benefits of EUR beyond [1].
The empirical comparison of EUR with the adaptive weighting in [1] is reported above. Here, we emphasize the theoretical contributions unique to EUR:
- [1] balances teacher–student alignment purely via prediction confidence yet lacks a theoretical link to generalization/robustness properties.
- Our EUR is built on prediction entropy, which we theoretically show (Theorems 1 & 2) to correlate with feature norm and classification margin, thereby providing a direct mechanism for robustness improvement. This connection enables a principled adaptive weighting scheme: low-entropy predictions (confident teacher) lead to stronger alignment, while high-entropy predictions defer to student self-refinement, improving robust weak-to-strong generalization.
This theoretical grounding explains why EUR improves both clean and robust performance, as confirmed by the empirical results. We will add this discussion to the revised manuscript.
[1] Guo, Jianyuan, et al. "Vision superalignment: Weak-to-strong generalization for vision foundation models." arXiv preprint arXiv:2402.03749 (2024).
Thank you for your clear and thorough rebuttal, which has effectively addressed my concerns. I am satisfied that the issues have been resolved and I maintain a positive position of this paper.
Esteemed Reviewer,
We sincerely thank you for engaging with our rebuttal, and for your valuable comments. Rest assured we will incorporate all your suggestions in the draft. Meantime, kindly let us know if there is anything else we can answer, clarify or improve.
Best regards,
Authors
The paper proposes a new framework called Adv-W2S (Adversarial Weak-to-Strong) to improve the robustness of VLMs against adversarial attacks. It makes several key contributions: it provides theoretical analysis that explains the shortcomings of traditional weak-to-strong generalization methods in terms of robustness, attributing this issue to the lack of adversarial samples in the learning objective; it introduces a new method based on adversarial sample generation, allowing robust knowledge to be extracted from strong student models in an unsupervised way; and it validates the effectiveness of Adv-W2S through extensive experiments, showing its superior performance in zero-shot classification tasks as well as its applicability to various downstream tasks such as image captioning, visual question answering, and COT reasoning. These contributions offer valuable insights for addressing current vulnerabilities in the security and robustness of VLMs.
优缺点分析
Strengths:
① The writing is excellent, and the paper is logically structured.
② New method design: The authors propose a weak-to-strong generalization framework called Adv-W2S, based on adversarial sample generation, which can extract robust knowledge from a strong student model in an unsupervised manner.
③ Comprehensive experimental validation: The authors conduct extensive experiments across multiple network architectures and evaluation scenarios, demonstrating the effectiveness of Adv-W2S.
Weaknesses:
① The paper introduces the parameter β in the CLIP weak-to-strong formulation (i.e., Equation (4)) . However, there is limited discussion later in the paper regarding the specific value of β. It would be beneficial to include a brief analysis or ablation study on how β affects the experimental results.
② The authors could refer to the PGR formulation used in Burns' work to evaluate the robust generalization performance of their method. Additionally, introducing a "strong ceiling" — that is, showing the upper bound of performance when directly training the strong model — would make the robustness comparison more convincing.
问题
When I first started reading the paper, a question came to mind: The term "superalign" usually refers to the field of AI alignment, where the focus is on aligning highly intelligent, complex, or enhanced models with human values and intentions. It primarily concerns how to ensure that superhuman-level AI systems (such as models that perform beyond human capabilities) remain consistent with human values and objectives.
CLIP clearly does not fall into this category. However, in the later experimental sections, the paper integrates CLIP into large-scale VLMs such as LLAVA, which partially addresses this issue. It would be helpful if the authors could clarify this point more explicitly at the beginning of the paper. However, this does not detract from the fact that this is still a good paper.
Furthermore, based on this point, since the authors likely have sufficient GPU resources, expanding the experiments to even larger models would better reflect the "superalignment" scenario.
other comments see weaknesses.
局限性
yes
最终评判理由
This article is well-developed in terms of topic selection, experimental analysis, and findings. The rebuttal also addressed my questions, and I believe it is a good paper.
格式问题
no
Response to Reviewer m9zw
We thank the reviewer for constructive feedback.
1. (Weaknesses-1) Analysis and impact of the parameter in the CLIP weak-to-strong formulation (i.e., Eq. (4)).
We thank the reviewer for raising this point. The trade-off hyperparameter controls the balance between weak-to-strong teacher-student alignment and self-prediction consistency in Eq. (4) and is similarly applied in our adversarial baseline (Eq. (5)). In our experiments, is selected via 1% of the training data and either fixed or linearly warmed up. To further clarify the role of , we report the performance of the baseline adversarial weak-to-strong formulation (Eq. (5)) under different fixed values in the table below:
| Clean | Robust | |
|---|---|---|
| 0.3 | 64.10 | 41.65 |
| 0.5 | 63.45 | 42.43 |
| 0.7 | 63.27 | 41.96 |
However, a fixed cannot adapt to varying teacher reliability, particularly for uncertain or hard samples. To address this, our final method introduces Entropy-guided Uncertainty Re-weighting (EUR) (Eq. (6)), which adaptively scales based on teacher confidence, leading to our unified objective in Eq. (13). We report average clean and robust accuracy across 14 zero-shot datasets to quantify this effect:
| Method | Clean | Robust |
|---|---|---|
| Baseline (Eq. (5)) | 63.45 | 42.43 |
| Adv-W2S w/o EUR | 67.09 | 46.79 |
| Adv-W2S | 68.75 | 48.38 |
These results demonstrate that entropy-based adaptive weighting notably improves robustness generalization while also enhancing clean performance. We will incorporate this clarification and the ablation results into the revised manuscript.
2. (Weakness-2) PGR evaluation from Burns et al. [a]
We thank the reviewer for this helpful suggestion. We here estimate the "strong ceiling" by performing robust fine-tuning of the strong model directly on each downstream dataset, which approximates the upper bound of achievable robustness. This provides a clearer perspective on the gap between robust weak-to-strong generalization in the zero-shot setup and task-specific robust training. To better contextualize our results, we consider the Performance Gap Recovered (PGR) [a] in both clean performance and adversarial robustness across diverse tasks. The table below summarizes the PGR results (%) of our method (average classification accuracy against clean/adversarial examples across zero-shot datasets, captioning on the COCO dataset, and VQA on the VQAv2 dataset):
| Task | Clean PGR | Robust PGR |
|---|---|---|
| Classification | 86 | 79 |
| COCO Captioning | 81 | 75 |
| VQAv2 | 83 | 78 |
These results indicate that Adv-W2S recovers the majority of the performance gap to the strong ceiling in both clean and robust settings, validating its effectiveness in achieving robust weak-to-strong generalization without task-specific training. We will include this analysis in the revised manuscript.
[a] Burns et al., Weak-to-strong generalization: Eliciting strong capabilities with weak supervision (ICML 2024).
3. (Question-1) Clarifying “superalign” terminology.
We thank the reviewer for this observation. We agree that "superalign" is commonly used in AI safety to describe aligning superhuman-level models with human values. Our work uses the term in the weak-to-strong generalization sense, following Burns et al. [a], where the goal is to transfer robustness from a weaker model to a stronger one — an analogous alignment challenge in multimodal robustness.
- We will revise the introduction to explicitly acknowledge the conventional meaning of "superalign" and clarify our intended usage.
- We will highlight that our method starts from CLIP (a foundational vision-text model) and extends to large-scale VLMs such as LLaVA, emphasizing that our formulation addresses robustness transfer rather than human-value alignment.
- This clarification will be added early in the paper to prevent confusion and situate our contributions within the broader literature.
[a] Burns et al., Weak-to-strong generalization: Eliciting strong capabilities with weak supervision (ICML 2024).
4. (Question-2) Scalability to larger model backbones
To validate scalability, we here extend experiments to larger CLIP backbones (ViT-H/14 and ViT-g/14), which are ~3 larger than ViT-B/16 and widely used in recent VLMs. Average zero-shot classification accuracy (%) across 14 datasets under perturbation radius are summarized below:
| Backbone | Method | Clean | Robust |
|---|---|---|---|
| ViT-L->ViT-H | TeCoA | 66.95 | 48.63 |
| PMG | 68.96 | 50.31 | |
| FARE | 70.55 | 50.25 | |
| TGA | 65.91 | 49.14 | |
| Adv-W2S | 72.89 | 53.27 | |
| ViT-H->ViT-g | TeCoA | 67.41 | 49.20 |
| PMG | 69.71 | 51.95 | |
| FARE | 71.62 | 51.68 | |
| TGA | 66.99 | 50.44 | |
| Adv-W2S | 73.94 | 54.09 |
Our method maintains consistent gains on both clean and robust accuracy, with especially strong improvements in robustness as model capacity increases. We will include this analysis and also reference robustness results of other large-scale VLMs in other vision-language tasks reported in Response 2 to Reviewer 7iVd.
The author's answer solved my question, since my initial rating was already high enough, I still maintain this rating, and improve some minor scores. Good luck!
Esteemed Reviewer,
Many thanks for engaging with our rebuttal and for all your valuable comments. Rest assured, we will incorporate all your suggestions into the final paper.
Meantime, if there is anything else we can improve or answer, kindly do let us know.
Best regards,
Authors
In this work, the weak-to-strong robustness generation problem is studied, and the key finding is that the strong teacher VLM is also vulnerable to adversarial attack as the training objective is not attack-aware. To this end alignment re-weighting and reference refining are proposed to enhance the weak-to-strong alignment and improve the model's adversarial robustness. In detail, besides using a fixed weights to balance the tradeoff between teacher-student alignment and student self-refinement, a teacher entropy to measure the uncertainty is proposed and used to guide the adaptive weights. The proposed approach is validated in multiple datasets with multiple tasks like linear probing, image caption, VQA, hallucination, etc.
优缺点分析
Strength
- The paper is well formated and written, with both theoretical analysis and experiments.
- Sufficient experiment coverage, the proposed method is validated in multiple vision-language tasks, like zero-shot classification, image caption, VQA, and hallucination.
- The pretrained vision encoder is validated on multiple open sourced base models, like llava 1.5 and openflanmingo.
Weakness
- The studied adversarial attack seems to be limited: 1) only studied visual encoder. 2) limited to perturbation based attack
- Did not test with more recent VLM models like llava 1.6 and Qwen VL
问题
-
This work is mostly focusing on the perturbation based adversarial attack to the vision encoder, which is fine. But one follow up question would be: Can this approach be generalized to other adversarial attacks? E.g, the perturbation based attack is also studied in text only language models [1]. From principal, it seems this entropy based approach can be generalized to this case. Another possibility is extend the approach to adversarial attack that's not perturbation based, like prompt based attack [2]
-
Another hypothesis is that stronger VLM based model will be more robust, so it would be good to see how robust the more recent VLM models are, like the Qwen VL and llava-onevision, or llava 1.6(next).
[1] Kim, Jaehyung, et al. "RoAST: Robustifying Language Models via Adversarial Perturbation with Selective Training." EMNLP (Findings). 2023. [2] Xu, Xilie, et al. "An LLM can Fool Itself: A Prompt-Based Adversarial Attack." ICLR. 2024.
局限性
Yes, limitations are discussed in appendix.
最终评判理由
Given the comprehensive response with additional discussion and evidence showing the effectiveness of the approach on 1) Language models and 2) more recent VLMs with stronger performance. I updated my rating accordingly to reflect these discussions
格式问题
N/A
Response to Reviewer 7iVd
We thank the reviewer for constructive feedback.
1. (Weakness-1) (Question-1) Generalization beyond perturbation-based visual adversarial attacks (e.g., text perturbation attacks, text prompt attacks).
We thank the reviewer for this question. While our main experiments focus on perturbation-based attacks on the visual encoder (dominant in multimodal robustness studies), our approach is not restricted to this setting.
- Connection to text-based perturbations: Work [1] studies continuous embedding-level perturbations for LLMs, while [2] addresses discrete text modifications to achieve prompt-based attacks. Importantly, both types of text-based adversarial attacks are applicable for evaluating the robustness of our method as well as other adversarial fine-tuning approaches.
- Extended evaluation: To demonstrate generality, we additionally evaluate robustness against text perturbation [1] and prompt-based attacks [2]; results are summarized below:
| Method | Clean | Robust Acc. (Text Perturbation Attack) | Robust Acc. (Text Prompt Attack) |
|---|---|---|---|
| TeCoA | 61.56 | 41.22 | 39.97 |
| PMG | 64.46 | 42.37 | 40.82 |
| FARE | 65.50 | 40.28 | 38.58 |
| TGA | 62.11 | 41.54 | 40.11 |
| Adv-W2S | 68.75 | 45.30 | 44.06 |
These results confirm that our Adv-W2S generalizes beyond image-level perturbations and remains robust under diverse modalities. We will add this discussion and cite [1,2] in the revised manuscript.
[1] Kim, Jaehyung, et al. "RoAST: Robustifying Language Models via Adversarial Perturbation with Selective Training." EMNLP (Findings). 2023.
[2] Xu, Xilie, et al. "An LLM can Fool Itself: A Prompt-Based Adversarial Attack." ICLR. 2024.
2. (Weakness-2) (Question-2) Evaluation on recent VLMs (LLaVA 1.6 (next), Qwen VL)
We thank the reviewer for this helpful suggestion. To address it, we extend our evaluation to recent stronger VLMs on Qwen VL 7B and LLaVA 1.6 (Next) 7B, both using the ViT-L vision encoder for fair comparison. Results on COCO captioning (CIDEr) and VQAv2 (VQA accuracy) are summarized below. Adv-W2S continues to provide significant robustness gains while maintaining competitive clean performance, indicating the method generalizes well to recent VLM architectures.
| Model (7B, ViT-L) | Method | COCO Captioning (Clean/Robust) | VQAv2 (Clean/Robust) |
|---|---|---|---|
| Qwen VL | Standard | 116.2/4.3 | 78.5/1.1 |
| TeCoA | 102.5/49.1 | 72.8/36.4 | |
| PMG | 108.4/53.0 | 74.1/37.3 | |
| FARE | 109.0/48.6 | 75.2/35.7 | |
| TGA | 105.7/51.3 | 72.5/36.9 | |
| Adv-W2S | 114.8/57.7 | 77.9/40.5 | |
| LLaVA 1.6 (Next) | Standard | 118.7/5.1 | 80.2/1.8 |
| TeCoA | 105.4/51.3 | 74.6/37.8 | |
| PMG | 110.1/54.8 | 75.5/38.6 | |
| FARE | 111.5/50.8 | 76.0/36.9 | |
| TGA | 107.3/53.5 | 74.8/37.5 | |
| Adv-W2S | 116.9/60.2 | 79.8/42.3 |
These results show that Adv-W2S retains its robustness advantage on newer, stronger VLM backbones, scaling effectively while maintaining competitive clean accuracy. We will add this discussion and highlight that the robustness improvements persist even as model capacity increases.
Would like to thank the author for the comprehensive response with additional discussion and evidence showing the effectiveness of the approach on 1) Language models and 2) more recent VLMs with stronger performance. I updated my rating accordingly to reflect these discussions
Esteemed Reviewer,
We sincerely thank you for engaging with our rebuttal, and for your valuable comments. Rest assured we will incorporate all your suggestions in the draft. Meantime, kindly let us know if there is anything else we can answer, clarify or improve.
Best regards,
Authors
This paper studies the problem of weak-to-strong (W2S) generalisation for vision-language models (VLMs) in the context of adversarial robustness, a capability that standard W2S approaches fail to transfer. The authors identify the root cause as a mismatch in the data used for alignment: while clean examples suffice for natural accuracy, they fall short for robustness.
To address this, the paper proposes Adv-W2S, an unsupervised, adversarially guided W2S generalisation framework that: 1) generates adversarial examples for the strong student and inverse adversarial examples for the weak teacher; 2) re-weights teacher guidance using an entropy-based uncertainty score; 4) aligns the student on adversarial inputs to the teacher’s refined predictions. The paper provides theoretical analysis (Theorems 1-3) to elucidate how low-entropy supervision encourages larger feature norms and margins, which underpin improved robustness of their proposed method. Extensive experiments on 20 datasets across diverse tasks, including zero-shot classification, image captioning, VQA, and hallucination show that Adv-W2S consistently improves both clean accuracy and adversarial robustness over strong baselines. Comprehensive ablations further validate the effectiveness of Adv-W2S.
优缺点分析
Strengths:
- This paper demonstrates a clear problem formulation and empirical evidence, where vanilla W2S cannot transfer robustness - even with a robust.
- Adv-W2S is simple yet novel: coupling adversarial and inverse-adversarial views with entropy-based curriculum elegantly addresses noisy teacher signals.
- A practical plug-and-play replacement vision encoder for downstream VLMs.
- Extensive experiments and ablation empirically validate the efficacy of Adv-W2S across a wild range of tasks and benchmarks.
- The paper is clearly structured and easy to follow.
Weaknesses:
- The authors have presented several theoretical results, it is suggested to provide intuitions following each result to improve clarity.
- Table 11 shows that maximising the teacher-student gap outperforms student-only perturbations, but no intuition is provided beyond “better transfer”.
- Theorem 2 establishes an robustness radius of , however, it seems like the empirical evaluations rely on attacks. The theoretical bound may not be applicable to the evaluation setting without further clarification.
- It is suggested to provide computational cost analysis.
- Writing can be improved (See questions).
问题
- How was the one-hot labels based on strong model predictions derived in Eq. 4)? (I.e., how was the threshold determined to derive the one-hot labels given the predictions of the student model?).
- All reported robustness results are against PGD/AutoAttack with . It remains unclear how Adv-W2S performs under other perturbation types, such as larger , -norm attacks, or text-prompt-based adversaries.
- Since in Eq.6) is set to the median teacher entropy, what happens when teacher predictions are close to uniform?
- Can Adv-W2S handle open-set classes unseen by the teacher? How is defined in that case?
- It seems like Theorem 2 assume the image encoder is globally -Lipschitz under the norm. How does this impact in practice when the method is implemented with a standard CLIP backbone, which is not Lipschitz-controlled.
- Several typos found: e.g., ‘Robutness’ in Section 3.2 Heading, ‘satifies’ in Line 202, ‘psedolabels’ in Line 222.
局限性
The authors have adequately addressed the limitations. Some suggestion to consider is how the presented theoretical analysis relate practically.
格式问题
No major formatting concerns.
Response to Reviewer uQC8
We thank the reviewer for constructive feedback.
1. (Weakness-1) Intuitions of theoretical results for clarity.
Absolutely. We will include clearer and intuitive explanations in the revision. Our theoretical results are intended to explain why entropy-guided re-weighting and inverse adversarial refinement improve robustness transfer. Intuitive remarks will be added after each theorem:
- Theorem 1 shows that low prediction entropy (high certainty prediction) enforces a larger vision feature norm, whereas norms of features for high entropy of prediction can be low. Therefore, the model demonstrably is able to filter high-confidence samples from low-confidence samples. This is somewhat similar to the organization of feature space in hyperbolic encoders, where a low norm of features indicates high uncertainty of predictions, and a high norm indicates low uncertainty of predictions [a].
- Theorem 2 builds on Theorem 1, showing that larger feature norms lead to wider classification margins, improving robustness since small perturbations are less likely to change predictions.
- Theorem 3 explains that the loss between adversarial student and inverse adversarial teacher upper bounds that of the clean student, implying that improvements on adversarial samples also benefit clean predictions.
These clarifications directly link the theory to our design choices, improving readability without sacrificing rigor.
[a] Nickel et al., Poincare embeddings for learning hierarchical representations, NeurIPS 2017
2. (Weakness-2) Intuition of maximising teacher-student gap (Table 11).
The rationale for maximizing the teacher–student prediction gap is to align adversary generation with weak-to-strong generalization:
- Target worst-case disagreements: Perturbations focus on where student predictions deviate most from the teacher, forcing learning from challenging cases.
- Consistent optimization: This keeps inner (adversary generation) and outer (parameter optimization) steps aligned toward reducing the gap rather than optimizing different objectives.
- Selective guidance: The student aligns with reliable teacher predictions while refining uncertain ones. In other words, rather than fully imitating the teacher, the student should maximize agreement with the teacher on stable predictions while relying on their own refinement when the teacher is uncertain. Optimizing adversaries that maximize the teacher–student gap directly exposes these disagreement regions, allowing the student to correct them and inherit both natural performance and robustness more effectively than student-only perturbations, thereby improving transfer performance.
We will add this explanation in the revised manuscript for clarity.
3. (Weakness-3) Theorem 2 provides an robustness bound, but evaluations use attacks.
The bound in Theorem 2 was presented for notational simplicity, yet the Lipschitz continuity argument is norm-agnostic and directly extends to . Specifically, robustness bounds can be converted via standard norm inequalities (e.g., ), yielding . Therefore, our theoretical framework is fully compatible with the -based robustness. We will clarify this generality in the revision to ensure consistency between theory and -based evaluations.
4. (Weakness-4) Computational cost analysis.
We report the average accuracy (%) across 14 datasets with the training time per epoch in the table below. We also show that reducing the adversary generation step to 5 further enhances the efficiency while preserving most of the natural performance and robustness.
| Method | Clean | Robust | Time |
|---|---|---|---|
| TeCoA | 61.56 | 43.26 | 3.1 h |
| PMG | 64.46 | 45.74 | 4.3 h |
| FARE | 65.50 | 42.97 | 3.8 h |
| TGA | 62.11 | 45.19 | 4.0 h |
| Adv-W2S (10 Steps) | 68.75 | 48.38 | 5.3 h |
| Adv-W2S (5 Steps) | 66.49 | 47.13 | 2.9 h |
This demonstrates that Adv-W2S can maintain state-of-the-art robustness with substantially lower computational overhead and scales well to efficiency-sensitive settings.
5. (Question-1) How was the one‑hot label in Eq. 4 derived from strong model predictions?
In Eq. (4), denotes a simple argmax operation that converts the student’s soft prediction into a one-hot label by selecting the index of the maximum logit. No additional threshold is used. The one-hot label directly corresponds to the predicted class with the highest probability. We will clarify this notation in the revision.
6. (Question-2) Robustness Beyond Attacks (larger , -norm attacks, or text-prompt-based adversaries)
We report average accuracy (%) of larger perturbation in the table below:
| Method | Clean | Robust | |
|---|---|---|---|
| 6/255 | TeCoA | 52.31 | 25.78 |
| PMG | 55.23 | 27.12 | |
| FARE | 55.63 | 26.35 | |
| TGA | 53.10 | 26.57 | |
| Adv-W2S | 58.25 | 30.18 | |
| 8/255 | TeCoA | 47.45 | 20.88 |
| PMG | 50.21 | 22.51 | |
| FARE | 50.94 | 21.59 | |
| TGA | 47.75 | 21.13 | |
| Adv-W2S | 54.83 | 26.07 |
We also validate the robustness against perturbations ()
| Method | Clean | Robust |
|---|---|---|
| TeCoA | 59.26 | 41.57 |
| PMG | 62.20 | 42.95 |
| FARE | 62.90 | 42.17 |
| TGA | 60.11 | 41.83 |
| Adv-W2S | 66.33 | 46.28 |
Robustness against text-prompt-based adversaries can be found in Response 1 to Reviewer 7iVd. We show that our Adv-W2S method consistently outperforms others under diverse perturbation types/scenarios.
7. (Question-3) What happens when teacher predictions are close to uniform?
When teacher predictions are uniform (high entropy), the weighting function in Eq. (6) naturally degrades toward a constant weighting (0.5) across all instances. In this case, the teacher cannot provide informative guidance. If teacher predictions are close-to-uniform but contain useful information, weights can be post-processed by sigmoid with sharp slope to boost the teacher's signal. In our experiments, the weak teacher is adversarially fine-tuned and typically exhibits non-uniform predictions, so this scenario rarely occurs in practice. We will clarify this behavior in the revision.
8. (Question-4) Can Adv-W2S handle open-set classes unseen by the teacher? How is defined in that case?
In our current formulation, teacher–student alignment during training assumes a shared label space, where is defined as the teacher’s predictions over this space via the cosine similarity between the image feature of and its corresponding text feature set of diverse categories. This enables efficient alignment without ground-truth labels, consistent with CLIP-style training.
For an open-set scenario during training, where class candidates may be unknown or unbounded, we envision three possible extensions:
- Large-scale virtual label space: Following the original CLIP pretraining approach, we can construct by computing cosine similarity between the image feature of and a large set of text embeddings (e.g., descriptive phrases). While feasible, this approach incurs a higher computational cost. Using a subset of image properties can somehow enhance the computational efficiency.
- Prototype-based virtual labels: A more efficient variant is to cluster the text embedding space into representative prototypes, then define via cosine similarity between image features and these prototypes, forming pseudo-label distributions for alignment.
- Feature-level alignment: Alternatively, adversarial weak-to-strong generalization can be adapted to perform directly at the feature level, potentially with a projector to resolve dimensionality differences between weak and strong models.
While our current experiments focus on shared-label training, these strategies highlight that our framework can be extended to open-set training scenarios; we will clarify this point in the revision.
9. (Question-5) Assumption of Global Lipschitzness in Theorem 2.
The global -Lipschitz assumption was given for analytical clarity. Naturally, local Lipschitzness in the vicinity of datapoints is sufficient for our margin–robustness argument. In Eq. (13), we align predictions of perturbed with predictions of , which implicitly teaches CLIP student local Lipschitzness on the data, as is in practice. However, as we assume Lipschitzness on the feature rather than the predictions, we verified results by imposing a Lipschitz regularization term with the equation:
which enforces bounded variations in feature space. This approach directly optimizes the Lipschitz equation and yields similar robustness trends (Table below), confirming that the theoretical assumption is not restrictive.
| Method | Clean | Robust |
|---|---|---|
| Adv-W2S (w/ Lipschitzness) | 68.29 | 48.43 |
| Adv-W2S | 68.75 | 48.38 |
Moreover, our analysis focuses on explaining the mechanistic relation between margin and robustness rather than requiring exact constants. Although standard CLIP backbones are not explicitly Lipschitz-controlled, we empirically estimated the local gradient norm of ViT-L/14 on 1,000 random ImageNet samples and observed bounded values around 0.22–0.29. These small values indicate local Lipschitz behavior around natural data, making the approximation reasonable and consistent with the robustness trends confirmed by our experiments.
10. (Weakness-5) Writing. (Question-6) Several typos.
We will thoroughly proofread the manuscript and correct all typographical errors to improve overall clarity in the revised version.
I thank the authors for your detailed response. I remain positive about the paper, and look forward to seeing the planed revision is included in the final paper.
Esteemed Reviewer,
Thank you for engaging with our rebuttal, and for all valuable comments. Rest assured we will incorporate all your suggestions into the final paper.
Meantime, if there is anything else we can improve or answer, kindly do let us know.
Best regards,
Authors
Dear reviewers,
For those who have not responded yet, please take a look at the authors’ rebuttal and update your final scores.
Best wishes,
AC
There is a general consensus among the reviewers that the paper, in its current form, is suitable for publication. After careful consideration of the feedback and the paper itself, the recommendation is to accept this submission.