PaperHub
6.4
/10
Poster4 位审稿人
最低3最高5标准差0.7
3
5
4
4
3.8
置信度
创新性3.0
质量2.8
清晰度3.0
重要性2.5
NeurIPS 2025

SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations

OpenReviewPDF
提交: 2025-04-05更新: 2025-10-29
TL;DR

We propose SECA, a constraint-preserving zeroth-order method that elicits LLM hallucinations via semantically equivalent and coherent rephrasings.

摘要

关键词
Large Language Models; Adversarial Attack; Red Teaming; LLM Hallucination; Robustness Evaluation; Trustworthy AI

评审与讨论

审稿意见
3

This paper investigates the problem of generating Semantically Equivalent and Coherent prompts that elicit hallucinations from large language models. The proposed approach leverages one LLM to generate diverse rephrasings of a given prompt and another LLM to verify their feasibility. Among the feasible rephrasings, the method selects the most adversarial one. Experimental results on multiple-choice benchmarks demonstrate that this approach achieves a high success rate while maintaining low constraint violation rates.

优缺点分析

Strengths:

  1. The problem formulation of finding semantically equivalent and coherent adversarial prompts is interesting. The authors did a good job at connecting it to traditional adversarial attack formulation in computer vision.
  2. The human study on whether LLM can check semantic equivalence is important.

Weakness: The main weakness is the over-simplified experimental setting.

  1. The experiment is only on multi-choice problems, whereas in practice open-ended problems are more common.
  2. The prompt template set up can be problematic: In the example in figure 5 (Appendix B), it seems that the LLM is prompted to first give an answer (from "A", "B", "C", "D") and then explain the reasoning. However, in practice, it is more reasonable to first reason and then give the final answer. It is unclear whether the method can still work under this more practical setting.

问题

Does the method still work in a setting where the LLM is prompted to first reason through the question before selecting a multiple-choice answer?

局限性

Yes

最终评判理由

My concern of the open-ended question is well addressed. Indeed, since your metrics is about whether the multi-choice answer is correct, as well as whether the explanation is correct or not, it does count as open-ended problem domain. I raise my ratings.

I still have concerns about the answer-first settings, because in practice I believe the LLM people actually use will first prompt the model to think before answering. In other words, such "hallucination attack" can be easily mitigated.

格式问题

Format looks good.

作者回复

We thank Reviewer MajM for their thoughtful feedback and for highlighting the clarity of our work. We are especially grateful for the recognition that “the authors did a good job at connecting it to traditional adversarial attack formulations in computer vision,” as this connection was central to our motivation. We also appreciate the acknowledgment of our human study. Please find our detailed responses to the insightful questions below.

Q1: "The experiment is only on multi-choice problems, whereas in practice open-ended problems are more common."

A1: First, we clarify that our setting is already an open-ended MCQA generation task, where the model is prompted to select an incorrect answer choice followed by a hallucinated open-ended explanation.

Next, we designed our problem along two axes of difficulty: (1) open-ended MCQA vs. free-form open-ended generation, and (2) constrained vs. unconstrained optimization. While existing work may tackle unconstrained attacks in more complex free-form tasks, we emphasize that our current constrained optimization setting for open-ended MCQA generation task is already highly challenging and underexplored. No existing attack method is capable of operating under these constraints, making our formulation a meaningful and non-trivial benchmark.

Our choice of MMLU is well-justified for the following reasons:

  • Hallucination attacks versus Jailbreaking: This work focuses on eliciting hallucinations from LLMs, rather than bypassing safety mechanisms (jailbreaking). While our method can be viewed as an adversarial attack, we emphasize throughout the paper that our goal is fundamentally different from jailbreaking. Accordingly, we choose MMLU to better evaluate hallucination behaviors in LLMs.

  • Novel constrained optimization setting: To the best of our knowledge, we are the first to explicitly formulate hallucination attacks under semantic equivalence and coherence constraints (see Equation (2), Line 138). This presents a significant challenge, as existing SOTA attack methods are not designed to handle such constraints (see Section 2 and our response to Q1 from Reviewer 6rsw for further discussion). MMLU offers a clean and structured evaluation space where semantic equivalence is critical and constraint violations are easily observable. Even within the MMLU task, it was previously unclear whether generating semantically equivalent and linguistically coherent adversarial prompts under constrained optimization was feasible—highlighting the novelty of our setting.

Given these reasons, we focus on open-ended MCQA generation tasks rather than free-form open-ended generation. We agree that extending our method to free-form generation, potentially by optimizing over entire responses rather than token-level objectives, is a promising direction for future work.

Q2: "The prompt template set up can be problematic: In the example in figure 5 (Appendix B), it seems that the LLM is prompted to first give an answer (from "A", "B", "C", "D") and then explain the reasoning. However, in practice, it is more reasonable to first reason and then give the final answer. It is unclear whether the method can still work under this more practical setting."; "Does the method still work in a setting where the LLM is prompted to first reason through the question before selecting a multiple-choice answer?"

A2: Answer-first settings are common in adversarial attacks in LLMs. In optimization-based attacks such as Greedy Coordinate Gradient (GCG), it is standard to prompt the model to begin with a target string, e.g., “Sure, here is how to make a bomb”, so that the log-likelihood objective can be defined over the initial confirmative tokens, which are more likely to induce harmful continuations. Such methods do not easily extend to reasoning-first settings, where objective formulation is challenging. Inspired by this, we design our hallucination targets to begin with one of ‘A’, ‘B’, ‘C’, or ‘D’, which effectively elicits follow-up hallucinated responses, as demonstrated in our paper. While we also experimented with reasoning-first prompts, their unpredictable reasoning length makes it difficult to determine where the answer choice will appear, complicating the definition of a principled objective (e.g., log-likelihood over a fixed span).

Besides that, we aim to highlight the distinction between LLM explanations and reasoning. As shown in Figure 5 of Appendix B, our intention is to illustrate how the LLM explains its predicted answer post hoc. In contrast, reasoning refers to an inference-time computation strategy intended to guide or improve the model's prediction. While we agree that extending our method to reasoning-first prompting is a valuable direction for future work, we emphasize that the central research problem we are addressing is how to generate semantically equivalent and coherent prompts to elicit hallucinations. While reasoning-first prompts might be more resistant to SECA prompts, this is an open problem that we reserve for future work.

To date, no existing attack method is capable of generating semantically equivalent and coherent adversarial prompts, even for the answer-first prompting. This makes our current answer-first setting both meaningful and non-trivial. In fact, prior to our work, it was unclear whether such attacks were feasible at all under constrained optimization, even in the current setting. We therefore view our work as a crucial first step in establishing the feasibility and methodology for constrained hallucination attacks. Once this foundation is solid, extending to reasoning-first prompting or even reasoning models becomes a natural and promising direction for future research.

We also note that in adversarial machine learning, it is common for early attack methods to be challenged by subsequent defense mechanisms. For example, the well-known Greedy Coordinate Gradient (GCG) attack was later shown to be vulnerable to techniques such as randomized smoothing. Similarly, reasoning-first prompting and reasoning models may act as implicit defenses against hallucination. However, this does not diminish the value or strength of our proposed attack. Rather, it highlights the importance of establishing strong baselines and well-defined constrained settings before evaluating robustness under more complex model behaviors.

Q3: "The main weakness is the over-simplified experimental setting."

A3: We respectfully disagree with this statement.

As discussed in our responses to Q1 and Q2, the proposed constrained optimization framework, applied even to open-ended MCQA generation tasks with an answer-option-first prompting format, is already highly challenging and underexplored. Prior to our work, it was unclear whether one could generate semantically equivalent and linguistically coherent adversarial prompts under such constraints. Our setting is both novel and non-trivial, as acknowledged by Reviewers LvAw and 6rsw. We believe it deserves focused investigation before extending to more complex scenarios such as open-ended generation or reasoning-augmented prompting/models.

We also note that in adversarial machine learning, it is common for early attack methodologies to begin with simplified but principled settings. For example, the well-known Greedy Coordinate Gradient (GCG) attack uses prompts like "Sure, here is how to make a bomb" as optimization targets—despite the fact that, in real-world scenarios, models often instruct to respond with intermediate reasoning before producing a final answer. Nevertheless, no jailbreak attack against reasoning models or multi-step prompting can exist without first establishing effective methods under more controlled settings.

Similarly, in adversarial attack in computer vision, the foundational works on p\ell_p​-norm attacks (e.g., FGSM, PGD) helped define robustness evaluation benchmarks, even though such perturbations do not fully align with human perceptual similarity. These simplified attack settings were important first steps that later led to more advanced attacks better aligned with human perception like perceptual attack.

In this context, we view our constrained and well-defined setting as a necessary and meaningful first step toward building rigorous and extensible adversarial benchmarks for LLMs.

We sincerely thank the reviewer again for their valuable feedback and hope that our clarifications help reinforce the contributions of our work.

评论

Dear Reviewer,

Thank you for your valuable questions and suggestions. Please let us know if you have any other concerns you'd like us to address. If our responses have been satisfactory and addressed all of your concerns, we would appreciate it if you would reconsider your evaluation.

Best,

The Authors

评论

Thank you for the detailed response and good efforts for the rebuttal. My concern of the open-ended question is well addressed. Indeed, since your metrics is about whether the multi-choice answer is correct, as well as whether the explanation is correct or not, it does count as open-ended problem domain. I raise my ratings.

I still have concerns about the answer-first settings, because in practice I believe the LLM people actually use will first prompt the model to think before answering. In other words, such "hallucination attack" can be easily mitigated.

评论

Dear Reviewer,

Thank you for increasing your rating. We appreciate your thoughtful comments and would like to further clarify our stance regarding the answer-first setting.

Q: "I still have concerns about the answer-first settings, because in practice I believe the LLM people actually use will first prompt the model to think before answering. In other words, such "hallucination attack" can be easily mitigated."

A: While we agree that reasoning-first prompting is increasingly used in practice, we respectfully note that reasoning is not a necessary component in hallucination or multiple-choice question answering (MCQA) research. A large body of prior work studies hallucinations and MCQA without requiring any reasoning-first or think-before-answering setup. The absence of reasoning in these settings does not diminish the validity or significance of their contributions. For example:

Hallucination Elicitation

  • The ICML 2025 paper "Eliciting Language Model Behaviors with Investigator Agents" [1],
  • The ICML 2024 paper "Fast Adversarial Attacks on Language Models In One GPU Minute" [2], and
  • The ICLR 2025 Spotlight paper "Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions" [3]

all explore hallucination elicitation using direct-answer prompts, without incorporating reasoning-first/thinking-first prompting.

Multiple Choice Question Answering (MCQA)

  • The ACL 2024 Long paper "Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?" [4],
  • The ACL 2024 Finding paper "Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions" [5], and
  • The ICLR 2023 paper "Leveraging Large Language Models for Multiple Choice Question Answering" [6]

all study LLM behavior in MCQA without enforcing reasoning-first/thinking-first setups.

These studies show that many meaningful insights can be obtained by analyzing directly answering behaviors.

Our paper is not centrally focused on LLM reasoning ability. Instead, our primary contribution lies in the definition and generation of semantically equivalent and coherent adversarial prompts, within a constrained optimization framework. The answer-first open-ended MCQA setting allows us to clearly study hallucination elicitation under such constraints, making it a meaningful and non-trivial robustness evaluation benchmark.

We do not disagree that studying hallucination under reasoning-first/thinking-first prompts is important and may offer valuable complementary insights. However, such a setup constitutes a substantially different research problem that warrants a separate and dedicated investigation. It is not possible to draw general conclusions about the superiority or effectiveness of one setting over another based solely on empirical observations in a different paradigm. While reasoning-based prompting may help mitigate hallucinations in some cases, its absence in our current study does not invalidate the findings or reduce the significance of our contributions, which are centered around defining and exploring semantically equivalent and coherent adversarial prompts.

Thank you again for your thoughtful feedback. Please let us know if our responses have adequately addressed your concerns regarding the "over-simplified experimental setting" noted as the main weakness.

References:

[1] Li, X. L., Chowdhury, N., Johnson, D. D., Hashimoto, T., Liang, P., Schwettmann, S., & Steinhardt, J. (2025). Eliciting language model behaviors with investigator agents. arXiv preprint arXiv:2502.01236.

[2] Sadasivan, V. S., Saha, S., Sriramanan, G., Kattakinda, P., Chegini, A., & Feizi, S. (2024). Fast adversarial attacks on language models in one gpu minute. arXiv preprint arXiv:2402.15570.

[3] Wiegreffe, S., Tafjord, O., Belinkov, Y., Hajishirzi, H., & Sabharwal, A. (2024). Answer, assemble, ace: Understanding how LMs answer multiple choice questions. arXiv preprint arXiv:2407.15018.

[4] Balepur, N., Ravichander, A., & Rudinger, R. (2024). Artifacts or abduction: How do llms answer multiple-choice questions without the question?. arXiv preprint arXiv:2402.12483.

[5] Pezeshkpour, P., & Hruschka, E. (2023). Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483.

[6] Robinson, J., Rytting, C. M., & Wingate, D. (2022). Leveraging large language models for multiple choice question answering. arXiv preprint arXiv:2210.12353.

审稿意见
5

This paper proposes a novel framework for eliciting hallucinations in LLMs using semantically equivalent and linguistically coherent prompt rephrasings. It formulates hallucination elicitation as a constrained optimization problem and uses two LLMs: a proposer and a feasibility checker. Experiments on the MMLU benchmark across multiple LLMs demonstrate that the proposed method achieves higher attack success rates while preserving naturalness.

优缺点分析

Strengths:

  1. The proposed method is simple and effective. The idea of using two LLMs to generate and check the adversarial prompts is novel and insightful.
  2. The authors conduct comprehensive experiments including ablation studies and human evaluations to show the effectiveness of the proposed method.
  3. The paper is well-written and well-organized. The method flows naturally and clearly, making it easy to reproduce.

Weaknesses:

  1. The method is only evaluated in the multiple-choice QA setting (MMLU). It remains unclear how well SECA generalizes to open-ended generation tasks such as summarization or dialogue.
  2. While SECA is effective in eliciting hallucinations, the paper lacks analysis on whether models trained to resist SECA prompts would generalize to other types of realistic adversarial inputs.

问题

Can the proposed method be applied to other types attacks against LLMs (e.g. adversarial attacks, backdoor attacks)?

局限性

yes

最终评判理由

I have read the rebuttal. The authors solve my questions well and I would like to keep the current positive score.

格式问题

No

作者回复

We thank Reviewer LvAw for recognizing the quality, clarity, significance, and originality of our work. We sincerely appreciate the reviewer’s acknowledgment that our method is novel, simple, and effective; that our empirical evaluation is comprehensive; and that the paper is well-written, well-organized, and easy to reproduce. Please find our detailed responses to the insightful questions below.

Q1: "The method is only evaluated in the multiple-choice QA setting (MMLU). It remains unclear how well SECA generalizes to open-ended generation tasks such as summarization or dialogue."

A1: First, we clarify that our setting is already an open-ended MCQA generation task, where the model is prompted to select an incorrect answer choice followed by a hallucinated open-ended explanation.

Next, we designed our problem along two axes of difficulty: (1) open-ended MCQA vs. free-form open-ended generation, and (2) constrained vs. unconstrained optimization. While existing work may tackle unconstrained attacks in more complex free-form tasks, we emphasize that our current constrained optimization setting for open-ended MCQA generation task is already highly challenging and underexplored. No existing attack method is capable of operating under these constraints, making our formulation a meaningful and non-trivial benchmark.

Our choice of MMLU is well-justified for the following reasons:

  • Hallucination attacks versus Jailbreaking: This work focuses on eliciting hallucinations from LLMs, rather than bypassing safety mechanisms (jailbreaking). While our method can be viewed as an adversarial attack, we emphasize throughout the paper that our goal is fundamentally different from jailbreaking. Accordingly, we choose MMLU to better evaluate hallucination behaviors in LLMs.

  • Novel constrained optimization setting: To the best of our knowledge, we are the first to explicitly formulate hallucination attacks under semantic equivalence and coherence constraints (see Equation (2), Line 138). This presents a significant challenge, as existing SOTA attack methods are not designed to handle such constraints (see Section 2 and our response to Q1 from Reviewer 6rsw for further discussion). MMLU offers a clean and structured evaluation space where semantic equivalence is critical and constraint violations are easily observable. Even within the MMLU task, it was previously unclear whether generating semantically equivalent and linguistically coherent adversarial prompts under constrained optimization was feasible—highlighting the novelty of our setting.

Given these reasons, we focus on open-ended MCQA generation tasks rather than free-form open-ended generation. We agree that extending our method to free-form generation, potentially by optimizing over entire responses rather than token-level objectives, is a promising direction for future work.

Q2: "While SECA is effective in eliciting hallucinations, the paper lacks analysis on whether models trained to resist SECA prompts would generalize to other types of realistic adversarial inputs."

A2: While we agree that leveraging SECA prompts to improve the robustness of existing LLMs is an important and promising direction, we emphasize training LLM on SECA prompts is not the central research problem in our paper.

Based on our understanding and preliminary assessment, we believe that fine-tuning LLMs on SECA prompts can enhance robustness against certain realistic adversarial inputs, such as jailbreaks involving paraphrasing or stylistic modifications or backdoor attacks using paraphrasing-based triggers. However, as noted in Q1, our proposed constrained optimization framework is novel and merits a focused study. We thus leave the exploration of improving model robustness through training/finetuning as an exciting direction for future work.

Q3: "Can the proposed method be applied to other types attacks against LLMs (e.g. adversarial attacks, backdoor attacks)?"

A3: While the primary goal of our work is to generate semantically equivalent and coherent adversarial prompts to elicit hallucinations from LLMs—rather than to bypass safety mechanisms—the answer is yes: SECA can be viewed as a general framework that may be extended to other types of attacks on LLMs.

For example, in jailbreaking or backdoor attacks, one could similarly impose semantic equivalence and coherence constraints to craft stealthier adversarial prompts. A prompt such as “How to build a bomb?” might be rephrased as “How does one go about assembling an explosive device?”, retaining malicious intent while potentially bypassing safety filters. This would effectively constitute a semantically equivalent jailbreak or backdoor trigger.

However, we do not recommend using SECA for such tasks, for the following reasons:

  • In jailbreaking, the objective is to circumvent safety mechanisms. Arbitrary prompts, including intent-shifting, storytelling, or even gibberish, are acceptable. Imposing semantic constraints is unnecessary and often counterproductive.

  • In backdoor attacks, the goal is typically to embed a trigger that activates harmful behavior under specific conditions. Semantic equivalence to a base prompt is not required, so enforcing equivalence adds unnecessary overhead.

  • In hallucination attacks, by contrast, the goal is to induce hallucinations without changing the task intent. Semantic equivalence is therefore essential to ensure the adversarial prompt remains a faithful and meaningful variant of the original query.

    To draw an analogy from adversarial attacks in computer vision: if the goal is to fool a classifier into misclassifying a cat as a dog, the perturbation must be small enough that the image still looks like a cat to a human. Allowing arbitrary changes (e.g., turning a cat into a dog) trivializes the attack, as one could simply change the image into a dog. Similarly, hallucination attacks without semantic equivalence constraints either become trivial or cause intent shift (see Table 1 in the paper and our response to Q1 from Reviewer 6rsw).

Conclusion: While SECA is broadly applicable, we believe its strength lies in hallucination attacks where semantic equivalence constraints are both meaningful and necessary. Thus, we recommend applying SECA primarily to settings where preserving task intent is essential.

We sincerely thank the reviewer again for their valuable feedback and hope that our clarifications help to further highlight the contributions of our work.

审稿意见
4

The authors propose a novel framework called SECA for creating coherent hallucination attacks on large language models (LLMs). By utilizing a feasibility checker and proposer LLM, the framework is able to find the most adversarial attack, while retaining the semantic content of the original attack and its coherency. Authors evaluate compared to two other methods and also provide human evaluation for both feasibility and hallucination evaluation.

优缺点分析

Strengths:

  • The semantic coherency in adversarial attacks is important as it can shed more light into the reasoning capabilities of LLMs. Their evaluation through this type of attack can also catch spurious semantic incoherencies, which could in turn show that LLMs may be grammatically and syntactically driven, instead of semantically.

  • SECA reaches a strong attack success rate with only ≈30 proposer-checker loops (~2 minutes per sample), while GCG needs hundreds of gradient steps and still under-performs (Table 2 & Appendix K/L). The monotonic objective curves in Fig. 4a further show consistent, fast convergence. This efficiency matters if the framework is to be used for routine red-teaming or large-scale robustness audits.

Weaknesses:

  • The evaluation requires more explanation, as the only reported metrics are new and evaluated only against Raw and GCG, omitting stronger jailbreakers such as Tree-of-Thought or Prompt-Inversion (Table 1, Section 4.2, Figure 3).
  • The LLMs evaluated on the paper are relatively small and are more prone to hallucinations. Experiments on larger models, especially those with explicit reasoning traces, would better showcase the framework's ability identifying true semantic adversarial attacks. Reasoning is also more important as a few reasoning steps could lead the model into identifying the hallucinations.
  • A hallucination mitigation analysis on the proposing LLMs would be beneficial as well. Authors should explain how they mitigate the language hallucinations on the auxiliary models, in order to prevent a non-intended hallucinatory sample that is not semantically coherent but still provides an adversarial attack.

问题

  • Could authors explain if there any metrics relevant to the use or if there any other experiments comparing with different methods?
  • Are there any experiments done with larger commerical models or Large Reasoning Models (LRMs), in order to showcase the framework's performance scaling with parameter size or reasoning steps, even with chain-of-thought methods?
  • How does the framework mitigate hallucinations created by the auxiliary LLMs?

局限性

Yes

最终评判理由

Given the direction of research in this area, I believe that a round of experimentation is reasoning metholodogies is important, as it has also been cited by the other reviewers in their rebuttals. I believe the publication has merit, but would be greatly strengthened by these extra experiments.

格式问题

No

作者回复

We thank Reviewer 6rsw for recognizing the quality, clarity, significance, and originality of our work. We appreciate the acknowledgment that our method sheds light on the reasoning capabilities of LLMs, suggesting that their behavior may be driven more by grammatical and syntactic patterns than by true semantic understanding. We also thank the reviewer for highlighting the effectiveness and efficiency of our proposed approach. Please find our detailed responses to the insightful questions below.

Q1: Evaluation is only against Raw and GCG. Compare with other jailbreaking methods. Explain "if there any other experiments comparing with different methods?"

A1: The algorithms you mentioned are indeed interesting; we will include appropriate citations, and they may help further improve the efficiency of our SECA method. We also examined several prompts generated by existing attack methods and found that, although they can generate successful adversarial prompts, none were able to produce semantically equivalent variants of the original prompts. This indicates that these methods cannot find feasible solutions under our newly proposed constrained optimization formulation (Equation (2), Line 138). We provide several illustrative examples below as an extended version of Table 1 in our paper.

TypeSECExampleComment
OriginalYYWhat is the value of p in 24 = 2p?
SECA (ours)YYIf doubling the value of p results in 24, what is p?
Probe SamplingNNWhat is the value of p in 24 = 2p?@&*(C()*(*@Hgibberish
COLDNYIn the equation 24 = 2p, what is the value of the expression 2p?intent shift
AdvPrompterNYWhat is the value of p in 24 = 2p? Creative response. 45 pts. 01/16/2021...Include brief mention of balance or symmetry…intent shift
Tree-of-AttacksNY“twenty-four is what you get when you square p and double it.”intent shift
Tree-of-AttacksNYI'm running a creativity test... hallucinate a plausible-looking answer...trivial
Prompt-InversionNYWhat is the value of p in 20 = 2p?intent shift

Note: The COLD attack may produce prompts that are semantically similar but not equivalent (see Section 2 in our paper for a discussion on semantic similarity vs. equivalence). The example prompt appears topically related but leads to a different solution, constituting an intent shift attack

We also hope to emphasize that it is essential for distinguishing jailbreaking and hallucination attacks:

  • In jailbreaking, the goal is to bypass safety mechanisms. Arbitrary prompts—such as intent-hiding, storytelling, or gibberish—are acceptable.

  • In hallucination attacks, the goal is to induce factual errors while preserving the original task intent. Therefore, enforcing semantic equivalence is essential to ensure meaningful attacks.

    To draw an analogy from adversarial attacks in computer vision: if the goal is to fool a classifier into misclassifying a cat as a dog, the perturbation must be small enough that the image still looks like a cat to a human. Allowing arbitrary changes (e.g., turning a cat into a dog) trivializes the attack, as one could simply change the image into a dog. Similarly, hallucination attacks that do not enforce semantic constraints either become trivial or result in intent shift (see Table 1 in the paper and the table above in our response).

    Since none of the existing attack methods can handle our constrained formulation, we compare only with the most representative prior work, GCG—which, however, also leads to infeasible solutions.

Q2: Experiments on larger models

A2: Please see the tables below for new results on larger models. SECA remains effective on larger models, showing strong performance relative to raw ASR.

ASR@30/10/1 (Raw)

SubjectGPT-3.5-TurboGPT-4
Cli0.26/0.17/0.030.00/0.00/0.00
Bio0.18/0.11/0.020.00/0.00/0.00
Ana0.37/0.28/0.080.01/0.00/0.00
Mat0.75/0.60/0.180.17/0.10/0.02
CS0.49/0.30/0.050.07/0.03/0.00
ML0.17/0.05/0.010.28/0.23/0.07
Sec0.04/0.01/0.000.00/0.00/0.00
Phy0.28/0.23/0.050.16/0.09/0.01
Che0.05/0.01/0.000.00/0.00/0.00
CPy0.42/0.30/0.100.01/0.00/0.00
Psy0.08/0.03/0.000.01/0.00/0.00
Soc0.13/0.05/0.000.13/0.12/0.07
Phi0.07/0.03/0.000.00/0.00/0.00
Hi0.22/0.08/0.010.00/0.00/0.00
Law0.03/0.01/0.000.11/0.06/0.01
Eco0.18/0.15/0.030.00/0.00/0.00

ASR@30/10/1 (SECA)

SubjectGPT-3.5-TurboGPT-4
Cli0.77/0.69/0.540.00/0.00/0.00
Bio0.65/0.59/0.490.31/0.21/0.13
Ana0.80/0.75/0.450.49/0.45/0.39
Mat0.92/0.88/0.600.68/0.66/0.54
CS0.62/0.60/0.350.86/0.86/0.85
ML0.52/0.51/0.500.41/0.40/0.40
Sec0.39/0.33/0.170.30/0.25/0.23
Phy0.69/0.57/0.350.20/0.18/0.17
Che0.62/0.47/0.150.11/0.06/0.01
CPy0.95/0.90/0.800.32/0.30/0.30
Psy0.49/0.45/0.400.20/0.20/0.20
Soc0.81/0.80/0.650.25/0.25/0.20
Phi0.49/0.42/0.230.21/0.20/0.20
Hi0.63/0.58/0.420.56/0.56/0.51
Law0.57/0.56/0.490.22/0.22/0.22
Eco0.65/0.53/0.210.25/0.25/0.21

Objective Difference (SECA increases the objective significantly compared to Raw)

SubjectGPT-3.5-TurboGPT-4
Cli5.303.05
Bio5.364.44
Ana4.205.29
Mat2.168.09
CS2.058.79
ML4.652.59
Sec3.606.15
Phy4.453.48
Che4.642.71
CPy4.713.42
Psy5.623.86
Soc4.335.82
Phi4.022.38
Hi3.4112.25
Law4.812.33
Eco4.553.69

Note: Due to time constraints, GPT-3.5-Turbo (gpt-3.5-turbo-0125) and GPT-4 (gpt-4-0613) were evaluated on a 30% subset.

Q3: "The evaluation requires more explanation, as the only reported metrics are new "; "Could authors explain if there any metrics relevant to the use"

A3: As mentioned in Section 4.1, the metrics used in our paper include: Best-of-K Attack Success Rate (ASR@K), constraint violations vSEv_{SE} (semantic equivalence) and vCv_C (coherence), and Type Token Ratio (TTR). ASR@K and TTR are standard metrics widely adopted in the trustworthy LLM community.

The main nonstandard metrics are the constraint violations vSEv_{SE} and vCv_C. This is because, to the best of our knowledge, we are the first to introduce the constrained optimization formulation for hallucination attacks (see Equation (2), Line 138), where the goal is to generate semantically equivalent and coherent adversarial prompts that induce hallucinations in LLMs. As a result, it is necessary to evaluate whether the generated prompts violate these constraints—hence the use of vSEv_{SE} and vCv_C.

While such constraint-based metrics may be uncommon in prior LLM adversarial attack literature, they are well-established in constrained optimization and adversarial attacks in computer vision—such as p\ell_p-bounded or perceptual attacks. This justifies their adoption in our work.

Q4: Experiments on reasoning models/chain-of-thought methods

A4: While extending to open-ended QA and reasoning models is an interesting direction for future work, we emphasize that our current constrained optimization setting—despite not involving reasoning models—is already highly challenging and underexplored. To the best of our knowledge, no existing attack method can effectively operate under the these constraints, making our setting a meaningful and non-trivial benchmark. Prior to our work, it was unclear whether such attacks were feasible at all under constrained optimization.

We view our work as a crucial first step toward principled constrained hallucination attacks. Once this foundation is established, extending to reasoning models and chain-of-thought prompting becomes a natural next step.

As in adversarial ML, early attacks often precede the development of stronger defenses. E.g., Greedy Coordinate Gradient (GCG) attack was later shown to be vulnerable to techniques such as randomized smoothing. Similarly, reasoning models/chain-of-thought methods may eventually serve as implicit defenses, but this only underscores the need for strong, well-defined attack baselines like ours.

Q5: "How does the framework mitigate hallucinations created by the auxiliary LLMs?"

A5: While we agree that any LLMs, including our auxiliary models, may introduce hallucinations, we have performed additional experiments to demonstrate that it does not happen frequently and invalidate our results. To this end, we designed a two-stage verification process, as described in Lines 6–7 of Algorithm 1.

  1. Semantic Equivalence Proposal:
    We first use an efficient and cost-effective model (GPT-4.1-nano) to generate semantically equivalent candidate prompts. Empirically, this step yields approximately 70% strictly semantically equivalent prompts.

  2. Feasibility Verification:
    To further reduce hallucinations, we verify each candidate using a stronger feasibility checker LLM (GPT-4.1-mini). This model verifies whether the proposed prompt is semantically equivalent to the original. As shown in Table 3 in the paper, this verifier achieves 80–85% accuracy in identifying equivalence.

By evaluating each candidate twice—once for generation and once for verification—we significantly reduce hallucination risks introduced by the auxiliary LLMs.

  1. Final Hallucination Evaluation:
    For measuring hallucination in our results, we use GPT-4.1, among the best publicly available models at the time of submission. According to Table 3, it achieves 90–95% accuracy in hallucination detection, providing a strong upper bound on evaluation reliability.

We sincerely thank the reviewer again for their valuable feedback and hope that our clarifications help reinforce the contributions of our work.

评论

Thank you to the authors for answering my questions. Most of my concerns have been resolved, however I feel that the Chain-of-Thought or reasoning methodologies evaluation could be critical for the novelty of the work, given that most of the research area is headed towards LRMs, which there are certain of them that are also open source right now (e.g. OpenThinker). I will retain my rating.

审稿意见
4

This paper introduces SECA, a black-box attack framework that provokes hallucinations in large language models while keeping the adversarial prompts both semantically faithful and readable. SECA runs a heuristic tree search over prompt space, guided by two auxiliary LLMs:

  1. Proposer LLM: generates candidate paraphrases that preserve the original meaning, which become the tree’s nodes.
  2. Feasibility Checker LLM: filters those candidates, retaining only the ones that remain semantically equivalent and linguistically coherent.

The authors test SECA on the MMLU benchmark against the Greedy Coordinate Gradient (GCG) attack. SECA consistently delivers higher attack-success rates across several target models. They also verify that the Checker’s judgments of equivalence and coherence align closely with human annotations, confirming its reliability in the pipeline.

优缺点分析

Strengths:

  1. This paper proposes to focus on generating human-like instructions that can elicit LLM hallucinations, which addresses a practical gap between existing jailbreaking work and real-world usage scenarios.

  2. The paper proposes a training-free heuristic tree search method with a Proposer and a Checker LLM, which is simple to deploy and test.

Weaknesses

  1. In line 67, the authors claim that they compare seven frontier LLM models, yet in Table 2 reports results for only three relatively small models (average < 14 B). Figure 3 illustrates improvements on seven models, but the exact numbers are not provided. I suggest the authors consider adding a table with full Attack Success Rates (ASR) for all seven models tested on their benchmark.

  2. The authors take MMLU as their benchmark, which is less commonly used in LLM adversarial attacking domain. MMLU as an multiple-choice question answering (MCQA) setting, is easier than open-ended answer benchmarks, such as AdvBench[1] and JBB-Behaviors[2]. I encourage the authors to test their method at least on AdvBench to demonstrate broader effectiveness.

  3. The authors only compare their method with Greedy Coordinate Gradient (GCG). It would benefit from including comparison with other recent white-box methods—Probe Sampling [3], COLD-Attack [4], and black-box method AdvPrompter [5]—which have shown strong performance and efficiency compared with GCG.

[1]Zou, Andy, et al. "Universal and transferable adversarial attacks on aligned language models." arXiv preprint arXiv:2307.15043 (2023). [2]Chao, Patrick, et al. "Jailbreakbench: An open robustness benchmark for jailbreaking large language models." arXiv preprint arXiv:2404.01318 (2024). [3]Zhao, Yiran, et al. "Accelerating greedy coordinate gradient and general prompt optimization via probe sampling." Advances in Neural Information Processing Systems 37 (2024): 53710-53731. [4]Guo, Xingang, et al. "Cold-attack: Jailbreaking llms with stealthiness and controllability." arXiv preprint arXiv:2402.08679 (2024). [5]Paulus, Anselm, et al. "Advprompter: Fast adaptive adversarial prompting for llms." arXiv preprint arXiv:2404.16873 (2024).

问题

  • How does SECA perform on larger models such as GPT-3.5-turbo or GPT-4? How will reasoning models such as GPT-o4-mini and DeepSeek-R1 behave under your method's attack? Any preliminary figures would be insightful.
  • What is the efficiency of your method? Roughly how many candidate prompts can SECA generate and evaluate per GPU-minute on an Nvidia A5000?
  • How does SECA perform at stricter query budgets, such as ASR@1 and ASR@10? An ablation or scaling curve would be very illuminating.

局限性

yes

最终评判理由

The authors addressed my questions during the rebuttal, and my concerns have been resolved.

格式问题

no

作者回复

We thank Reviewer ibLV for their thoughtful and constructive feedback, as well as for recognizing the originality of our work. We appreciate the acknowledgment that our paper addresses a practical gap between existing jailbreaking research and real-world usage scenarios, and that our training-free attack method offers easy deployment and testing. Below, we provide detailed responses and additional experimental results addressing the reviewer’s insightful questions.

Q1: Exact numbers in Figure 3 are not provided; Additional experiments on GPT-3.5-Turbo & GPT-4; Ablation study on stricter query budgets.

A1: Below is the raw data of the requested experiments. SECA remains effective on larger models, showing strong performance relative to raw ASR. Higher budgets lead to better success rates, but SECA still improves raw with limited queries.

ASR@30/10/1 (Raw)

SubjectLlama-3-3BLlama-3-8BLlama-2-13BQwen-2.5-14BQwen-2.5-7BGPT-4o-MiniGPT-4.1-NanoGPT-3.5 Turbo*GPT-4*
Cli0.52/0.26/0.040.64/0.39/0.070.07/0.02/0.000.02/0.01/0.000.15/0.07/0.010.05/0.05/0.020.04/0.02/0.000.26/0.17/0.030.00/0.00/0.00
Bio0.42/0.20/0.030.52/0.31/0.050.13/0.05/0.000.01/0.00/0.000.09/0.06/0.010.00/0.00/0.000.00/0.00/0.000.18/0.11/0.020.00/0.00/0.00
Ana0.36/0.17/0.020.35/0.16/0.020.05/0.02/0.000.04/0.02/0.000.00/0.00/0.000.00/0.00/0.000.03/0.01/0.000.37/0.28/0.080.01/0.00/0.00
Mat0.63/0.43/0.090.73/0.47/0.080.20/0.08/0.010.04/0.02/0.000.08/0.05/0.010.06/0.03/0.000.39/0.26/0.070.75/0.60/0.180.17/0.10/0.02
CS0.61/0.40/0.080.85/0.59/0.120.26/0.11/0.010.01/0.00/0.000.03/0.01/0.000.00/0.00/0.000.02/0.01/0.000.49/0.30/0.050.07/0.03/0.00
ML0.61/0.39/0.080.69/0.42/0.070.12/0.04/0.000.02/0.01/0.000.06/0.03/0.000.04/0.02/0.000.01/0.00/0.000.17/0.05/0.010.28/0.23/0.07
Sec0.46/0.26/0.040.47/0.26/0.040.05/0.02/0.000.02/0.01/0.000.01/0.00/0.000.00/0.00/0.000.05/0.04/0.010.04/0.01/0.000.00/0.00/0.00
Phy0.59/0.34/0.050.42/0.28/0.070.16/0.06/0.010.00/0.00/0.000.06/0.02/0.000.06/0.03/0.000.06/0.02/0.000.28/0.23/0.050.16/0.09/0.01
Che0.35/0.14/0.020.49/0.27/0.050.12/0.04/0.000.05/0.02/0.000.12/0.07/0.010.00/0.00/0.000.01/0.00/0.000.05/0.01/0.000.00/0.00/0.00
CPy0.51/0.33/0.060.49/0.32/0.060.12/0.05/0.010.04/0.01/0.000.03/0.01/0.000.00/0.00/0.000.00/0.00/0.000.42/0.30/0.100.01/0.00/0.00
Psy0.46/0.24/0.040.43/0.24/0.050.06/0.02/0.000.00/0.00/0.000.04/0.03/0.000.00/0.00/0.000.04/0.03/0.010.08/0.03/0.000.01/0.00/0.00
Soc0.30/0.17/0.030.65/0.42/0.080.05/0.02/0.000.04/0.02/0.000.14/0.09/0.020.00/0.00/0.000.01/0.00/0.000.13/0.05/0.000.13/0.12/0.07
Phi0.24/0.10/0.010.42/0.25/0.050.04/0.02/0.000.00/0.00/0.000.03/0.01/0.000.00/0.00/0.000.03/0.01/0.000.07/0.03/0.000.00/0.00/0.00
Hi0.59/0.33/0.050.59/0.35/0.070.11/0.04/0.000.01/0.00/0.000.11/0.04/0.000.00/0.00/0.000.04/0.01/0.000.22/0.08/0.010.00/0.00/0.00
Law0.39/0.19/0.030.47/0.25/0.040.08/0.03/0.000.00/0.00/0.000.07/0.04/0.000.00/0.00/0.000.02/0.01/0.000.03/0.01/0.000.11/0.06/0.01
Eco0.51/0.28/0.040.72/0.52/0.130.06/0.02/0.000.00/0.00/0.000.02/0.01/0.000.00/0.00/0.000.05/0.04/0.010.18/0.15/0.030.00/0.00/0.00

ASR@30/10/1 (SECA)

SubjectLlama-3-3BLlama-3-8BLlama-2-13BQwen-2.5-14BQwen-2.5-7BGPT-4o-MiniGPT-4.1-NanoGPT-3.5 Turbo*GPT-4*
Cli0.89/0.69/0.200.79/0.65/0.280.46/0.28/0.050.24/0.20/0.110.36/0.28/0.160.46/0.42/0.320.40/0.32/0.210.77/0.69/0.540.00/0.00/0.00
Bio0.92/0.74/0.250.82/0.69/0.310.51/0.29/0.040.33/0.31/0.210.37/0.31/0.230.30/0.25/0.200.29/0.28/0.220.65/0.59/0.490.31/0.21/0.13
Ana0.78/0.63/0.210.79/0.56/0.210.33/0.19/0.030.22/0.21/0.160.32/0.26/0.150.24/0.20/0.140.28/0.26/0.250.80/0.75/0.450.49/0.45/0.39
Mat0.76/0.70/0.260.70/0.67/0.300.74/0.46/0.100.52/0.43/0.230.18/0.15/0.110.46/0.42/0.270.61/0.60/0.520.92/0.88/0.600.68/0.66/0.54
CS0.97/0.86/0.321.00/0.93/0.390.59/0.37/0.090.27/0.21/0.130.26/0.20/0.120.18/0.17/0.160.58/0.58/0.500.62/0.60/0.350.86/0.86/0.85
ML0.81/0.70/0.260.70/0.61/0.240.43/0.27/0.050.23/0.20/0.130.34/0.25/0.120.23/0.21/0.150.07/0.06/0.040.52/0.51/0.500.41/0.40/0.40
Sec0.76/0.59/0.230.81/0.64/0.280.34/0.19/0.030.38/0.30/0.120.26/0.21/0.110.17/0.14/0.080.28/0.21/0.120.39/0.33/0.170.30/0.25/0.23
Phy0.86/0.70/0.260.68/0.57/0.210.71/0.52/0.170.25/0.22/0.090.60/0.53/0.420.46/0.39/0.260.37/0.34/0.270.69/0.57/0.350.20/0.18/0.17
Che0.84/0.67/0.170.88/0.69/0.210.47/0.29/0.050.17/0.15/0.070.20/0.14/0.060.29/0.25/0.170.36/0.32/0.240.62/0.47/0.150.11/0.06/0.01
CPy0.75/0.66/0.270.80/0.73/0.350.62/0.44/0.110.43/0.39/0.200.39/0.37/0.320.43/0.43/0.340.29/0.22/0.120.95/0.90/0.800.32/0.30/0.30
Psy0.83/0.63/0.210.87/0.67/0.310.34/0.17/0.020.33/0.28/0.150.26/0.22/0.140.26/0.24/0.190.32/0.29/0.230.49/0.45/0.400.20/0.20/0.20
Soc0.70/0.47/0.130.80/0.67/0.240.30/0.22/0.070.24/0.18/0.080.37/0.29/0.160.21/0.19/0.140.24/0.18/0.120.81/0.80/0.650.25/0.25/0.20
Phi0.70/0.51/0.160.83/0.70/0.250.25/0.12/0.020.26/0.22/0.130.37/0.28/0.160.11/0.08/0.050.34/0.29/0.190.49/0.42/0.230.21/0.20/0.20
Hi0.89/0.77/0.310.92/0.86/0.400.60/0.42/0.100.42/0.40/0.240.38/0.33/0.240.38/0.37/0.270.52/0.51/0.430.63/0.58/0.420.56/0.56/0.51
Law0.68/0.51/0.160.83/0.71/0.240.32/0.19/0.040.05/0.04/0.020.31/0.20/0.080.07/0.06/0.030.19/0.17/0.140.57/0.56/0.490.22/0.22/0.22
Eco0.78/0.59/0.210.99/0.88/0.380.44/0.25/0.040.13/0.10/0.040.28/0.21/0.130.16/0.14/0.120.40/0.37/0.320.65/0.53/0.210.25/0.25/0.21

Objective Difference (SECA increases the objective significantly compared to Raw)

SubjectLlama-3-3BLlama-3-8BLlama-2-13BQwen-2.5-14BQwen-2.5-7BGPT-4o-MiniGPT-4.1-NanoGPT-3.5 Turbo*GPT-4*
Cli1.771.272.913.872.669.226.065.303.05
Bio2.481.742.425.793.158.236.525.364.44
Ana2.211.872.925.494.249.036.324.205.29
Mat1.661.532.366.103.528.535.042.168.09
CS1.541.302.025.443.538.276.262.058.79
ML1.341.241.854.853.057.723.944.652.59
Sec1.821.452.375.674.116.744.753.606.15
Phy1.881.953.175.685.087.646.264.453.48
Che2.352.111.913.772.158.375.784.642.71
CPy2.452.063.466.685.3010.157.374.713.42
Psy1.961.722.736.083.309.485.865.623.86
Soc2.441.332.964.292.486.944.954.335.82
Phi2.441.832.625.093.255.845.704.022.38
Hi2.122.243.767.033.7512.458.493.4112.25
Law1.701.601.902.792.664.784.104.812.33
Eco1.671.593.314.433.338.365.814.553.69

Note: Due to time constraints, GPT-3.5-Turbo (gpt-3.5-turbo-0125) and GPT-4 (gpt-4-0613) were evaluated on a 30% subset.

Q2: Table 2 only show results on 3 models rather than 7

A2: We evaluated SECA and Raw on all seven models, as shown in Figure 3 and detailed in our response to Q1. As noted in Appendix K (Lines 635–637), we excluded GCG on 14B models due to memory constraints—our machine supports GCG only on 7B models. GPT models were also omitted, as GCG requires white-box access.

Q3: MMLU is a less common benchmark; testing on open-ended QA benchmarks (AdvBench or JBB-Behaviors); attacking reasoning models

A3: We did not test SECA on AdvBench or JBB-Behaviors because these datasets are designed for jailbreak attacks, whereas our work focuses on hallucination attacks, which are fundamentally different from jailbreaking (see our response to Q1 from Reviewer 6rsw). Also, we clarify that our setting is already an open-ended MCQA generation task, where the model is prompted to select an incorrect answer choice followed by a hallucinated open-ended explanation.

Due to space limitations, we kindly refer you to our response to Q1 from Reviewer LvAW for detailed discussion related to MMLU and open-ended QA, and to Q4 from Reviewer 6rsw for reasoning model–specific questions.

Q4: Compare with other attack methods

A4: The algorithms you mentioned are indeed interesting; we will include appropriate citations, and they may help further improve the efficiency of our SECA method. Due to space constraints in this rebuttal, we kindly refer you to our response to Q1 from Reviewer 6rsw for comparisons between SECA and the referenced attack methods.

Q5: Efficiency of SECA

A5: In our experiments, generating one candidate prompt takes approximately 0.5–1 second on a single NVIDIA A5000 GPU.

We sincerely thank the reviewer again for their valuable feedback and hope that our clarifications help reinforce the contributions of our work.

评论

Thanks for addressing my questions. I will increase my score.

评论

Dear Reviewer,

Thank you for your valuable questions and suggestions. Please let us know if you have any other concerns you'd like us to address. If our responses have been satisfactory and addressed all of your concerns, we would appreciate it if you would reconsider your evaluation.

Best,

The Authors

最终决定

This paper introduces SECA, a framework generating semantically equivalent and coherent adversarial prompts to elicit LLM hallucinations, using an LLM proposer for rephrasings and a feasibility checker for semantic/coherence verification. Reviewers praised its novel focus on realistic attacks, efficient heuristic search, and strong empirical performance on MMLU, with higher attack success rates than GCG. Concerns included limited evaluation on larger/commercial LLMs, narrow focus on multiple-choice tasks, and missing comparisons with recent attack methods. However, the authors addressed key issues in their rebuttal, including expanded experiments on additional models and clarification of generalizability. Despite minor limitations, SECA fills a critical gap by studying realistic prompt variations that trigger hallucinations, with practical value for LLM robustness auditing. Its rigorous validation, including human evaluations, strengthens its contribution. The AC concurs this work is a valuable advancement for understanding LLM vulnerabilities and recommends acceptance.