5.8

/10

Poster4 位审稿人

最低5最高6标准差0.4

3.5

置信度

COLM 2025

Prompt-Reverse Inconsistency: LLM Self-Inconsistency Beyond Generative Randomness and Prompt Paraphrasing

Jihyun Janice Ahn,Wenpeng Yin

OpenReview PDF

提交: 2025-03-21更新: 2025-08-26

TL;DR

This paper introduces Prompt-Reverse Inconsistency (PRIN), where Large Language Models give conflicting answers when identifying correct versus incorrect responses, raising concerns about their logical reliability.

摘要

关键词

Large Language ModelNatural Language ProcessInconsistency of LLMsPrompt-Reverse InconsistencyRandomness InconsistencyParaphrase Inconsistency

评审与讨论

审稿意见

评分: 5置信度: 42025-05-03

In this work, the authors define and explore the Prompt-Reverse Inconsistency (PRIN), analyzing how existing models show this behavior in multiple-choice question settings. The authors also experimented with how other types of inconsistency affect PRIN, approaches to mitigate PRIN, how PRIN scores correlate to rates of other types of inconsistency, how to apply PRIN to boost task performance, and how PRIN varies with different sizes of options.

I think the work deals with interesting issues/approaches. However, at the same time, the work could have been more comprehensive/improved if the authors 1) explored a bit more generalizable concept for PRIN, 2) considered a wider set of decoding approaches, and 3) were more comprehensive in examining/combining existing approaches.

Considering these issues and the overall impact of the work, I am slightly below the bar for accepting the work.

接收理由

Relevance to COLM: The work is relevant to COLM as it deals with issues and approaches that could be impactful to the use of LLMs, especially for LLM-as-Judge use cases.
Overall clear writing
Comprehensive experiments, specifically over many models

拒绝理由

Narrow definition of PRIN: The first concern is that the authors might have defined “PRIN” too narrowly, while there can be other types of tasks than multiple choice questions where PRIN could apply. For instance, randomness and paraphrase inconsistencies are not limited to multiple-choice questions. Here, I can imagine one case of PRIN not in a multiple-choice question: If a math question is about “how many apples are taken by Tim”, the reverse question can be “how many apples are left?” Here, basically, the question is in the other way around, and the answer also needs to be changed, but logically we can get the original answer only by seeing the reverse answer. It would have been worthwhile if the authors could explore a broader concept (which would result in more generalizability) or at least the authors could discuss these expanding directions.
Lack of details/considerations on decoding approaches: There was a methodological concern also - I was curious why the authors did not explain their decoding approaches. There are a number of different decoding approaches—I guess the authors might have used sa sampling approach, but there are other options like greedy decoding or beam search. For use cases that require high accuracy, greedy decoding might have been the most adequate approach to adopt, as it would output the LLM’s most confident output token series. While this decoding might be only doable for open-source models, still, investigating this would have been valuable for the work. However, the authors did not employ it, and I also could not know what kind of decoding approaches the authors actually used. For example, did they use the same temperature or repetition penalty for different models (of course, if they can specify them)?
Limited exploration on paraphrasing prompts: Regarding the investigation on paraphrasing (in Table 4), the authors conducted paraphrasing by swapping one word. Given that the authors claim that variability due to prompt paraphrasing is minor, I was not sure if paraphrasing by swapping only one word would be enough. Maybe, doing more drastic paraphrasing (e.g., changing the sentence structure) could have been beneficial.
Not examining/comparing to existing compelling approaches: The authors conducted a comparison study to CoT and self-consistency, but I was curious why the authors did not experiment with reasoning models. At the time of COLM submission, several reasoning models were available (to my knowledge, o1 and DeepSeek-R1). Moreover, I was also curious how the performance would look like if the authors combine multiple of approaches together (e.g., CoT + PRIN). Would it result in a further performance boost? Knowing that would likely help the community decide whether to adopt the PRIN-based approach with already known wisdom.
Minor presentation issue: Lastly, a minor thing on presentation, but a line plot might not be the best visualization for Figure 1, as this type of chart is more adequate for presenting sequential data or so. Bar charts might be more adequate.

(I raisded the score from 4 to 5 after reading the author response)

给作者的问题

It would be great to know the authors' responses about the above issues.

评论- Thank you for your thoughtful review and sharing your concern to us. We would like to address your concerns one by one bellow.

2025-06-03

1. Narrow definition of $\boxed{\mathrm{PRIN}}$

We appreciate the reviewer’s suggestion and agree that exploring PRIN beyond multiple-choice tasks is an exciting direction. However, we believe the example provided (e.g., “how many taken” vs. “how many left”) still fundamentally fits within our defined PRIN framework, and it represents a complementary prompt pair where the reverse logically determines the original answer, similar to our setup. Our current work focuses on the cleanest case by separating prompts and candidate answers, but we agree that broader forms, such as compositional or embedded reverse reasoning, are valuable extensions.

2. Lack of details/considerations on decoding approaches

We used the default decoding settings for each model — typically sampling-based decoding with their default temperature and repetition penalty — because this reflects the most common usage patterns and provides a more direct and realistic warning to users about how PRIN manifests in everyday applications. While we agree that exploring alternative decoding methods like greedy or beam search (particularly on open-source models) could yield interesting insights, our primary goal here was to assess the inconsistency that users are most likely to encounter. We will clarify these decoding choices in the paper and note this as a valuable direction for follow-up work.

3. Limited exploration on paraphrasing prompts

We understand how Table 4 may give the impression that we only tested word-level paraphrasing, but as detailed in Section 4.4, we actually expanded the paraphrasing analysis by generating two additional paraphrased versions using GPT-4, creating a total of five distinct paraphrased prompts, including more substantial, global rephrasings beyond simple word swaps. Our results showed that even under these stronger paraphrasing variations, PRIN remained relatively stable, suggesting that it reflects a deeper systematic issue, not just sensitivity to prompt wording.

4. Not examining/comparing to existing compelling approaches

We agree that, compared to reasoning-focused models like o1 and DeepSeek-R1, as well as exploring combinations like CoT + PRIN, would be valuable. However, we faced two practical constraints: (1) o1 requires additional financial costs beyond what we already invested in GPT-4o, and (2) DeepSeek-R1 was not available for approved use in the authors’ country. During our study period. Despite these limitations, we believe our focus on widely accessible general-purpose models establishes a solid foundation, and we see integrating PRIN with advanced reasoning models or combined techniques as promising future work.

评论- Comment on the author response

2025-06-08

I want to first thank the authors for their response.

For the scoping, in the revised version, it would be good if the authors could indicate the possible extensions of PRIN while acknowledging that the current version is rather a tightly scoped version of a "clean" case.
While I agree that sampling is the baseline approach of how we use LLMs, it would still be really nice if the authors explored this direction a bit.
I think I might have missed this portion. I want to thank the authors for pointing this out.
I understand the authors' practical concerns.

I buy some of the authors' points, hence, I will increase the score to 5 - I still think that some of concerns I had would be good to be explroed.

审稿意见

评分: 6置信度: 32025-05-09

This paper introduces and investigates Prompt-Reverse Inconsistency (PRIN), a novel type of LLM self-inconsistency that occurs when models give conflicting responses to logically opposite questions. Specifically, when presented with a question and multiple answer candidates, LLMs often provide contradictory judgments when asked "Which are correct answers?" versus "Which are incorrect answers?". The authors conduct extensive experiments across multiple LLMs (both closed and open-source) and various mathematical/scientific tasks to analyze PRIN, examining its relationship with other inconsistency types, mitigation strategies, and potential applications. Their findings demonstrate that PRIN is prevalent across all tested LLMs, can be mitigated through explicit reasoning and negation clarification, and may actually be leveraged to improve model performance in some cases.

接收理由

Comprehensive analysis: The authors examine PRIN across multiple models, datasets, and experimental settings, providing a solid empirical foundation for their findings.
Application potential: Beyond identifying a problem, the authors show how PRIN can actually be leveraged to improve model performance in certain scenarios.

拒绝理由

Limited theoretical explanation: While the paper thoroughly documents the PRIN phenomenon, it lacks a deeper theoretical explanation of why LLMs exhibit this behavior. Specifically, the authors position PRIN as distinct from Randomness Inconsistency and Paraphrase Inconsistency, but the paper doesn't sufficiently establish that PRIN is fundamentally different rather than a special manifestation of existing inconsistency types in the LLM-as-a-judge setting. The correlation with negation handling (shown in Figure 1) is suggestive but insufficient as a comprehensive explanation of the underlying mechanisms.
Incomplete exploration of mitigation impacts: While the authors demonstrate that their proposed mitigation methods reduce PRIN, they don't thoroughly analyze potential tradeoffs. For instance, does adding reasoning paths or negation explanations introduce other issues like increased latency, reduced performance on certain task types, or creation of new inconsistency patterns?
Training-based solutions: The paper focuses on prompt-based mitigations, but doesn't discuss potential training-based approaches. Would fine-tuning with logical constraints or adversarial examples targeting negation help reduce PRIN? This would complement the current prompt-based solutions.

给作者的问题

Have you explored whether PRIN occurs in more diverse task domains beyond mathematical reasoning, such as common sense reasoning or factual QA?
Your work demonstrates a correlation between PRIN and negation handling abilities. Have you investigated other cognitive or linguistic factors that might predict or explain PRIN?
The paper shows that PRIN can improve performance for high-capability models like GPT-4 but not for weaker models. Is there a performance threshold where PRIN becomes beneficial rather than detrimental?
Have you considered how PRIN might interact with other established techniques like few-shot prompting or self-consistency beyond what's covered in Section 4.5?
Since your findings suggest that LLMs struggle with logical contradictions, what implications do you see for applications requiring formal logical reasoning?
Could PRIN be considered a special case of Randomness Inconsistency that occurs specifically in discriminative settings? What evidence supports PRIN being a fundamentally distinct phenomenon rather than a context-specific manifestation of existing inconsistency types?
Have you investigated whether models explicitly trained on logical reasoning tasks or with logical constraints exhibit lower PRIN? This could help determine if PRIN is an inherent limitation of current architectures or a training deficiency.

评论- We sincerely thank you for your detailed review and insightful comments. Below, we provide comprehensive responses to your points. Due to space constraints, we have condensed the original questions for brevity and clarity, and appreciate your understanding.

2025-06-03

1. $\boxed{\mathrm{PRIN}}$ in domains beyond math (e.g., commonsense, factual QA)?

That’s an excellent question. In this paper, we focused exclusively on mathematical domains. Our observations suggest that a model’s ability to handle negation plays a key role in triggering PRIN. Given this, we believe PRIN could plausibly arise in other domains, such as commonsense reasoning or factual QA, where similar linguistic challenges exist. However, due to time constraints, we were unable to conduct direct experiments on datasets from those domains. Investigating whether PRIN manifests more broadly across diverse task domains is a promising direction for future work.

2. Other cognitive/linguistic factors related to $\boxed{\mathrm{PRIN}}$ ?

The concept of PRIN emerged from our observation that model performance systematically diverges depending on the presence or absence of negation in prompts. Our study primarily focused on this negation sensitivity. However, to account for potential confounds, such as unfamiliarity with specific lexical choices, we conducted experiments using multiple reworded versions of the prompts.

As shown in Section 4.2 and Table 5, while LLMs exhibit some sensitivity to prompt formulation, the observed PRIN patterns remain consistent across different rewordings. This suggests that PRIN is not merely an artifact of superficial linguistic variation but reflects deeper issues in how models handle negation. Thus, we did explore one class of linguistic factors, rewording, but have not yet extended our analysis to other cognitive factors. Investigating additional predictors of PRIN beyond negation remains an open and important avenue for future work.

3. Is there a capability threshold where the benefits of $\boxed{\mathrm{PRIN}}$ outweigh its drawbacks?

That’s a very insightful point. Based on the results from the three models evaluated in our study, we observed that, unlike the GPT-4 series, models like LLaMA-3 do not consistently benefit from PRIN; in fact, performance can even degrade. While our current experiments are insufficient to pinpoint an exact performance threshold where PRIN shifts from being detrimental to beneficial, our findings suggest that a model’s negation handling capability plays a critical role. With more targeted experiments measuring this capability across a broader range of models, it may be possible to quantify such a threshold. This would offer valuable guidance for both LLM development and evaluation going forward.

4. Interaction with few-shot prompting or self-consistency?

While Section 4.5 explores combining PRIN with self-consistency, we acknowledge that deeper interactions with techniques like few-shot prompting remain underexplored. Importantly, few-shot prompting may reduce or mask PRIN by biasing the model toward specific demonstration patterns, potentially hiding its raw inconsistency rather than resolving it.

5. Implications of LLM struggles with contradictions for logical reasoning?

We agree that our findings raise important implications: PRIN highlights that even advanced LLMs struggle with maintaining consistent logical complements, suggesting they are currently ill-suited for tasks requiring strict formal logical reasoning without additional safeguards. Applications in fields like law, formal verification, or logic-based reasoning will likely require specialized training, architectural adjustments, or post-hoc correction mechanisms to ensure reliability. We see addressing this gap as a critical avenue for future research.

6. Is $\boxed{\mathrm{PRIN}}$ a case of Randomness Inconsistency in discriminative settings, or something distinct?

We argue that PRIN is fundamentally distinct from Randomness Inconsistency. As shown in Section 4.4, models like Llama-3 and Falcon exhibit low randomness inconsistency (i.e., they produce stable outputs across runs) yet still show high PRIN, meaning they consistently fail to align direct and reverse prompts. This suggests PRIN arises not from sampling variability but from deeper issues in how models process negation and complementarity in discriminative reasoning. We see PRIN as capturing a unique failure mode that existing randomness-based metrics cannot explain.

7. Do models trained on logical tasks show reduced $\boxed{\mathrm{PRIN}}$ (architecture vs. training)?

While we did not explicitly test specialized models trained solely on logical reasoning tasks, our study includes GPT-4, which, as a top-performing, closed-source model, was very likely pretrained on diverse reasoning and logic tasks. Yet, as our results show, even GPT-4 exhibits notable PRIN, suggesting that PRIN is not just a training deficiency but may reflect deeper limitations in current architectures or learning frameworks. We agree that further testing on specialized logic-constrained models would be valuable and plan to explore this in future work.

评论- re

2025-06-11

Thank you for the response.

I will improve my score to 6 since the authors make some clarifications on some details of the experiments. However my concerns about the implications and generalizability remains. If paper is accepted i suggested more cross-domain evaluations and other prompting methods would strength the papers and benefit future researchers.

审稿意见

评分: 6置信度: 42025-05-12

This work poses an interesting and novel question that is slighlty different than simply generator-validator/discriminator gap in consistency. Namely, how consistent is a model in regards to the answer and the complement of the answer. In my knowledge this is the first work to tackle this consistency problem so it is quite original. Generally the quality and clarity of this paper quite good aside for a few minor things mentioned below.

Significance: I can see this type of metric being used in future language model evaluations. The paper's significance is limited due to reliance on multiple choice datasets (or ones that can be made into multiple choice using the method of sampling answers they pointed out). However, this doesn't factor into my review scores as it seems unavoidable and the method for evaluating consistency is still valid and usable.

接收理由

This is a new type of consistency measure that is well validated and interesting to the community for further analyzing the consistency of language models. For ensuring the logical reasoning capabilities of language models we certianly want to be sure that the answer and complement of the answer match, which this paper provides a discussion and analysis of how to think about validating this.

拒绝理由

(1) Domain Reliability

There is only one concern I have that is a reason to reject which is the validity of the results outside of the math domain. I do think the authors need to add datasets outside of the math domain in order to be able to show the results hold generally for LLMs. There are many multiple choice data outside of the math domain. I would recommend the authors add a few of these to further validate their results. This can be addressed in the rebuttal period.

The Rest

These concerns are not necessarily reasons to reject but simply points of clarity that can be improve the paper.

(2) Prompt Reverse is a very confusing name

It took me, the reader, a long time to get used to PRIN as prompt reverse was quite confusing to me. I would recommend something more descriptive since many things could be a reverse prompt. Perhaps "Answer Complement Inconsistency" or something like this.

(3) F1 is not enough for the reader

For the research question we are also interested in looking at precision and recall since it tells us much more about about the the degree of "completeness" of the answer complement/reverse prompt.

(4) What about if we just cared about correctness or consistency and not completeness?

Related to the above, it would be good to add an analysis that ignored completeness or full overlap or full complement that showed the reader when the "Reverse Prompt/Answer Complement" simply agreed with the Original prompt (intersection >= 1). I would further divide this analysis into when this is correct and incorrect according to the task accuracy metric.

评论- We would like to appreciate your suggestions! We would like to address your concern and suggestions one by one as follows:

2025-06-03

1. $**Domain Reliability**$ There is only one concern I have that is a reason to reject which is the validity of the results outside of the math domain. I do think the authors need to add datasets outside of the math domain in order to be able to show the results hold generally for LLMs. There are many multiple choice data outside of the math domain. I would recommend the authors add a few of these to further validate their results. This can be addressed in the rebuttal period.

We fully agree that extending PRIN evaluation beyond the math domain is an exciting and valuable direction. For this initial study, we intentionally focused on math and science domains because they provide clear ground truth for correctness, enabling rigorous PRIN measurement. While we did not include non-math multiple-choice datasets in this version, we believe the core mechanism underlying PRIN (namely, the probabilistic inconsistency between direct and reverse prompts) is model-internal and not domain-specific. Preliminary tests we conducted on non-math QA datasets showed qualitatively similar inconsistencies, although measuring PRIN reliably outside math is more challenging due to the lack of objective correctness labels. We will emphasize this as a key avenue for future work and greatly appreciate the reviewer’s suggestion.

2. $**Prompt Reverse is a very confusing name**$ It took me, the reader, a long time to get used to $\boxed{\mathrm{PRIN}}$ as prompt reverse was quite confusing to me. I would recommend something more descriptive since many things could be a reverse prompt. Perhaps "Answer Complement Inconsistency" or something like this.

Thank you for this thoughtful suggestion. We understand that “Prompt-Reverse Inconsistency (PRIN)” may initially sound unfamiliar or confusing. We chose this name specifically to emphasize the inconsistency between the direct prompt (asking for correct answers) and the reverse prompt (asking for incorrect answers), which directly reflects our experimental setup. That said, we appreciate the reviewer’s point that alternatives like “Answer Complement Inconsistency” could offer a more intuitive framing, and we will carefully consider renaming in future revisions or follow-up work to improve clarity.

3. $**F1 is not enough for the reader**$ For the research question, we are also interested in looking at precision and recall,l since it tells us much more about the degree of "completeness" of the answer complement/reverse prompt.

We agree that reporting precision and recall alongside F1 could provide a more detailed picture of the completeness and balance between direct and reverse prompt outputs. In this work, we focused on F1 as a summary measure to keep the analysis streamlined, but we appreciate the reviewer’s point and will include detailed precision and recall breakdowns in the next revision or as supplementary material to offer deeper insights into where the complementarity gaps arise.

4. $**What about if we just cared about correctness or consistency and not completeness?**$ Related to the above, it would be good to add an analysis that ignored completeness or full overlap or full complement that showed the reader when the "Reverse Prompt/Answer Complement" simply agreed with the Original prompt (intersection >= 1). I would further divide this analysis into when this is correct and incorrect according to the task accuracy metric.

We agree that analyzing partial agreement (e.g., intersection ≥ 1) and linking it to task accuracy would add valuable perspective beyond completeness. We will add it in the next version.

评论- Thanks for the response; Keeping score

2025-06-11

Thank you for the response.

I am keeping the score because I do believe that according to the authors: "While we did not include non-math multiple-choice datasets in this version, we believe the core mechanism underlying PRIN (namely, the probabilistic inconsistency between direct and reverse prompts) is model-internal and not domain-specific" Should be tested empirically to complete the contribution. That said, if other reviewers and the meta-reviewer don't see this as particularly an issue and the authors can be trusted with a more extensive generic domain follow up then I am fine to err on the side of full recomendation for acceptance.

If the paper is accepted, I would really like to see more generic domain evaluations for a camera ready even if its one or two more additional appendix experiments to turn the authors beliefs into a tested claim (I agree that I also believe the mechanism is not domain-specific).

审稿意见

评分: 6置信度: 32025-05-16

This paper introduces and analyzes a novel phenomenon in LLMs called Prompt-Reverse Inconsistency (PRIN). While past work has studied Randomness Inconsistency and Paraphrase Inconsistency, PRIN captures a different failure mode in LLMs' discriminative reasoning. Specifically, when an LLM is asked to identify correct answers versus incorrect answers from a set of candidates, its responses are often inconsistent across these logically complementary prompts.

Through extensive experiments on multiple datasets (MATH, MathQA, EquInfer) and a range of open-source and closed-source LLMs (e.g., GPT-4, LLaMA-3, Falcon), the authors quantify the PRIN effect, relate it to known inconsistency types, and propose mitigation strategies. The work is well-scoped and structured around six clear research questions.

接收理由

Novel contribution: Introduces and rigorously studies a previously unexamined inconsistency phenomenon (PRIN).

Strong empirical analysis: Evaluates PRIN across a wide range of models and tasks.

Practical mitigation: Proposes simple but effective techniques (e.g., CoT with negation explanation) to reduce PRIN.

Well-structured and reproducible: Clear research questions, public datasets/models, and promised code release.

拒绝理由

Limited theoretical framing: The paper lacks a deeper formalization or explanation of PRIN from first principles (e.g., logic, learning dynamics).

Dense presentation: Some results (especially in Q4) could be made more digestible; currently requires significant effort to parse.

Focus on math/logic tasks: The scope is somewhat narrow; a broader range of tasks (e.g., commonsense or legal reasoning) would strengthen generality.

给作者的问题

Could PRIN be grounded more formally in logical consistency (e.g., monotonicity, Boolean closure)? Are there known logical frameworks where such behavior is characterized?
Do you expect PRIN to appear in non-mathematical domains (e.g., QA, ethics, or summarization)? Have you observed this elsewhere?
Do you believe PRIN arises from objective misalignment, dataset artifacts, or model architecture? Any empirical ablations on these fronts?
Are there specific patterns in prompts or candidate answer sets that systematically trigger high PRIN?
Your negation-based fix works for small-scale tasks—does it generalize to long contexts, real-world tasks, or multimodal prompts?

伦理问题详情

评论- Thank you for your detailed review on our paper. We would like to share our answers for your questions one by one bellow.

2025-06-03

1. Could $\boxed{\mathrm{PRIN}}$ be grounded more formally in logical consistency (e.g., monotonicity, Boolean closure)? Are there known logical frameworks where such behavior is characterized?

Thank you for this thoughtful question. PRIN arises in probabilistic neural language models, not symbolic systems. As shown in Table 1, LLMs like GPT-4 generate responses by sampling from conditional probability distributions, without enforcing strict logical complementarity between prompts (e.g., “correct” vs. “incorrect” answers). This makes PRIN an emergent behavioral inconsistency unique to probabilistic generation, not a violation of formal symbolic rules. While existing logical frameworks don’t yet capture such behavior, we see our work as an empirical foundation that can motivate future formalization

2. Do you expect $\boxed{\mathrm{PRIN}}$ to appear in non-mathematical domains (e.g., QA, ethics, or summarization)? Have you observed this elsewhere?

Thank you for this insightful question. While our study focused on mathematical and scientific tasks to ensure clear ground truth, we believe PRIN is a general behavioral property of LLMs and expect it to appear in non-mathematical domains like QA, ethics, or summarization — especially wherever prompts and their reversals (e.g., relevant vs. irrelevant facts, ethical vs. unethical actions) are involved. Preliminary observations outside math suggest similar inconsistencies, though systematically measuring PRIN in open-ended tasks is more challenging due to the lack of clear correctness labels. We see this as an exciting direction for future research.

3. Do you believe $\boxed{\mathrm{PRIN}}$ arises from objective misalignment, dataset artifacts, or model architecture? Any empirical ablations on these fronts?

We believe PRIN likely arises from a combination of factors: the probabilistic nature of LLM generation, imperfect alignment objectives (which don’t explicitly enforce logical complementarity), and possibly artifacts in training data where correct/incorrect distinctions are inconsistently represented. While our current work focuses on empirically characterizing PRIN’s behavioral patterns, we agree that ablation studies (e.g., probing fine-tuning objectives, data subsets, or architectural components) are valuable next steps, and we highlight this as promising future work.

4. Are there specific patterns in prompts or candidate answer sets that systematically trigger high $\boxed{\mathrm{PRIN}}$ ?

Please refer to Tables 5 and 6. Our experimental results across the three datasets indicate that answer sets involving complex mathematical expressions, such as those in EquInfer, tend to systematically trigger high PRIN. This suggests that the structural complexity and symbolic density of certain candidate answer sets are strongly associated with increased PRIN values.

5. Your negation-based fix works for small-scale tasks—does it generalize to long contexts, real-world tasks, or multimodal prompts?

Thank you for the great question. Based on our negation-based fix experiments using CONDAQA (Ravichander et al., 2022), we observed that, with the exception of high-performance models such as GPT-4, many models failed to reliably follow prompts, even producing empty responses in some cases. Given that CONDAQA is a large-scale benchmark with realistic passages and complex negation phenomena, these failures suggest that our fix does not yet generalize well to longer contexts or real-world tasks, and would likely face similar limitations in multimodal settings.

最终决定Accept

2025-07-08

This paper proposes evaluating LLMs by checking whether they give the two answers that are complements when a question is inverted. Experiments are conducted on multiple math datasets and analysis considers a range of suitable questions.

Overall, I agree with the consensus of the reviewers that this is a novel approach that is informative, but there is one key weaknesses in the paper: experiments are only on math datasets (noted by three reviewers). The author response, that non-math datasets can be too ambiguous, is not sufficient. Some settings certainly have this issue, but there are many tasks that do not. Adding non-math tasks would show whether this is a generalizable technique or not.

Reviewers raised other concerns that could be addressed with small changes to the writing:

The name 'Prompt reverse' is confusing. While only raised by one reviewer, this is a concern I shared when reading the paper.
Only F1 is presented. This hides potentially important variations. It is not necessary to include precision and recall for every experiment, but they should be added in Table 3 and could be discussed anywhere else there are interesting variations.

Other issues may reasonably be considered beyond the scope of this work, but are valuable to consider as ways to strengthen the contribution:

There is limited theoretical framing. This concern was raised by two reviewers, though I would note that not every paper has to contribute both theoretical and empirical results.
Limited discussion of other possible solutions and trade-offs in metrics. This was raised by two reviewers and would be valuable to explore if the authors could make space for it.

[comment from PCs] The AC raises a good point with regard to the naming. We strongly recommend to change the title and the name of the method.