We appreciate the reviewer for pointing out the potential weaknesses of our paper, and we will address these concerns by answering the questions.

Question 1: How robust is the approach to NLI degradation? If the NLI model used for contradiction detection underperforms—e.g., on non-English or technical content—how significantly does this affect the reliability of the MIS defense?

Response to question 1:

We thank the reviewer for pointing out this potential issue. To address your concern, we have done the following additional experiments to demonstrate that the performance of our framework degrades gracefully with NLI degradation. In particular, we repeat our experiments on RealtimeQA and Mistral-7B with prompt injection attack, but impose some error probability on NLI checks. In particular, for each contradiction edge, we invert it with probability and test the performance of our framework under different values of ’s. The accuracy results of these additional experiments are as follows:

With retrieved documents:

	Attack @ Pos 1	Attack @ Pos 5	Attack @ Pos 10
0.1	65	67	66
0.3	64	62	62
0.5	55	54	51

With retrieved documents:

	Attack @ Pos 1	Attack @ Pos 25	Attack @ Pos 50
0.1	59	69	69
0.3	62	67	65
0.5	42	66	65

We can see that, even with 0.5 error probability (i.e. we invert each edge with 50% probability), our methods still demonstrate superior robustness. Note that for , the accuracies are almost unaffected when the attack is on position 25 or 50 because the malicious document is very unlikely to get sampled to begin with. We can thus conclude that our methods are quite robust against NLI degradation.

In addition, we would like to note that the possibility of NLI degradation is precisely why, on top of the empirical results, we have focused on providing theoretical analyses parameterized by the properties of the NLI performance. Furthermore, we have worked out and will provide new and tighter theoretical results and empirical simulations that augment this argument in later versions of our paper. Recognizing that malicious / low-quality documents might degrade the performance of NLI models, we provide a new version of Theorem 1 in which we permit larger error probability between benign documents and malicious documents compared with the current version of the paper. In particular, we show that even if the NLI model makes errors with ~0.3 probability between benign and malicious documents, our method is still -robust under certain natural conditions, where is the number of documents. We also conducted new simulations on the contradiction graph generation process (similar to Appendix B.1.1), and found that: even with ~0.4 NLI error probability between benign and malicious documents, the probability that any MIS includes any malicious document stays near zero until the number of malicious documents becomes substantial relative to the total number of documents . For example, robustness holds up to for and up to for . Note that all of the above analysis assumes the worst case where any pair of malicious documents are non-contradictory to each other.

Ultimately, we would like to mention that we are not relying on NLI model accuracy in isolation – our defence framework adopts the philosophy of defence-in-depth. The NLI signal is used in conjunction with the ranking signal, and both are inputs to the MIS algorithm; so in order for adversaries to succeed, they need to exploit weaknesses in the NLI model and conduct adversarial search engine optimization, and do so in a way that subverts the protections provided by applying the MIS that uses both types of signals. Thus, although our approach works best when NLI is accurate, its defence-in-depth nature allows for gracefully handling of inaccuracies in NLI in the face of realistic adversaries.

Question 2: Could the sampling parameters be learned or adapted? Is it feasible to tune the sampling distribution or context size in a data-driven manner, rather than fixing γ, T, and m heuristically?

Response to question 2:

We thank the reviewer for pointing out this question. This is a very promising direction for future work that generally falls into the learning-augmented algorithms paradigm, and we think is feasible. In our paper, we have also provided detailed ablation studies that could guide the choice of parameters in a data-driven and compute-aware manner:

For : In our experiments, we are using exponentially decaying weights with decay factor as a proxy for reliability scores of documents. This is because we are using google retriever to retrieve relevant documents and that doesn’t come with an explicit reliability score. In real deployments of our system, we would expect retrieved documents to come with some reliability score / some meta-data that can offer a nice estimate for a reliability score, then one can just use these scores in place of the exponentially decaying weights in the paper.
For the number of sampling rounds : As we have shown in Appendix C.6, the performance of our algorithm basically monotonically increases with . In other words, increasing trades off compute for enhanced robustness. Thus, in real deployment, one can just decide on an appropriate value of by considering how much compute is available. One can find an appropriate with, say, binary search, to find the maximum acceptable with a given amount of compute. turns out to be a good choice that nicely balances compute and performance in our experiments.
For the context size : As we have shown in Appendix C.6, the performance generally decreases as increases, because a larger increases the likelihood of sampling malicious documents. Thus, if one expects there to be many potentially malicious / misleading documents, one might take a smaller , e.g. as in our experiments. If documents being malicious / misleading is unlikely, then one can take a larger as that avoids the algorithm from potentially missing relevant and useful documents.

Question 3: Can the MIS selection be integrated with document reranking? For example, would combining MIS with re-ranking via learned relevance scores further enhance robustness or utility?

Response to question 3:

Yes, definitely. For example, if each document comes with a relevance / trustworthiness score, then we can construct a weighted contradiction graph where the weight of each document is its relevance / trustworthiness score. Then, we can find a weighted MIS instead of an unweighted one. Of course, the robustness / performance of this method depends on the quality of the relevance / trustworthiness score. If high-quality documents indeed have high scores, then the weighted MIS algorithm described above will more likely include them in the MIS, thus leading to enhanced robustness or utility. This is a very promising direction for future work, but out of the scope for the first paper in this space, as we wanted to demonstrate the promise of our method in the simplest conceptually possible scenario. Contradiction-based graph-theoretic methods should be a widely applicable technique that can see interesting applications in many different forms / domains.

Question 4: What happens when no consistent majority exists? In cases where most retrieved documents are noisy, off-topic, or benignly contradictory, how does the system avoid excessive pruning or failure?

Response to question 4:

We appreciate the reviewer for pointing out this question, which points to a major motivation for why we would want to take reliability / rank information into account. We first note that completely noisy, off-topic documents are usually useless / could even have adverse effects for answering the question correctly and the best way to deal with them is to prune them out, regardless how our system works (e.g. [1] explicitly lists “context filtering” as a canonical sub‑module of RAG pipelines, alongside retrieval and generation).

If no consistent majority exists among the relevant documents (after filtering out noisy, off-topic ones), then even humans may not effectively identify the correct answer from them. Our algorithm deals with this case gracefully though, as we explicitly prioritize higher-ranked documents. For example, say there are documents and each document points to a different answer to the question. However, as long as the top-ranked document contains the correct answer, our algorithm can find it. In short, by explicitly utilizing rank / reliability information, our algorithm is still likely to find a decent answer even if a consistent majority is hasty.

References:

[1]. Sharma et al, Retrieval-Augmented Generation: A Comprehensive Survey of Architectures, Enhancements, and Robustness Frontiers