/10

Poster4 位审稿人

最低3最高4标准差0.5

ICML 2025

SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models

Yung-Sung Chuang,Benjamin Cohen-Wang,Zejiang Shen,Zhaofeng Wu,Hu Xu,Xi Victoria Lin,James R. Glass,Shang-Wen Li,Wen-tau Yih

OpenReview PDF

提交: 2025-01-23更新: 2025-07-26

TL;DR

We designed a self-supervised reward to align LLMs for generating better citations to attribute the context when answering to questions, without human supervision.

摘要

关键词

Large Language ModelsLLMsAlignmentPreference OptimizationContext AttributionCitation

评审与讨论

审稿意见

评分: 42025-03-04

This submission proposes SelfCite, an approach for LLMs to improve their citation of sentences from the input context to support their responses. SelfCite evaluates the quality of a citation in terms of the LLM's probability of generating the same response without the cited sentences and with only the cited sentences. This self-supervised reward is used for best-of-N sampling, which is in turn used to generate training data for preference optimization to improve the LLM's intrinsic citation capability. SelfCite is evaluated on the LongBench-Cite benchmark (composed of 5 datasets) against baselines that use prompting, fine-tuning, or contributive context attribution, showing superior citation F1 scores. The paper also presents several studies of ablations/alternatives.

Update after rebuttal

The rebuttal provided additional discussions, clarifications, and a latency table that I trust will be added to further improve the paper.

给作者的问题

Please see Weaknesses under Other Strengths and Weaknesses.

论据与证据

Yes, I think the experimental design and results are sound.

方法与评估标准

The proposed approach makes sense as it judges the quality of cited sentences based on their contribution to the LLM's probability of generating the given response, which is mostly aligned with the degree to which the cited sentences support the response in a general sense (not from the LLM's point of view). The evaluation is done on LongBench-Cite, which is a recognized benchmark for citation quality.

理论论述

The submission does not make theoretical claims.

实验设计与分析

I read the experimental sections in full and did not find issues with soundness.

My only major comment is that the discussion of experimental results should also address the different notions of "citation" targeted by ContextCite on the one hand and evaluated by LongBench-Cite (and similar benchmarks) on the other hand. The former is a contributive context attribution method, i.e., one that aims to find the sources "that a model actually uses when generating a statement" (lines 423-424), whereas the latter is based on GPT-4o annotations of whether context sentences "support" a statement in a general sense. I think the paper in general could discuss this distinction more clearly or more often. Regarding Section 3.4 specifically, given these somewhat different objectives, I think it is expected that the LongCite fine-tuned models in Table 1 are already slightly better than ContextCite in terms of LongBench-Cite performance. Similarly, this might explain the decent but inferior results of SFT (supervised fine-tuning) on ContextCite. What I find interesting and a bit unexpected is that using contributive attribution to re-rank citations can then improve LongBench-Cite performance, despite the (small?) mismatch in objectives.

补充材料

I read Appendix B on the use of ContextCite and Appendix D on the comparison with Claude Citations.

与现有文献的关系

Section 5 discusses the relationships with 1) other work on teaching LLMs to generate citations as well as with 2) contributive context attribution and 3) self-supervised alignment in general. The main distinction with respect to 1) is the use of the techniques of 2) to further improve citation quality in a self-supervised manner. A longer-term goal, which the submission reports some results on, is to produce high-quality citations (as measured by benchmarks like LongBench-Cite) in a completely self-supervised manner.

遗漏的重要参考文献

I cannot think of essential references that were not discussed, but please see "Other Comments or Suggestions" for additional references on contributive context attribution.

其他优缺点

Strengths

I think (as mentioned above) that it is a very interesting idea and finding that leveraging a somewhat different notion of "citation", namely importance in causing a model to generate a certain statement, can improve performance with respect to the evaluated notion of citation, namely logically supporting a statement in a general sense.
The paper reports on many ablations and alternatives: the fully self-supervised case of SFT on ContextCite, and all the subsections of Section 4 (different rewards, citation length balancing, preference optimization vs. SFT, etc.). The set of baselines considered (shown in Table 1) is also comprehensive.

Weaknesses

Some aspects of best-of-N sampling (Section 2.3) are not clear to me:

I am wondering why the text implies that best-of-N sampling has to be done "after generating the full response" (lines 154-155). Could it not also be done after generating each statement $r_i$ ?
In re-sampling citation sequences in position $e_i$ , are the future statements and citations $r_{i+1}, e_{i+1}, ...$ removed? The notation in Algorithm 1 suggests yes?
Lines 132-134, right column, "additional inference cost of generating candidates and re-ranking": This inference cost should be quantified more precisely somewhere.

其他意见或建议

Minor questions and comments:

What constitutes a statement $r_i$ in the response? How is the response divided into statements?
Lines 112-113, right column, "we exclude the BoN candidates that cites more than 384 tokens in total": I believe this refers to the set of context sentences $c_{e_i^1}, ..., c_{e_i^m}$ , not the sequence of identifiers $e_i^1, ..., e_i^m$ , but it is not completely clear.
Equation (1): Were weighted sums of the probability drop and probability hold metrics also considered?
Table 2: Are the numbers in the second Llama-3.1-8B-Instruct row better than the ones in the first Llama-3.1-8B-Instruct row because the second one is for answering with citations (i.e. generating citations also improves answer correctness)?
The Limitations section could also note that SelfCite assumes access to the LLM's predicted probabilities, which may not always be available.

Additional references on contributive context attribution:

G. Sarti et al. "Inseq: An Interpretability Toolkit for Sequence Generation Models." ACL 2023.
L. Monteiro Paes et al. "Multi-Level Explanations for Generative Language Models." https://arxiv.org/abs/2403.14459

作者回复

2025-03-31

We thank Reviewer PWPX for the constructive comments!

…different notions of "citation" targeted by ContextCite on the one hand and evaluated by LongBench-Cite (and similar benchmarks) on the other hand.

Thanks for pointing out the mismatch between the objectives of corroborative (sources that support a statement, e.g., LongCite & LongBench-Cite benchmark) and contributive context attribution (e.g., ContextCite). As you said, SelfCite applies contributive alignment (using context ablations) to a method for corroborative evaluation (LongBench-Cite). Our intuitions are:

Among citation candidates proposed by LongCite, the candidates actually “used” by the model are also more likely to “support” the statement—other candidates may just be semantically related. Although such support is certainly not always guaranteed, the two objectives are still aligned to some extent.
Current corroborative methods (LongCite) have a big room to improve, even by a contributive method, despite their discrepancy in goals. In other words, if LongCite were already near perfect, enforcing it to be “more contributive” may not help much.

We will discuss this nuanced point more clearly throughout our paper!

…why the text implies that best-of-N sampling has to be done "after generating the full response" (lines 154-155). Could it not also be done after generating each statement $r_i$ ?

Yes, BoN only needs each statement $r_i$ before sampling citation $e_i$ . Our implementation here is mostly for convenience. We generate a full response to get $r_1, ..., r_S$ first, then re-sample each $e_i$ , considering the weak dependence between $r_{>i}$ and $e_i$ .

In re-sampling citation sequences in position $e_i$ , are the future statements and citations $r_{i+1}, e_{i+1}$ ,… removed? The notation in Algorithm 1 suggests yes?

Yes, future statements aren't used when sampling citation $e_i$ .

Lines 132-134, right column, "additional inference cost of generating candidates and re-ranking": This inference cost should be quantified…

We measured latency per example (8*A100 GPUs, batch size 1, model parallelism) on LongBench-Cite. Direct decoding of LongCite-8B vs. SelfCite SimPO shows similar latency. BoN sampling+reranking is ~7x slower. We'll add this to our paper.

Method	Avg latency (s)
LongCite-8B	24.3
SelfCite:
BoN sampling	149.0
BoN reranking	34.0
SimPO model	26.2

What constitutes a statement $r_i$ in the response? How is the response divided into statements?

In LongCite-45k, statements are split based on semantic integrity in data generation, so fine-tuned models will naturally learn to produce statements. We'll clarify this in our paper.

Lines 112-113, right column, "we exclude the BoN candidates that cite more than 384 tokens in total": I believe this refers to the set of context sentences $c_{e_i^1}, ..., c_{e_i^m}$ , not the sequence of identifiers $e_i^1, ..., e_i^m$ …

Yes, the length limit was applied on the cited texts, not identifiers. We'll clarify this in our paper.

Equation (1): Were weighted sums of the probability drop and probability hold metrics also considered?

Good suggestion! We only tested 1:1 weights for simplicity but will explore this further.

Table 2: Are the numbers in the second Llama-3.1-8B-Instruct row better than the ones in the first Llama-3.1-8B-Instruct row because the second one is for answering with citations?

After carefully checking our experiments, we found the second Llama-3.1-8B-Instruct row (avg 71.7) was actually mistakenly taken from the ContextCite result, which uses greedy decoding and answering without citations, thus is not directly comparable.

We reran the experiments and show full results in the table below. The first Llama-3.1-8B-Instruct row in Table 2 of our paper should be updated with row (2) below (avg 68.9). Its original scores with "†" (avg 60.2) in Table 2 of our paper are taken from Table 3 in LongCite paper and thus have some prompt/implementation differences (they didn't open-source this part of code). We will update it to our own results for now. The second Llama-3.1-8B-Instruct row should be updated with row (4) (avg 63.3).

In summary, answering with citations hurts accuracy (68.9 -> 63.3), which is expected and consistent with the same trend from Table 3 in the LongCite paper (all non-LongCite models have such degradation when asked to answer with citations.) We'll update Table 2 of our paper.

	Long.	Multi.	Hot.	Dur.	Gov.	Avg
Answering without citations
(1) Greedy (CC)	67.4	87.9	73.5	67.8	62.1	71.7
(2) Sampling	66.0	83.7	65.8	62.8	66.1	68.9
Answering with citations
(3) Greedy	61.2	79.0	68.8	60.0	54.9	64.8
(4) Sampling	58.4	75.3	67.3	59.3	56.4	63.3

The Limitations section could also note that SelfCite assumes access to the LLM's predicted probabilities…

Additional references on contributive attribution: …

We'll add these points and references to our paper. Thanks again for the valuable suggestions!

审稿人评论

2025-04-04

Thanks to the authors for their responses. I trust that the additional discussions, clarifications, and latency table will be added to the paper.

Regarding the correction to Table 2, it seems then that there is no longer a degradation in answer correctness due to SFT on ContextCite data (lines 283-285).

作者评论

2025-04-04

We sincerely thank Reviewer PWPX for the very detailed and thoughtful review and the reply. Your in-depth questions and constructive feedback greatly helped us improve every detail of the paper!

We confirm that we will include all the above additional discussions, clarifications, and the latency table in the final version of our paper. Regarding the correction to Table 2, you’re right—there is no longer a degradation in answer correctness due to SFT on ContextCite data, and we will revise lines 283–285 accordingly to reflect this correction.

Thank you again for your time and valuable insights throughout the reviewing process!

审稿意见

评分: 32025-03-11

This paper proposes an attributable response generation strategy, SelfCite, which cites relevant sentences in the context supportive of the generated response. SelfCite can both operate during inference or during training. For inference, SelfCite pick the Best-of-N using a newly designed reward/score composed by probability drop and probability hold via ablating the reference. For training, the reward signal can be leveraged to curate DPO preference data to train the generation model. Experimental results on LongBench-Cite demonstrate the effectiveness of the proposed strategy in terms of citation quality.

给作者的问题

Refer to the above sections

论据与证据

Yes, the claim that "using the combination of probability drop and probability hold during inference and training can boost the citation quality" is supported by the experimental results.

方法与评估标准

The proposed method makes sense, but not entirely novel, as the ablation technique has been proposed by Cohen-Wang et al. in ContextCite. The authors adopt this technique to produce a reward or score to guide the selection of the best generation or to be used in preference optimization.
The evaluation mostly focuses on citation quality, but overlooks the answer accuracy which is also quite important. From Table 2, it seems the answer accuracy drops substantially on Llama-3 which is concerning.
Given many RL algorithms can be explored to improve the generation quality (both citation and answer accuracy), it is more beneficial to investigate these RL algorithms for this problem.
In the field of attributable generation, there are also other benchmark datasets, such as the ALCE benchmark (Gao et al., 2023) and ExpertQA (Malaviya et al., 2023). Is the proposed method applicable in these benchmarks?

Gao et al., 2023: "Enabling large language models to generate text with citations". Malaviya et al., 2023: "Expertqa: Expert-curated questions and attributed answers".

理论论述

There is no theoretical claim

实验设计与分析

Yes, I have checked the experimental designs and analyses. There are no issues.

补充材料

Yes, I have reviewed all the supplementary materials.

与现有文献的关系

The idea is quite related to the field of attributable generation in general. The methodology is inspired from the work of ContextCite.

遗漏的重要参考文献

No.

其他优缺点

Strengths:

The proposed approach using self rewards instead of manual annotations is meaningful. The results in terms of citation quality show the effectiveness of the proposed reward.
Extensive experiments are conducted over a well-known benchmark on both inference-only and training-based scenarios. Comprehensive analysis on model components, length balancing, data size and iteration is provided.

Weaknesses:

There is lack of detailed discussion on the comparison against existing approaches such as Zhang et al., 2024 and Cohen-Wang et al., 2024. How does SelfCite differentiate itself from these studies?
Given that many existing works have already investigated attributable generation, there is a lack of empirical comparison against these baselines. From related work, both Huang et al., 2024 and Gao et al., 2023, for example, have proposed methods to tackle this problem, but there are no comparisons in experiments.
Besides LongBench-Cite, what about other benchmarks such as the ALCE (Gao et al., 2023) datasets?
One key limitation is the degradation of answer accuracy. Admitting that the citation quality is crucial for reliable generation, we should not expect trading enhanced citation quality with decreased answer capability.

Gao et al., 2023: "Enabling large language models to generate text with citations".

其他意见或建议

作者回复

2025-03-31

We thank Reviewer Cy66 for the constructive comments!

Ours vs ContextCite

The proposed method makes sense, but not entirely novel, ablation technique has been proposed by ContextCite

While inspired by ContextCite (CC)’s context ablation (L029, right column), our key contribution differs notably: SelfCite enables an LLM to directly generate its own citations, while CC does not.

CC is post-hoc, relying on external linear models trained from scratch for each new example, requiring heavy inference per example (32 to 256 ablated generations) after a response was produced.

In contrast, SelfCite directly teaches LLMs to produce accurate citations in responses with a few tokens in one pass. The context ablation signals become internalized capabilities of LLMs after SimPO; no context ablations needed at inference.

Extra distinctions:

CC measures citation “necessity” by “prob drop”; SelfCite adds “sufficiency” by “prob hold” to catch missing citations.
CC’s linear model assumes context sentences independently impact responses; SelfCite directly learns from nonlinear “prob drop/hold” rewards, handling sentence interactions.

CC’s details were in Appendix B due to space. We'll move them to the main text.

Answer Accuracy Drops?

Table 2, it seems answer accuracy drops on Llama-3 which is concerning. (avg 71.7 vs 64.6)

Good catch! We carefully checked our experiments and found the high score (avg 71.7) of Llama3 baseline under “Answering with citations” was mistakenly copied from our ContextCite (CC) experiment. We reran the experiments to get a corrected baseline of avg 63.3 in row (4) of the table below. Compared to this correct baseline, both our + SFT on CC (avg 64.6) and + SimPO (Ours) (avg 64.7) in fact show slightly higher accuracy. We apologize for the mistake.

	Long.	Multi.	Hot.	Dur.	Gov.	Avg
Answering without citations
(1) Greedy (CC: wrong baseline)	67.4	87.9	73.5	67.8	62.1	71.7
(2) Sampling	66.0	83.7	65.8	62.8	66.1	68.9
Answering with citations
(3) Greedy	61.2	79.0	68.8	60.0	54.9	64.8
(4) Sampling (true baseline)	58.4	75.3	67.3	59.3	56.4	63.3
Table 2 from paper
+ SFT on CC	58.8	83.4	65.8	57.8	57.5	64.6
+ SimPO (Ours)	56.8	80.9	65.3	59.5	60.9	64.7

Why is CC higher? (avg 71.7)

CC’s higher score (71.7) comes from:

CC uses greedy decoding; we follow LongCite [1] to use sampling (top_p=0.7; temp=0.95). See rows (1) vs (2): avg 71.7 vs 68.9
CC’s citations are post-hoc, so it’s “answering without citations”, making answer generation easier and more accurate. See rows (1) vs (3): avg 71.7 vs 64.8

The correct baseline to be used is Sampling + Answering with citations in row (4): avg 63.3, which is slightly better than the same results from LongCite ([1], Table 3, row Llama-3.1-8B, column C’s). And it confirms no accuracy drop after our fine-tuning (avg 64.6 & 64.7). We’ll update Table 2 in our paper.

[1] LongCite: https://arxiv.org/pdf/2409.02897

More RL algorithms?

…beneficial to investigate these RL algorithms...

Our main goal is to validate a novel "reward" for citation. Prior work ([2], Figures 4 & 5) shows Best-of-N (BoN) closely approximates the upper-bound scores of RL without training artifacts to make comparisons more confounded. Following established practices [3, 4], we used BoN as main evaluation, and further verified it using training-based alignment, SimPO, achieving the same improvement of BoN. While extra RL algorithms may offer improvements, we believe they wouldn’t qualitatively change our observations given BoN results.

We also acknowledge SelfCite doesn’t aim to boost answer accuracy (but not to decrease it either). Combining it with answer-matching rewards to jointly improve citations & answers is an exciting direction we want to explore in future work!

[2] Controlled Decoding from Language Models, Mudgal et al., ICML 2024

[3] Scaling laws for reward model overoptimization, Gao et al., ICML 2023

[4] Let’s Verify Step by Step, Lightman et al., ICLR 2024

More Benchmarks/Baselines?

…other benchmark datasets, such as ALCE... …both Huang et al., 2024 and Gao et al., 2023 have proposed … there are no comparisons in experiments.

Following your advice, we evaluated on ALCE and compared to Huang et al. 2024 & Gao et al. 2023. Due to space, please see our rebuttal to Reviewer WRpv. Spoiler: SelfCite outperforms baselines on ALCE even on cross-domain transfer!

More Discussion?

lack of detailed discussion on comparison against existing approaches such as Zhang et al., 2024 (LongCite) & Cohen-Wang et al., 2024 (ContextCite).

Comparison with ContextCite is at the top of this reply; will be added to our paper.

Comparison with LongCite is in our Section 5. Briefly, LongCite uses data from proprietary APIs for SFT only. SelfCite performs further alignment steps without external supervision.

We also made a table to contrast key distinctions among prior works; see our rebuttal to Reviewer ci8i.

审稿人评论

2025-04-03

I appreciate the authors' responses to my questions with additional experiments. They have mostly addressed my concern and I have raised my score accordingly.

作者评论

2025-04-04

We sincerely thank Reviewer Cy66 for the thoughtful and constructive feedback and for taking the time to revisit our response and raising the score. We’re very glad to hear that the additional experiments already addressed your concerns!

If there are any remaining concerns, we’d be happy to clarify. Thanks again for your valuable input throughout the process!

p.s. We also wanted to remind Reviewer WRpv who mainly raised the same questions of adding more baselines of Huang et al., (2024) and Gao et al., (2023) on the ALCE benchmark. Since those additional experiments have been added in our rebuttal and already acknowledged by Reviewer Cy66, we hope our responses were helpful to resolve the concerns from Reviewer WRpv as well!

审稿意见

评分: 42025-03-24

This paper proposes a method ("Self-cite") to automatically evaluate cited text using context ablation -- i.e. changing the context and compare probability of generating a given sentence. It then proposes to use this signal to enhance citation quality as the reward model for two approaches: (1) Best-of-N sampling and (2) preference learning. Experiments show that it is possible to enhance the citation F1 for both LongCite-8B and Llama-3.1-8B-Instruct.

给作者的问题

Since the method (especially with BoN) aims to improve citation quality while keeping the answer unchanged, are there cases where the generated answer is not faithful (and thus can't be supported by the context) and how does SelfCite deal with that?

论据与证据

Experiments results on LongBench-Cite with two models (LongCite and Llama-3.1-8B) demonstrate that the proposed reward signal does improve citation quality.

方法与评估标准

Proposed method: The propose method leverages probability difference in generating a given answer when a piece of cited text is included / not included in the context, which is an intuitive method to approximate citation quality.
Dataset : The experiments are mainly conducted on the LongBench-Cite benchmark, which adopts sentence-level citations. However, my understanding is that proposed method is not limited to sentence-level citation by design and can be adopted to chunk-level citation. Therefore, it would be better to also include experiments are datasets such as ALCE [0].

[0] Enabling Large Language Models to Generate Text with Citations. Gao et al., EMNLP 2023.

理论论述

N/A

实验设计与分析

Baseline chosen: Experiments are conducted against three baselines: a prompting baseline, ContextCite and Fine-tuned models. I think the ContextCite baseline is not appropriate. More specifically, ContextCite is a method that is used to attribute generated response towards the context (i.e. generating citation), and thus it is applied on Llama-3.1-8B-Instruct, while the proposed method is a method that is used to improve citation quality, and is applied to model that is already fine-tuned to generate citation (either using the LongCite-SFT data or data generated with ContextCite in the experiment setting). Thus, it is a bit unfair to compare these two methods. On the other hand, previous method [1] has been proposed to leverage NLI models to measure citation precision / recall, which seems to bit a more appropriate baseline for the proposed method. While it is true that this method requires an external NLI model whereas the proposed method is "self-supervised", it would be helpful to compare these two approaches (if there is a gap, how big is it?).

[1] Training Language Models to Generate Text with Citations via Fine-grained Rewards. Huang et al., 2024 ACL.

补充材料

I briefly skim through the submitted code.

与现有文献的关系

The proposed method contributes to the line of work that enables language model to generate citation / attribution to its generation, which is an important research direction. The proposed "self-supervised" method is interesting in that it leverages context ablation to improve citation quality.

遗漏的重要参考文献

N/A

其他优缺点

Strength: Overall I think the proposed method is an effective approach to improve citation quality given an answer generated. The SimPO then BoN results are pretty strong on both LongCite-8B and Llama-3.1-8B SFT on ContextCite.
Weakness: While the proposed method is "self-supervised", it would be helpful to compare it with some "supervised" methods -- for instance, the NLI model as reward as mentioned before, or SFT with the data from LongCite that is used to create the preference-pair. It is ok if SelfCite does not out-perform these methods, but it will be helpful to understand the gap (if there is any).

其他意见或建议

N/A

作者回复

2025-03-31

We thank Reviewer WRpv for the constructive comments!

A Better Baseline: SimPO with NLI Rewards (Also for Reviewer Cy66)

…ContextCite baseline is not appropriate. … [1] has been proposed to leverage NLI models to measure citation precision / recall, which seems to bit a more appropriate baseline

We agree that ContextCite’s mechanism is much different from SelfCite, so we will change our framing and treat its scores mainly for reference. We follow your advice to adopt NLI rewards from Huang et al. 2024 [1] as baseline. For fair comparison, we reuse our SelfCite SimPO training pipeline (initializing from LongCite-8B + trained with LongCite-45k data), but only change the reward function to be the NLI-based citation recall/precision proposed in Huang et al. 2024 [1]. We ignore the correctness reward in [1] as we don’t have ground truth answers from LongCite-45k.

We compare this method (SimPO w/ NLI Rewards) with ours (SimPO w/ SelfCite) on both LongBench-Cite (table below) and ALCE (table in the next section). Both results show SimPO w/ NLI Rewards improves citation quality over LongCite (except for MultifieldQA & HotpotQA), but still consistently outperformed by SelfCite, further verifying the effectiveness of SelfCite. We will include this baseline in our final version paper.

Metric: Citation F1	Longbench-Chat	MultifieldQA	HotpotQA	Dureader	GovReport	Avg
LongCite-8B	66.6	79.9	64.1	73.7	84.5	73.8
+ SimPO w/ NLI Rewards	69.8	77.4	63.2	77.2	87.5	75.0
+ SimPO w/ SelfCite	69.1	81.0	71.5	78.9	89.1	77.9

Evaluation on Chunk-level Citation Benchmark ALCE (Also for Reviewer Cy66)

previous method [1] can be adopted to chunk-level citation. Therefore, it would be better to also include experiments are datasets such as ALCE [0].

We follow your advice to test our models on ALCE and show the results in the table below. We found that our baseline LongCite-8B already achieves much better citation recall/precision than the prompting method of Gao et al. (2023). The baseline “SimPO w/ NLI Rewards” (using the rewards from Huang et al., (2024) above) performs slightly better than LongCite-8B. Our method, “SimPO w/ SelfCite”, further brings substantial improvements over both baselines.

The bottom row is the best result from the supervised method of Huang et al. (2024). It differs from other model settings and was trained on in-distribution data. Its numbers are thus incomparable with the other rows and we only include them for reference. Specifically, the differences are:

They train the models only on the “in-distribution” training sets of QA datasets in ALCE, with the exact same chunk-level setting of ALCE, while SelfCite was trained on “out-of-distribution” LongCite-45k data with sentence-level citations.
They directly use the same NLI evaluator used in ALCE benchmark (google/t5_xxl_true_nli_mixture) to provide rewards for citation recall/precision, essentially optimizing the benchmark scores of ALCE directly.
They also do distillation from ChatGPT.

Despite the fact of a cross-domain & cross-setting transfer setting, SelfCite still achieves good performance much better than baselines: LongCite-8B & SimPO w/ NLI Rewards, showing its effectiveness. We will include this result in our paper.

	ASQA			ELI5
	EM Rec.	Cite Rec.	Cite Prec.	Correct	Cite Rec.	Cite Prec.
Gao et al. 2023
Llama-2-13B-chat	34.66	37.48	39.62	12.77	17.13	17.05
Llama-3.1-8B-Instruct	42.68	50.64	53.08	13.63	34.66	32.08
Finetuned on LongCite-45k
LongCite-8B	42.11	62.27	57.00	15.37	30.54	29.15
+ SimPO w/ NLI Rewards	41.20	65.65	60.20	15.30	33.06	31.05
+ SimPO w/ SelfCite	42.57	71.68	62.05	15.17	37.09	35.62
Finetuned on ALCE train set
Huang et al. 2024	40.05	77.83	76.33	11.54	60.86	60.23

How does SelfCite handle unfaithful answers?

are there cases where the generated answer is not faithful (and thus can't be supported by the context) and how does SelfCite deal with that?

We did sometimes (but not often) find that the answer can be a slight misunderstanding of the cited information, and SelfCite will still cite such text that it is based on. Stepping back, this is a common theme for all methods that post-hoc generate citations, where the prevailing philosophy is that “citations” are for the traceability and verifiability of answers, which enable a user to double-check the answer correctness easily, in case the answer is wrong. We will include this discussion in our final version paper.

审稿意见

评分: 32025-03-26

The paper presents SelfCite, a self-supervised method for improving citation accuracy in Large Language Models (LLMs). The key innovation lies in using context ablation to compute a self-rewarding signal based on necessity and sufficiency scores, which are then used to enhance citation quality through best-of-N sampling and preference optimization (SimPO). The method achieves up to 5.3 F1 improvement on the LongBench-Cite benchmark without requiring human annotations.

给作者的问题

Strength:

The paper introduces a novel self-supervised approach for citation alignment in LLMs, eliminating the need for human annotations. The combination of necessity and sufficiency scores to derive a reward function is well-motivated and provides a principled way to improve citation quality.
The method is designed to be lightweight, leveraging a model’s own probability estimates rather than requiring expensive external annotations. This makes it applicable to large-scale citation tasks in real-world settings, such as research assistants or fact-checking systems.
Unlike previous methods that rely on human annotations or costly API calls, SelfCite autonomously improves citation quality using a reward function derived from context ablation, making it highly scalable and cost-efficient. Weakness:
The paper should more explicitly differentiate SelfCite from previous work, especially in how it improves over ContextCite and other contributive context attribution methods. A comparison table summarizing key differences could be helpful.
The necessity and sufficiency scores are well-motivated, but additional theoretical justification or a toy example demonstrating their individual impact could strengthen the argument.
Since best-of-N sampling increases inference-time costs, a discussion on its efficiency trade-offs and potential ways to reduce overhead (e.g., pruning low-quality candidates early) would be beneficial.
The effect of hyperparameters like N in best-of-N sampling and the choice of probability thresholds for necessity/sufficiency scores should be explored more systematically.

论据与证据

Yes

方法与评估标准

Strength:

The paper introduces a novel self-supervised approach for citation alignment in LLMs, eliminating the need for human annotations. The combination of necessity and sufficiency scores to derive a reward function is well-motivated and provides a principled way to improve citation quality.
The method is designed to be lightweight, leveraging a model’s own probability estimates rather than requiring expensive external annotations. This makes it applicable to large-scale citation tasks in real-world settings, such as research assistants or fact-checking systems.
Unlike previous methods that rely on human annotations or costly API calls, SelfCite autonomously improves citation quality using a reward function derived from context ablation, making it highly scalable and cost-efficient. Weakness:
The paper should more explicitly differentiate SelfCite from previous work, especially in how it improves over ContextCite and other contributive context attribution methods. A comparison table summarizing key differences could be helpful.
The necessity and sufficiency scores are well-motivated, but additional theoretical justification or a toy example demonstrating their individual impact could strengthen the argument.
Since best-of-N sampling increases inference-time costs, a discussion on its efficiency trade-offs and potential ways to reduce overhead (e.g., pruning low-quality candidates early) would be beneficial.
The effect of hyperparameters like N in best-of-N sampling and the choice of probability thresholds for necessity/sufficiency scores should be explored more systematically.

理论论述

Strength:

The paper introduces a novel self-supervised approach for citation alignment in LLMs, eliminating the need for human annotations. The combination of necessity and sufficiency scores to derive a reward function is well-motivated and provides a principled way to improve citation quality.
The method is designed to be lightweight, leveraging a model’s own probability estimates rather than requiring expensive external annotations. This makes it applicable to large-scale citation tasks in real-world settings, such as research assistants or fact-checking systems.
Unlike previous methods that rely on human annotations or costly API calls, SelfCite autonomously improves citation quality using a reward function derived from context ablation, making it highly scalable and cost-efficient. Weakness:
The paper should more explicitly differentiate SelfCite from previous work, especially in how it improves over ContextCite and other contributive context attribution methods. A comparison table summarizing key differences could be helpful.
The necessity and sufficiency scores are well-motivated, but additional theoretical justification or a toy example demonstrating their individual impact could strengthen the argument.
Since best-of-N sampling increases inference-time costs, a discussion on its efficiency trade-offs and potential ways to reduce overhead (e.g., pruning low-quality candidates early) would be beneficial.
The effect of hyperparameters like N in best-of-N sampling and the choice of probability thresholds for necessity/sufficiency scores should be explored more systematically.

实验设计与分析

Strength:

The paper introduces a novel self-supervised approach for citation alignment in LLMs, eliminating the need for human annotations. The combination of necessity and sufficiency scores to derive a reward function is well-motivated and provides a principled way to improve citation quality.
The method is designed to be lightweight, leveraging a model’s own probability estimates rather than requiring expensive external annotations. This makes it applicable to large-scale citation tasks in real-world settings, such as research assistants or fact-checking systems.
Unlike previous methods that rely on human annotations or costly API calls, SelfCite autonomously improves citation quality using a reward function derived from context ablation, making it highly scalable and cost-efficient. Weakness:
The paper should more explicitly differentiate SelfCite from previous work, especially in how it improves over ContextCite and other contributive context attribution methods. A comparison table summarizing key differences could be helpful.
The necessity and sufficiency scores are well-motivated, but additional theoretical justification or a toy example demonstrating their individual impact could strengthen the argument.
Since best-of-N sampling increases inference-time costs, a discussion on its efficiency trade-offs and potential ways to reduce overhead (e.g., pruning low-quality candidates early) would be beneficial.
The effect of hyperparameters like N in best-of-N sampling and the choice of probability thresholds for necessity/sufficiency scores should be explored more systematically.

补充材料

Strength:

The paper introduces a novel self-supervised approach for citation alignment in LLMs, eliminating the need for human annotations. The combination of necessity and sufficiency scores to derive a reward function is well-motivated and provides a principled way to improve citation quality.
The method is designed to be lightweight, leveraging a model’s own probability estimates rather than requiring expensive external annotations. This makes it applicable to large-scale citation tasks in real-world settings, such as research assistants or fact-checking systems.
Unlike previous methods that rely on human annotations or costly API calls, SelfCite autonomously improves citation quality using a reward function derived from context ablation, making it highly scalable and cost-efficient. Weakness:
The paper should more explicitly differentiate SelfCite from previous work, especially in how it improves over ContextCite and other contributive context attribution methods. A comparison table summarizing key differences could be helpful.
The necessity and sufficiency scores are well-motivated, but additional theoretical justification or a toy example demonstrating their individual impact could strengthen the argument.
Since best-of-N sampling increases inference-time costs, a discussion on its efficiency trade-offs and potential ways to reduce overhead (e.g., pruning low-quality candidates early) would be beneficial.
The effect of hyperparameters like N in best-of-N sampling and the choice of probability thresholds for necessity/sufficiency scores should be explored more systematically.

与现有文献的关系

Strength:

The paper introduces a novel self-supervised approach for citation alignment in LLMs, eliminating the need for human annotations. The combination of necessity and sufficiency scores to derive a reward function is well-motivated and provides a principled way to improve citation quality.
The method is designed to be lightweight, leveraging a model’s own probability estimates rather than requiring expensive external annotations. This makes it applicable to large-scale citation tasks in real-world settings, such as research assistants or fact-checking systems.
Unlike previous methods that rely on human annotations or costly API calls, SelfCite autonomously improves citation quality using a reward function derived from context ablation, making it highly scalable and cost-efficient. Weakness:
The paper should more explicitly differentiate SelfCite from previous work, especially in how it improves over ContextCite and other contributive context attribution methods. A comparison table summarizing key differences could be helpful.
The necessity and sufficiency scores are well-motivated, but additional theoretical justification or a toy example demonstrating their individual impact could strengthen the argument.
Since best-of-N sampling increases inference-time costs, a discussion on its efficiency trade-offs and potential ways to reduce overhead (e.g., pruning low-quality candidates early) would be beneficial.
The effect of hyperparameters like N in best-of-N sampling and the choice of probability thresholds for necessity/sufficiency scores should be explored more systematically.

遗漏的重要参考文献

Strength:

The paper introduces a novel self-supervised approach for citation alignment in LLMs, eliminating the need for human annotations. The combination of necessity and sufficiency scores to derive a reward function is well-motivated and provides a principled way to improve citation quality.
The method is designed to be lightweight, leveraging a model’s own probability estimates rather than requiring expensive external annotations. This makes it applicable to large-scale citation tasks in real-world settings, such as research assistants or fact-checking systems.
Unlike previous methods that rely on human annotations or costly API calls, SelfCite autonomously improves citation quality using a reward function derived from context ablation, making it highly scalable and cost-efficient. Weakness:
The paper should more explicitly differentiate SelfCite from previous work, especially in how it improves over ContextCite and other contributive context attribution methods. A comparison table summarizing key differences could be helpful.
The necessity and sufficiency scores are well-motivated, but additional theoretical justification or a toy example demonstrating their individual impact could strengthen the argument.
Since best-of-N sampling increases inference-time costs, a discussion on its efficiency trade-offs and potential ways to reduce overhead (e.g., pruning low-quality candidates early) would be beneficial.
The effect of hyperparameters like N in best-of-N sampling and the choice of probability thresholds for necessity/sufficiency scores should be explored more systematically.

其他优缺点

Strength:

The paper introduces a novel self-supervised approach for citation alignment in LLMs, eliminating the need for human annotations. The combination of necessity and sufficiency scores to derive a reward function is well-motivated and provides a principled way to improve citation quality.
The method is designed to be lightweight, leveraging a model’s own probability estimates rather than requiring expensive external annotations. This makes it applicable to large-scale citation tasks in real-world settings, such as research assistants or fact-checking systems.
Unlike previous methods that rely on human annotations or costly API calls, SelfCite autonomously improves citation quality using a reward function derived from context ablation, making it highly scalable and cost-efficient. Weakness:
The paper should more explicitly differentiate SelfCite from previous work, especially in how it improves over ContextCite and other contributive context attribution methods. A comparison table summarizing key differences could be helpful.
The necessity and sufficiency scores are well-motivated, but additional theoretical justification or a toy example demonstrating their individual impact could strengthen the argument.
Since best-of-N sampling increases inference-time costs, a discussion on its efficiency trade-offs and potential ways to reduce overhead (e.g., pruning low-quality candidates early) would be beneficial.
The effect of hyperparameters like N in best-of-N sampling and the choice of probability thresholds for necessity/sufficiency scores should be explored more systematically.

其他意见或建议

Strength:

The paper introduces a novel self-supervised approach for citation alignment in LLMs, eliminating the need for human annotations. The combination of necessity and sufficiency scores to derive a reward function is well-motivated and provides a principled way to improve citation quality.
The method is designed to be lightweight, leveraging a model’s own probability estimates rather than requiring expensive external annotations. This makes it applicable to large-scale citation tasks in real-world settings, such as research assistants or fact-checking systems.
Unlike previous methods that rely on human annotations or costly API calls, SelfCite autonomously improves citation quality using a reward function derived from context ablation, making it highly scalable and cost-efficient. Weakness:
The paper should more explicitly differentiate SelfCite from previous work, especially in how it improves over ContextCite and other contributive context attribution methods. A comparison table summarizing key differences could be helpful.
The necessity and sufficiency scores are well-motivated, but additional theoretical justification or a toy example demonstrating their individual impact could strengthen the argument.
Since best-of-N sampling increases inference-time costs, a discussion on its efficiency trade-offs and potential ways to reduce overhead (e.g., pruning low-quality candidates early) would be beneficial.
The effect of hyperparameters like N in best-of-N sampling and the choice of probability thresholds for necessity/sufficiency scores should be explored more systematically.

作者回复

2025-03-31

We thank Reviewer ci8i for the constructive comments!

A Comparison Table (Also for Reviewer Cy66)

The paper should more explicitly differentiate SelfCite from previous work… A comparison table summarizing key differences could be helpful.

We follow your suggestion to make a table and point out the key differences among the previous work. We will include it in our final version paper.

Method	Sentence-level citations?	One pass generation?	Preference optimization?	Handle 128K long-context?	External supervision?
ACLE	❌(chunk-level)	✅	❌(prompting)	❌(8K)	2-shot prompting
Huang et al. 2024	❌(chunk-level)	✅	✅	❌(8K)	NLI + ground truth
ContextCite	✅	❌(at least 32 calls)	❌(not generative)	✅	N/A
LongCite	✅	✅	❌(SFT only)	✅	SFT data
SelfCite (Ours)	✅	✅	✅	✅	N/A

A Toy Example of Necessity and Sufficiency Scores

The necessity and sufficiency scores are well-motivated, but additional theoretical justification or a toy example demonstrating their individual impact could strengthen the argument.

The necessity and sufficiency scores are designed based on human’s common preference: a citation has to be necessary and sufficient, which also matches the metrics of citation precision (necessary) and recall (sufficient) commonly exist in the evaluation benchmarks. Here we follow your advice to show a simple toy example demonstrating their individual impacts:

Document:

[1] Alice traveled to France in 2020.
[2] Bob visited the famous National Museum in Tokyo, Japan in 2019.
[3] Chloe visited the Louvre Museum in Paris, France in 2018.
…

Query:
"Which famous museum could Alice have visited?"

Response:
"Alice could have visited the Louvre Museum."

Citation Candidates:

[1,2] (Incorrect):
- Necessity: Probability drops (since removing [1] prevents “Alice traveled to Paris”). (✅ high necessity due to [1])
- Sufficiency: Probability doesn’t hold. [1,2] alone cannot fully support “visited the Louvre Museum” since [2] is irrelevant and [3] is missing. (❌ low sufficiency)
[1,3] (Correct):
- Necessity: Probability drops more; removing [1,3] loses essential details (“Alice traveled to Paris,” “visited Louvre”). (✅ high necessity)
- Sufficiency: Probability holds. [1,3] Fully supports the response. (✅ high sufficiency)

This toy example clearly shows the individual contributions of necessity and sufficiency scores.

Discussion on Efficiency

Since best-of-N sampling increases inference-time costs, a discussion on its efficiency trade-offs and potential ways to reduce overhead (e.g., pruning low-quality candidates early) would be beneficial.

We calculate the latency per example on the LongBench-Cite dataset. On average, direct decoding from LongCite-8B/SelfCite SimPO models have similar latency. When using SelfCite BoN, the sampling + reranking steps in total take roughly 7x longer time compared to the direct decoding methods. All experiments are done with 8*A100 GPUs with batch size 1 and model parallel.

Method	Avg latency (s)
LongCite-8B	24.3
SelfCite BoN sampling	149.0
SelfCite BoN reranking	34.0
SelfCite SimPO model	26.2

Also, because we only sample the citation sequence, not the whole responses, the number of generated tokens is very limited, usually within 5~10 tokens. The strategy of pruning low-quality candidates early may not help, as they are mainly for saving time in generating long responses.

Latency of BoN is not a major concern

In fact, we do not have concerns on the longer latency or extra inference cost of BoN, because we also have the SelfCite SimPO model that can achieve the same performance of BoN in one pass generation, without any additional inference cost. If the users have the need for best efficiency, the best solution is to directly use our SimPO model, instead of using our BoN and trying to optimize BoN.

Exploring Hyperparameters

The effect of hyperparameters like N in best-of-N sampling and the choice of probability thresholds for necessity/sufficiency scores should be explored more systematically.

There are no “probability thresholds” for necessity/sufficiency scores in SelfCite. We use the raw probability changes (probability drop and probability hold) during context ablation directly as the reward. There is no need to tune any thresholds in our reward design.

For N in best-of-N sampling, as we mentioned in Line 217 (left column), after deduplicating repeated citation candidates, on average there are only 4.8 candidates (std=3.2) left per statement. It is due to the fact that we only do sampling within the citation sequences, and keep the statements in the response unchanged. When generating citations, usually only a limited number of relevant sentences can support the statements, resulting in a limited possibility of citations to be generated. Given the low diversity of citation candidates, increasing N to be larger than 10 would have a very limited impact on the BoN results.

最终决定Accept (poster)

2025-05-01

SelfCite is a novel self-supervised framework designed to improve the accuracy of citations made by Large Language Models (LLMs) when referencing provided context. The core idea involves using context ablation—measuring the change in response probability when cited sentences are removed or isolated—to generate necessity and sufficiency scores, creating a self-reward signal without human labels. This reward signal is then effectively utilized both during inference via best-of-N sampling and for training via preference optimization (like SimPO), leading to significant improvements in citation quality on benchmarks such as LongBench-Cite. Reviewers generally agreed on the merits of the proposed method. We highly recommend authors incorporate these feedbacks into the revision.