Evaluation of Self-consistency

We measure the degree of self-consistency between an assessed output and a set of sampled responses, following [9, 10]. Formally, given a set of sampled responses and an output that encompass a set of facts , we define the self-consistency score of as:

where SC() represents the self-consistency score. denotes whether is supported by . It return 1 as 1 if is supported by , 0 if contradicts , and 0.5 if the relationship is inconclusive. We employ GPT-4-turbo to assess through the following prompt template:

Take the following facts about a person as truth: {premise}.

Please check the consistency between the text above and the fact "{hypothesis}".

Choose one of the following answers:

A. The fact is supported by the text above.

B. The fact is contradicted by the text above.

C. The fact is neither supported nor contradicted by the text above. It is inconclusive.`

Your answer should be one word ("A", "B" or "C").`

We conduct evaluation on ID and the baseline approaches that aim to enhance self-consistency in the final output (i.e., USC, SR, SE-SL, SE-RG, FSC). The evaluation is conducted on the Biographies benchmark, which requires the model to list five major achievement of a scientist. We divide the output into a set of facts by treating each listed major achievement as a separate fact. We consider the scenarios where the factuality improvement approach integrates 8 sampled responses (i.e., ), and measures the self-consistency between the final output and the eight sampled responses. The sampled responses are obtained through temperature sample, with T=0.7. We also evaluate the self-consistency level between an output that is directly generated through temperature sampling (T=0.7) and the other eight sampled responses, denoted as Vanilla.

The evaluation results are as follows:

Method\Base Model	Llama2	Llama3	Mistral	Qwen	Gemma	GLM
Vanilla	0.6087	0.6323	0.6024	0.6789	0.7069	0.6453
USC	0.6049	0.6524	0.6064	0.6765	0.7244	0.6641
SR	0.6345	0.6443	0.6509	0.7195	0.7204	0.6948
FSC	0.5984	0.6343	0.6099	0.6826	0.7097	0.6795
SE-SL	0.6221	0.6715	0.6435	0.6999	0.7481	0.6725
SE-RG	0.6393	0.6466	0.6344	0.7063	0.7520	0.6809
ID	0.6479	0.6821	0.6635	0.7366	0.7592	0.7336

We can see that the self-consistency level achieved by integrative decoding is significantly better than the other approaches on six LLMs. Notably, the second best approach in terms of self-consistency level is SE-RG, but demands significantly more inference costs than ID. According to our efficiency analysis, ID is able to achieve superior self-consistency while consuming only 13.72% of the inference latency required by SE-RG.

[9] SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models.

[10] Detecting hallucinations in large language models using semantic entropy.