The reviewers see some good potential in the proposed approach. The empirical results are good and demonstrate clear improvements over the baselines. The problem is the questionable analysis and problematic justification provided to explain the approach and the results. In addition to the comments already provided in the reviews, let me articulate some problems and inconsistencies.

First Eq 4 is problematic. Eq 4 suggests that the value of a prefix is the expected value of the completion of this prefix according to the LLM distribution. This is problematic since the value of a prefix should reflect human preferences only instead of depending on the LLM. When the values of prefixes depend on the LLM, then if we change the LLM that we are trying to improve then the value of those prefixes will also change. The point of RLHF is to use preference data to improve an LLM, but if we use this LLM to define the values of prefixes then we are effectively using the LLM that is assumed to be inaccurate to improve itself based on its own inaccurate distribution.

Second, as noted by the reviewers the assumption in Lemma 1 is problematic. Lemma 1 assumes that . When we combine this assumption with Eq 4 we get . Again, this means that the reward of a prefix is dependent on the LLM probabilities, which is problematic.

Third, in Figure 2a, the paper shows the empirical value distribution of full sequences given a prefix and suggests that this distribution is approximately Gaussian with its mean controlled by the value of the prefix as shown in Eq 5. This is inconsistent with Eq 4 since Fig 2a and Eq 5 rely on the preference data distribution to estimate this Gaussian while Eq 4 relies on the LLM distribution.

Overall, it is not clear why the proposed approach should achieve higher scores than the baselines as demonstrated in the empirical results.