Thank you for your detailed review and questions. We hope that our response will satisfy your concerns. Please note that we have corrected the typos and writing errors that you highlighted.

Bradley-Terry Model on Partial Sequences

We note that baseline RGTG methods such as ARGS (Khanov et, al 2024), CD (Mudgal et. al, 2024) and RAD (Deng and Raffel, 2023) do not present any connection to offline RLHF. Our analysis for PARGS shoes that we recover a ratio of two distinct RLHF policies. We argue that this approach is the best because it avoids the pitfalls of Theorem 1, and has strong (with significance testing) empirical performance.

Missing term in Equation 3

The equation has inside the exponential which shows up as in the other equations.

Method Assumptions

If we are given two partial sequences A and B, the one with the best extension to full sequence should be preferred. If the winning sequence from the preference dataset is the optimal sequence or human generated, then our assumption would do the correct thing.

RAD Baseline

RAD (Deng & Raffel, 2023) is very similar to CD-Fudge method (Mudgal et. al ,2024) in that they distill a tokenwise reward model from a full-sequence reward model using a square loss function. We included CD in all our experiments. Nevertheless, we present the results of RAD on the TLDR dataset.

Method	Reward Standard Error
Top-K Sampling	-0.11 0.28
RAD	0.11 0.25
CD	0.32 0.33
ARGS	1.57 0.21
PARGS	2.36 0.20

We observe that PARGS has the best performance when looking at the reward score. We will add the complete results to the paper.

Metrics

We briefly discuss the different metrics that we employ starting line 298. Following Khanov et. al, 2024, we use reward models trained on full sequences (preference data) as an oracle to measure the alignment of generation with human preferences. Next we use GPT-4 as a proxy to human evaluation to rank between pairs of generations. We provide the prompt in the appendix. This has been shown to align with human preferences (Rafailov et al.,2023). We also present a human evaluation study in Appendix H.

We present an additional experiment on applying PARGS on Machine Translation using a reward model trained on post-edit dataset in Appendix F. For these experiments we look at the BLEU score. Finally, we look at generation diversity using the Rouge-L metric in appendix F.