Thank you for your response. We would appreciate the opportunity to engage in a more detailed discussion with you.

Fairness regarding utilizing chat LLMs.

The traditional training approach is unable to leverage chat LLMs. Because they suffer from severe forgetting caused by continual pre-training (CPT) or vanilla fine-tuning. In contrast, RaDis allows for the retention of the general abilities of chat LLMs. We believe that utilizing accessible chat LLMs represents a valuable contribution, as it addresses the limitations of previous methods.

Utilizing an RM to filter rationales in RaDis.

We acknowledge that implementing an 'RLHF-based' version of RaDis could potentially improve results. However, the substantial resources required to train an RM were not available to us. Besides, introducing an RM will also raise questions on whether the general abilities are preserved by rationales or learned from the RM.

Additionally, the specific implementation is not the focus of this paper. Our key contribution is introducing rationale in CL, which brings new insights on relying on the reasoning ability of LLMs to connect learned and new knowledge and alleviate forgetting. The filtering technique is orthogonal to our contribution (the rationales are still self-generated whether they are filtered or not) and does not hurt novelty.

Rationale quality.

To further address your concern regarding the rationale quality, we randomly sampled 200 sentence pairs and examined their corresponding rationales. Our analysis found that most of the rationales are highly related to the corresponding sentence pairs.This is because the model can refer to the reference translation when generating rationales, which is consistent with prior work [1] showing that post-rationalization helps reduce hallucinations. We have uploaded the rationale samples as supplementary materials for further reference.

Whether RaDis is just a simplified version of current aligning approach.

We would like to emphasize the fundamental differences between RaDis and RLHF.

The objective of RaDis is to fine-tune the model for a specific task while preserving its original abilities. This involves training with task data. Self-generated data is used to mitigate forgetting.

In contrast, RLHF aims to aligning model behavior with human preferences, with little or no task-specific annotated data. The model learns from human feedbacks. Self-generated data is used as a bridge to convey human preferences.

While both approaches use self-generated data, it serves different purposes, in different ways, with distinct motivations. Therefore, we respectfully disagree with the reviewer’s characterization of RaDis as a simplified version of current aligning approach. As using synthetic data for training is currently one of the most common methods for fine-tuning LLMs, it would not be fair to claim that a paper lacks novelty because it utilizes synthetic data for training. While the training approach may seem similar on the surface, we believe that the reviewer may have overlooked the novel insights introduced by RaDis (utilizing rationales to alleviate forgetting), as well as the differences between our method and the existing post-training framework.

Why RaDis presents non-trival contribution

We believe the reviewer's claim that 'using self-generated data to preserve a chat-LLM's ability is not surprising' oversimplifies the problem and our approach.

First, RaDis involves SFT with task data. In contrast, in RLHF (on-policy), the model is fine-tuned exclusively with self-generated data, with 10-100x smaller learning rate. Given these difference, the findings that forgetting can be easily mitigated with self-generated data in the RLHF phase do not necessarily generalize to the SFT phase.

Second, not all self-generated data is equally effective in alleviating forgetting. As demonstrated in our experiments, SDFT also uses self-generated data during training. However, its effectiveness in mitigating forgetting is clearly inferior to that of RaDis.

Finally, self-generated data may hinder new task learning. As demonstrated in our experiments, Seq-KD and SDFT—two baselines that also utilize self-generated data—failed to improve the model's translation performance, while RaDis balances both learning and consolidating.

The clarifications above demonstrate that achieving effective CIT, which requires both alleviating forgetting and learning new knowledge, with synthetic data is not a trivial problem. Therefore, we sincerely hope the reviewer could reconsider the novelty of our proposed approach. If the reviewer still has concerns, we would sincerely appreciate references to prior works that may address similar ideas.

[1] Chen, Xiao, et al. "Post-Semantic-Thinking: A Robust Strategy to Distill Reasoning Capacity from Large Language Models." arXiv preprint arXiv:2404.09170 (2024).