Modifying Large Language Model Post-Training for Diverse Creative Writing
In creative writing generation, we facilitate diversity in LLM outputs by counting in how each training instance differs from other instances with the same prompt.
摘要
评审与讨论
This paper aims to increate the generation diversity during post-training process. The main idea is to include the degree of difference to facilitate diverse instances. The proposed method could achieve comparable diversity to a human created dataset. The human evaluation results also validate that the proposed DDPO could boost more diverse generations.
接收理由
-
The task of improving the diversity while keeping generation quality during post-training is very important.
-
The performance of the proposed method is good.
拒绝理由
The paper lacks of the experiments on GRPO from deepseek, which is popular recently and also shows good performance.
We want to thank the reviewer for their feedback.
-GRPO: The least thing we did was compare our approaches with the vanilla GRPO-based model, Deepseek-R1. Adapting GRPO for diversification could be done, but we consider this as a future work direction.
This papers proposes to include a "deviation" term to the DPO or ORPO loss terms in order to increase the diversity of outputs of an LLM. The main idea is to favor the "winning" examples if they have a greater distance to the rest of training examples (for a given prompt). This terms is a multiplicative factor in front of the usual terms for this optimization techniques.
接收理由
The method proposed in this paper is easy to implement and include in existing training frameworks. It can also be applied to any kind of diversity metric, as the computations can be carried out offline in a pre-processing step and added to the metadata of the training examples. Thus it can find wide applicability. The results also seem to suggest that the method is effective in increasing the diversity with small effect on generation capability, although I have some issues with the evaluation (look also at the "Reasons To Reject" section).
拒绝理由
My biggest issue with this paper is its evaluation. As reported in Section 5.1 the authors evaluate their results on the same metric they are optimizing on cosine similarity on two embedding spaces (depending on the diversity considered). By doing that, the authors risk "gaming the metric". In the world of machine translation this is a known effect, see e.g. [1] or [2]. It would be much better to measure the diversity using a different metric than the one being optimized for.
The authors also provide a human evaluation which can provide better insights. This effort is greatly appreciated, but I also see issues with the methodology followed: First, for judging quality the evaluators pick a set which includes the "most interesting, highest-quality writing", but the diversity is judged on the full set. As the two evaluations are decoupled, it might happen that the highest quality writing is indeed in an hypothesis that is not diverse, i.e. when the model stays close to the original version. Secondly, the humans were provided "summarized versions of writings", thus introducing another transformation into the pipeline which can further influence the results. And also the authors state that "five of this paper’s authors served as evaluators being blind to the conditions". I do not presuppose any ill intent on the evaluators, but there are a lot of unconscious biases that are hard to control. The authors of the paper were of course aware of the goal of their research and they might recognize the sets of hypotheses produced by their system (probably due to an increased diversity!), which might influence their judgments, specially concerning quality. Again, I am sure that there was no ill intent at any point in the process, but human biases are hard to control. An independent set of judges, with no knowledge of the goals of the study would have been a better choice.
All in all I do think that the method does increase the diversity of the outputs, but it is hard to really quantify it from the information given in the paper.
[1] "BLEURT Has Universal Translations: An Analysis of Automatic Metrics by Minimum Risk Training", Yiming Yan, Tao Wang, Chengqi Zhao, Shujian Huang, Jiajun Chen, Mingxuan Wang, 2023. https://aclanthology.org/2023.acl-long.297/ [2] "Mitigating Metric Bias in Minimum Bayes Risk Decoding", Geza Kovacs, Daniel Deutsch, Markus Freitag, 2024 https://aclanthology.org/2024.wmt-1.109/
给作者的问题
- In 3.2.1 it would be good to define and for the people not familiar with the algorithm.
- Please do not include a footnote in Equations (8) and (9). It can be confused with raising to the power of 3. Include the content of the footnote in the text referencing the equations.
- Line 138: "by using the r/writingPrompt dataset".
- In Figure 2 it can be difficult to distinguish the different shades of the colors. Regretfully I do not have a clear recommendation, maybe making a better use of the combination of colors and shapes? E.g. given that DPO are squares and DORPO are rhombi, they can share the same colors (e.g. DDPO-sty has the same color as DORPO-sty). In this way the number of colors can be reduced and better contrast can be introduced. Or maybe consider labelling the points in the graph?
Thank you first for a thoughtful review.
-
Metric issue: We see the reviewer’s concern, and hope that human evaluation and the analysis on other diversity metrics in Appendix F could address some of the concerns. While Appendix F computed diversity metrics on surface-level textual features (e.g., compression ratio), our approaches on semantic diversity could improve these metrics over the original DPO and ORPO.
-
Human evaluation design choices: The summarization and “picking the highest-quality writing” were due to practical reasons, despite the limitations that the reviewer mentioned. Each generated text tends to be lengthy, with a maximum of 2048 tokens. Doing both evaluative tasks with lengthy samples was cognitively infeasible when we piloted the evaluation tasks. Hence, we provided summarized texts. With summarized texts, still, either ranking the quality of the eight samples or deciding which set has a better average writing quality could be difficult. Hence, we asked evaluators to pick the highest quality writing.
-
Self-evaluation in human evaluation: While we admit the limitation of our approaches, we tried our best to minimize the bias. First, only one of the authors actively read the qualitative samples during the model development process. Second, for the human evaluation, we used prompts that were not used in the model development process. Moreover, evaluation instances were properly randomized without the evaluators being aware of the conditions. Lastly, we conducted DDPO-both vs. GPT-4o and DDPO-both vs. DPO comparisons together to avoid making evaluators easily expect which condition evaluative instances are coming from.
-
Regarding questions: We will try to address comments if the paper gets accepted.
Appendix F indeed provides additional evidence about the diversity, thanks for pointing it out.
I acknowledge the difficulties the authors point out, but recognizing them don't make the issues with the evaluation go away. The authors agree with most of my concerns, so I stand with my review. I still think the method is useful, but I would like a more thorough evaluation of the method.
This work focuses on post training such that it improves the diversity in the LM responses in addition to improving the quality for creative writing. The key idea is to make minor modifications to DPO and ORPO training objectives such that they weigh diversity positively. This invovles including a notion of deviation in responses during post-training -- responses that are different from an average response (other sampled responses) are weighed more favorably compared to other responses. The deviation is characterized by semantic and stylistic embeddings, and the quality is captured by neural reward models. The approach is evaluated on the writing-prompts dataset via experiments on 8B-sized Llama and Mistal models and compared to vanilla SFT, GPT-4o (including iterative prompting), and non-diversity modified DPO/ORPO. The performance is also compared against another diversity-promoting post-training method called DivPO.
接收理由
-- The approach is reasonable and well-motivated.
-- The approach seems to outperform the non-diversity baselines in terms of distinctiveness and quality.
-- The paper is well-organized and easy to follow.
拒绝理由
-- DivPO seems to be a very important baseline and is not described in sufficient detail. I am not sure whether there are fundamentally salient differences between the proposed approach and DivPO. This work should better contextualized. Moreover, DivPO is the only diversity-promoting baseline considered.
-- Why only four instances per evaluation prompt for evaluation -- more samples would provide a better estimate of diversity in responses.
-- The paper is difficult to engage with in parts -- the figures are dense and colors look very similar to each other. The baselines should be more logically clustered and visualized. The use of the expectation symbol also seems unconventional -- I think it is being used to denote sample averages.
-- The modification of DPO/ORPO objectives for diversity seems arbitrary and does not seem to be grounded in any theory or justified in some other manner.
给作者的问题
please address the comments above.
First, we want to thank the reviewer for the considerate review.
-
DivPO: While we noted the high-level approach of DivPO in Section 2 and our implementation of DivPO in Appendix H, we see that these might not be sufficient to see how our approach and DivPO relate to each other. To note, DivPO filters winning data of DPO to the most diverse ones while making losing data be least diverse, while our approach weights the training objective with how the winning instance is different from others. To our knowledge, at the time of the submission, DivPO was the only generalizable diversification post-training approach.
-
Four instances per prompt: We wanted to evaluate approaches in a setting that is realistic in user-facing application scenarios. That is, when the user is interacting with an LLM application (e.g., a creative writing assistant), they might not see very many instances. Due to cognitive capacity, they might realistically see fewer than ten. We designed our evaluations for such a setting, sampling four instances per prompt. An evaluation on more scalable generation settings can be future work.
-
Presentation issues: If the submission gets accepted, we will revise the manuscript to address the reviewer’s concerns.
-
Grounding of approach: While we do not have a theoretical grounding for the approach, we explain our intuition at the beginning of Section 4. That is, we aim to promote generation of “untypical” samples by counting in the deviation of training data during the post-training process.
Thank you for your response. I will maintain my current score.
The paper considers the problem of improving language model diversity during post-training. The motivation for this work is that algorithms like DPO improve response quality but may reduce the diversity of responses generated from models. The proposed method achieves very improved diversity without negatively affecting other metrics of language model utility. The key mechanism enabling the proposed method is the use of “deviation” (the amount of difference between samples for a given prompt) during RL training. A deviation term is simply placed into the losses of two familiar preference learning algorithms, DPO and ORPO - a low loss is obtained if the model assigns high probability to the chosen response (as in the original algorithms) and low probability to the rejected response (as in the original algorithms), both scaled by how much the chosen response differs from other samples for that prompt. In other words, the learning rate adapts to the diversity of samples, and the model learns much more from diverse samples than from homogenous samples, which conceptually moves the parameter space in the direction of diversity. The closest neighbor to this work is DivPO (“Diverse preference optimization” by Lanchantin et al 2025), which filters response data to encourage diversity. The proposed algorithms in this paper are claimed to encourage diversity while still training on the entire set of available prompts, by being similar to the original base algorithm (DPO or ORPO). Results show that DDPO provides sizeably better diversity than DivPO, while providing similar (or marginally less) utility at the primary RL task (generating stories that are are expected to get a large number of upvotes on Reddit). Human evaluation is consistent with this result.
接收理由
- The proposed method is simple and intuitive
- The paper is, on the whole, extraordinarily well written
- Despite my stated concern about the human evaluation below, I generally am convinced that this approach works and would recommend it to others working on creative generation
拒绝理由
- The human evaluation is not very satisfying to me. First, it is done by the authors of the paper - for a human evaluation that is crucial in judging the success of the method, I have a bit of a problem with this. I think that, as model developers, we can learn to fingerprint a specific method, and there is a natural bias that emerges thereof. Second, the use of story summaries likely elides a lot of the “diversity” of writing and, as such, this human evaluation likely only captures topical diversity (which the authors do acknowledge). If this human evaluation was better-executed, I would give this paper an 8 or a 9 overall.
- The proposed work only trains on creative writing generation. This is a valid testbed, and the paper advertises the paper faithfully as being for creative writing, but I do feel that this method could be broader (honestly, this is not actually a good reason to reject, but rather a personal thing I wish was different about the paper).
给作者的问题
- One footnote I had trouble following: “As having a robust reward model for creative writing is difficult due to subjectivity in evaluation, it was more of the case in our context” (below line 113)
- Figure 2 is a bit hard to process just by glancing. Can you add cluster labels or shading to point the reader to what therein is interesting to look at? Also, can the color coding be made more semantically meaningful?
We thank the reviewer for valuable feedback.
-
Self-evaluation issue (same as the response to HqMa): While we admit the limitation of our approaches, we tried our best to minimize the bias. First, only one of the authors actively read the qualitative samples during the model development process. Second, for the human evaluation, we used prompts that were not used in the model development process. Moreover, evaluation instances were properly randomized without the evaluators being aware of the conditions. Lastly, we conducted DDPO-both vs. GPT-4o and DDPO-both vs. DPO comparisons together to avoid making evaluators easily expect which condition evaluative instances are coming from.
-
Human evaluation design choice (summarization): The summarization was due to practical reasons: each generated text tends to be lengthy, with a maximum of 2048 tokens. Doing both evaluative tasks with lengthy samples was cognitively infeasible when we piloted the evaluation tasks. Hence, we provided summarized texts.
-
Domain: We also believe that examining the approach in other domains is an interesting future work opportunity. We hope that our work serves as the basis for such future research.
-
We will address the writing and figure issues if the submission gets accepted. For the writing, we will change it to “Having a robust reward model for creative writing is difficult due to subjectivity in evaluation, which was also the case in our context.”
Thank you for acknowledging and clarifying my areas for concern. I still believe that the two limitations I mentioned are scientific concerns with the submitted paper, but I will maintain my current score and recommend acceptance.
The paper proposes a language model post-training method for improving output diversity while preserving generation quality in creative writing tasks. The approach modifies existing preference optimization methods (DPO and ORPO) by incorporating a deviation term that weights training instances based on how different they are from other instances with the same prompt, facilitating learning from rare, high-quality instances. The method demonstrates improved diversity compared to baseline approaches while maintaining comparable quality. The comparison with human-created datasets shows promise, and the approach outperforms the existing diversity-promoting method DivPO. The reviews raise some concerns around the human evaluation methodology in which the authors served as evaluators, summaries of generated texts were evaluated, and only four instances per prompt were evaluated. However the evaluation sufficiently demonstrates the potential of the method in creative writing generation.
[Automatically added comment] At least one review was discounted during the decision process due to quality]