c:["$","div",null,{"className":"container py-8 max-w-6xl mx-auto","children":["$","$e",null,{"fallback":null,"children":["$","$L16",null,{"paper":{"id":"VdDtRu7RTf","title":"Write More at Once: Stylized Chinese Handwriting Generation via Two-stage Diffusion","abstract":"$17","keywords":["Handwritten Text Generation; Conditional Diffusion;"],"primary_area":"applications to computer vision, audio, language, and other modalities","venue":"ICLR 2025 Conference Withdrawn Submission","conference":"ICLR","year":2025,"status":"withdrawn","is_accepted":false,"avg_rating":4.75,"avg_rating_normalized":4.75,"rating_min":3,"rating_max":6,"rating_std":1.08972,"review_count":4,"comment_count":5,"creation_date":"2024-09-27","modification_date":"2024-11-14","forum_link":"https://openreview.net/forum?id=VdDtRu7RTf","pdf_link":"https://openreview.net/pdf?id=VdDtRu7RTf","arxiv_id":null,"arxiv_url":null,"arxiv_match_method":null,"arxiv_matched_at":null,"tldr":"","created_at":"2026-01-21T12:25:09.615992+00:00","updated_at":"2026-04-22T07:34:22.293774+00:00","authors":[{"id":"~Honglie_Wang1","name":"Honglie Wang","openreview_id":"~Honglie_Wang1","position":0},{"id":"~Minsi_Ren1","name":"Minsi Ren","openreview_id":"~Minsi_Ren1","position":1},{"id":"~Yangyang_Liu3","name":"Yangyang Liu","openreview_id":"~Yangyang_Liu3","position":2},{"id":"~Yan-Ming_Zhang1","name":"Yan-Ming Zhang","openreview_id":"~Yan-Ming_Zhang1","position":3}]},"stats":{"ratings":[{"id":"G9ic5cKGVy","value":5,"confidence":5},{"id":"vQIeL0k8ch","value":6,"confidence":4},{"id":"kfSeJjSR4Q","value":3,"confidence":5},{"id":"AanfleiUP9","value":5,"confidence":4}],"avg_rating":4.75,"rating_min":3,"rating_max":6,"rating_std":1.2583057392117916,"detailed_scores":{"soundness":[2,3,1,2],"contribution":[3,2,2,2],"presentation":[2,3,1,2],"originality":[],"quality":[],"clarity":[],"significance":[]}},"commentTree":[{"id":"G9ic5cKGVy","paper_id":"VdDtRu7RTf","replyto":"VdDtRu7RTf","number":2,"type":"Official_Review","role":"reviewer","rating":5,"confidence":5,"soundness":2,"contribution":3,"presentation":2,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"rating":5,"summary":"This paper introduces a two-stage diffusion model for generating stylized Chinese handwritten text lines. Unlike previous methods limited to single font outputs, this approach tackles sentence-level generation by treating it as a image style transfer problem. The first stage, CharPos-Diff, generates the character positions and creates a text line template using standard font images. The second stage, Imitating-Diff, then transforms this template into a handwritten style using a diffusion model, incorporating style information from a reference sample. The model incorporates a novel loss function that emphasizes stroke contours and uses a content-style alignment technique for improved results. Experiments demonstrate the method's effectiveness in generating structurally accurate and stylistically consistent handwritten text lines.","questions":"In table 2, as the CSA module can extract both content and style information, what would happen to the method's performance if the CA module is removed and only the CSA module is retained?","soundness":2,"strengths":"1.\tIt is good to see the work about generating handwritten Chinese text lines. Although there have been some studies on generating handwritten English text lines, the more challenging task of generating handwritten Chinese text lines is still a relatively under-explored area.\n\n2.\tThe core idea of this work (first generate a template image and then transfer the template into a stylized image.) is well-motivated and novel.\n\n3.\tThe proposed two-step diffusion-based method is technically sound and validated by experiments.","confidence":5,"weaknesses":"1. Overall, the experimental section can be refined with more explanation. For example, （1） there have been several existing works on single handwritten Chinese character generation, but this paper only compares one method; moreover, from the results in Tab. 1, the performance of the proposed method is only similar to that of One-DM. （2） there is no quantitative evaluation experiments for the generated text line images. (3) Tab. 2 is not referred in the paper. (4) Performance metrics used in Tab. 1 are not explained.\n\n2. The introduction to IMITATING-DIFFUSION is a little simplistic, making it not easy to fully understand. It only explains how each module is implemented, but does not describe how the modules interact with each other. In addition, most of the design seems to be based on existing methods.","contribution":3,"presentation":2,"code_of_conduct":"Yes","flag_for_ethics_review":["No ethics review needed."]},"created_at":"2024-11-01T00:00:00+00:00","modified_at":"2024-11-13T00:00:00+00:00","replies":[],"contentHtml":{"summary":"

This paper introduces a two-stage diffusion model for generating stylized Chinese handwritten text lines. Unlike previous methods limited to single font outputs, this approach tackles sentence-level generation by treating it as a image style transfer problem. The first stage, CharPos-Diff, generates the character positions and creates a text line template using standard font images. The second stage, Imitating-Diff, then transforms this template into a handwritten style using a diffusion model, incorporating style information from a reference sample. The model incorporates a novel loss function that emphasizes stroke contours and uses a content-style alignment technique for improved results. Experiments demonstrate the method's effectiveness in generating structurally accurate and stylistically consistent handwritten text lines.

","questions":"

In table 2, as the CSA module can extract both content and style information, what would happen to the method's performance if the CA module is removed and only the CSA module is retained?

","strengths":"

\n
It is good to see the work about generating handwritten Chinese text lines. Although there have been some studies on generating handwritten English text lines, the more challenging task of generating handwritten Chinese text lines is still a relatively under-explored area.
\n
\n
The core idea of this work (first generate a template image and then transfer the template into a stylized image.) is well-motivated and novel.
\n
\n
The proposed two-step diffusion-based method is technically sound and validated by experiments.
\n

","weaknesses":"

\n
Overall, the experimental section can be refined with more explanation. For example, （1） there have been several existing works on single handwritten Chinese character generation, but this paper only compares one method; moreover, from the results in Tab. 1, the performance of the proposed method is only similar to that of One-DM. （2） there is no quantitative evaluation experiments for the generated text line images. (3) Tab. 2 is not referred in the paper. (4) Performance metrics used in Tab. 1 are not explained.
\n
\n
The introduction to IMITATING-DIFFUSION is a little simplistic, making it not easy to fully understand. It only explains how each module is implemented, but does not describe how the modules interact with each other. In addition, most of the design seems to be based on existing methods.
\n

","code_of_conduct":"

Yes

"}},{"id":"vQIeL0k8ch","paper_id":"VdDtRu7RTf","replyto":"VdDtRu7RTf","number":1,"type":"Official_Review","role":"reviewer","rating":6,"confidence":4,"soundness":3,"contribution":2,"presentation":3,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"rating":6,"summary":"$18","questions":"Section 4.1 title \"Dataset.\" --> \"Dataset\"\nSection 4.2 title \"Implement details\" --> \"Implementation details\"","soundness":3,"strengths":"Originality: The paper presents an approach to image generation of Chinese handwritten lines. The task itself and the separation into layout and content generation is not novel, but the diffusion approach to this problem is somewhat novel.\nQuality & Clarity: The writing is mostly clear. The ablation study is a strong part of the paper, investigating the quality of the content generation on the individual symbols dataset, the quality of layout generation independently from the content generation, and the contribution of the proposed content style aggregation model.","confidence":4,"weaknesses":"I believe that the main weakness of the paper is the combination of fairly narrow domain / significance and the absence of the open-source model/code release:\n\n* The proposed approach is particularly suitable for languages that don't have cursive writing but doesn't seem to generalize beyond that. \n* The proposed layout generation approach could be suitable for other types of problems, such as generation of pages of handwriting or multi-line handwriting, but is evaluated only on the narrow domain.\n* Given the narrow focus of the paper and a fairly complex model consisting of two diffusion models and a particular attention mechanism, reproducibility of this approach is fairly difficult, thus hindering further progress based on this work.\n\nI believe that the paper could be strengthened either by releasing the code or the model, or by showcasing that the proposed approach could generalize beyond the domain highlighted in the paper (ex. different scripts or different types of images)","contribution":2,"presentation":3,"code_of_conduct":"Yes","flag_for_ethics_review":["No ethics review needed."]},"created_at":"2024-11-01T00:00:00+00:00","modified_at":"2024-11-13T00:00:00+00:00","replies":[],"contentHtml":{"summary":"$19","questions":"

Section 4.1 title \"Dataset.\" --> \"Dataset\"\nSection 4.2 title \"Implement details\" --> \"Implementation details\"

","strengths":"

Originality: The paper presents an approach to image generation of Chinese handwritten lines. The task itself and the separation into layout and content generation is not novel, but the diffusion approach to this problem is somewhat novel.\nQuality & Clarity: The writing is mostly clear. The ablation study is a strong part of the paper, investigating the quality of the content generation on the individual symbols dataset, the quality of layout generation independently from the content generation, and the contribution of the proposed content style aggregation model.

","weaknesses":"$1a","code_of_conduct":"

Yes

"}},{"id":"kfSeJjSR4Q","paper_id":"VdDtRu7RTf","replyto":"VdDtRu7RTf","number":3,"type":"Official_Review","role":"reviewer","rating":3,"confidence":5,"soundness":1,"contribution":2,"presentation":1,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"rating":3,"summary":"This paper proposes a two-stage diffusion method to synthesize Chinese handwritten text line, conditioned on text content and style reference image. The key idea is to first generate the layout that consists of the text content and character positions, then combine the layout and character style information to synthesize the desired text line. Some experiments evaluate the proposed method.","questions":"My major concerns are the unclear method description and weak experiment designs. More details are provided in Weaknesses.","soundness":1,"strengths":"1)\tIt is interesting to break down the text line generation into two independent processes.\n2)\tThe proposed LayoutDiffuser component generates accurate layout.","confidence":5,"weaknesses":"1)\tThe pipeline of the proposed method is unclear. It is recommended to provide an overall introduction of the whole method in Section 3.2. \n2)\tI am confused about how to obtain the bounding boxes of style image in the testing phase.\n3)\tThe proposed LayoutDiffuser lacks a definition and introduction in the method section.\n4)\tThe effectiveness of the proposed method is questionable, as Table 1 shows it lags behind the SOTA method One-DM across five metrics, such as FID and style score.\n5)\tIn Figure 2 and the last row of Figure 4, the generated samples differ significantly from the Target Image in terms of ink color, and stroke connections, raising doubts about whether the proposed method can accurately mimic the handwriting style.\n6)\tQuantitative ablation results are recommended to provide.\n7)\tThis paper does not include a user study and an analysis of failure cases.","contribution":2,"presentation":1,"code_of_conduct":"Yes","flag_for_ethics_review":["No ethics review needed."]},"created_at":"2024-11-03T00:00:00+00:00","modified_at":"2024-11-13T00:00:00+00:00","replies":[],"contentHtml":{"summary":"

This paper proposes a two-stage diffusion method to synthesize Chinese handwritten text line, conditioned on text content and style reference image. The key idea is to first generate the layout that consists of the text content and character positions, then combine the layout and character style information to synthesize the desired text line. Some experiments evaluate the proposed method.

","questions":"

My major concerns are the unclear method description and weak experiment designs. More details are provided in Weaknesses.

","strengths":"

It is interesting to break down the text line generation into two independent processes.
The proposed LayoutDiffuser component generates accurate layout.

","weaknesses":"

The pipeline of the proposed method is unclear. It is recommended to provide an overall introduction of the whole method in Section 3.2.
I am confused about how to obtain the bounding boxes of style image in the testing phase.
The proposed LayoutDiffuser lacks a definition and introduction in the method section.
The effectiveness of the proposed method is questionable, as Table 1 shows it lags behind the SOTA method One-DM across five metrics, such as FID and style score.
In Figure 2 and the last row of Figure 4, the generated samples differ significantly from the Target Image in terms of ink color, and stroke connections, raising doubts about whether the proposed method can accurately mimic the handwriting style.
Quantitative ablation results are recommended to provide.
This paper does not include a user study and an analysis of failure cases.

","code_of_conduct":"

Yes

"}},{"id":"AanfleiUP9","paper_id":"VdDtRu7RTf","replyto":"VdDtRu7RTf","number":4,"type":"Official_Review","role":"reviewer","rating":5,"confidence":4,"soundness":2,"contribution":2,"presentation":2,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"rating":5,"summary":"The paper presents a method for generating stylized Chinese handwriting at the sentence level using a two-stage diffusion model. The model addresses limitations in previous methods that focused on generating single characters or words. The authors introduce two primary components:\n 1. CharPos-Diffusion for generating character positions in a text line.\n 2. Imitating-Diffusion for transferring the layout into a specified handwriting style.\nThis approach models text-line generation as a style transfer problem and enables the generation of coherent sentence-level handwriting with personalized style features.\n The CharPos-Diffusion stage generates a layout of character positions, creating a structured template for the text line based on reference samples. Whereas the Imitating-Diffusion stage uses these templates to transfer style information from a reference, combining content and style features effectively.","questions":"1. Generating samples from a score-based diffusion model requires a large number of steps. The authors have mentioned that they have used a reduced-step solver like DPM-Solver++ to speed up the sampling. But most likely the consequences of using it is reduced image quality or stability. It is unclear from the current draft what measures has been taken to address that issue. \n\n2. Column Captions/heading title in Table 1 and Table 2 – What is meant by style score and content score? Are those referring to Writer Identification accuracy and character recognition accuracy? \n\n3. Dataset train /test splitting: In page 5, line 254 and 258 the authors have mentioned that they have randomly selected training and testing data. How could the results be reproducible in such a framework? This is important for fair comparison and bench-marking purpose. \n\n4. Can the model generate text with an unseen style ?","soundness":2,"strengths":"1. The paper identifies an essential gap in “Chinese handwritten text generation” research: the need to go beyond isolated character generation. By focusing on full text lines, this method supports applications that require sentence-level coherence. \n\n2. The \"CharPos-Diffusion\" component introduces a layout loss that maintains the \"positional relationships\" between characters, ensuring consistent spacing and alignment.\n3. The \"imitating-Diffusion\" stage incorporates Harris corner detection into its loss function to emphasize stroke contours. This inclusion enhances style fidelity by focusing on the stroke details.","confidence":4,"weaknesses":"1. The two-stage diffusion process, combined with the need for high-dimensional style and content features, leads to **high computational demands**. This complexity could limit the model’s scalability, particularly for real-time applications.\n\n2. This approach is only suitable to generate text lines from Scripts with distinct separate characters with a clear bounding box, this is not a generic approach which could be applied to generate cursive handwritten text of Latin script. \n\n3. Time complexity analysis of the entire method is missing,\n\n4. Ablation of loss function - involves only qualitative method.","contribution":2,"presentation":2,"code_of_conduct":"Yes","flag_for_ethics_review":["No ethics review needed."]},"created_at":"2024-11-03T00:00:00+00:00","modified_at":"2024-11-13T00:00:00+00:00","replies":[],"contentHtml":{"summary":"

The paper presents a method for generating stylized Chinese handwriting at the sentence level using a two-stage diffusion model. The model addresses limitations in previous methods that focused on generating single characters or words. The authors introduce two primary components:

CharPos-Diffusion for generating character positions in a text line.
Imitating-Diffusion for transferring the layout into a specified handwriting style.\nThis approach models text-line generation as a style transfer problem and enables the generation of coherent sentence-level handwriting with personalized style features.\nThe CharPos-Diffusion stage generates a layout of character positions, creating a structured template for the text line based on reference samples. Whereas the Imitating-Diffusion stage uses these templates to transfer style information from a reference, combining content and style features effectively.

","questions":"

\n
Generating samples from a score-based diffusion model requires a large number of steps. The authors have mentioned that they have used a reduced-step solver like DPM-Solver++ to speed up the sampling. But most likely the consequences of using it is reduced image quality or stability. It is unclear from the current draft what measures has been taken to address that issue.
\n
\n
Column Captions/heading title in Table 1 and Table 2 – What is meant by style score and content score? Are those referring to Writer Identification accuracy and character recognition accuracy?
\n
\n
Dataset train /test splitting: In page 5, line 254 and 258 the authors have mentioned that they have randomly selected training and testing data. How could the results be reproducible in such a framework? This is important for fair comparison and bench-marking purpose.
\n
\n
Can the model generate text with an unseen style ?
\n

","strengths":"

\n
The paper identifies an essential gap in “Chinese handwritten text generation” research: the need to go beyond isolated character generation. By focusing on full text lines, this method supports applications that require sentence-level coherence.
\n
\n
The \"CharPos-Diffusion\" component introduces a layout loss that maintains the \"positional relationships\" between characters, ensuring consistent spacing and alignment.
\n
\n
The \"imitating-Diffusion\" stage incorporates Harris corner detection into its loss function to emphasize stroke contours. This inclusion enhances style fidelity by focusing on the stroke details.
\n

","weaknesses":"

\n
The two-stage diffusion process, combined with the need for high-dimensional style and content features, leads to high computational demands. This complexity could limit the model’s scalability, particularly for real-time applications.
\n
\n
This approach is only suitable to generate text lines from Scripts with distinct separate characters with a clear bounding box, this is not a generic approach which could be applied to generate cursive handwritten text of Latin script.
\n
\n
Time complexity analysis of the entire method is missing,
\n
\n
Ablation of loss function - involves only qualitative method.
\n

","code_of_conduct":"

Yes

"}},{"id":"HTqYBQqTtm","paper_id":"VdDtRu7RTf","replyto":"VdDtRu7RTf","number":1,"type":"Withdrawal","role":"author","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"withdrawal_confirmation":"I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors."},"created_at":"2024-11-14T00:00:00+00:00","modified_at":"2024-11-14T00:00:00+00:00","replies":[],"contentHtml":{"withdrawal_confirmation":"

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.

"}}],"submissionHistory":[]}]}]}]