7.3

/10

Poster4 位审稿人

最低4最高5标准差0.5

3.5

置信度

创新性2.0

质量3.0

清晰度2.5

重要性2.5

NeurIPS 2025

OmniSVG: A Unified Scalable Vector Graphics Generation Model

Yiying Yang,Wei Cheng,Sijin Chen,Xianfang Zeng,Fukun Yin,Jiaxu Zhang,Liao Wang,Gang Yu,Xingjun Ma,Yu-Gang Jiang

OpenReview PDF

提交: 2025-04-14更新: 2025-10-29

摘要

关键词

SVG GenerationVision Language Model

评审与讨论

审稿意见

评分: 5置信度: 32025-06-28

This paper introduces OmniSVG, a novel framework for generating high-quality Scalable Vector Graphics (SVGs) from multimodal inputs (text, images, or character references). It leverages a pre-trained Vision-Language Model (VLM) to tokenize SVG commands and coordinates, enabling efficient generation of complex, editable SVGs. Key contributions include: 1) OmniSVG: An autoregressive model that decouples structural logic from geometry, addressing "coordinate hallucination" in prior methods. 2) MMSVG-2M: A large-scale dataset with 2M annotated SVGs (icons, illustrations, anime characters) and a benchmark (MMSVG-Bench) for evaluation.

优缺点分析

Strengths

Multimodal Capability: Handles diverse inputs (text, images, references) and tasks (Text-to-SVG, Image-to-SVG, Character-Reference SVG).
Dataset & Benchmark: MMSVG-2M fills a gap in complex SVG datasets, and MMSVG-Bench provides standardized evaluation metrics.
SVG tokenization reduces redundancy and improves efficiency.
Superior quantitative metrics (lower FID, higher CLIP/DINO scores) and qualitative outputs (rich colors, geometric accuracy).

Weaknesses

Limited Evaluation of Complex Characters: While the model performs well on simple icons and animated characters, the paper lacks qualitative results for more intricate designs—particularly detailed characters such as those in MMSVG-2M-Character. To fully assess generalization, these examples should be in-the-wild (i.e., not part of the training dataset).
Insufficient Showcase of SVG Scalability: One of SVG’s key advantages is its ability to scale without aliasing. However, the authors do not provide enough zoomed-in comparisons to demonstrate this benefit. Including such results would better highlight the model’s precision in preserving geometric fidelity at higher resolutions.

问题

Does the SVG tokenization scheme limit expressiveness by reducing all paths to five basic commands (M, L, C, A, Z)?

While this simplification standardizes training (Sec. 3.1), it may discard nuanced XML attributes (e.g., transform, stroke-dasharray) and composite operations (e.g., groups) found in professional designs. Two concerns arise:

Approximation Errors: Complex curves/shapes requiring advanced commands (e.g., quadratic Béziers, filters) must be approximated with cubic Béziers (C). Does this introduce visible artifacts or bloated path data? Editability Trade-off: Simplified paths may ease generation but hinder post-generation editing (e.g., modifying a single element in a grouped shape). How does OmniSVG balance automation with downstream usability?

局限性

yes

最终评判理由

In the rebuttal, the authors provided additional clarification regarding their SVG tokenization, which—while not perfect—is sufficient for most SVG images. Unfortunately, NeurIPS does not permit additional images in the rebuttal, so I could not evaluate results on complex characters or SVG scalability. Nevertheless, given the practical value of this work and its usefulness to the design industry, I believe the paper is acceptable. Therefore, I maintain my score as Accept.

格式问题

作者回复

2025-07-31

We thank the reviewer for their approval of: 1) a unified multimodal framework that handles diverse inputs for SVG generation tasks, 2) the MMSVG-2M dataset and MMSVG-Bench, which address the scarcity of complex SVG resources and establish standardized evaluation protocols, 3) an innovative SVG tokenization approach that improves efficiency and reduces redundancy, and 4) demonstrated superiority in both quantitative metrics (FID, CLIP/DINO scores) and qualitative aspects (color richness, geometric precision).

Please find our point-to-point response to your concerns below.

📝 W1: Clarification of the evaluation of complex characters.

💡 A: Before training, we partitioned the data into training, validation, and test sets, ensuring that all test cases were unseen during training. Benchmark results are presented in Figures 4 in the manuscript, with additional test results in Figure 12. To better demonstrate OmniSVG's generalization capability on complex character designs, we will include qualitative results on real-world character examples outside our training distribution in the revised version. This will provide a more comprehensive evaluation of our model's performance on intricate character designs, as suggested by the reviewer.

📝 W2: Insufficient Showcase of SVG Scalability.

💡 A: We appreciate the reviewer's valuable suggestion. In the revised version, we will include zoomed-in view results to further highlight the advantages of our SVG approach. Additionally, we have provided a demonstration of the layer-wise SVG generation process in the supplementary material (2245_OmniSVG_A_Unified_Scalable_Supplementary_Material/assets/OmniSVG-demo-gen-proc-anime-1080.mp4). Furthermore, we will present visualizations of different layers to offer a clearer comparison between our method and optimization-based approaches.

📝 Q1: Clarification of the SVG Tokenizer.

💡 A: Our SVG tokenization consists of five basic path commands (M, L, C, A, Z) along with the color command F. The combination of these six commands is sufficient to represent the majority of vector graphics in practical applications. As shown in Figure 1, even complex characters can be fully represented with just these six commands, demonstrating the expressive power of our selected command set. However, we acknowledge that this approach does not support advanced SVG features such as gradient fills, filter effects, or animation properties. However, by focusing on the most fundamental geometric representation commands, we are able to simplify the tokenization process significantly while retaining enough expressiveness to enable the model to efficiently learn and generate the core structure of vector graphics.

📝 Q2: Clarification of the error accumulation and editability trade-off.

💡 A: Our SVG tokenization approach follows the design principles of DeepSVG, simplifying complex SVG commands into five basic path commands (M, L, C, A, Z) and adding a fill command. This design choice is based on thorough technical considerations and practical effectiveness validation. Regarding concerns about approximation errors, extensive experiments have shown that while there is a theoretical approximation process when converting complex curves (such as quadratic Bézier curves) to cubic Bézier curves, this conversion does not produce visible visual artifacts in practice. Complex character SVG in Figure 1 of the paper clearly demonstrate that OmniSVG can generate high-quality SVG graphics with complex geometric shapes and fine details, proving the effectiveness of our tokenization scheme in maintaining visual quality.

More importantly, this simplification brings key technical advantages: the generated path representations are unique, eliminating ambiguity caused by multiple equivalent representations of the same shape. This enhances the stability of model training and consistency in the generated results. Regarding the trade-off in editability, we view this as a design choice rather than a limitation. Although the simplified paths cannot be parameterized for editing as whole entities like the original advanced SVG elements, this decomposition actually provides finer-grained control. For instance, by breaking a circle into Bézier curves, users can precisely transform it into an ellipse or other irregular shapes by adjusting control points. This flexibility is often more valuable in professional design scenarios, as it allows designers to make more precise and creative shape adjustments without the constraints of predefined geometric primitives.

2025-08-06

I thank the authors for their detailed responses. They provided additional clarification regarding their SVG tokenization, which—while not perfect—is sufficient for most SVG images. Unfortunately, NeurIPS does not permit additional images in the rebuttal, so I could not evaluate results on complex characters or SVG scalability. Nevertheless, given the practical value of this work and its usefulness to the design industry, I believe the paper is acceptable. Therefore, I maintain my score as Accept.

评论- Thank you for the feedback

2025-08-06

We greatly appreciate your acknowledgment of our work, and will provide additional visualizations for complex characters, SVG scalability, and other relevant aspects, as well as additional evaluations in our revision. Thank you again for your thoughtful comments and for engaging in the discussion.

2025-08-03

Dear Reviewer,

Could you please check if the authors’ rebuttal adequately addresses your concerns? If so, kindly acknowledge the rebuttal and provide any additional comments. If not, it would be greatly appreciated it if you could engage in a discussion with the authors. Your input at this stage is essential to the review process. Thank you very much for your time and effort!

2025-08-04

Dear Reviewer,

Thank you for your valuable feedback on our paper "OmniSVG: A Unified Scalable Vector Graphics Generation Model". We hope our responses have addressed your concerns to your satisfactory. If you have any further concerns, please let us know during the discussion session.

Thank you again for your valuable time and effort!

Best regards,

All authors

审稿意见

评分: 4置信度: 32025-07-03

The paper proposes a generative model for SVG that supports texts, images, or character reference inputs. The method proposes a parameterization and tokenization for SVG commands and introduces a large-scale SVG dataset for training. Results show better performance than prior methods that support only one of the text-conditioned or image-conditioned tasks.

优缺点分析

Strengths

The paper proposes a unified condition-to-SVG framework that leverages the strength of pre-trained VLMs.
The collected dataset will advance the research in this topic.
The paper shows empirical performance gain over prior works.

Weaknesses

The main concern is insufficient evaluation. For image-to-SVG task, the pixel-wise reconstruction performance should be reported. Current evaluation metrics also don't reflect how output SVG preserves input character references due to the coarseness of the perceptual metrics used in this paper. Since the dataset contains ground truth SVG sequences, it's also possible to report metrics compared with GT sequences.
The paper proposes an autoregressive framework that may suffer from error propagation. How does the method perform for SVG sequence outputs of different lengths, reflecting different complexities of the task? This aspect should also be quantitatively evaluated.
The paper follows a standard VLM fine-tuning pipeline dedicated to SVG generation task. Discussions and insights on how such pipeline can be adapted to other inverse graphics tasks or tasks in other domains (or even small scale experiments) will strengthen the paper.

问题

Does the model support multi-turn generation or free-form text inputs, e.g., some editing instructions for a reference SVG image? In other words, is the instruction following capability of the pre-trained VLM lost after the proposed fine-tuning process?

局限性

Yes.

最终评判理由

All my concerns are well addressed in the additional experiments and clarifications in the rebuttal. I found this work to have solid empirical performance superior to prior works and therefore vote for acceptance.

格式问题

N/A

作者回复

2025-07-31

We thank the reviewer for approval of: 1) our unified multimodal framework that elegantly handles diverse input conditions for SVG generation, 2) the contribution of a large-scale dataset, and 3) the demonstrated empirical superiority with comprehensive performance improvements over existing single-modality approaches.

Please find our point-to-point response to your concerns below.

📝 W1: More quantitive evaluations.

💡 A: Regarding pixel-wise reconstruction metrics, following StarVector, we have reported DINO, SSIM, LPIPS, MSE in the manuscript. Additionally, we also provide PSNR metrics as shown below. We appreciate your suggestion and have also calculated edit distance metrics between the output SVG and ground truth sequences using the method from [1]. To facilitate comparison, we report the normalized edit distance, which is calculated as $\frac{\text{edit distance}(\text{generated}, \text{gt})}{\max(\text{len(generated)}, \text{len(gt)})}$ , where a smaller value indicates greater similarity between the generated SVG sequence and the ground truth (GT) sequence. The following table shows that the normalized edit distance of the OmniSVG method is significantly smaller than that of the baseline, suggesting that the SVG sequences generated by our method are much closer to the ground truth (GT) sequences.

Method	DINO↑	SSIM↑	LPIPS↓	MSE↓	PSNR↑	edit distance (normalized) ↓
LIVE	0.960	0.979	0.034	0.004	26.12	0.953
DiffVG	0.951	0.956	0.056	0.015	20.52	0.958
StarVector	0.977	0.944	0.044	0.012	21.82	0.840
OmniSVG	0.980	0.954	0.049	0.011	22.78	0.712

[1] https://github.com/roy-ht/editdistance

📝 W2: Clarification of the potential error accumulation.

💡 A: We fully acknowledge the potential error accumulation issue inherent in autoregressive frameworks and have conducted systematic quantitative evaluations on SVG sequences of varying lengths. We divided the test set into four ranges based on token length (0-256, 256-512, 512-1024, 1024-2048) and comprehensively evaluated the performance using six metrics: DINO similarity, MSE, PSNR, SSIM, LPIPS, and text similarity, comparing OmniSVG with the current state-of-the-art baseline StarVector. Note that we calculate the average token length (# Tokens) of a generated SVG sample utilizing the Qwen2.5-VL tokenizer.

The experimental results show that as the sequence length increases, all metrics decline, with DINO similarity gradually decreasing. However, this degradation is relatively gradual. Compared to the baseline method, StarVector, which also uses an autoregressive architecture, OmniSVG performs better across all sequence lengths and metrics. For the longest sequences, OmniSVG achieves significantly higher performance in both DINO similarity and text similarity, demonstrating its superior capability in generating high-quality outputs. Thanks to our parameterization design and simplified representation, error propagation in the autoregressive model is not significantly noticeable in our approach.

These results indicate that although the inherent error propagation characteristics of autoregressive frameworks are unavoidable, our method effectively mitigates quality degradation in long sequence generation through stronger representation learning and generation capabilities, demonstrating better robustness when handling complex SVG tasks.

#Tokens	Method	DINO↑	MSE↓	PSNR↑	SSIM↑	LPIPS↓
0-256	OmniSVG	0.985	0.005	25.45	0.977	0.021
	Starvector	0.979	0.007	24.36	0.964	0.029
256-512	OmniSVG	0.980	0.008	23.96	0.961	0.037
	Starvector	0.971	0.011	23.01	0.958	0.044
512-1024	OmniSVG	0.977	0.012	22.73	0.953	0.049
	Starvector	0.968	0.014	22.09	0.947	0.052
1024-2048	OmniSVG	0.968	0.013	22.02	0.946	0.053
	Starvector	0.962	0.021	19.11	0.939	0.059

📝 W2: More discussion on the VLM fine-tuning pipeline.

💡 A: The OmniSVG approach based on pre-trained VLMs can indeed be adapted to other inverse graphics tasks, particularly demonstrating significant application value in unstructured graphical representation domains such as CAD and 3D mesh generation. In CAD generation, CAD-MLLM [2] and CAD-GPT [3] have demonstrated the feasibility of utilizing multimodal large language models to generate CAD command sequences, where CAD-MLLM supports diverse inputs including text, images, and point clouds, while CAD-GPT enhances spatial reasoning capabilities to achieve precise synthesis from single-view images or text. In 3D mesh generation, AutoSDF [4] and MeshGPT [5] learn discrete tokens and reconstruct 3D representations using VQVAE models, PolyGen [6] employs two decoder Transformers to sequentially predict vertex positions and connectivity, and MeshXL [7] explores explicit sequence modeling approaches for high-fidelity 3D mesh generation. The core advantage of these methods lies in fully leveraging the powerful representation capabilities of pre-trained VLMs, enabling easy extension to these tasks. As shown in Table 5, competitive performance can be achieved through simple task-specific fine-tuning, while acceleration strategies for VLMs make processing complex graphical representations feasible. More importantly, the successful deployment of industrial systems such as Hunyuan3D-PolyGen [8] demonstrates the scalability and practicality of pre-trained model-based approaches in real-world applications. This unified framework not only avoids the high cost of training specialized models from scratch for each inverse graphics task, but also facilitates knowledge transfer between different graphical representations through shared visual-language understanding capabilities.

[2] Xu J, Zhao Z, Wang C, et al. Cad-mllm: Unifying multimodality-conditioned cad generation with mllm. arXiv preprint arXiv:2411.04954, 2024.

[3] Wang S, Chen C, Le X, et al. Cad-gpt: Synthesizing cad construction sequence with spatial reasoning-enhanced multimodal llms. AAAI Conf Artif Intell, 2025, 39(8): 7880-7888.

[4] Mittal P, Cheng Y C, Singh M, et al. Autosdf: Shape priors for 3d completion, reconstruction and generation. IEEE/CVF CVPR, 2022: 306-315.

[5] Siddiqui Y, Alliegro A, Artemov A, et al. Meshgpt: Generating triangle meshes with decoder-only transformers. IEEE/CVF CVPR, 2024: 19615-19625.

[6] Nash C, Ganin Y, Eslami S M A, et al. Polygen: An autoregressive generative model of 3d meshes. ICML, PMLR, 2020: 7220-7229.

[7] Chen S, Chen X, Pang A, et al. Meshxl: Neural coordinate field for generative 3d foundation models. NeurIPS, 2024, 37: 97141-97166.

[8] Weng H, Zhao Z, Lei B, et al. Scaling mesh generation via compressive tokenization[C]//Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 11093-11103.

📝 Q1: The capability of the pre-trained VLM would not lost after the proposed fine-tuning process.

💡 A: Since our training supervision focuses exclusively on SVG generation, the current OmniSVG model weights cannot directly perform VLM-related tasks such as QA, Captioning, or Grounding. However, OmniSVG demonstrates strong instruction following capabilities in text-to-SVG and image-to-SVG tasks, generating SVG tokens that comply with the given language or image conditions.

Importantly, our model architecture preserves the core architecture of Qwen2.5-VL and only extends it through vocabulary expansion during training. Therefore, it can seamlessly support co-training of SVG generation tasks with VLM tasks to retain the corresponding VLM capabilities. Similarly, other research utilizing VLMs for various tasks (such as RT-2 [9]) has shown that this co-training approach can enhance model generalization on high-level instructions and even improve instruction following capabilities through chain-of-thought reasoning and multi-turn dialogue.

As this paper focuses on SVG generation subtasks to enable better comparison with previous work, we currently concentrate on these specific tasks. We believe that this co-training paradigm represents a highly valuable direction for future exploration, and we will continue to investigate this approach.

[9] Zitkovich B, Yu T, Xu S, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control[C]//Conference on Robot Learning. PMLR, 2023: 2165-2183.

2025-08-08

Dear reviewer,

We sincerely appreciate your constructive feedback on our paper, "OmniSVG: A Unified Scalable Vector Graphics Generation Model." As the discussion session is drawing to a close, we want to express our gratitude for the valuable suggestions you've provided. We hope that our response has addressed your concerns. If you have any further questions or concerns, please do not hesitate to let us know.

We deeply appreciate your valuable time and effort.

Best regards,

All authors

评论- Thank you for the rebuttal

2025-08-08

Dear authors,

I appreciate the efforts to include additional evaluation results and clarifications in the rebuttal which provided additional evidence for the empirical performance of the method. I have read the rebuttal and other reviewers' comments. All my concerns have been addressed, and I will update my score to vote for acceptance.

评论- Thank you for the feedback

2025-08-08

Dear reviewer,

Thank you for your recognition of our work and for your positive feedback. We greatly appreciate your time and valuable comments.

Best regards,

All authors

2025-08-03

Dear Reviewer,

2025-08-04

Dear Reviewer,

Thank you again for your valuable time and effort!

Best regards,

All authors

审稿意见

评分: 4置信度: 32025-07-03

This work introduces two components:

MMSVG-2M, a preprocessed dataset of vector graphics. Where the authors provide caption-SVG pairs, illustration-SVG pairs, and character-SVG pairs (generated via VTracer)
The authors introduce a tokenization scheme to facilitate SVG generation, and finetune Qwen 2.5 to perform SVG generation

The authors demonstrate state-of-the-art results on SVG generation.

优缺点分析

Strengths:

OmniSVG handles complex SVGs up to 30k tokens, surpassing previous methods.
This method supports text, image, and character-reference inputs.
The dataset itself (if made public) MMSVG-2M would be a huge contribution, this would be a super diverse dataset for SVG research.
The method outperforms baselines in quantitative and qualitative evaluations.

Weaknesses:

Generating complex SVGs can be time-consuming. It isn't clear if this can be further optimized with some kind of multi-resolution implementation (test-time scaling) or some improved tokenizer
Writing generally could be improved.

Generally I like the paper, if the authors can provide additional clarity around the method I would be happy to take another look.

问题

In the supplemental, Table 7. You are a helpful SVG Generation assistant, designed to generate SVG. We provide two image as input, the second image is the character reference image of the first image, generate the character reference SVG based on these two input images. -- is this prompt incorrect? What are the two images as input? I was under the impression that there was only one image as input in the character generation case.
VTracer should really be mentioned more prominently. You utilize this tool for SVG generation of the characters, and this is only mentioned in Figure 2 of the main text.
Could the authors describe their tokenizer in more detail? I am particularly interested in how the point coordinates are preprocessed and tokenized. Are the coordinates normalized to image width/height? Are the images assumed to have equal width/height? Are the coordinates quantized before tokenization?
Could the authors describe their token output format in more detail? Do the authors utilize some kind of constrained sampling to ensure that the output always follows a specific grammar?
Could the authors describe the difference between the icon & illustration subsets of the dataset in terms of how they are used? Are they both used for text-to-SVG or illustration-to-SVG?

局限性

The reliance on synthetic captions seems to be a large constraint. Due to this, more niche characters or world knowledge may be lost in the SVG generation pipeline.
The character to SVG task is not super meaningful in my view. Since the data is derived synthetically anyways.
The authors do not demonstrate any multi-round image editing. In my view, this would enable to previous character-to-SVG task to be a meaningful task.

最终评判理由

This paper in the current form is not well written. This is not to deny the massive contribution of the dataset, and the successful demonstration that a text-to-svg or image-to-svg model could be trained on such a dataset.

However the paper as submitted, lacks significant details regarding the implementation of the tokenizer and how the authors accomplish different tasks. I think a good paper should not only demonstrate strong technical achievement, but also do a good job in explaining their work to others.

格式问题

N/A

作者回复

2025-07-31

We thank the reviewer for recognizing: 1) the ability to handle complex SVGs, 2) the versatility of our multi-modal framework, 3) the substantial contribution of MMSVG-2M dataset, and 4) the superior performance.

We thank the review for the valuable feedbacks for us to improve our paper. We will address your concerns and carefully revise the paper accordingly. Please find our point-to-point response to your concerns below.

📝 W1: Time-consuming promblem of generating complex SVGs.

💡A: The generation efficiency for complex SVGs is a challenging topic for existing methods. However, thanks to the OmniSVG tokenizer, we are able to generate complex SVGs with only 6,000 OmniSVG tokens, while StarVecter would require 30,000 tokens. Since SVG is a resolution invariant visual representation, we are unable to speed up the generation with the multi-resolution implementation. However, modern technologies like multi-token generation [1], linear [2], or even log-linear attention [3] could be a promising way to achieve a much faster generation speed.

[1] Fabian et al. Better & faster large language models via multi-token prediction. arXiv preprint arXiv:2404.19737, 2024.

[2] Katharopoulos et al. Transformers are RNNs: Fast autoregressive transformers with linear attention. In ICML, pages 5156–5165. PMLR, 2020.

[3] Guo et al. Log-linear attention. arXiv preprint arXiv:2506.04761, 2025.

📝 W2: Writing generally could be improved.

💡A: Thank you for your valuable feedback. We will improve our writing and address your concerns highlighted in the question session with more details in our revision.

📝 Q1: Clarification on the two-image input for character reference SVG generation.

💡A: In the character reference SVG generation task, generating SVGs directly from natural images is challenging. To address this, we introduce a two-stage training approach. Stage 1: Initially, we input two images into the model: the first is the original natural image, and the second is a rasterized version of a target SVG, serving as a reference in SVG style. This dual-image setup allows the model to leverage the capabilities of a pretrained Image-to-SVG model while incorporating the more complex natural image of the character as an additional input. This helps the model gradually acquire the ability to generate character-specific SVGs. Stage 2: In the later stage of training, we gradually drop the second (reference) image. This encourages the model to retain its character SVG generation ability while transitioning from an Image-to-SVG paradigm to a more general character-to-SVG generation directly from natural images. This two-stage training strategy provides better control over style and detail compared to direct Image-to-SVG approaches, especially in scenarios requiring simplified vector graphics. We acknowledge that our original explanation was unclear, and we will elaborate further on the differences between these tasks and the advantages of dual-image input in the revised version.

📝 Q2: In comparison with VTracer.

💡 A: We have also conducted a quantitative comparison with VTracer. Since images cannot be included during the rebuttal period, we report the quantitative comparison results with VTracer. For a fairer comparison, we compared our method with VTracer under two settings: preserving VTracer's performance (with longer tokens) and using the same token count (sacrificing performance). Following our paper, we calculate the average token length (# Tokens) of a generated SVG sample utilizing the Qwen2.5-VL tokenizer. As shown in the results below, our method achieves a trade-off between generation quality and token count (editability) that VTracer cannot achieve. Given the increased SVG tokens, VTracer achieves better generation quality at the cost of editability.

Method	#Tokens↓	DINO↑	SSIM↑	LPIPS↓	MSE↓
VTracer	7117	0.9833	0.985	0.0245	0.0051
VTracer(same token level)	2418	0.9416	0.912	0.0672	0.0120
OmniSVG	2144	0.9865	0.937	0.0587	0.0123

📝 Q3: More details of the SVG tokenizer.

💡 A: Regarding the implementation details of our SVG tokenizer, our method first preprocesses all SVG graphics to a standard 200×200 canvas, ensuring consistent coordinate space across inputs of different sizes. For coordinate quantization, we discretize all coordinate values before tokenization by truncating decimal parts to map floating-point coordinates to the integer domain. In the specific tokenization process, we adopt a coordinate merging strategy, encoding 2D coordinate points into single tokens through the mapping function <x, y> → x × w + y, where w=200 is the unified canvas width. This design reduces SVG sequence length and improves generation efficiency. Additionally, we assign dedicated tokens for six SVG command types {M, L, C, A, Z, F} and use 4096 special tokens to represent color hex values. The entire SVG script is flattened into a single command sequence, where each path begins with a drawing command followed by corresponding coordinate tokens. Finally, the entire SVG sequence is mapped to the same representation space as the pretrained VLM through learnable embedding layers, achieving unified modeling of images and SVG.

📝 Q4: Details of the token output format.

💡 A: Our token output format strictly follows predefined SVG syntax specifications. Specifically, our tokenizer constrains SVG path representations to 5 fundamental path commands (M, L, C, A, Z) with their corresponding coordinate parameters, as well as the color command F paired with color tokens. This design ensures structured and parsable outputs. During generation, the model's output space is naturally constrained within this predefined token vocabulary, where each token corresponds to specific SVG syntax elements, thereby guaranteeing that generated sequences always conform to valid SVG path syntax. This design not only simplifies the model's learning task and prevents generation of invalid SVG code, but also improves generation efficiency. We demonstrate the application effect of this token format in actual generation processes in detail in the supplementary material 2245_OmniSVG_A_Unified_Scalabl_Supplementary Material/assets/OmniSVG-main-demo-1080.mp4, where it can be observed that each token output by the model strictly adheres to this syntax specification, ensuring the validity and editability of generated results.

📝 Q5: Clarification on the usage of the icon and illustration subsets.

💡 A: Thank you for your question. Our dataset is composed of three subsets, including icon, illustration, and character, each representing a different level of visual complexity and application scenario. All three subsets are used during training for both text-to-SVG and image-to-SVG tasks. Specifically, the icon subset consists primarily of concise icon designs, typically composed of basic geometric shapes with fewer paths (average token count of ~4k using the Qwen2.5-VL tokenizer), making it well-suited for interface and UI design. The illustration subset includes more complex and diverse SVGs at the illustration level, encompassing a wider range of artistic styles and detailed structures, making it suitable for creative and visual art design tasks.

📝L1: Impact of synthetic captions on niche characters and world knowledge in SVG generation.

💡 A: We acknowledge the concern about the potential loss of niche characters and world knowledge in the SVG generation pipeline. For our MMSVG-2M-Character dataset, 50% of the data is from real SVG files collected from the web, preserving diversity and authenticity, while the other 50% is generated using FLUX-based models with vector-style LoRA and vectorized through VTracer. During caption generation, we adopted a generalizable strategy, avoiding over-reliance on proper nouns and instead introducing specific concepts through context to balance accuracy and generalization. However, avoiding world knowledge loss entirely presents challenges. Overemphasizing proper nouns would hurt the model’s generalization, while current captioning techniques still struggle with highly specific or niche knowledge. Despite ensuring data quality through PSNR and SSIM filtering and mixing real and synthetic data, this trade-off remains unavoidable under current technology. We look forward to advancements in semantic understanding and knowledge preservation.

📝L2: Justification of the significance of the character-to-SVG task despite synthetic data.

💡A: The character to SVG task possesses significant practical value in real world applications. In domains such as graphic design, animation production, and brand identity design, designers frequently need to convert character images into editable vector formats. Our method can automatically accomplish this conversion process, substantially improving workflow efficiency.

📝 L3: Clarification on the lack of multi-round image editing and its potential impact on the character-to-SVG task.

💡 A: We fully agree that multi-round image editing holds significant importance in practical applications, and this interactive editing approach can indeed substantially enhance the utility and controllability of SVG generation tasks. In this work, we focused our research on the core parametric representation problem in text-to-SVG and image-to-SVG generation, as high-quality parametric representation forms the foundation for achieving precise editing. Although the Qwen2.5-VL model architecture we adopted inherently supports chain-of-thought (CoT) reasoning and multi-round dialogue capabilities, considering that multi-round editing tasks require maintaining lengthy context information and complex state management, given our current limited computational resources, we prioritized ensuring the quality and stability of single-round generation tasks.

2025-08-03

I'm slightly puzzled by some of the author responses, and will prepare some questions to clarify my understanding and evaluation of their work soon.

2025-08-03

We truly value your insightful opinions and suggestions, and greatly appreciate your time improving our paper. We are very happy to make further clarifications if there are any other concerns.

2025-08-04

So I've re-read the paper again, including the supplemental -- as well as the author responses.

My currently evaluation is that the authors have not fully answered my questions. For the current discussion I will focus on the concrete questions rather than the future work discussion.

For example:

For the character reference generation task -- what exactly is the input to the model? Fig 5 suggests a single image is provided as input (and Figure 5 is very confusing here, using different characters for every stage of the character-to-svg pipeline). While the supplemental suggests that two images are provided. So what is it?
How exactly are the tokens constrained by syntax during the generation process? For example -- for commands requiring three arguments, do the authors perform constrained sampling to guarantee that three arguments will be locations in the image -- and by doing so ensure correct grammar?
I forgot to mention this previously, but there are quite a few grammatical errors in the paper itself -- this does not affect my evaluation of the work, but they should be fixed. For example Line 3 (development of autonomous SVG generation workflow is..., workflows should be plural), Line 27 (optimizing differentiable vector graphics rasterizers -> differentiable rasterizers as used as part of the optimization process, but they are not directly optimized)

I thank the authors for taking the time to answer my questions.

2025-08-05

Dear Reviewer,

We would like to express our sincere gratitude for the reviewer's thorough reading of our paper and supplemental materials, as well as for the valuable engagement in the discussion. We truly appreciate the time and effort the reviewer has dedicated to reviewing our work. Below, we provide detailed, point-by-point responses to the questions raised by the reviewer.

📝 Q1: Clarification of inputs of the character-reference task.

💡 A: For the character reference task, our model is designed to take a single image as input (the original natural image) and produce the target SVG as output, which we achieve through a two-stage training strategy that leverages Qwen's interleaved multi-modal representation capabilities. 1) In Stage 1, we train the model using two images as input, image-A (the original natural image, which serves as the final model input and corresponds to Figure 5's character reference image (CRef)) and image-B (the SVG-style character reference image of image-A, as described in Supplementary Material A.3 Character-SVG Pairs Construction), where the model learns to generate the SVG corresponding to image-B; 2) in Stage 2, we drop image-B from the input and train the model to generate the same SVG output using only image-A, which matches our desired inference pattern; 3) during inference, the model takes only image-A as input and generates the corresponding SVG output. This staged approach is essential because directly training with only the single-image configuration (Stage 2) prevents the model from successfully generating SVG outputs from natural images, but by first learning with the additional style reference (image-B) and then adapting to single-image input, the model develops the capability to transform natural images into SVG representations effectively.

📝 Q2: Clarification of the sampling and output token pattern.

💡 A: We did not use constrained sampling. Instead, we allow the model to learn valid SVG output patterns autonomously. During decoding, we skip commands and points that cannot be validly decoded. This issue occurs more frequently in the early stages of training, but the decoding success rate improves significantly as training progresses.

📝 Q3: Fixing the grammatical errors.

💡 A: We sincerely appreciate the reviewer's careful reading of our manuscript and thank the reviewer for pointing out these grammatical errors. We will thoroughly check and correct all these issues in the revised version, and we are grateful for your time and attention to these details.

We hope our responses have addressed your concerns to your satisfactory. If you have any further concerns, please do not hesitate to let us know.

Thank you again for your valuable time and effort!

Best regards,

All authors

2025-08-05

I have updated my rating to weak accept (borderline accept).

This rating is based on the technical contribution -- I am not providing a higher score since I think "significant" revisions to the text to improve the paper quality, including methodological clarity, figure clarity, and writing.

评论- Thank you for the feedback

2025-08-05

Thank you very much for your careful evaluation and for updating your rating to weak accept. We greatly appreciate your acknowledgment of the technical contributions of our work.

We understand your concerns regarding the clarity of the methodology, figure quality, and overall writing. We take your suggestions seriously and are committed to making revisions accordingly to improve the paper's presentation and clarity in the future version.

Thank you again for your thoughtful comments and for engaging in the discussion.

审稿意见

评分: 5置信度: 52025-07-03

The paper proposes OmniSVG, a framework for autoregressive SVG generation by leveraging pre-trained VLMs.

SVGs that are composed of multiple paths, curves, and decomposed into layers are tokenized into sequences where type identifier are represented by characters, continuous coordinates are discretized, and special tokens added to indicated the start and end of svg sequence. These tokens are then projected to the same embedding space of the pre-trained VLM through a learnable embedding layer. Text or image conditions are tokenized with the frozen VLM tokenizer, and used as prefix tokens.

The pre-trained VLM is finetuned to produce SVG tokens autoregressively.

The authors introduce a large annotated SVG dataset containing 2 million samples, divided into three subsets based on SVG complexity, with three benchmark tasks. Comprehensive quantitative evaluations and qualitative comparisons on each task demonstrate that the proposed method outperforms prior optimization-based, auto-regressive, and diffusion-based methods in both quality and efficiency.

优缺点分析

Strengths:

(+1) The authors propose a simple yet effective technique to re-purpose VLMs for SVG generation. Ablations show that their choice of SVG parameterization is important for compatibility with pre-trained VLMs.

(+2) Evaluation of the system is thorough. Qualitative results in the main paper and supplemental demonstrate that the next-token prediction framework is capable of modeling complex SVG shapes.

(+3) The quantitative metrics are equally comprehensive.

(+4) The authors contribute a valuable dataset.

Weakness:

(-1) While the evaluation methods demonstrate OmniSVG’s improved performance over other learning-based methods, it is less conclusive when compared with optimization-based methods. Optimization-based methods that minimize a reconstruction objective with geometric regularizers like DiffVG and LIVE are able to produce shapes as complex as OmniSVG for the task of image-to-SVG in comparable time. Qualitative and quantitative evaluations corroborate this. Optimization-based methods have the advantage of requiring no training data. While OmniSVG uses fewer tokens, evaluations should be done to show that OmniSVG produces SVG that model the distribution of curves and paths and layer decomposition desired by users, attributes that optimization-based methods cannot capture. This evaluation can be done qualitatively, where the control points and handles of the primitives and parametric curves can be shown, including their layer decomposition.

(-2) DiffVG and LIVE can be combined with a text-to-image model, making it a strong baseline for text-to-SVG. Currently, this experiment is missing.

(-3) The SVG tokenization is heavily borrowed from IconShop, with the additional token for color fill (a trivial introduction). Fig 6 (ablation) shows that this choice of parameterization is crucial, despite using the same VLM under-the-hood. To make a fair comparison to IconShop and justify the fine-tuning of a VLM and overall contribution of the paper, training an auto-regressive transformer of comparable size from scratch (like in IconShop) with the additional fill color token on the proposed MMSVG-2M dataset is an important baseline. Alternatively, justifying the importance of the VLM, like its generalization capabilities, is needed.

(-4) Failure case shown in Fig 8 is one most image-to-vector methods will suffer from as the input image contains signals that standard vector primitives with solid fill attribute cannot model. Instead, it would be more meaningful to test image-to-SVG on vector shapes with complex geometry – more complex than characters – and containing complex appearance like linear or freeform gradients, ones that can be modelled in common design software. Text-to-image models can provide good candidate images.

(-5) I don’t agree with the statement that OmniSVG decouples structural logic from low-level geometry. The tokenized SVG, like the original XML file, is imperative and contains the same information – curve type, coordinate. However, the chosen parameterization is more compact and I suspect this might be the reason why it is more suitable for VLMs in practice.

问题

Is the reason why the number of tokens used in MMSVG-Character is x3 times more than MMSVG-Illustration because the dataset for the former contains less artist curated SVGs and more VTracer SVGs, where redundant and over-parameterized curves can be present? The training samples shown in Fig 2 do not explain why more tokens are needed for characters as they appear to have the same (visual) complexity as samples from illustrations.
Is the embedding layer just a linear layer? Details about its implementation is missing in the paper.
Example questions used in the user study and task given to the participants should be shown in the paper (or supplemental).

局限性

The authors have adequately addressed the limitations, namely (1) the wall-clock time during inference, and (2) performance on out-of-distribution raster images for the image-to-SVG task.

最终评判理由

The authors have resolved all my concerns and I am satisfied with the rebuttal.

After reading all the other reviews and their respective rebuttal, my final rating will be accept.

格式问题

Minor: the main paper hyper-references various sections in the Appendix but the Appendix was provided in the supplemental.

作者回复

2025-07-31

We sincerely thank the reviewer for recognizing our key contributions: 1) an innovative approach adapting pre-trained VLMs for SVG generation, 2) comprehensive experimental validation, 3) a substantial dataset of 2 million annotated SVG samples, and 4) ablation studies validating our SVG parameterization choices.

Please find our point-to-point response to your concerns below.

📝 W1: More analyses about the comparisons with DiffVG and LIVE.

💡 A: We provide comprehensive quantitative comparisons with optimization methods including LIVE and DiffVG in Table 3 of our paper and Table 8 in the supplementary material. For convenience, we report the the results corresponding to the papers OmniSVG, LIVE, and DiffVG in the following table. While LIVE achieves better scores on pixel-level reconstruction metrics (SSIM, LPIPS, MSE), OmniSVG significantly outperforms LIVE on DINO semantic similarity , indicating that our method better captures high-level semantic features of input images.

Dataset	Methods	DINO↑	SSIM↑	LPIPS↓	MSE↓	#Tokens
MMSVG-Icon	LIVE	0.960	0.979	0.034	0.004	18.2k
	DiffVG	0.951	0.956	0.056	0.015	19.8k
	OmniSVG	0.980	0.954	0.049	0.011	3.8k
MMSVG-Illustration	LIVE	0.959	0.960	0.044	0.011	18.2k
	DiffVG	0.929	0.930	0.077	0.021	19.8k
	OmniSVG	0.974	0.944	0.069	0.019	9.7k
MMSVG-Character	LIVE	0.911	0.939	0.109	0.038	18.2k
	DiffVG	0.876	0.915	0.163	0.051	19.8k
	OmniSVG	0.921	0.917	0.049	0.021	30.8k

OmniSVG generates SVGs with much fewer tokens than optimization-based methods (LIVE, DiffVG), reflecting more concise and well-structured SVG generation. This conciseness benefits SVG users by enabling easier understanding and manipulation of control points, Bézier curves, and layer structures. In contrast, while optimization-based methods achieve high pixel-level accuracy, they often produce redundant and irregular path structures that, though visually similar, are harder for designers to comprehend and edit.

Our method learns from real SVG data distributions, generating vector graphics that align with human creation patterns, a key advantage that pure optimization methods cannot capture. Due to rebuttal constraints preventing image displays, we will visualize in the revised version how OmniSVG generates more concise and editable SVG XML files for the same input images.

📝 W2: Text-to-SVG task comparisons with DiffVG and LIVE.

💡 A: We appreciate the reviewer's suggestion to combine DiffVG and LIVE with text-to-image models as text-to-SVG baselines. In response, we conducted experiments using GPT-4o to convert text prompts into images, which were then processed using DiffVG and LIVE for image-to-SVG tasks, establishing text-to-SVG baselines for comparison. Our results on the MMSVG-Icon test set show that OmniSVG outperforms the baseline methods across multiple metrics, including FID and Aesthetic scores, and also performs well on the HPS metric. While the CLIP scores are similar, OmniSVG achieves this with significantly fewer tokens, demonstrating greater efficiency in generating high-quality SVGs.

Method	FID↓	CLIP↑	Aesthetic↑	HPS↑	#Tokens
LIVE	88.83	0.3322	4.88	0.234	18.3k
DiffVG	93.23	0.3209	4.31	0.201	19.8k
OmniSVG	64.83	0.3194	5.49	0.247	3.8k

These quantitative results comprehensively validate the advantages of OmniSVG over existing methods combined with text-to-image models for text-to-SVG tasks. We will supplement the revised version with corresponding qualitative results.

📝 W3: Fairly comparisons with IconShop and justification the fine-tuning of the VLM.

💡 A: We appreciate the reviewer's suggestion. We extended the IconShop baseline by adding color fill and trained it using IconShop's transformer framework. To validate the importance of VLM, we trained models from scratch by randomly initializing the weights instead of using pretrained VLM weights. Since Qwen2.5-VL-3B-Instruct contains 4B parameters, for fair comparison, we also trained a model from scratch with the same parameter count as IconShop. This comparison primarily evaluates performance differences across different architectures. The results show that the 4B model from scratch performs significantly worse than smaller models, indicating that larger models are more challenging to train, while pretrained models like Qwen provide effective initialization.

Method	FID↓	CLIP↑	Aesthetic↑	HPS↑
IconShop(+fill)	77.65	0.2810	4.47	0.235
OmniSVG(from_scratch, 4B)	84.12	0.2727	4.18	0.228
OmniSVG(from_scratch, tiny)	74.29	0.2931	4.68	0.242
OmniSVG	64.83	0.3194	5.49	0.247

The results demonstrate that leveraging pretrained VLM significantly improves generation quality across all metrics, validating the importance of VLM's generalization capabilities for SVG generation tasks.

📝 W4: Failure cases analyses.

💡 A: Thank you for highlighting this limitation. We agree that complex appearance attributes, such as gradients, present significant challenges for current methods. As we cannot upload images during the rebuttal period, we will include additional failure cases in the revised version, specifically demonstrating where our method struggles with SVGs containing linear and freeform gradients. The limitation arises from the inherent complexity of modeling gradients in vector graphics, which requires handling multiple parameters like color stops, interpolation methods, and gradient directions. Our current tokenizer focuses on geometric shapes and solid color fills, and has not yet fully addressed these complex attributes. This is a common bottleneck in deep learning-based SVG generation methods, as the high-dimensional and nonlinear nature of gradient parameters makes end-to-end training difficult. Despite this, our method performs excellently with vector graphics that involve complex shapes and solid color fills, generating highly detailed SVG outputs. This forms a strong foundation for future extensions to handle more complex appearance modeling.

📝 W5: Clarification of the statement.

💡 A: Thank you for your insightful observation. We agree that the statement "OmniSVG decouples structural logic from low-level geometry" is inaccurate and will remove it in the revised version. The core contribution of the OmniSVG tokenizer is to establish a unique and canonical representation for SVG graphics. As the reviewer correctly points out, the tokenized SVG retains the original structure and geometric information but achieves uniqueness through a systematic parameterization scheme. This is essential for VLMs, as multiple representations of the same graphic create ambiguity during learning. For example, a rounded rectangle can be represented either as <rect width="200" height="100" x="10" y="10" rx="20" ry="20" fill="blue" /> or as a path command. Such diversity makes it hard for models to learn essential features. OmniSVG eliminates this ambiguity through standardized tokenization, ensuring identical visual elements always correspond to the same token sequence, enhancing the efficiency and accuracy of VLMs in understanding and generating SVG.

📝 Q1: Clarification of the tokens of MMSVG dataset.

💡 A: The 3x increase in token usage for MMSVG-Character compared to MMSVG-Illustration is due to differences in image complexity, not just redundancy from VTracer conversion. Although we removed complex SVG structures using the picosvg tool during preprocessing, character images inherently require more Bézier curves to capture fine details like facial features, clothing textures, and hair, which contrasts with the simpler illustrations featuring large color blocks and concise lines. While Figure 2 shows similar visual complexity, Figure 9 in the supplementary materials highlights that the Character dataset has a higher overall complexity, with more high-complexity images than the Illustration dataset. Additionally, VTracer generates denser path points for character images, which is necessary for accurate reconstruction. The average token count reflects these inherent differences in SVG complexity between the two datasets.

📝 Q2: Details of the embedding layer.

💡 A: Yes, the embedding layer is a linear layer. The implementation is as follows:

torch.nn.Embedding(vocab_size, embed_size)

For OmniSVG(4B), the vocab_size is 196,042 and embed_size is 2,048. For OmniSVG-L(8B), the vocab_size is 200,128 and embed_size is 3,584.

📝 Q3: Details of user study.

💡 A: We will include the complete user study questionnaire in the supplementary materials of the revised version. Due to space limitations, detailed scoring criteria will be provided in the revised version. Below are some example questions.

Instructions: Participants evaluated SVG outputs from 5 methods based on three criteria: Preference (overall quality), Vividness (visual appeal), and Alignment (consistency with input), rating each on a scale of 0 to 100 (0 = worst, 100 = best).

Part 1: Text-to-SVG Evaluation

Sample Format: Text Prompt: "A cute cartoon cat sitting on a chair"

Generated SVGs from Methods A through E are presented. Participants rate each method using the following table:

Method	Preference (0-100)	Vividness (0-100)	Alignment (0-100)
Method A
Method B
Method C
Method D
Method E

Part 2: Image-to-SVG Evaluation

Sample Format: A source raster image is provided, followed by SVG outputs from Methods A through E. Participants use the same rating table structure as in Part 1 to evaluate each method's performance.

2025-08-08

Dear authors,

Thank you for the detailed response and I apologize for the delay.

I appreciate the additional comparison with IconShop, given the small turnaround time. I recommend adding these in the main paper as the results are insightful.

The table showing "Text-to-SVG task comparisons with DiffVG and LIVE" has surprising results. As these are raster-based metrics, and given GPT-4o has been trained on ample (rasterized) vector graphics, I would expect LIVE and DiffVG to be quite competitive. Where the text-prompts used the same as the ones in MMSVG-Icon test set? Did these prompts require additional prompt engineering to ensure GPT-4o produces vector-like images? The answer to this does not affect my overall view of the paper.

Otherwise, I am satisfied with the rebuttal. After reading all the other reviews and their respective rebuttal, I will keep my original score of borderline accept.

2025-08-08

Dear reviewer,

Thank you very much for your positive feedback and for taking the time to review both the paper and the rebuttal. We truly appreciate your thoughtful evaluation and the insights you've provided.

We are grateful for your appreciation of our additional comparison with IconShop, and we will include this as an additional ablation experiment to demonstrate the importance of VLM pretraining in SVG generation in the revised version.

Were the text prompts the same as those in the MMSVG-Icon test set? Did these prompts require additional prompt engineering to ensure GPT-4o produces vector-like images?

To clarify the process, we follow a text -> GPT-4o -> DiffVG -> SVG workflow. In this process, we start by providing GPT-4o with a prompt such as, “Generate an image in vector style based on the text ‘Cloud icon with an upward arrow symbolizes uploading or cloud storage,’” where the text is the same as the test set of MMSVG-Icon. The resulting image is then passed through LIVE/DiffVG for optimization, which further refines the image and converts it into an SVG format.

Once again, thank you for your positive feedback and thoughtful review. We appreciate your time and consideration.

2025-08-03

Dear Reviewer,

2025-08-04

Dear Reviewer,

Thank you again for your valuable time and effort!

Best regards,

All authors

2025-08-06

Dear Reviewer,

According to this year's NeurIPS review policy, "Reviewers must participate in discussions with authors before submitting a Mandatory Acknowledgement", could you please provide additional comments discussing whether the rebuttal addresses your concerns?

Thank you.

最终决定Accept (poster)

2025-09-17

This work presents a large-scale dataset for vector graphics and proposes a unified framework for generating SVGs under multiple conditions. The authors also introduce an SVG tokenization method that reduces redundancy and improves efficiency. They report strong performance gains in both quantitative and qualitative evaluations.

Reviewers agree on the following strengths:

The proposed dataset is valuable for future research.
The approach supports multiple conditions, including text, image, and character-reference inputs.
Strong empirical performance.

Some concerns regarding limited evaluation were raised, but the authors provided additional results that addressed these issues. One notable weakness is the writing quality. The authors need to improve clarity not only in the text but also in the presentation of images, tables, and experimental settings.

Considering all the points above, the AC recommends a borderline accept.