/10

Poster4 位审稿人

最低2最高3标准差0.5

ICML 2025

Vector Grimoire: Codebook-based Shape Generation under Raster Image Supervision

Marco Cipriano,Moritz Feuerpfeil,Gerard de Melo

提交: 2025-01-24更新: 2025-08-16

TL;DR

We propose a text-guided SVG generative model that creates vector graphics from natural language descriptions, learning from raster images without direct SVG supervision

摘要

关键词

Image GenerationScalable Vector GraphicsVQ-VAEDifferentiable Rasterizer

评审与讨论

审稿意见

评分: 22025-03-13

This work focuses on SVG generation. Its model consists of two modules: a visual shape quantizer learns to map raster images onto a discrete codebook by reconstructing them as vector shapes, and an auto-regressive Transformer model jointly learns the distribution over shape tokens, positions and textual descriptions.

给作者的问题

Does the extension to support stroke width and color prediction mentioned in Sec 5.3 requires re-training of the proposed two modules?

论据与证据

There is no comparison with IconShop on FIGR-8. This work does not use ground-truth paths as the supervision while Iconshop uses ground-truth paths. If there is no such direct comparison, it is impossible to identify the contribution of the proposed work: a) it achieves better performance than the one uses ground-truth path. If this is the case, the contribution is very significant, i.e., changing the traditional paradigm of SVG generation; or b) it does not achieve better performance than the one uses ground-truth path. If that is the case, the contribution simply lies on avoiding using ground-truth paths.

方法与评估标准

There is no qualitative comparison between baselines and the proposed method. The quantitative comparison makes sense.

理论论述

There are no theoretical claims.

实验设计与分析

In the comparison to the SDS-based methods, the chosen baselines are not the most advanced ones [1][2][3] and the results of SDS-based methods look much worse than what are reported in the original paper. I suggest the authors make more explanation for it.

[1] https://arxiv.org/pdf/2312.16476

[2] https://arxiv.org/pdf/2405.10317

[3] https://arxiv.org/pdf/2411.16602 (optional as it is too close to the submission deadline)

补充材料

I check the post-processing part.

与现有文献的关系

It introduces a new paradigm of training SVG generation model which does not rely on SVG data (which requires lots of data for training) and is not SDS-based (which can be very slow). The idea of splitting the image to patches in VSQ stage is smart.

遗漏的重要参考文献

The advanced SDS-based methods are not discussed [1][2][3]

[1] https://arxiv.org/pdf/2312.16476

[2] https://arxiv.org/pdf/2405.10317

[3] https://arxiv.org/pdf/2411.16602 (optional as it is too close to the submission deadline)

其他优缺点

no additional comments.

其他意见或建议

no additional comments

作者回复

2025-03-31

We sincerely appreciate the reviewer's time and thoughtful feedback on our manuscript. Given the time constraints of this rebuttal, we have focused on addressing the major concerns as follows.

Vector-based baselines

We have extended our analysis to two vector-supervised methods – DeepSVG and IconShop – training them on the same FIGR-8 data used for Grimoire. Unlike Grimoire, DeepSVG supports conditioning only on class identifiers, therefore we assigned a unique identifier to each class in FIGR-8.

Upon suggestion of reviewer f1Fu, we have also finetuned Llama 3.2 on FIGR-8 with minimal data preprocessing. We believe this to be an insightful analysis that shows how tailored tokenization pipelines and extensive data preprocessing are necessary for other vector-supervised models to perform effectively.

Despite raster and vector data providing very different supervising signals, we believe that this analysis ultimately helps better position our method.

Llama. We fine-tuned Llama (instruction tuning) for three days on eight H100 GPUs. Minimal preprocessing includes rounding up the path coordinates to integer values. Upon inspection, this did not affect the quality of the image. We use the original chat template and included special tokens to delimit the SVG code. The performance at inference appears very poor. The model predicts the most recurrent patterns in the dataset, resulting mainly in circular artifacts. The SVG syntax is, however, correct most of the time, allowing rendering.

DeepSVG. We train DeepSVG using the official training script. The model converges within a few hours, but the results are also not good, yielding the lowest CLIPScore and FID among all models.

IconShop. We also re-trained the original IconShop model on the subset of FIGR-8 used in Grimoire. In this case, the performance of the model is comparable to Grimoire, resulting in slightly better CLIPscore and FID.

All results are reported in the table below.

Model	CLIPScore	FID	Conditioning	Supervision
DeepSVG	22.10	58.03	Class	Vector
Llama 3.2	25.45	38.93	Prompt	Vector
Grimoire	29.00	0.64	Prompt	Raster
IconShop	31.18	0.40	Prompt	Vector

Additional SDS-based methods

We have added two more recent SDS-based methods (SVGDreamer and Chat2SVG) to our qualitative analysis for the final version of our manuscript. We have also included a new quantitative analysis of all these models.

The point of the results in section 4.5 is not comparing general generative capabilities but highlighting that besides the aesthetically pleasing results, this family of models falls short in representing a specific target domain and provides no way to be extended to new data.

Making this analysis quantitative is not straightforward. FID score between image distribution is reliable on thousands of samples, but the computational cost of SDS-based models requires up to hours for a few samples (e.g. SVGDreamer) or uses costly proprietary models (e.g. Chat2SVG).

We have hence used the PSNR of 20 generated samples from all models. The results highlight how all models fall short on our dataset distribution.

Class	Model	Average PSNR (dB)
User	CLIPdraw	28.68
	Grimoire	45.19
	VectorFusion	36.62
	Chat2SVG	37.62
	SVGDreamer	34.53
Heart	CLIPdraw	28.54
	Grimoire	45.66
	VectorFusion	38.54
	Chat2SVG	37.88
	SVGDreamer	34.44

update after rebuttal

I appreciate the authors' clarifications. Most of my concerns have been addressed by the rebuttal. I would lean to keep my score by involving the additional evaluations and discussions in the revised version.

给作者的问题

Are there any quantitative evaluations for text-conditioned SVG generation?

论据与证据

The discussion of the existing SDS-based methods is not very clear. Sec. 2.1 only describes several related works without discussing the difference between the existing works and the proposed methods. Sec. 5.4 only shows visual comparisons with the existing works. It would be better to involve the analysis in the paper, rather than supp.

方法与评估标准

The proposed method is a reasonable and effective solution to SVG generation.

理论论述

Yes.

实验设计与分析

The current evaluation is not thorough. Only Im2vec is used for comparison. To make the result more convincing, additional existing SVG generation works should be compared and discussed.

补充材料

Yes.

与现有文献的关系

The proposed method enhances the generation quality over existing raster-supervised SVG models and enables flexible text-conditioned SVG generation.

遗漏的重要参考文献

No.

其他优缺点

The overall pipeline is a reasonable solution to perform SVG generation. The experiments demonstrate the effectiveness of the proposed method.

其他意见或建议

It would be better to reorganize the structure of section 5. The first paragraph states two aspects of the result, but there are four subsections here.

作者回复

2025-03-31

We sincerely thank you for taking the time to review our manuscript and providing valuable feedback.

Discussion on SDS-based Methods

In the final version of the manuscript, we plan to more clearly highlight the differences between SDS approaches and Grimoire at the end of section 2.1 as follows:

“These methods do not involve training for the vector generation process and instead rely on models trained on raster images for other tasks, making them difficult to extend to new data.”

Regarding this point, we have also:

incorporated two more recent SDS-based methods, SVGDreamer and Chat2SVG, and will include additional qualitative results in the final version of the manuscript.
Added a quantitative comparison of these methods. This analysis has been included in our response to Reviewer sRe3.

We used PSNR to evaluate some generated samples from all the SDS-based models, highlighting how these methods fall short when compared to a specific dataset. We chose not to use FID, as it requires a large number of samples to be statistically significant, which was not feasible within the constraints of SDS methods: slow generations e.g. SVGDreamer, expensive inference e.g. Chat2SVG based on Claude APIs.

Code Length of the SVG

For Im2Vec, the number of paths and control points per path is fixed at eight and ten, respectively.
Grimoire dynamically adjusts the number of strokes based on the target complexity.

After inspecting over 15,000 generated SVGs across all FIGR-8 classes, we found that the average number of paths is 95.

We have uploaded samples and reconstructions for the “user” class for both models to our anonymous repository. For the reconstructions we included the original ground truth for reference.

For single-target reconstructions, Im2Vec tends to overlap strokes around the outline, whereas for more complex targets, the strokes collapse. We encourage the reviewer to directly inspect the SVG code length on the anonymous GitHub Repository: Link.

We found this analysis insightful and plan to incorporate it into the results section in the final version of the manuscript.

Additional Baselines

Thank you for suggesting finetuning a standard LLM with minimal preprocessing. This was an insightful suggestion, especially given that the tokenization pipelines of other vector-supervised models (e.g., IconShop) are a considerable limitation.

We have fine-tuned LLaMA 3.2 on the same FIGR-8 subset used for Grimoire with minor preprocessing and also added comparisons with other vector-supervised models using their respective tokenization pipelines (DeepSVG, IconShop).

This analysis has been included in our response to reviewer sRe3.

Missing References

We appreciate all reviewers for highlighting important missing references, such as LIVE. We plan to expand the related work section to incorporate all suggested papers. Specifically, we will:

Before L87: Introduce vector-supervised methods that predate the LLM era, citing DeepSVG, Google-Fonts, and DeepVecFont.
Immediately after: Among the LLM-based approaches, explicitly mention StarVector and Chat2SVG.
At the end of the section: Where we discuss SDS-based methods, include SVGDreamer and dedicate a small paragraph to neural implicit representations, citing NiVel, Text-to-Vector Generation with Neural Path Representation, and NeuralSVG.

While these works differ significantly in methodology, they address similar problems to Grimoire and will be appropriately cited.

We hope this revision sufficiently addresses concerns regarding missing references.

审稿人评论

2025-04-08

I thank the author's for their response to my review. In particular I appreciate the addition of the LLaMA baseline, the inclusion and results of which I regard as a positive addition to the paper. I also appreciate adding the average SVG code-lengths (in terms of number of paths) which demonstrated, as I feared, that the Grimoire code is relatively complex and verbose.

Balancing these two factors, I maintain my original score. The paper is interesting and a useful addition to the literature with some significant disadvantages.

作者评论

2025-04-08

Thank you for your positive comments and for appreciating our work!

We would like to kindly offer an additional clarification that may support your evaluation. The higher number of path segments observed in the FIGR8 experiments is not an inherent limitation of the method, but rather a result of how the vector primitives are designed.

For example, as illustrated in the preliminary results in Figure 9, when we reconstruct the image using a layer-based approach rather than stroke-based primitives, the number of paths required is significantly reduced—closely resembling real-world SVG files.

SVG files generated using this layered setting are available in the same folder, should you be interested in exploring this aspect further. Link to folder.

We hope this provides a helpful additional perspective.

审稿意见

评分: 22025-03-25

This paper presents GRIMOIRE, a novel text-guided generative model for scalable vector graphics (SVG). The model consists of two main components: a Visual Shape Quantizer (VSQ), which learns to reconstruct raster images as vector shapes through a discrete codebook; and an Auto-Regressive Transformer (ART), which models the joint distribution over shape tokens, positions, and textual descriptions to generate SVGs from natural language prompts. Unlike prior approaches requiring direct supervision from SVG data, GRIMOIRE is trained only with raster image supervision, enabling it to scale to larger datasets. The authors evaluate their method on tasks such as closed-shape reconstruction (MNIST, Emoji) and stroke-based generation (icons, fonts), demonstrating improved flexibility over SVG-supervised methods and competitive generative quality against image-supervised baselines.

给作者的问题

Why use a raster-domain image encoder (e.g., ResNet-18) instead of a vector-domain encoder such as that used in DeepSVG? This choice impacts the learned latent representation, and a comparison would help clarify its effects.

How does the VSQ module differ from the VQ-based architecture in Im2Vec? The performance seems similar, so it's unclear how much the modification contributes to overall results.

Can the authors include a comparison with IconShop for SVG generation? Given that IconShop is a recent and strong baseline, its inclusion would help contextualize the performance of GRIMOIRE.

论据与证据

The proposed architecture builds incrementally upon established techniques in vector image representation and generation. It integrates standard components in a coherent manner. While the claims are plausible and generally supported by qualitative and quantitative evidence, the overall novelty is modest. The improvements are incremental, and while the results are reasonable, they do not clearly establish substantial advancement over existing methods.

方法与评估标准

The benchmark datasets used in this work are appropriate, as they align well with those used in prior literature, allowing for meaningful comparisons. However, the methodology would benefit from clearer descriptions of which components are used in each experiment. Since the paper combines several previously established techniques, it is important to explicitly state which variants or components are evaluated in each experiment.

理论论述

There is no theoretical claim in this work.

实验设计与分析

Vector Quantization for SVG Representation (Tables 1 and 2):

An ablation study would greatly strengthen the experimental section. For example, variations in encoder-decoder configurations (e.g., patch size, grid size, codebook size, or SVG command set) could demonstrate the robustness and contribution of the individual components. The current comparisons, such as with Im2Vec, leave some ambiguity about whether the improvements stem from the proposed method or the dataset/model choices.

SVG Generation (Table 3):

The evaluation should include a comparison with IconShop to give a more comprehensive view of generative quality relative to recent state-of-the-art models.

补充材料

All parts of the supplementary materials have been reviewed. However, it would be helpful if the authors explicitly stated the goals of each experiment in the supplementary section to improve clarity.

与现有文献的关系

This work contributes to the growing field of vector image representation and generation by bridging the raster and vector domains. The approach is particularly promising for multimodal applications, including vision-language models and code generation involving SVG. Given the relevance to downstream applications and the increasing interest in multimodal generation, this research has strong potential impact.

遗漏的重要参考文献

Several key works are missing that are essential for contextualizing this paper’s contribution: Google-Fonts (ICCV 2019): A foundational model for deep SVG generation combining image auto-encoders with SVG decoders.

DeepSVG (NeurIPS 2020): Introduced transformer-based autoencoders for vector graphics; its SVG tokenization remains widely used.

LIVE (CVPR 2022): Demonstrates SVG translation from raster images without score distillation loss.

DeepVecFont / DeepVecFont v2 (SIGGRAPH Asia 2021, CVPR 2023): Addressed SVG command modeling and differentiable rasterization. These works are highly relevant to both the architecture and training methodology proposed in GRIMOIRE and should be discussed in the paper. Notably, vector image representation has a history that precedes the large language model (LLM) era, contrary to the implication in the related work section.

Note that representation for SVG is not starting from large language model era as mentioned in related works

其他优缺点

Strengths:

Proposes a new framework for SVG generation that avoids the need for direct SVG supervision, opening the door to larger training datasets. Addresses a relatively underexplored area in generative modeling, with practical applications in design, UI generation, and code synthesis.

Weaknesses:

Lacks sufficient citation and discussion of related foundational works. Ablation studies are missing, making it hard to isolate the contributions of each component. Experimental gains over prior work are modest.

其他意见或建议

LLM Baselines:

s a non-essential suggestion (outside rebuttal), it would be informative to include SVG generation results from public large language models (e.g., OpenAI’s GPT). Although their SVG generation quality is currently limited, showing this comparison would highlight the advantage of GRIMOIRE.

Typos:

There is a broken image reference at line 322: “Qualitative results in ?? confirm this behaviour on the MNIST dataset.”

作者回复

2025-03-31

We sincerely appreciate the time and effort you have dedicated to reviewing our manuscript and providing insightful feedback.

We conducted ablation experiments to address your concerns. We have explored different patch and grid sizes on the MNIST dataset, analyzed the impact of stroke length and width on FIGR8, and investigated the effects of varying codebook sizes. Additionally, we have included comparisons with three vector-supervised methods.

Ablation: Patch and Grid Sizes (MNIST)

We trained the VSQ module on MNIST with different grid sizes (paper value: 5) and patch sizes (paper value: 128x128).

Key findings:

The patch size variations had minimal impact on model performance.
Grid size variations led to improvements for larger number of patches per image, likely due to the simpler topology of smaller patches.
In all cases, the reconstruction error remains lower than Im2Vec.

The values in the table below reports the MSE on the test-set.

Patch Size	Tiles = 3	Tiles = 5	Tiles = 8
32	0.093	0.092	0.078
64	0.092	0.09	0.071
128	0.09	0.094	0.078

Ablation: Stroke Length (FIGR8)

To assess the impact of stroke properties on VSQ performance, we conducted two ablations:

Stroke length variations: We created patches with smaller or larger strokes. Results show that shorter strokes yield lower reconstruction errors, similarly to the grid size variations.
Multiple stroke predictions per patch: We extended the prediction head of the VSQ to output two strokes per patch instead of one (as in the paper). Results show that more than one segment per shape consistently degrades the reconstruction quality. This suggests that the complexity of strokes in our dataset does not require multiple Bézier curves per patch. | Stroke Length | Segments | Stroke Width | MSE | |----------------|----------|--------------|--------| | 3.0 | 1 | 0.4 | 0.0049 | | 5.0 | 1 | 0.66 | 0.011 | | 8.0 | 1 | 1.06 | 0.023 | | 3.0 | 2 | 0.4 | 0.0052 | | 5.0 | 2 | 0.66 | 0.017 | | 8.0 | 2 | 1.06 | 0.023 |

Ablation: Codebook Size

To understand the impact of codebook size $|V|$ , we trained the VSQ on FIGR8 using all the sizes proposed in the original Finite Scalar Quantization paper (240, 1000, 4375 [our paper], 15360, and 64000).

The results reported below highlight two key observations:

The reconstruction error decreases significantly up to $|V| = 4375$ but shows only marginal improvements beyond this point.
Using excessively large codebooks does not justify the increased computational cost.

V	MSE
240	0.0205
1000	0.0175
4375	0.0145
15360	0.0130
64000	0.0128

Comparison with Vector-Supervised Models

We have added a comparison with three vector-supervised models, including a publicly available LLM: LLama 3.2, DeepSVG, and IconShop. We trained the models on the SVG version of FIGR8. A detailed analysis is in our response to reviewer sRe3.

Missing References

We appreciate your feedback regarding missing references. We have addressed this in our response to reviewer f1Fu and outlined how we will incorporate these references in the final manuscript.

Vector Grimoire: Codebook-based Shape Generation under Raster Image Supervision

摘要

评审与讨论

给作者的问题

论据与证据

方法与评估标准

理论论述

实验设计与分析

补充材料

与现有文献的关系

遗漏的重要参考文献

其他优缺点

其他意见或建议

Vector-based baselines

Additional SDS-based methods

Other questions and observations

update after rebuttal

给作者的问题

论据与证据

方法与评估标准

理论论述

实验设计与分析

补充材料

与现有文献的关系

遗漏的重要参考文献

其他优缺点

其他意见或建议

Discussion on SDS-based Methods

Other Questions

给作者的问题

论据与证据

方法与评估标准

理论论述

实验设计与分析

补充材料

与现有文献的关系

遗漏的重要参考文献

其他优缺点

其他意见或建议

Code Length of the SVG

Additional Baselines

Missing References

给作者的问题

论据与证据

方法与评估标准

理论论述

实验设计与分析

Vector Quantization for SVG Representation (Tables 1 and 2):

SVG Generation (Table 3):

补充材料

与现有文献的关系

遗漏的重要参考文献

其他优缺点

Strengths:

Weaknesses:

其他意见或建议

LLM Baselines:

Typos:

Ablation: Patch and Grid Sizes (MNIST)

Ablation: Stroke Length (FIGR8)

Ablation: Codebook Size

Comparison with Vector-Supervised Models

Missing References

Other Questions