PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
4
4
4
3.8
置信度
创新性2.8
质量2.8
清晰度3.3
重要性2.5
NeurIPS 2025

GuideFlow3D: Optimization-Guided Rectified Flow For Appearance Transfer

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29
TL;DR

A training-free method that steers pre-trained generative rectified flow with differentiable guidance for robust, geometry-aware 3D appearance transfer across shapes and modalities.

摘要

关键词
3D Diffusion3D Appearance TransferGenerative Guidance

评审与讨论

审稿意见
5

This paper tackles the task of transferring appearance specified by images or texts to a given 3D geometry. For this, it proposes a training-free approach that is based on a pre-trained generative model, i.e., TRELLIS [14]. Specifically, the authors incorporate the universal guidance [56] scheme into the pre-trained model's sampling. Namely, during each flow matching step, the authors find an optimized latent that minimizes losses based on 1) part-aware appearance loss, or 2)self-similarity-based geometry loss (Sec. 4.1). Experiments on a newly-proposed dataset verify the effectiveness of the proposed approach according to GPT-4V ranking.

优缺点分析

  1. Quality-wise: the qualitative results in the paper are of high quality;
  2. Clarity-wise: the paper is well-written and easy to follow;
  3. Significance-wise: the paper tackles an important task of editing 3D assets, which is vital for large-scale applications;
  4. Originality-wise: the proposed approach that is training-free and based on universal guidance is interesting and looks promising.

问题

I do not fully get what the baseline "TRELLIS" in L258 means:

Trellis enables detailed local texture transfer using the appearance condition with structured latents generation, and serves as our baseline without guidance.

Can authors clarify?

局限性

My main concern is that the baselines are too weak.

Though there are not many 3D appearance transfer works, there are tons of works that excel on 2D style transfer. Thus, I think the following pipeline is a competitive one:

  1. Render the given mesh in one or multiple views (we can also easily obtain masks since the geometry is given)
  2. Use state-of-the-art 2D style transfer model to change the appearance in the 2D rendered image. For example,
    1. https://github.com/garibida/cross-image-attention
    2. https://github.com/diyiiyiii/StyTR-2
    3. https://github.com/silky1708/local-style-transfer
    4. https://github.com/FilippoBotti/MambaST
  3. For text-based one, we can first use text-to-image to generate appearance 2D reference and then use style transfer to generate 2D image
  4. Run TRELLIS image-conditioned model to output 3D assets
    1. If we want to keep the mesh we can also keep the active voxel positions

Frankly speaking, even 7 years ago, there were works that conducted 2D-to-3D style transfer, e.g., https://github.com/hiroharu-kato/style_transfer_3d. Please compare and provide suitable credits.

最终评判理由

After carefully reading the authors' responses and other reviews, I still think this is an interesting paper that tackles the task of appearance transfer across 3D assets. The rebuttal also added new baselines that include 2D style transfer, making the paper more comprehensive. Thus, I lead towards acceptance.

格式问题

N/A

作者回复

We thank Reviewer zJCJ for their positive feedback, highlighting the high quality of our results, the well-written and easy to follow presentation, and the interesting and promising nature of our training-free, universal guidance-based approach to the important task of editing 3D assets… vital for large-scale applications. Below we address all concerns:

Q1: My main concern is that the baselines are too weak. Though there are not many 3D appearance transfer works, there are tons of works that excel on 2D style transfer.

Following reviewer suggestions, we added three new baselines: (1) EASI-Tex (SIGGRAPH 2024) [1] is a conditional diffusion-based method that uses edge-aware conditioning and ControlNet to texture an existing 3D mesh from a single RGB image, (2) Cross Image Attention (SIGGRAPH 2024) [2] builds upon the self-attention layers of generative models and introduces a cross-image attention mechanism that implicitly establishes semantic correspondences across images; (3) Mamba-ST (WACV 2025) [3] adapts the behavior of State Space Models to simulate the behavior of cross-attention layers, which are able to combine two separate embeddings into a single output. Please note that (2) and (3) are 2D style transfer methods which we adapt to our use-case following your suggestion. We conduct an experiment with image condition using Lappearance\mathcal{L}_{\text{appearance}} on the simple-complex set (same setting as Main Paper: Tab. 1). Below are the results:

MethodFidelity ↓Clarity ↓Integration ↓Quality ↓Adaptation ↓Overall ↓
Mamba-ST3.343.433.473.523.463.45
Cross Image Attention2.612.642.692.712.682.66
EASI-Tex1.781.831.9001.891.851.82
GuideFlow3D1.151.141.161.211.111.06

GuideFlow3D achieves strong improvements in both appearance transfer and structural clarity preservation. This demonstrates the ability of our approach to effectively translate visual and semantic cues into coherent, high-fidelity textures across different 3D objects. We also conducted a user study to measure the correlation between the LLM’s judgments and those of human evaluators and found that GPT-4 performs comparably to humans. Please refer to our response to Reviewer dDnT Q1 for a detailed explanation. We will add these results alongside comprehensive benchmarking evaluation and qualitative comparison with the baselines on all of our experimental settings to the final manuscript.

We appreciate your suggestions for adapting 2D style transfer methods to 3D and, if desired, we will also provide the text-based results during the discussion phase, as a way to better situate the context of our work in the evolving editing landscape.

Q2: I do not fully get what the baseline "TRELLIS" in L258 means.

Since TRELLIS supports flexible local editing based on their manuscript, the idea of this baseline is to understand how the pretrained model performs without any guidance. We pass the input image/text condition and the active voxels intersecting with the surface of the input 3D object mesh (structure) to the structured latents (SLATs) generation (second stage) of TRELLIS model to generate the output 3D asset.

[1] Perla, Sai Raj Kishore et al. "EASI-Tex: Edge-Aware Mesh Texturing from Single Image", SIGGRAPH 2024.

[2] Alaluf, Yuval et al. "Cross-Image Attention for Zero-Shot Appearance Transfer", SIGGRAPH 2024.

[3] Botti, Filippo et al. "Mamba-ST: State Space Model for Efficient Style Transfer", WACV 2025.

评论

I thank the authors for their time and effort in addressing my concerns. The 2D-style transferred results are nice to have. Please add them to the main paper. Given the lack of reliable quantitative metrics, I also encourage the inclusion of as many qualitative results as possible to provide readers with a comprehensive understanding of the outcomes across different approaches.

审稿意见
4

The authors propose GuideFlow3D, an optimization based method that tries to edit the latent representation of a pretrained rectified flow model to edit the appearance of 3D assets. The authors propose two different types of guidance including part-aware losses and self-similarity. With the proposed guidance, the authors are able to transfer the texture and geometric details of one 3D asset to another. Experiments are conducted on 5 rigid object categories. The qualitative results show that the proposed method outperforms the baseline in different Ranking metrics.

优缺点分析

Strengths

Detailed Explanations The authors explain the proposed method very well in the method section. Specifically, the authors provide enough details for the two proposed guidance, which makes the paper easy to follow and reimplement.

Good Visual Quality The authors provide good qualitative results for the proposed method. Noticeably, the in-the-wild appearance transfer in the Appendix looks great.

Weaknesses

The biggest concern of the paper is that the proposed method is an application extension of the paper [14]. The goal for [14] is 3D generation , and the proposed method aims for another direction: editing the latent representation of one object so that it matches the appearance of another one.

The novelty of the paper potentially lies in 1) how to obtain the latent representation and 2) how to edit the latent representation. Since the pretrained model is adopted, the authors should make efforts to validate the novelty in the second point. However, in the experiment section, the authors still use [14] as one of the main baseline, which is not sufficient to validate the novelty of the paper.

In general, image/object editing is a popular topic. I suggest the authors could adopt the editing method from another paper (even though the object representation could be different) and make comparison with the proposed method.

Minors

I would suggest adding some image pairs in Fig 4 of Appendix to the main paper.

问题

The key points are addressed in the weakness section. I will adjust my rating if the authors answer my questions properly.

I have some questions for detail clarification:

  1. The paper mentions that the objects are mainly in canonical pose, how is the scene editing in Scene Appendix C implemented? Do you use GT pose or another off-the-shelf pose estimator?

  2. Will object categories have big impacts on co-segmentation based objective?

局限性

yes

最终评判理由

The authors' rebuttal address my concerns about baselines and the contribution of the work. I raise my rating accordingly.

格式问题

There is no concern about paper formatting.

作者回复

We thank Reviewer 2LHX for highlighting the detailed explanations for easy reimplementation, good visual quality, and noting that the in-the-wild appearance transfer in the Appendix (Fig 4) looks great. Below, we address all concerns:

Q1: The authors could adopt the editing method from another paper (even though the object representation could be different) and make comparison with the proposed method.

Following reviewer suggestions, we added three new baselines: (1) EASI-Tex (SIGGRAPH 2024) [1] is a conditional diffusion-based method that uses edge-aware conditioning and ControlNet to texture an existing 3D mesh from a single RGB image, (2) Cross Image Attention (SIGGRAPH 2024) [2] builds upon the self-attention layers of generative models and introduces a cross-image attention mechanism that implicitly establishes semantic correspondences across images; (3) Mamba-ST (WACV 2025) [3] adapts the behavior of State Space Models to simulate the behavior of cross-attention layers, which are able to combine two separate embeddings into a single output. Please note that (2) and (3) are 2D style transfer methods which we adapt to our use-case following the suggestion of Reviewer zJCJ. We conduct an experiment with image condition using Lappearance\mathcal{L}_{\text{appearance}} on the simple-complex set (same setting as Main Paper: Tab. 1). Below are the results:

MethodFidelity ↓Clarity ↓Integration ↓Quality ↓Adaptation ↓Overall ↓
Mamba-ST3.343.433.473.523.463.45
Cross Image Attention2.612.642.692.712.682.66
EASI-Tex1.781.831.9001.891.851.82
GuideFlow3D1.151.141.161.211.111.06

GuideFlow3D achieves strong improvements in both appearance transfer and structural clarity preservation. This demonstrates the ability of our approach to effectively translate visual and semantic cues into coherent, high-fidelity textures across different 3D objects. While our method leverages the pretrained Trellis model, the core novelty lies in our training-free inference-time optimization framework, which combines universal guidance with structured latent flow sampling. This is enabled by our two proposed guidance objectives—part-aware appearance loss and self-similarity loss—specifically tailored to handle geometric dissimilarity in 3D appearance transfer. The strong performance over both 2D and 3D baselines empirically validates this contribution. We also conducted a user study to measure the correlation between the LLM’s judgments and those of human evaluators and found that GPT-4 performs comparably to humans. Please refer to our response to Reviewer dDnT Q1 for a detailed explanation. We will add these results alongside comprehensive benchmarking evaluation and qualitative comparison with the baselines on all of our experimental settings to the final manuscript.

Q2: How is the scene editing in Scene Appendix C implemented? Do you use GT pose or another off-the-shelf pose estimator?

We use an off-the-shelf CAD model retrieval and alignment algorithm to get per-object pose estimations and obtain 3D mesh geometries of objects in ScanNet [4, 5].

Q3: Will object categories have big impacts on co-segmentation based objective?

While the co-segmentation based objective is designed using PartField, in practice, we get the PartField feature vectors per 3D point/voxel and run K-means clustering. Instead of PartField features, we can also use SLAT features for clustering (Main Paper: Tab. 3, Ablation Study). While a slight performance drop is expected, our method does not rely on 1-1 correspondences, more on approximate part matching across objects. As shown with in-the-wild appearance transfer results (Appendix: Fig 4), we are able to support diverse object categories, for example, patterns from the body of the giraffe are transferred to the seat, arms, and backrest of the chair, while the appearance of giraffe legs is transferred to chair legs.

Q4: I would suggest adding some image pairs in Fig 4 of Appendix to the main paper.

Thank you for the suggestion, we will revise the final manuscript with some in-the-wild appearance transfer results in the main paper.

[1] Perla, Sai Raj Kishore et al. "EASI-Tex: Edge-Aware Mesh Texturing from Single Image", SIGGRAPH 2024.

[2] Alaluf, Yuval et al. "Cross-Image Attention for Zero-Shot Appearance Transfer", SIGGRAPH 2024.

[3] Botti, Filippo et al. "Mamba-ST: State Space Model for Efficient Style Transfer", WACV 2025.

[4] Stekovic, Sinisa et al. "Pytorchgeonodes: Enabling differentiable shape programs for 3d shape reconstruction." CVPR 2025.

[5] Ainetter, Stefan et al. "Automatically annotating indoor images with cad models via rgb-d scans". WACV 2023.

评论

Thanks for the efforts in the rebuttal. The new baselines make the paper more solid. However, as I mentioned in the weakness section, it seems the proposed method is an application extension of a generation model [14]. Could you clarify what new insights or technical contributions you have beyond [14]?

评论

Thank you for appreciating our efforts in the rebuttal and finding that the new baselines make the paper more solid.

While [14] trains a network that generates 3D objects based on the input condition, our work proposes an appearance transfer framework that leverages an interleaved optimization-flow sampling strategy purely at inference time, enabling 3D stylization between objects with substantially different geometries, known failure case of existing methods (Reviewer dnDT). We show two strong approaches for guiding this transfer to be semantically aware while requiring no additional fine-tuning of the base generative model. Our experiments demonstrate the superiority of our approach compared to baseline methods for appearance transfer, further demonstrating the practicality of our approach. Below, we summarize our main contributions:

  • Our approach elegantly avoids the need for costly retraining or complex architectural modifications, e.g., ControlNet-like adapters (Reviewer dDnT), positioning it as a flexible, training-free solution (Reviewer dDnT, zJCJ), generalizable to different appearance representations, e.g., images or text (Reviewer nLwK).

  • The proposed loss functions are tailored to the specific use case of 3D appearance transfer, by establishing a semantically meaningful part-aware correspondence and preserving intrinsic structures in absence of explicit geometric supervision (Reviewer dDnT).

  • Our method enables transfer across domains and modalities (e.g., giraffe → chair, cabinet → airplane), with both image- and text-based appearance guidance. In-the-wild results (Appendix D: Fig. 4) demonstrate flexibility to real-world conditions and tackle an important task… vital for large-scale applications (Reviewer zJCJ).

Our framework is designed to operate over structured 3D latent space, making it broadly applicable to generative models beyond voxel-based or next-token prediction paradigms. By decoupling style control from the generation process and introducing versatile guidance functions, GuideFlow3D opens new directions in controllable 3D generation and asset stylization.

We clarify these points clearly in the revised version of the manuscript.

[14] Xiang, Jianfeng, et al. "Structured 3D Latents for Scalable and Versatile 3D Generation." CVPR 2025.

评论

Thanks again for the reply. It address my concerns about the contribution. I have no further questions at this moment.

审稿意见
4

This paper addresses the task of 3D appearance transfer, focusing on scenarios where the input object (providing the geometry) and the appearance object (providing the style) have significant geometric differences. The authors propose GuideFlow3D, a training-free framework that operates by steering a pretrained rectified flow model at inference time. The method is built upon the Trellis generative model and its structured latent representation.

The core contribution is a guided sampling process that extends the concept of universal guidance to rectified flows. During the reverse flow, the sampling steps are periodically interleaved with optimization steps that incorporate a differentiable guidance loss. This allows the generation to be constrained by objectives for which the underlying model was not explicitly trained. The paper introduces two such guidance functions tailored for different conditions:

  1. a part-aware appearance loss, L_{appearance}, which is used for mesh-based appearance objects and leverages part-based co-segmentation to enforce localized correspondence, and
  2. a self-similarity loss, L_{structure}, a contrastive objective used for image or text-based appearance objects that preserves the intrinsic structure of the input mesh.

To evaluate their method, the authors created a new benchmark dataset and employed a GPT-4V-based ranking system, arguing that traditional metrics are unsuitable for this task. The experiments show that GuideFlow3D outperforms baselines both qualitatively and quantitatively according to this evaluation scheme.

The paper is interesting and good quality, my current reccomandation is temporary: a successful rebuttal that provides insights about human validation for the metric and a more thorough discussion/comparison with SOTA baselines could raise my score to a 4 (borderline accept) or 5 (accept).

优缺点分析

Strengths:

  1. High Significance and Clear Motivation: The paper tackles a well-defined and important problem: transferring appearance between 3D objects with substantially different geometries. This is a known failure case for many existing 3D generative and style transfer methods, which often struggle with geometric inconsistencies orrequire strong structural similarity between source and target. The ability to stylize a simple geometric primitive with the complex appearance of a detailed, but structurally different, object has clear and immediate applications in democratizing content creation for gaming, AR/VR, and rapid industrial design prototyping.
  2. Originality: The core technical contribution is original. The proposed method of integrating optimization-based guidance directly into the sampling process of a pretrained rectified flow model is a clever and powerful idea. This approach elegantly avoids the need for costly retraining or complex architectural modifications (e.g., ControlNet-like adapters), positioning it as a flexible, training-free solution. The logic is sound: directly applying the base generative model (Trellis) is insufficient for the task (Fig. 4b), while naive optimization of the guidance loss would push the latents out of the learned data distribution, resulting in unrealistic artifacts (Fig. 4a). The proposed interleaved optimization-flow scheme provides a balance between adherence to the guidance objective and the powerful generative prior of the pretrained model.
  3. Guidance Functions: The two proposed loss functions, L_{appearance} and L_{structure}, are well-motivated and tailored to their specific use cases. The use of part-based feature fields from PartField to drive the co-segmentation for L_{appearance} is a particularly strong design choice. It establishes a semantically meaningful correspondence between parts (e.g., chair leg to chair leg), which is far more robust than naive nearest-neighbor matching in a generic latent space, a claim strongly supported by the ablation stud. Similarly, the contrastive self-similarity loss is a well-founded approach for preserving an object's intrinsic structure when only non-geometric guidance (image or text) is available, drawing on established concepts from the literature.

Weaknesses:

  1. Reliance on an Un-validated and Potentially Biased Evaluation Metric: The paper's quantitative evaluation rests entirely on a GPT-4V/4o-based ranking system. While the authors correctly argue that traditional metrics like FID or LPIPS are ill-suited for this task, their chosen alternative introduces serious methodological concerns. Relying on a large language model as the sole measure of quality is a major weakness without rigorous validation. As recognized by the authors in the supplementary, recent studies have highlighted that LLMs can exhibit significant biases. The paper presents the GPT-based rankings as objective fact without a crucial validation step: a small-scale human user study to measure the correlation between the LLM's judgments and those of human evaluators.
  2. Limited Comparison to State-of-the-Art Baselines: The experimental comparison is limited to a simple UV mapping baseline and Trellis , the very model that GuideFlow3D is built upon. This is not sufficient to properly contextualize the work within the broader field of 3D style transfer. The field has seen rapid progress with methods based on different 3D representations, such as NeRFs (e.g., StyleRF ) and 3D Gaussian Splatting (e.g., StyleGaussian ). While these methods may have different input/output formats, a thorough discussion of their relationship to GuideFlow3D, and the conceptual trade-offs, is missing. The current experiments only prove that adding guidance improves Trellis; they do not establish that GuideFlow3D is superior to, or even competitive with, other contemporary state-of-the-art techniques. This is a critical omission for a top-tier conference paper.
  3. Narrow Experimental Scope and Unsubstantiated Generalization: The experiments are conducted almost exclusively on furniture items (chairs, tables, etc.). While the appendix shows also good quality examples on more general categories, a deeperquantitative generalization study would be crucial for evaluating the generalizability and real-world applicability of the approach.

问题

  • The reliance on GPT-4V as the sole quantitative evaluator is a major concern due to potential un-analyzed biases. Did the authors perform a small-scale human study (even with a limited number of participants, similar to the user study in the StyleRF paper ) to validate the GPT-4V rankings against human judgments?

  • The paper's comparison is limited to a naive baseline and its own foundation model (Trellis). To better situate this work, could you please elaborate on the relationship between GuideFlow3D and other recent state-of-the-art 3D style transfer methods like StyleRF or StyleGaussian? What are the primary challenges in performing a direct comparison (e.g., differences in 3D representation from NeRF/Gaussians to meshes), and what are the conceptual advantages of your mesh-based guided flow approach over these methods? A convincing argument here would help establish the significance of your work.

-The Lappearance guidance relies heavily on the co-segmentation from PartField . How does GuideFlow3D's performance degrade if the segmentation quality is poor, for instance, when transferring appearance between two objects where PartField produces inconsistent or incorrect part labels (e.g., mapping a chair leg to an armrest)? Showing a failure case analysis would provide valuable insight into the method's robustness and practical limitations.

  • In Eq. (6), the self-similarity loss is a contrastive loss on geometric features that encourages parts within the same cluster to be similar. However, in appearance transfer, one might want to apply different textures to semantically distinct but geometrically similar parts (e.g., applying a fabric texture to a chair's seat cushion and a wood texture to its structurally similar backrest). How does this loss handle such cases, and does it risk homogenizing the texture across all geometrically similar regions, even if the text prompt implies variation?

局限性

Yes

最终评判理由

Having read the authors rebuttal I agree on increasing my score to 4 (borderline accept). I acknoledge the work done to improve the evaluation and comparison, but part of the concerns about the possibly biased evaluation remains as well as the limited scope of the forniture scenarion.

格式问题

N/A

作者回复

We thank Reviewer dDnT for their detailed review and for recognizing the high significance and clear motivation of our work, the originality of our training-free framework, and the clever and powerful nature of our interleaved optimization-flow scheme. We also appreciate the acknowledgment of our well-motivated and tailored guidance functions and the flexibility our method offers by avoiding the need for costly retraining or complex architectural modifications. Below, we address the concerns:

Q1: Reliance on an Un-validated and Potentially Biased Evaluation Metric.

Traditional metrics like CLIP similarity, LPIPS, or FID fail to capture semantic consistency, geometry-aware stylization, or perceptual realism in cross-category 3D appearance transfer [1]. Recent works across 3D generation and stylization tasks have shown that GPT-4V exhibits strong alignment with human preferences [50, 67, 43, 76], and is increasingly adopted as a practical tool for perceptual evaluation in absence of ground truth. However, we agree that without rigorous validation, relying solely on an LLM can be a major weakness. Based on the suggestion, we conducted a user study on Amazon Mechanical Turk with 21 participants using 51 randomly selected object and image pairs from the simple-complex evaluation set, on the image condition setting. The participants were shown the appearance image and two views (front and back at a camera angle of 45 degrees) each of the structure mesh + outputs from different methods, without any knowledge about the methods. Then, they were provided with the same prompts as GPT (Appendix: Fig 6) and asked to rate the best performing considering overall quality. Based on 1070 responses, GuideFlow3D achieved the highest win rate (34.8%) compared to Cross Image Attention (26.7%), EASI-Tex (21.6%), and Mamba-ST (16.9%), consistent with the GPT-4V rankings (please refer to Q2 for a discussion on the baselines and GPT evaluation). These findings validate the reliability of our LLM-based evaluation, demonstrating strong alignment with human perception in assessing texture fidelity, structural preservation, and overall appearance quality. We will include a summary of this user study in the main paper and provide full details in the appendix.

Q2: Limited Comparison to State-of-the-Art Baselines.

Following reviewer suggestions, we added three new baselines: (1) EASI-Tex (SIGGRAPH 2024) [1] is a conditional diffusion-based method that uses edge-aware conditioning and ControlNet to texture an existing 3D mesh from a single RGB image, (2) Cross Image Attention (SIGGRAPH 2024) [2] builds upon the self-attention layers of generative models and introduces a cross-image attention mechanism that implicitly establishes semantic correspondences across images; (3) Mamba-ST (WACV 2025) [3] adapts the behavior of State Space Models to simulate the behavior of cross-attention layers, which are able to combine two separate embeddings into a single output. Please note that (2) and (3) are 2D style transfer methods which we adapt to our use-case following the suggestion of Reviewer zJCJ. We conduct an experiment with image condition using Lappearance\mathcal{L}_{\text{appearance}} on the simple-complex set (same setting as Main Paper: Tab. 1). Below are the results:

MethodFidelity ↓Clarity ↓Integration ↓Quality ↓Adaptation ↓Overall ↓
Mamba-ST3.343.433.473.523.463.45
Cross Image Attention2.612.642.692.712.682.66
EASI-Tex1.781.831.9001.891.851.82
GuideFlow3D1.151.141.161.211.111.06

GuideFlow3D achieves strong improvements in both appearance transfer and structural clarity preservation. This demonstrates the ability of our approach to effectively translate visual and semantic cues into coherent, high-fidelity textures across different 3D objects. We will add these results alongside comprehensive benchmarking evaluation and qualitative comparison with the baselines on all of our experimental settings to the final manuscript.

Q3: Narrow Experimental Scope and Unsubstantiated Generalization.

We agree that generalizability is important and appreciate the reviewer noting the quality of our open-domain examples. Our focus on furniture categories in the main paper is intentional, driven by practical downstream applications (e.g., AR/VR, asset libraries) where editable 3D meshes are commonly used. This also enables controlled, semantically coherent benchmarks, making user studies and GPT-based evaluations more interpretable and consistent. We also provide in-the-wild transfer results (Appendix: Fig 4) across diverse categories demonstrating that GuideFlow3D generalizes to semantically and geometrically dissimilar objects. We will make this motivation clearer in the revised manuscript.

Q4: To better situate this work, could you please elaborate on the relationship between GuideFlow3D and other recent state-of-the-art 3D style transfer methods like StyleRF or StyleGaussian?

Recent methods like StyleRF [4] and StyleGaussian [5] represent important progress in 3D style transfer. However, these approaches operate on implicit or point-based representations (NeRFs, Gaussians), whereas GuideFlow3D works on explicit textured meshes, enabling direct editing, part-aware control, and downstream usability. Notably, StyleGaussian stylizes only color while keeping geometry fixed, limiting its ability to reflect fine structural patterns. It also scales poorly with the number of Gaussians due to memory and compute constraints. In contrast, GuideFlow3D offers geometry-aware stylization with training-free inference, and fixed compute cost via structured latent sampling. We will revise the related works section with such conceptual tradeoffs, highlighting how mesh-based controllable appearance transfer can be useful.

Q5: How does GuideFlow3D’s performance degrade if the segmentation quality is poor?

While the co-segmentation based objective is designed using PartField, in practice, we get the PartField feature vectors per 3D point/voxel and run K-means clustering. Instead of PartField features, we can also use SLAT features for clustering (Main Paper: Tab. 3, Ablation Study). While a slight performance drop is expected, our method does not rely on 1-1 correspondences, more on approximate part matching across objects. As shown with in-the-wild appearance transfer results (Appendix: Fig 4), we are able to support diverse object categories, for example, patterns from the body of the giraffe are transferred to the seat, arms, and backrest of the chair, while the appearance of giraffe legs is transferred to chair legs. However, in extreme diversity cases with no semantic meaning between two meshes, the co-segmentation can fail due to inconsistent features, either SLAT or PartField. We will add qualitative analysis of co-segmentation failure cases to our manuscript for the final version.

Q6: How does the self-similarity loss handle such cases, and does it risk homogenizing the texture across all geometrically similar regions, even if the text prompt implies variation?

Lstructure\mathcal{L}_{\text{structure}} encourages intra-cluster consistency and inter-cluster contrast in appearance by operating on geometry-derived feature clusters. However, it does not enforce global texture uniformity across all geometrically similar parts. Instead, the contrastive formulation (Eq. 6) preserves local appearance consistency while allowing global variation across different clusters. Clustering is performed in the latent geometry-feature space, which captures more than raw shape similarity (e.g., spatial location, curvature context). In practice, this often separates semantically distinct parts even if they share structural traits.

[1] Perla, Sai Raj Kishore et al. "EASI-Tex: Edge-Aware Mesh Texturing from Single Image", SIGGRAPH 2024.

[2] Alaluf, Yuval et al. "Cross-Image Attention for Zero-Shot Appearance Transfer", SIGGRAPH 2024.

[3] Botti, Filippo et al. "Mamba-ST: State Space Model for Efficient Style Transfer", WACV 2025.

[4] Liu, Kunhao, et al. "Stylerf: Zero-shot 3d style transfer of neural radiance fields." CVPR 2023.

[5] Liu, Kunhao, et al. "Stylegaussian: Instant 3d style transfer with gaussian splatting." SIGGRAPH Asia 2024.

评论

Having read the authors rebuttal I agree on increasing my score to 4 (borderline accept). I acknoledge the work done to improve the evaluation and comparison, but part of the concerns about the possibly biased evaluation remains as well as the limited scope of the forniture scenarion.

评论

Thank you for acknowledging our efforts and raising your score. We appreciate all your comments.

Regarding evaluation bias, we would like to note that we conducted a full-scale unbiased study on Amazon Mechanical Turk (AMT) with 21 participants rating 51 examples, using the same rubric as GPT-4. The rankings were consistent with our initial LLM-based evaluation, confirming that our setup is unbiased and human-aligned, similar to recent work in text/image-to-3D generative domain [1, 2, 3].

We chose furniture due to its high structural variability, for example, chairs and tables, have different numbers of legs, shapes (e.g, an armchair vs. an office chair), and supports (e.g, round coffee table vs. a rectangular work desk). This makes part correspondence and appearance transfer particularly challenging. In contrast, categories like animals or vehicles often have more regularized part structure, with most variation occurring in appearance rather than geometry. Results in Appendix: Fig. 4 showcase strong generalization beyond furniture-to-furniture scenarios.

As Reviewer zJcJ suggested, we will include more qualitative results to provide readers with a comprehensive understanding of the outcomes across different approaches. We welcome any suggestions you may have on improving the two aspects of the paper you mentioned.

[1] Peng, Yuang, et al. "Dreambench++: A human-aligned benchmark for personalized image generation." ICLR 2025.

[2] Wu, Tong, et al. "Gpt-4v(ision) is a human-aligned evaluator for text-to-3d generation." CVPR 2024.

[3] Maiti, Shalini, et al. "Gen3deval: Using vllms for automatic evaluation of generated 3d objects." CVPR 2025.

审稿意见
4

GuideFlow3D introduces a training-free framework for 3D appearance transfer, effectively addressing challenges posed by geometric differences between input and appearance objects. It leverages pre-trained rectified flow models and employs part-aware appearance and self-similarity guidance losses during inference. The method supports various appearance representations, including images and text. This approach offers a flexible solution for high-quality 3D asset stylization.

优缺点分析

Strengths

  1. The paper is clearly structured and presented, making the proposed methodology and experimental results easy to follow and comprehend for the reader.
  2. GuideFlow3D’s training-free nature and its support for various appearance representations (images, text) make it a highly flexible framework for diverse 3D appearance transfer tasks.

Weaknesses

  1. The core methodology of GuideFlow3D can be perceived as an incremental combination of existing techniques. It heavily relies on pre-trained rectified flow models, and the concept of guidance for generative processes is well-established. The specific guidance losses, while adapted, may not introduce fundamentally new principles.
  2. A major limitation is the absence of direct comparisons with recent state-of-the-art 3D appearance and style transfer methods, such as TEXTure [1], StyleGaussian [2], and StyleRF[3]. Without benchmarking against these prominent approaches, it is challenging to ascertain GuideFlow3D's true competitive standing and relevance in the rapidly evolving landscape.
  3. The prompts and scenarios used in the experimental evaluation lack complexity and diversity, raising concerns about the method's robustness and its generalizability to more challenging real-world settings.
  4. The results shown typically use relatively clean input 3D meshes and clear appearance conditions. It's unclear how well GuideFlow3D performs when dealing with noisy, incomplete, or low-resolution input meshes, or low-quality/ambiguous appearance images/text. How does the quality of the input 3D mesh or the clarity of the appearance source (image/text) affect the output quality?

References

[1] Richardson, Elad, et al. "Texture: Text-guided texturing of 3d shapes." ACM SIGGRAPH 2023 conference proceedings. 2023.

[2] Liu, Kunhao, et al. "Stylegaussian: Instant 3d style transfer with gaussian splatting." SIGGRAPH Asia 2024 Technical Communications. 2024. 1-4.

[3] Liu, Kunhao, et al. "Stylerf: Zero-shot 3d style transfer of neural radiance fields." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

问题

Please refer to weaknesses.

局限性

Yes.

最终评判理由

The authors' clarifications on the method's robustness and the scope of their experiments have resolved my concerns. The new experiments comparing GuideFlow3D against strong, recent baselines like EASI-Tex, Cross Image Attention, and Mamba-ST provide the necessary context and demonstrate the competitive performance of the proposed method.

格式问题

No

作者回复

We thank Reviewer nLwK for finding our paper clearly structured and our method's training-free nature, highly flexible for diverse 3D appearance transfer. Below, we address all the questions:

Q1: The core methodology of GuideFlow3D can be perceived as an incremental combination of existing techniques.

While guidance is a well-established topic in 2D denoising models, our work extends this concept for transfering appearance (texture and geometric details) between two objects in 3D. Our guidance formulation repurposes a given generative model for a new task (different from the one it was originally trained for) in order to leverage the inductive bias of a pretrained rectified flow model. We propose a novel inference-time mechanism that interleaves rectified flow sampling with latent-space optimization, incorporating custom guidance losses into the denoising trajectory of structured latents. This formulation enables conditioning the generative process without retraining or modifying the base model, extending universal guidance in 3D. This approach elegantly avoids the need for costly retraining or complex architectural modifications, e.g., ControlNet-like adapters (Reviewer dDnT), positioning it as a flexible, training-free solution (Reviewer dDnT, zJCJ). The proposed loss functions are tailored to the specific use case of 3D appearance transfer, by establishing a semantically meaningful part-aware correspondence and preserving intrinsic structures in absence of explicit geometric supervision (Reviewer dDnT).

Q2: A major limitation is the absence of direct comparisons with recent state-of-the-art 3D appearance and style transfer methods.

While we compare with Trellis to show how guidance can improve its local editing capabilities to support 3D appearance transfer, we understand that it is challenging to ascertain competitive advantage without benchmarking against recent style transfer works. Direct comparison with StyleRF [1] and StyleGaussian [2] is not feasible due to fundamental differences in representation and output format—these methods operate on implicit 3D scenes (NeRFs, Gaussians) and produce render-only outputs, while GuideFlow3D works with explicit textured meshes. Additionally, they rely on per-instance optimization and often require multi-view supervision, whereas our method is training-free and fast at inference (please also refer to our response towards Reviewer dDnT Q4 for an explanation on how we can inherently address these two methods' limitations). While TEXTure [3] tackles the same task as us, we compare with EASI-Tex [4], a more recent faster and scalable approach shown to perform better than TEXTure.

Following reviewer suggestions, we added three new baselines: (1) EASI-Tex (SIGGRAPH 2024) [4] is a conditional diffusion-based method that uses edge-aware conditioning and ControlNet to texture an existing 3D mesh from a single RGB image, (2) Cross Image Attention (SIGGRAPH 2024) [5] builds upon the self-attention layers of generative models and introduces a cross-image attention mechanism that implicitly establishes semantic correspondences across images; (3) Mamba-ST (WACV 2025) [6] adapts the behavior of State Space Models to simulate the behavior of cross-attention layers, which are able to combine two separate embeddings into a single output. Please note that (2) and (3) are 2D style transfer methods which we adapt to our use-case following the suggestion of Reviewer zJCJ. We conduct an experiment with image condition using Lappearance\mathcal{L}_{\text{appearance}} on the simple-complex set (same setting as Main Paper: Tab. 1). Below are the results:

MethodFidelity ↓Clarity ↓Integration ↓Quality ↓Adaptation ↓Overall ↓
Mamba-ST3.343.433.473.523.463.45
Cross Image Attention2.612.642.692.712.682.66
EASI-Tex1.781.831.9001.891.851.82
GuideFlow3D1.151.141.161.211.111.06

GuideFlow3D achieves strong improvements in both appearance transfer and structural clarity preservation. This demonstrates the ability of our approach to effectively translate visual and semantic cues into coherent, high-fidelity textures across different 3D objects. We will add these results alongside comprehensive benchmarking evaluation and qualitative comparison with the baselines on all of our experimental settings to the final manuscript.

Q3: The prompts and scenarios used in the experimental evaluation lack complexity and diversity.

The main paper focuses on controlled intra- and inter-category furniture benchmarks to enable systematic evaluation. However, we also show the generalizability of GuideFlow3D in open-domain scenarios (Appendix D: In-the-wild Appearance Transfer). These examples involve transferring appearance across semantically and geometrically dissimilar object categories such as giraffe → chair and cabinet → airplane covering varied object domains including animals, tools, and vehicles (Appendix: Fig. 4). Despite the increased complexity and lack of semantic alignment, our method successfully preserves material styles, maintains structural coherence via part-aware matching in the latent feature space, and produces realistic outputs. This robustness is also noted by reviewers, the in-the-wild appearance transfer in the Appendix looks great (Reviewer 2LHX), and the method tackles an important task… vital for large-scale applications (Reviewer zJCJ), highlighting GuideFlow3D’s scalability towards complex meshes and scenarios.

Q4: The results shown typically use relatively clean input 3D meshes and clear appearance conditions.

Our evaluation benchmark is designed for demonstrating content creation possibilities, hence we use clean input meshes to have a fair measure of performance. We also demonstrate the application of GuideFlow3D to real-world scenes in Appendix C: Scene Editing, where it stylizes reconstructed ScanNet objects with pose noise, occlusion, and class ambiguity. Without retraining or scene-level modeling, our method applies object-wise appearance transfer to maintain stability across pose and mesh variation highlighting its suitability for practical AR/VR scenarios and digital asset creation.

[1] Liu, Kunhao, et al. "Stylegaussian: Instant 3d style transfer with gaussian splatting." SIGGRAPH Asia 2024.

[2] Liu, Kunhao, et al. "Stylerf: Zero-shot 3d style transfer of neural radiance fields." CVPR 2023.

[3] Richardson, Elad, et al. "TEXTure: Text-Guided Texturing of 3D Shapes". SIGGRAPH 2023.

[4] Perla, Sai Raj Kishore et al. "EASI-Tex: Edge-Aware Mesh Texturing from Single Image", SIGGRAPH 2024.

[5] Alaluf, Yuval et al. "Cross-Image Attention for Zero-Shot Appearance Transfer", SIGGRAPH 2024.

[6] Botti, Filippo et al. "Mamba-ST: State Space Model for Efficient Style Transfer", WACV 2025.

评论

I thank the authors for their thorough rebuttal and the significant effort made to address the concerns. The authors' clarifications on the method's robustness and the scope of their experiments have also resolved my other concerns. I have raised my score to 4 (borderline accept).

评论

Dear Reviewers,

Thank you for your efforts. As this paper received mixed ratings, please revisit the submission and actively check whether your earlier concerns have been adequately addressed in the author response and discussion before Aug 8.

Best, AC

最终决定

This paper proposes a training-free method for transferring the appearance of an image onto a 3D mesh. The method leverages a pre-trained generative model together with a generative model guidance strategy, and introduces part-aware and self-similarity loss functions.

Reviewers raised concerns about the novelty, the rigor of the evaluation, and the lack of discussions and comparison with the latest baselines. The rebuttal addressed many of these issues by providing additional experiments and clarifications. However, concerns remain regarding the rigor of using GPT as a judge, even though additional (small-scale) user studies were provided. AC shares this concern and believes the evaluation should be more rigorous. In addition, the paper should explicitly discuss and compare with more recent methods such as [a].

Given that the paper currently holds an average rating of 4.25, which is a positive score, and the introduced loss for the optimization framework is new, AC recommends acceptance but request a more rigorous evaluation to be included in the revision, including but not limited to a large-scale user study, different LLMs, multiple trials to reduce the uncertainty, and metrics like CLIPScore.

[a] TextureDreamer: Image-guided Texture Synthesis through Geometry-aware Diffusion