AnyView: Few Shot Personalized View Transfer
We address the task of learning a view from an image sample and then transfer it to novel objects.
摘要
评审与讨论
The paper is about customized image synthesis of an object with a specific view point. They leverage the DreamBooth customization approach in order to learn object and view features. The model requires few-shot examples for custom object and view synthesis.
优点
Based on the claims of the paper
- The method does not require any prior 3D data, to synthesize a custom object with a specific view.
- It is a few-shot approach, requiring 3-4 examples for object and only 1 example for view to synthesize the image.
- They showcase the background features importance for view synthesis.
缺点
-
Weak scientific novelty. They just use the DreamBooth to learn 2 different concepts leveraging the LoRA method for it.
-
Limited quantitative and qualitative results. The authors provide results using only 1 dataset (DTU MVS), which is not enough to prove the claims. Also, they can be cherry picked, especially, when the authors manually estimate the transformation of competitor model (Zero123) for the comparison.
-
Not convinced with the claim of disentangling the view synthesis from the concept. Not enough evidence are shown in the paper. The model is able to learn concepts (whether they are style, object, or scenes). There is a more chance that it attempts to learn the scene, not the view. That is why in almost all results, any image synthesized from the top camera view, object is faced to the camera, while, the reference object does not have that position. Also, the background is always changed in synthesized images (probably because of LoRA), which works against the claim of view disentanglment.
-
Poor qualitative results. The synthesized images are not appealing qualitatively (Fig 5, Fig 12). The customized objects have poor fidelity.
-
Lack of details how the training is done, how the quantitative metrics are computed.
问题
I am willing to reconsider my review if authors address the weaknesses mentioned above and convince me otherwise.
The paper introduces a method for generating personalized view transfers in image synthesis using diffusion models through the DreamBooth setup combined with low-rank adaptation (LoRA). More specifically, it proposes AnyView, a system that disentangles "view" as an independent high-level concept and leverage the personalization/transfer-learning algorithm of Dreambooth to extract this concept. Experiments on DTU and DreamBooth dataset validate its performance.
优点
- The paper brings forward a novel approach by defining the concept of "view" in image synthesis. By the Dreambooth personalization algorithm, the method requires minimal 2D data and no 3D information to extract such concept.
- The experimental setup is robust, with evaluations on datasets such as DTU and comparisons some benchmarks, demonstrating AnyView's performance.
- The paper provides a comprehensive analysis to give insight on how the setup like LoRA, background, complex object would affect the approach.
缺点
- Evaluation Metrics. My strongest concern is the use of evaluation metrics to assess the method's performance. While traditional metrics like SSIM and LPIPS are presented, they may not entirely capture the visual quality of the view-transfered novel images. Incorporating perceptual metrics or human study evaluations is critical for this few-shot synthesis technique, which can assess the fine details of the generated images.
- Dependence on Background Cues. While the model effectively generates personalized views, it relies significantly on spatial cues from backgrounds, as discussed in Sec. A.5. This dependence may limit its adaptability to varying or abstract backgrounds.
- Complex Multi-Object Scenes. While the paper acknowledges that the method faces challenge in maintaining view consistency and object separation for the multi-object scenes, more experimentation on such cases could further validate the method's efficacy and potential modifications to address this complexity.
问题
Please refer to the weakness section.
Thanks the reviewer for addressing the issues, I'll remain my rating.
The paper introduces AnyView, a method for transferring specific viewpoints to novel objects using few-shot learning with diffusion models. Building upon DreamBooth, the authors demonstrate that a pretrained stable diffusion model can learn the high-level concept of a view from a single image without relying on explicit 3D priors. They use Low-Rank Adaptation (LoRA) to separately learn the view and object concepts and then merge them to generate images of the novel object from the desired viewpoint. Experiments on the DTU dataset and natural images show that AnyView can efficiently generate reliable view samples, outperforming several methods in certain metrics.
优点
- This work studies a new task, learning a view from a single image.
- The authors provide evidence that diffusion models can learn and transfer high-level concepts like views, which could have broader implications for generative modeling.
缺点
- The method may struggle with complex scenes involving multiple objects or significant occlusions, as noted in the limitations.
- The results shown in experiments are all simple or celebrities, which SDXL model is highly possible to have priors. Using some unique objects will make the method more convincing.
- The reference images of rare object shown in appendix already have diverse views, some are even very similar to the view to learn. This undermines the effectiveness of the method.
- The method is simple combination of DreamBooth and ZipLoRA, which has limited innovation.
- The figures in the paper are very low-resolution and unlcear, making it difficult to read.
问题
See weaknesses above.
Sorry, to make it clear, I mean the rows in your new Fig12 part 2. The pig toy images seem to be obvious failure for me. Besides, if authors would like to address the effectiveness of this method, it's lack of quantitative metrics to evaluate. I think it's better to define the orientation or coordinates of both the view and the object to give some quantitative measurement.
The paper presents a novel approach for enhancing personalization in visual tasks through a method that integrates personalization and Low-Rank Adaptation (LoRA) fine-tuning for view transfer. The authors address the "forgetting" problem prevalent in personalization models when learning multiple IDs simultaneously, providing a clear motivation for their proposed architecture. Through thorough experimentation, the method demonstrates competitive performance results against various baselines, including methods that have been trained on more data. The findings are substantiated across multiple evaluation metrics.
优点
- The paper is clearly articulated, with a well-defined methodology supported by diagrams, effectively motivating the proposed architecture and addressing the "forgetting" problem in personalization models.
- The method demonstrates notable performance improvements over comparable baselines, achieving competitive results against specialized novel view synthesis (NVS) methods despite relying on a simpler approach with fewer input images. The authors provide comprehensive results across various metrics, including SSIM, PSNR, and LPIPS, highlighting the effectiveness of their approach in diverse scenarios.
- The authors provide a thorough analysis of design choices, including the impact of background in training data and object complexity on the method’s performance.
- The originality of the approach is noteworthy, particularly in its application of personalization and LoRA fine-tuning for view transfer tasks.
缺点
- Evaluation Metrics: The evaluation method used, particularly regarding masking, are unclear and warrant further explanation.
- Qualitative Results: Some views selected for qualitative results appear standard, especially in Figure 5 with Zero-1-to-2 comparison visualizations. Additionally, there is a lack of qualitative results for other baselines- although some are presented in Figure 13 in the appendix, they are not super convincing as Anyview results seem to lack fine-grained viewpoint consistency in view synthesis, while also lacking some fine-grained detailed in complex tasks.
- Background Sensitivity: The performance of the method is inconsistent, particularly sensitive to background context. Structured backgrounds (e.g., forests, tables) yield more reliable results than uniform backgrounds (e.g., grass). This sensitivity suggests the model may learn unintended spatial cues, potentially hindering its generalizability across diverse environments and highlighting a brittle understanding of 3D relationships.
- Evaluation Gaps: The paper lacks an analysis of failure cases and their characteristics, which is critical for understanding the limitations of the approach.
- Technical Limitations: There is no clear strategy for managing multiple objects or complex scenes within the proposed framework.
问题
- What is the precise definition of "masking" in the context of your evaluation methodology?
- How is the view generated from a single image, considering that views are inherently relative? Is it relying on the semantic prior of "canonical pose" of diffusion models? Would incorporating multiple images of the same view but with different objects improve performance? Additionally, how does the method address viewpoint ambiguity?
- What does Figure 7 represent? Are all the images inference attempts to generate an ideal view? The inconsistencies in these images raise questions regarding their interpretation.
- What specific failure cases or limitations can you identify in your approach?
- Is it feasible to extend this method to effectively handle multi-object scenes, and if so, how would that be accomplished?
The paper receives mixed scores from the reviewers. While reviewers appreciate the interesting task and comprehensive analysis, they also find the method with incremental novelty, the evidence poor, and the dataset limited, which are not fully addressed in the rebuttal. The authors are encouraged to address these comments in the revised version.
审稿人讨论附加意见
During discussion, reviewers all mentioned that the rebuttal is not satisfactory and they are not fully convinced. More solid experiments are needed to justify the proposed method. Reviewer TD66 also mentioned that controllable background generation is a problem.
Reject