OscillationInversion: Understand the structure of Large Flow Model through the Lens of Inversion Method
摘要
评审与讨论
The paper describes a method a method called 'Oscillations Inversion' that allows one to recover the latent noise corresponding to an image in a rectified flow model. Using this method, the paper discovers the phenomena that repeated repeated mapping and inversion between the latent representation and the pixel space images does not converge to a fixed point, rather it oscillates between distinct clusters.
Then, the paper proposes a method for guiding this oscillation using finetuned inversion. This refers to finetuning the flow model to align the velocity field of an inpainted image with the original image. This finetuning procedure is used to perform high-quality inpaining.
Finally, the paper proposes a method for further quality enhancement: post-inversion optimization. This enables the optimization of the latent representation to improve image fidelity metrics.
The experimental results show good performance on image inpainting and image enhancement tasks.
优点
The paper showcases a very interesting finding in that if one uses the proposed inversion method, the image and their latent representation oscillates between different clusters.
The paper proposes two further methods, finetuned inversion and post-inversion optimization that are applied to tackle denoising and image inpainting tasks. The paper showcases impressive results on these tasks.
缺点
The oscillatory behavior in the fixed point iteration is induced by the error in the approximate inversion. A flow model defines a 1-to-1 mapping between the noise representations and latent representations. If the inversion process were exact, the process would not oscillate. It would map back and forth between the noise representation and the latent representation. Therefore the finding of the oscillatory behavior is not a general finding about flow models, rather its something specific to this inversion process.
The paper does not provide sufficient evidence for its central claim: The images and their latent representation oscillate between local clusters. The paper claims this is the case empirically, but provides no experimental data. The claim also isn't justified theoretically or an intuitive reasoning given why that would be the case.
Lack of justification for methods. It is not clear where the training objectives come from and why the methods achieve the desired results.
- Finetuned inversion: It is not clear why this training objective would induce more diverse oscillatory behavior in latent space. The paper lacks justification or experimental data that supports this claim.
- Post-inversion optimization: It is unclear why this complex training objective is chosen over a simple fidelity loss with a MSE regularization term.
Paper formatting:
- Figure 6 is not legible.
- Page 5 margin violation
- Top of Page 6 odd figure spacing?
- Table formatting does not follow ICLR convention.
问题
-
Is the GP loss in Eq 9 equivalent to MSE? if z_i is fixed, the first and last terms are constant. The middle terms for an RBF kernel is simply the squared distance so this loss should be equivalent the MSE.
-
In post-inversion optimization: How are the points assigned to clusters? How are the clusters identified?
The paper studied the oscillation problem in the inversion problem of rectified flows. The authors use a numerical method to justify the claim that oscillations exist as groups. Then they introduced finetuned inversio to make the separated clusters to align with customized semantics. The observation in the paper is largely expereiment based and the method introduced is strainghtforward.
优点
The oscillation phenomen in the inversion problem of rectified flow is identified numerical. A solution to uitlize the oscillation is proposed to improve the quality of image generations.
缺点
The main weakness is that the observation is identified through experiments not analysis. It is hard to fullly make the readers convinced of their contribution. The optimization method to deal with oscillation is strainghtforwad without further explanation.
问题
- The reference format in section 2 looks very strange. Should be updated.
- As the paper is a direct follow up of paper Liu et al 2022, it is much better to elaborate the formulation in section 3.1.1.
- In (2), the integral on t is missing compared to Liu et al 2022.
-
The lines 270-272 are not correctly displayed -
In line 370, "Section ??"
The paper presents a novel method called Oscillation Inversion for enhancing image manipulation capabilities in rectified flow-based large text-to-image diffusion models. The authors observe that, unlike conventional inversion methods that converge smoothly, their proposed approach oscillates between distinct semantic clusters.
优点
- The method's applicability to diverse tasks (e.g., image restoration, enhancement, and makeup transfer) demonstrates its flexibility and potential for real-world uses.
缺点
- Are there other models to perform the experiments with? The experiments are just on one checkpoint. (FLUX)
问题
See the weaknesses.
The paper propose a method for solving the image inversion problem. The authors observe that when using a fixed point iterative method to solve the inversion problem, the solutions oscillates number of points. Based on this observation a method for image edit, image reconstruction, and super resolution is derived.
优点
- The suggest method for solving the inversion problem which is of interest for a large audience.
- The observation of oscillatory behavior for fixed point method on flow models and it relation to semantic meaning is novel. And the author are able to reproduce this behavior on smaller model as well.
- The method is able to produce SOTA results.
缺点
-
No clear algorithm for applying the method is provided. This makes it hard to understand the order of which the 3 stages of the method, Group Inversion, Finetuned inversion, and Post-inversion Optimization is applied, or whether all of the three stages are always applied.
-
It is not clear how the use of multiple latent encoding in equation 6. is used for image reconstruction and image super resolution.
-
In equation 8 appears and its said to be "a customized loss function" but no example of such loss function is given.
-
images in figure 6 are too small
问题
Can the authors please expand on weaknesses 1 and 2.
I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.