5.0

/10

Poster4 位审稿人

最低4最高6标准差0.7

3.3

置信度

正确性2.8

贡献度2.5

表达2.5

NeurIPS 2024

Simulation-Free Training of Neural ODEs on Paired Data

Semin Kim,Jaehoon Yoo,Jinwoo Kim,Yeonwoo Cha,Saehoon Kim,Seunghoon Hong

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

TL;DR

We propose simulation-free training method for Neural ODEs by adopting flow matching objective with learnable embeddings.

摘要

关键词

Neural ODEsimulation-free trainingflow matching

评审与讨论

审稿意见

评分: 5置信度: 42024-07-04

The paper revisits using neural ordinary differential equations (NODEs) for modeling deterministic maps on paired data, e.g., maps that solve regression and classification problems. The authors propose utilizing flow-matching (FM), a recent simulation-free training method for NODEs, to overcome the computational overhead of traditional NODE training and inference. The paper first presents the problems with simply using FM for learning maps on paired data since the learned map by FM is not guaranteed to preserve the coupling presented in training but only to preserve the distributions. Second, the authors propose to solve the coupling preservation problem by adding a learned encoder-decoder on the label space and an additional encoder on the data space to "rewire" the trajectories so that they do not cross. Then, the coupling will be preserved through FM training. The authors demonstrate the efficacy of their approach on classification and regression tasks and show that learning with FM also facilitates learning linear maps, which can be inferred in a single function evaluation - alleviating both training and inference shortcomings of traditional NODEs.

优点

The paper presents an interesting approach towards end-to-end learning of NODEs with flow matching when preserving the training set coupling is required.
Experiments validate the efficiency of the approach, achieving SOTA performance - both in metrics and runtime.

Presentation

Ideas and concepts in the paper are often presented with provided intuition, guiding the reader to the logic behind the algorithmic choices.

缺点

Presentation

Although intuitive explanations were mentioned as a strength, in some cases, they come in short. The paper lacks formal and rigorous explanations of the method and experimental settings. For instance:
- Adding noise to labels (L164) - I am unsure I understand the setting here. Is the noise added to the ground truth labels in the L2 loss? Or are they added to the embeddings $z_1$ ? An equation explicitly stating the noise addition would greatly help clarify this.
In the preliminaries section, the introduction of flow-matching is inaccurate and lacks the main point that FM does not regress to the marginal velocity field but rather regresses to a conditional velocity field, named linear dynamics in the paper, while the marginal is not necessarily linear.

Method

A naive solution to the crossing trajectories problem, would be to augment the dimension of the learned flow such as suggested and discussed in [1,2] and is in a sense similar to training a conditional model, where the condition is the initial point. The paper misses the sanity check and comparison with the most naive baseline, combining augmented NODEs with flow matching. I wonder, if one only learns $d_\psi$ and $g_\varphi$ , $f_\phi$ is set to be identity, and the learned flow is trained on an augmented space with the initial point, would that achieve better/worse results? According to [2], augmenting the flow model with the initial point would solve the crossing trajectories problem and allow the couplings to be preserved.
I find the motivation for using NODEs for regression and classification rather weak when the learned map is linear (i.e., solved by a single NFE). In a sense, the motivation for using NODEs was to utilize "infinite" depth, in a shared weights manner, by learning a time-dependent function. But in the case presented in the paper, most of the "heavy lifting" in learning the representation is already done by the encoder-decoder neural nets, while the learned velocity could be thought of as a last layer represented by some non-linear function, separating the data, which is not linearly separable as shown by experiments in section 6.3, since as it appears from Table 1, there's not much of a performance gain in using NFE $>1$ .

[1] Augmented Neural ODEs, Dupont et al. (2019)

[2] Augmented Bridge Matching, De Bortoli et al. (2023)

问题

Can the authors provide a comparison to augmented NODEs+FM training? or provide and explanation as to why they think this may not work?
Regarding the second point in the method weaknesses, I would be happy to extend the discussion on this point and think if there's some experiment to justify the use of NODEs here better.

Typos:

In Figure 1 (c), the title says NFE=1, but trajectories are curved. Could there be a typo in the plot title?
Figure 1 caption labels (c) and (d) are mislabeled as (b) and (c).
Table 1, CIFAR10, RNODE, Throughput seems like a typo, value is like simulation free methods while it should be lower.

局限性

yes.

作者回复

2024-08-07

Q1. The paper lacks formal and rigorous explanations of the method and experimental settings. For instance, adding noise to labels (L164).

A1. We appreciate the comment and will revise the presentations in the method section to be more clear in our final version of the manuscript. For adding noise to labels (L164), we let the label decoder reconstruct from noisy label embedding. Formally,
$\mathcal{L}(\psi, \varphi) = \mathbb{E} [||d_{\psi}( g_{\varphi}(y) + \epsilon) - y||_2^2],$ where $\epsilon \sim \mathcal{N}(0,\sigma^2)$ .

Q2. The introduction of flow-matching is inaccurate and lacks the main point. There are typos in Fig.1 and Tab. 1.

A2. We appreciate the reviewer’s effort to help make our manuscript rigorous and accurate. We will revise the description of Eq. (3) in Sec. 2, to clearly convey that the target velocity field $v_t$ is not a marginal velocity field but a conditional velocity field, which is defined by a per-sample basis. For typos:

Fig. 1.(c) shows NFE for training, which is 1 in flow matching since it does not require a full trajectory to be calculated during training. We will revise the caption to clearly inform this.
In Tab. 1, the throughput of RNODE on CIFAR10 is 0.19.

Q3. The paper misses the sanity check and comparison with the most naive baseline, combining augmented NODEs with flow matching.

A3. To address the reviewer's concern, we have extended our analysis in Sec. 6.3, adding ANODE+FM (augmented NODE [1] with flow matching) to Tab. R.1 of the rebuttal PDF. Our model demonstrates higher classification accuracy and a lower disagreement ratio compared to this baseline. This is because our model can relax target trajectory crossing, which is a consequence of learning encoders by flow loss. Since the fundamental reason for crossing arises from predefined dynamics rather than insufficient dimensionality, simply allowing encoders to augment the dimension (like ANODE) doesn't effectively prevent the issue of target trajectory crossing.

Q4. The motivation for using NODEs for regression and classification is rather weak when the learned map is linear (i.e., solved by a single NFE).

A4. We agree with the reviewer’s point that the motivation for using NODEs becomes weaker when the learned trajectory is solved by a single NFE. However, our method encompasses not only the linear dynamics but also any nonlinear dynamics that connects two endpoints $z_0$ and $z_1$ . As demonstrated in Fig. 4 of our paper, utilizing nonlinear dynamics (e.g., convex and concave) yields models that clearly benefit from additional NFEs.

In these cases, the motivation for using NODE as an infinite depth model remains valid. NODEs offer unique advantages such as parameter sharing across depth and the ability to freely trade off performance and computation without retraining. These properties make NODEs appealing compared to conventional neural networks, indicating their value as a research topic.

While our preliminary analysis of nonlinear predefined dynamics (L289-310) concluded that linear dynamics perform better than our nonlinear choices (i.e., convex and concave), this work can be extended by carefully designing new nonlinear dynamics that outperform linear ones. Future research could even make the dynamics a learnable, data-dependent component (L325-328), allowing the model to choose between simpler dynamics for inference cost and more complex dynamics for performance. We believe our work can serve as a starting point for such attempts by addressing the primary concern of heavy computation costs during NODE training.

[1] Augmented Neural ODEs, Dupont et al. (2019)

2024-08-12

I thank the authors for their response.

I have a few follow-up questions:

Can the authors describe the setting of the ANODE+FM experiment? how did they handle the different dimensionalities?
"simply allowing encoders to augment the dimension (like ANODE) doesn't effectively prevent the issue of target trajectory crossing." I had in mind adding an unrestricted dimension that is not bound to the FM loss, so it relieves the predefined trajectory crossings.

2024-08-13

We would like to thank the reviewer for the response.

In ANODE[1], zero padding is proposed as a method to augment the data dimension. Following this approach, in the ANODE+FM experiment, we applied different-sized zero padding to data and label to match the dimensionalities. As a result, data and label encoders contain no learnable parameters (because they are zero-padding), and we trained the dynamics function using only the flow loss.

Regarding the second question, we are a bit unsure about the suggested experimental setting and would appreciate it if the reviewer could provide further details, especially the meaning of “adding an unrestricted dimension that is not bound to the FM loss”. While we believe the current experiment represents one of the most straightforward approaches, we are open to extending our experiments if there is a more appropriate setting.

[1] Augmented Neural ODEs, Dupont et al., 2019

2024-08-14

Following the description of the experiment, it seems that the augmented dimension does not contribute to resolving crossings. If For simplicity, let us assume the data lies in $\mathbb{R}^2$ and the label lies in $\mathbb{R}$ . Consider augmenting the data with an additional dimension, lifting it to $\mathrm{R}^3$ by padding with a zero.

So a data point $x\in \mathrm{R}^2$ is augmented with a zero $\tilde{x} = [x,0] \in \mathrm{R}^3$ and the label $y$ is augmented so $\tilde{y}=[y,0,0]\in \mathrm{R}^3$ . In this setting, the FM loss will return the zero velocity field on the augmented third dimension.

The experiment I had in mind, is to have the third dimension as a learned one (like encoder), but at the end, at inference the label is the first dimension of the output.

I think the authors' current experiment does not make sense to me and does not represent a sensible baseline.

2024-08-14

Q1. Following the description of the experiment, it seems that the augmented dimension does not contribute to resolving crossings. (…) The experiment I had in mind, is to have the third dimension as a learned one (like encoder), but at the end, at inference the label is the first dimension of the output.

A1. We appreciate the reviewer for clarifying the detailed experimental setup. As a more concrete baseline to avoid the crossing trajectory problem, we additionally employed the idea from the Augmented Bridge Matching (AugBM) [1] as suggested by the reviewer in the original response, which conditions the dynamics model on the initial point. As suggested by the reviewer, we chose the data encoder $f$ as identity, and employed the label encoder $g_\varphi$ and decoder $d_\psi$ pre-trained by the reconstruction loss, whose latent dimension is matched with the data. Then, we trained the dynamics function $h_\theta$ on augmented input $(z_0, z_t)$ similar to AugBM [1] using flow loss $\mathbb{E}\_t||h\_\theta(z_0, z_t, t) - (z_1 - z_0)||$ until convergence. The result is given in Table 1 below:

Table 1. Results

	Train accuracy	Disagreement ratio
Initial point conditioning	45.50%	33.47%
Ours	98.80%	0.02%

While conditioning the dynamics function on the initial point can avoid the crossing of target trajectories in principle as in [1], in our preliminary study (Table 1) we observe that this approach suffers from under-fitting in practice. We conjecture that this is due to the increased variance of loss and gradient introduced by conditioning the dynamics function with an additional initial point, losing the Markovian property of the dynamics. The limitation is also discussed in the original paper [1] (page 9). In contrast, our method can jointly learn the encoders, which retains the Markovian property and hence reduces the variance. We appreciate the comment from the reviewer and add more thorough comparisons and analyses with these baselines in the draft.

[1] Augmented Bridge Matching, Bortoli et al. (2023)

审稿意见

评分: 6置信度: 32024-07-06

This paper presents an approach for instantiating a flow matching method for paired data $(x, y)$ without relying on iterative ODE solvers. The method uses an input encoder with a pair of target decoder and encoder to project the original data into a latent space. By imposing the form of the dynamics in latent space, the trajectory of the latent vector between $x$ and $y$ can be represented using a closed-form equation. This approach demonstrates competitive or superior performance compared to methods that require iterative ODE solvers and diffusion-based models.

优点

The idea and motivation are clearly exposed, with sufficient detail to understand the intuition behind the approach.
This work is a good example of the effort to link intuitions from various domains together to create a clear and simple method.
The results are interesting, providing evidence that nonlinear dynamics in latent space can be eliminated without compromising prediction accuracy.

缺点

I found that the presentation of the main Table 1 is not entirely clear to me. The main messages (smaller NFE vs. competitive results) are conveyed, but it is still a bit confusing to see the number of NFEs directly in the table for the methods requiring only one NFE.
Minor issues: many abbreviations are defined multiple times throughout the text. Please be careful if the authors used a writing assistant to help with the drafting.

问题

This is an extra point for me: I am curious about the differences between this approach and discrete depth neural networks. If the dynamics between $z_0$ and $z_1$ can be expressed with an interpolation, we could say that this architecture is very similar to the discrete depth neural networks, where the processor in the latent space is a single linear layer with or without a skip connection. An example of such an architecture can be seen in Lusch et al. (2018). Did the authors try to compare the proposed method with such an architecture?

References:

Lusch et al. (2018), Deep learning for universal linear embeddings of nonlinear dynamics, Nature Communications, https://www.nature.com/articles/s41467-018-07210-0

局限性

I confirm that the authors have addressed sufficiently the limitations of their approach.

作者回复

2024-08-07

Q1. Presentation issues in Table 1 and repeated definition of abbreviations in the main text.

A1. We thank the reviewer for highlighting these presentation issues. We acknowledge that some abbreviations (e.g., NFE or NODE) are defined repetitively. In the final version of our manuscript, we will revise the presentation of Tab. 1 and improve the readability of the main text to address these concerns.

Q2. Comparison with the discrete depth neural networks, where the processor in the latent space is a single linear layer with or without a skip connection.

A2. We appreciate the reviewer for bringing our attention to this interesting related work. Our model with linear dynamics shares the high-level motivation with Lusch et al. (2018) [1], which is to find an embedding space that yields linear dynamics between source and target. It is interesting to see that two different approaches, namely flow matching (with optimal transport) and Koopman operator, converge on the same point. Regardless of the theoretical background, both approaches are appealing as they seek to interpret a nonlinear system within a well-studied linear framework.

At the same time, we have identified several differences between our work and the line of research based on Koopman operator theory. While those works mainly focus on a systematic way to obtain a linearized representation of the underlying nonlinear dynamics (with eigenfunction), our work aims to find a way to learn it in a simulation-free manner, avoiding the heavy computation of forward simulation (e.g., which appears in $\mathcal{L}_{lin}$ of [1]) from an initial state to an end state. Additionally, compared to the discrete depth neural networks that have a single linear layer processor, our proposed method is generally applicable to any nonlinear dynamics that connects two endpoints $z_0$ and $z_1$ , exemplified as 'convex' or 'concave' in our paper (L289-L298). This implies that in our case, it is possible to have a latent trajectory as a curve in non-Euclidean geometry whenever the interpolated state $z_t$ is tractable.

[1] Deep learning for universal linear embeddings of nonlinear dynamics, Lusch et al. (2018)

审稿意见

评分: 5置信度: 32024-07-11

The authors propose to use the flow matching loss, which directly matches the dynamics of a neural ODE (NODE) model to the pre-defined (simple) vector field, for supervision tasks. While the flow matching with simple linear vector fields is efficient, it cannot work well for supervision tasks because the paired data structure can require a crossing trajectories, which cannot be achieved by using the ODE + linear vector field. To overcome this issue, the authors propose to use the input and label encoders, and learning simple linear vector field in the latent space. To prevent learning trivial dynamics (e.g., ignoring data), they also introduce the label decoder and label reconstruction loss, which makes the output latent signal to be meaningful. The authors validate their approach with various supervision tasks, and show the proposed framework outperforms other competitors in terms of the cost-performance trade-off.

优点

At first glance, the proposed method seems to be too simple; using the auto-encoders to match the latent dynamics with simple one is somewhat straight-forward. However, such a simple approach can remarkably improve NODE-based models for supervision tasks, with a significant margin compared to baseline NODEs.

The paper is extremely well-written and easy to follow.

缺点

While this paper provides useful insights into NODEs, such as the crossing trajectory problems and the use of latent dynamics, these are well-known topics within the community of NODEs. Therefore, I believe this paper should be evaluated based on its practical application rather than its theoretical aspects.

From the practical perspective, while the proposed technique outperforms some popular NODE-based supervision models, it exhibits significantly lower performance compared to the standard no-NODE-based baselines; the classification accuracy of 88.89% for CIFAR10 is not great. The fact that all NODE-based models failed at this task, and that the proposed model is better at least, does not bring much satisfaction (note that diffusion-based CARD model can estimate the uncertainty, though its performance is not great also).

I do not think that every model needs to achieve SOTA performance to be published. However, to at least have readers consider trying the proposed NODE-based framework instead of conventional finite-depth models (given that supervision is generally not approached using NODEs), I believe the proposed model should at least be compared to similar-sized non-NODE-based MLP and ResNet models.

问题

The authors experimentally demonstrated that input/label encoders do not become arbitrarily complicated (i.e., learn all the information on the given task), thus the latent dynamics play a sufficiently significant role for solving the task. Can the authors intuitively explain how this is possible?

局限性

The authors mention some limitations (e.g., assumming the underyling dynamics is linear) on the proposed method.

作者回复

2024-08-07

Q1. While this paper provides useful insights into NODEs, such as the crossing trajectory problems and the use of latent dynamics, these are well-known topics within the community of NODEs. Therefore, I believe this paper should be evaluated based on its practical application rather than its theoretical aspects.

A1. We agree with the reviewer that the problem of crossing trajectory and corresponding solution of using latent that augments data dimension is already discussed in previous NODE literature [1, 2]. They mainly discuss the approximation capability of NODEs when it has insufficient data dimension, where NODEs without dimension augmentation are revealed not to be a universal function approximator [2, 3].

Our observation, however, concerns a different type of crossing trajectory problem that arises when applying flow matching for simulation-free training of NODEs. This issue stems from intersections in target trajectories induced by predefined dynamics, rather than from an inherent limitation of NODEs.

The toy experiment in Fig. 1 (or Fig R.1 of the rebuttal PDF) illustrates this difference. As shown in Fig. 1(b), NODE successfully fits the data even without a dimension augmentation. However, when we introduce flow matching for simulation-free training, the predefined linear dynamics lead to trajectory crossing (Fig. 1(c)). Our key contribution is resolving this issue by inducing a valid velocity field (Fig. R.1 (d)) for the dynamics function to regress on.

Q2. To at least have readers consider trying the proposed NODE-based framework instead of conventional finite-depth models (given that supervision is generally not approached using NODEs), I believe the proposed model should at least be compared to similar-sized non-NODE-based MLP and ResNet models.

A2. To address this concern, we conducted an additional experiment comparing our method with a ResNet model. Our main experiments primarily aimed to compare our proposed method with NODE baselines, using a simple CNN backbone following the convention of NODEs [4]. While this CNN model sufficiently demonstrates our method and allows comparison with NODE baselines, we found that we can boost the performance of our model by using stronger backbones.

We customized the ResNet-18 architecture for our model and compared it with a ResNet model. Both models have approximately the same number of parameters (11.2M). Our model achieved a classification accuracy of 94.5%, matching the ResNet-18 model's performance of 94.6% with similar training costs.

Unlike conventional neural networks, NODEs have advantages of learning smooth, bijective continuous transformations between source and target. For instance, this property makes NODE-based classification models robust to adversarial data perturbation, as studied in TisODE [5]. However, it was previously difficult to consider NODEs as replacements for conventional neural networks in supervised tasks due to performance issues from inaccurate gradient estimation [6, 7] and high training costs associated with numerical ODE solvers. As our method matches the performance and training cost of conventional neural networks, we believe NODEs with simulation-free training could be now viable alternatives to consider.

Q3. The authors experimentally demonstrated that input/label encoders do not become arbitrarily complicated (i.e., learn all the information on the given task), thus the latent dynamics play a sufficiently significant role for solving the task. Can the authors intuitively explain how this is possible?

A3. We appreciate your thoughtful comment. Intuitively, the dynamics function plays a crucial role in solving tasks as it is a main component to be used in solving an ODE initial value problem from $z_0$ to $z_1$ during inference. Fitting the dynamics function to the target vector field is essential, while input and label encoders support this by constructing an embedding space that prevents target trajectory crossings with given data pairs and predefined dynamics. Therefore, the dynamics function effectively remains as the key element in solving the task.

[1] Augmented Neural ODEs, Dupont et al. (2019)

[2] Dissecting Neural ODEs, Massaroli et al. (2020)

[3] Approximation Capabilities of Neural ODEs and Invertible Residual Networks, Zhang et al. (2020)

[4] Neural Ordinary Differential Equations, Chen et al. (2018)

[5] On Robustness of Neural Ordinary Differential Equations, Yan et al. (2019)

[6] Adaptive Checkpoint Adjoint Method for Gradient Estimation in Neural ODE, Zhuang et al. (2020)

[7] MALI: A memory efficient and reverse accurate integrator for Neural ODEs, Zhuang et al. (2021)

2024-08-12

I appreciate the authors' thoughtful response. I will be increasing the review score from 4 to 5.

2024-08-13

We would like to thank the reviewer for the constructive feedback and reassessment of our work. We will make sure to incorporate all discussions into our next revision.

审稿意见

评分: 4置信度: 32024-07-12

This paper develops the Flow Matching (FM) algorithm to connect paired data. Due to the issue of crossing trajectories, FM in the data space cannot perfectly match associated pairs. To address this, the authors perform FM in an embedded space. Ultimately, they encode source and target data through an encoder end-to-end and learn FM loss in the embedded space. The main application is image classification.

优点

The paper begins with a reasonable motivation and is well-presented, making it easy to follow.

缺点

While the motivation is to avoid incorrect pair connections due to crossing trajectories in the data space, embedding data into a latent space does not guarantee the prevention of trajectory crossings. I recommend authors to visualize (on toy data, or on real data) that the proposed method experimentally prevents crossing trajectories. Moreover, it would be valuable if authors show trajectory crossing occurs in real world scenarios by comparing the results with FM. For example, it would support the motivation and also the proposed method if the proposed method outperforms FM in connecting paired data (in real-world data).
The improvement in accuracy may be due to the additional embedding networks rather than resolving the issue of trajectory crossings. Therefore, it is recommended that the authors conduct experiments and visulalize/demonstrate crossing issue can be resolved in a learned latent space on simple low-dimensional toy data.
The comparison group is weak. There are algorithms like ANODE [1] and FFJORD [2] designed to connect paths with simpler trajectories. It would be beneficial to compare these algorithms as well.

[1] Augmented Neural ODEs, NeurIPS, 2019. [2] FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models, ICLR, 2019.

问题

In the real world scenario (on the classification/regression task presented in the paper), I am not sure if the crossing trajectory is crucial point that influence the performance. I am curious that if authors can compare their models with FM (well, there may be issues on matching source and target dimensions equal when impelmenting FM).

局限性

See the Weakness section.

作者回复

2024-08-07

Q1. Embedding data into a latent space does not guarantee the prevention of trajectory crossings. The improvement in accuracy may be due to the additional embedding networks rather than resolving the issue. Please show that the proposed method experimentally prevents the issue.

A1. We would like to first clarify that while an embedding space alone does not prevent trajectory crossings, the flow loss (Eq. (5)) encourages non-crossing trajectories. As described in our paper (L128-130), this objective is optimized when encoders induce non-crossing target trajectories. As the flow loss remains high if the target trajectory has intersections (a dynamics function should struggle to fit on multiple velocities simultaneously), jointly optimizing the encoders with the dynamics function lets the encoders to relax trajectory crossings by adjusting the embeddings.

To visualize this, we have extended the 2D toy experiment from Fig. 1, comparing trajectories learned by NODE, flow matching, and our method. Results are presented in Fig. R.1 of the rebuttal PDF. The ground truth coupling (Fig. R.1(a)) with linear predefined dynamics induces multiple points of intersection, causing naive flow matching with an identity encoder to fail in preserving the original coupling (Fig. R.1(c)). Our method (Fig. R.1(d)) successfully learns an embedding space inducing non-crossing target trajectories, correctly fitting the data with proper coupling.

As demonstrated in the toy experiment, we believe that our model mainly benefits from solving trajectory crossing, rather than additional architectural components. In fact, we use the same data encoder $f_\phi$ (which precedes the dynamics function) for all baselines in our experiments (Sec. 6.2-6.3), to ensure fair comparison. This results in a roughly same architecture for all methods in the experiment, where the only difference comes from the label encoder used in our method only to achieve simulation-free training, and not used during inference. The ablation study in our paper (L257-276) also supports our claim that accuracy improvement results from avoiding crossing trajectory. We further extended the study in Q3, to clearly show that our proposed method benefits from mitigating crossing trajectory.

Lastly, we kindly note that the performance gain compared to NODE baselines may come from a precise gradient calculation (L232-234), as NODE baselines using adjoint sensitivity methods can suffer from inaccurate gradient estimation [1, 2].

Q2. The comparison group is weak. (Comparison with ANODE and FFJORD)

A2. For more comprehensive comparison, we compared with ANODE [3] baseline, which utilizes zero-padding to increase data dimension (Tab. R.2). Similar to as discussed in Sec. 6.2, our method consistently outperforms ANODE in training cost, test accuracy, and performance in the low NFE regime. We will include these results in our final manuscript.

Regarding FFJORD [4], we found that its main focus is on improving continuous normalizing flow in terms of computational efficiency, rather than encouraging simpler trajectories. Thus, we would be happy to hear from the reviewer about how further baselines could be added, so that we can improve the presentation of empirical results. Besides, we believe that RNODE [5], which explicitly regularizes trajectories, already serves as a strong baseline, demonstrating fairly good performance in the low NFE regime for SVHN image classification (Tab. 1).

Q3. In the real world scenario, I am not sure if the crossing trajectory is a crucial point that influences the performance. (comparison with FM)

A3. To address the reviewer’s concern about the importance of crossing trajectories in real-world scenarios, we compared two naive FM baselines on CIFAR10 while matching the source and target dimensions: one using a zero-padding encoder (ANODE+FM) and another using a learnable encoder which is trained by autoencoding objective then frozen during flow loss optimization (Autoencoder+FM). The analysis below expands our analysis in Sec. 6.3 (L257-276).

Tab. R.1 shows that the disagreement ratio and classification accuracy (measured on the training set) are significantly affected by learning encoders with our proposed objective. As discussed in our paper (L267-272), we could identify a trajectory crossing by high disagreement ratio between one-step and multi-step prediction result. Our observation that naive FMs show high disagreement ratios suggests that they cannot properly resolve target trajectory crossing, thereby failing to fit the training dataset (low accuracy). Our model, however, achieves high accuracy by preventing most trajectory crossings, showing low disagreement ratio.

We also measured the accuracy of predicted velocity, similar to the flow loss (Eq. (5)), by replacing MSE with cosine similarity to disregard the magnitude of the target velocity, which depends on the learned embedding space. If a target trajectory crossing problem occurs, the velocity prediction near the intersection point shows low cosine similarity. As shown in Fig. R.2, naive FM baselines suffer from crossings near endpoints, resulting in low cosine similarity. Our model mitigates this issue, consistently showing high cosine similarity across the entire range of $t$ .

By comparing our method with naive FM baselines, we observe that trajectory crossing is indeed crucial in real-world scenarios, and our proposed method benefits from effectively preventing it.

[1] Adaptive Checkpoint Adjoint Method for Gradient Estimation in Neural ODE, Zhuang et al. (2020)

[2] MALI: A memory efficient and reverse accurate integrator for Neural ODEs, Zhuang et al. (2021)

[3] Augmented Neural ODEs, Dupont et al. (2019)

[4] FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models, Grathwohl et al. (2019)

[5] How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization, Finlay et al. (2020)

2024-08-12

Thank you for the clarification. To be clear, I believed it is questionable whether it's truly possible to accurately model the straight trajectory in a situation where the Flow Matching (FM) loss and Embedding loss (AE loss) are mixed. I appreciate the authors for the additional visualization of the learned trajectory. However, I still have some concerns on the method and experiments.

First of all, this work does not develop a loss function or methodology to explicitly straighten the trajectory, as seen in [1] or [2]. Instead, the trajectory is implicitly straightened in the latent space because creating non-crossing trajectories is advantageous for minimizing FM loss in the embedding space. However, I believe there is a lack of theoretical evidence to support the claim that the trajectory is accurately linearized in a scenario where multiple losses are mixed. Moreover, it seems that much depends on the expressivity of the Encoder-Decoder in the embedding space, which raises concerns. I am particularly concerned that the ability to straighten the trajectory highly depends on the expressivity of the embedding network.

Furthermore, I believe that the concept of using embedding to straighten the trajectory by FM is not particularly novel at this moment. The approach of learning through Flow Matching via latent embedding has been already discussed many recent works, including [3] and [4]. These works also discuss about the straightened latent trajectory and showed good performance in high-dimensional experiments, which leads me to believe that the contribution of this paper is somewhat limited.

Minor point: In the UCI regression task, the RNODE with Dopri solver (which also straightens the trajectory in the data space) demonstrates better performance than proposed method. I believe that the proposed method should show a noticeable performance improvement over these comparisons.

For these reasons, I believe this paper has limited contribution, hence, will keep my score to 4.

References
[1] Liu, X., Gong, C., & Liu, Q. (2022). Flow straight and fast: Learning to generate and transfer data with rectified flow.
[2] Lee, S., Kim, B., & Ye, J. C. (2023, July). Minimizing trajectory curvature of ode-based generative models.
[3] Fischer, J. S., Gui, M., Ma, P., Stracke, N., Baumann, S. A., & Ommer, B. (2023). Boosting Latent Diffusion with Flow Matching.
[4] Dao, Q., Phung, H., Nguyen, B., & Tran, A. (2023). Flow matching in latent space.

评论- Official Comment by Authors (1/3)

2024-08-13

We appreciate the opportunity to clarify our key claims and contributions. Our primary aim is to develop a method for training NeuralODE models on paired data in a simulation-free manner. To achieve this, we adopt a flow matching framework. Upon identifying that a naive form of flow matching results in crossing target trajectories, we introduce an embedding space that is learned end-to-end with flow loss. Moreover, our method accommodates a general interpolation form as predefined dynamics, not limited to linear dynamics.

Q1. However, I believe there is a lack of theoretical evidence to support the claim that the trajectory is accurately linearized in a scenario where multiple losses are mixed. [...] I am particularly concerned that the ability to straighten the trajectory highly depends on the expressivity of the embedding network.

A1. To theoretically support our claim that our method can learn embeddings with non-crossing latent trajectories with combination of flow matching and autoencoding losses, we offer a formal proof here. In fact, we can show a general result for all trajectories of the form $\mathbf{z}_t= \alpha_t \mathbf{z}_0+ \beta_t\mathbf{z}_1$ , which includes non-linear trajectories in Sec. 6.3.

Suppose that we have a paired dataset $\mathcal{D}=\lbrace\mathbf{x},\mathbf{y}\rbrace_{i=1}^{N}$ that consists of data $\mathbf{x}\in \mathbb{R}^{d_x}$ and label $\mathbf{y}\in \mathbb{R}^{d_y}$ . We have a data encoder $f_\phi$ and label encoder $g_\varphi$ that transforms data and labels to latent $\mathbf{z}_0, \mathbf{z}_1 \in \mathbb{R}^{d}$ , where $\mathbf{z}_0=f\_\phi(\mathbf{x})$ and $\mathbf{z}_1=g\_\varphi(\mathbf{y})$ , respectively. Here we assume $d>d_x, d_y$ . We also have a pre-defined dynamics $F(\mathbf{z}_0,\mathbf{z}_1,t) = \alpha_t \mathbf{z}_0+ \beta_t\mathbf{z}_1 =\mathbf{z}_t$ . We assume that $\alpha_t$ and $\beta_t$ are smooth and nonzero except for $t=0$ and $t=1$ .

Formally, we say that the encoders $(f\_\phi,g\_\varphi)$ induce a target trajectory crossing if there exists a tuple $(t,\mathbf{x},\mathbf{y},\mathbf{x}^\prime,\mathbf{y}^\prime)$ such that $\alpha_t f_\phi(\mathbf{x})+ \beta_t g_\varphi(\mathbf{y})=\alpha_t f_\phi(\mathbf{x}^\prime)+ \beta_t g_\varphi(\mathbf{y}^\prime)$ for $\mathbf{x}\neq\mathbf{x}^\prime$ and $\mathbf{y}\neq\mathbf{y}^\prime$ .

Proposition 1. There exist $(f_\phi,g_\varphi)$ that always induce non-crossing target trajectory while minimizing the label autoencoding loss.

Proof. Let the latent space constructed by a set of basis $\mathbb{I}=\lbrace\mathbf{e}\_1,\mathbf{e}\_2,…, \mathbf{e}\_d\rbrace$ . Since $d>d_y$ , we can find a label encoder $g_\varphi$ such that utilizes $k$ basis $\mathbb{J}=\lbrace\mathbf{e}\_1,\mathbf{e}\_{2},…, \mathbf{e}\_k\rbrace$ ( $d>k\geq d_y$ ) and minimizes the autoencoding loss (i.e., $g_\varphi(\mathbf{y})=g_\varphi(\mathbf{y}')$ iff $\mathbf{y}=\mathbf{y}'$ ). Also, we can find a data encoder $f_\phi$ such that $\text{proj}\_{\text{span}(\mathbb{K})}f\_\phi(\mathbf{x}) = \text{proj}\_{\text{span}(\mathbb{K})}f\_\phi(\mathbf{x}')$ iff $\mathbf{x}=\mathbf{x}'$ , where $\mathbb{K}=\lbrace\mathbf{e}\_{k+1}, ..., \mathbf{e}\_{d}\rbrace$ .

Then, suppose that there exists a tuple $(t,\mathbf{x},\mathbf{y},\mathbf{x}^\prime,\mathbf{y}^\prime)$ such that $\alpha_t f_\phi(\mathbf{x})+ \beta_t g_\varphi(\mathbf{y})=\alpha_t f_\phi(\mathbf{x}^\prime)+ \beta_t g_\varphi(\mathbf{y}^\prime)$ , i.e., $\alpha_t (f_\phi(\mathbf{x})-f_\phi(\mathbf{x}'))+ \beta_t( g_\varphi(\mathbf{y})- g_\varphi(\mathbf{y}'))= \mathbf{0}$ .

Since $g_\varphi(\mathbf{y})- g_\varphi(\mathbf{y}') =\mathbf{0}$ iff $\mathbf{y} = \mathbf{y}'$ and $\text{proj}\_{\text{span}({\mathbb{K})}}(f\_\phi(\mathbf{x})-f\_\phi(\mathbf{x}'))=\mathbf{0}$ iff $\mathbf{x}=\mathbf{x}'$ by construction, such tuple does not exist. Therefore, there exists $f_\phi, g_\varphi$ such that does not induces target trajectory crossing, while minimizing the autoencoding loss.

This finishes the proof.

We particularly note that Proposition 1 does not require highly expressive data encoder $f_\phi$ , as it only requires $f_\phi$ to be injective (in the subspace $\text{span}(\mathbb{K})$ not utilized by the label encoder).

In addition, while $d_x$ and $d_y$ are dimensions in observation space, we conjecture that the latent dimension $d$ can be made smaller if the data lives on a low-dimensional manifold.

评论- Official Comment by Authors (2/3)

2024-08-13

We now show that, if the label encoder is injective (e.g. by the label autoencoding loss), then minimizing the flow loss is equivalent to learning the data and label encoder to induce non-crossing target trajectory, and learning the dynamics function to fit the induced trajectories.

Proposition 2. If $g_\varphi$ is assumed to be injective, the following equivalence holds: $(f_\phi,g_\varphi, h_\theta)$ minimizes the flow loss $\|h\_\theta(\mathbf{z}\_t, t)-\frac{d}{dt}\mathbf{z}\_t\|$ to $0$ for all $t\in[0, 1)$ $\Longleftrightarrow$ $(f_\phi,g_\varphi)$ always induce non-crossing target trajectory and $h_\theta$ perfectly fits the induced target velocity.

Proof. ( $\Longleftarrow$ ) If $(f_\phi,g_\varphi)$ always induce non-crossing target trajectory, there is a well-defined target velocity $\frac{d}{dt}\mathbf{z}\_t$ at every $\mathbf{z}\_t$ which is continuous on $t$ . If $h_\theta$ perfectly fits this target velocity for all $(\mathbf{z}\_t, t)$ , the flow loss is $0$ .

( $\Longrightarrow$ ) We prove by contradiction. Suppose the flow loss is at $0$ and there is a crossing trajectory, i.e. some $(t,\mathbf{x},\mathbf{y},\mathbf{x}^\prime,\mathbf{y}^\prime)$ that $\mathbf{z}\_t=\mathbf{z}'\_t$ for $\mathbf{x}\neq\mathbf{x}^\prime$ and $\mathbf{y}\neq\mathbf{y}^\prime$ . Since the loss is $0$ $\forall t\in[0, 1)$ , the dynamics function $h_\theta$ must output $\frac{d}{dt}F(\mathbf{z}_0,\mathbf{z}_1,t)$ at $\mathbf{z}_t$ , and $\frac{d}{dt}F(\mathbf{z}'_0,\mathbf{z}'_1,t)$ at $\mathbf{z}'_t$ . This is a contradiction since at the point of crossing we have $\mathbf{z}_t=\mathbf{z}'_t$ but $\frac{d}{dt}F(\mathbf{z}_0,\mathbf{z}_1,t)\neq\frac{d}{dt}F(\mathbf{z}'_0,\mathbf{z}'_1,t)$ .

This finishes the proof.

The above theoretical evidence aligns with our empirical results, demonstrating the effectiveness of our mixed loss approach. We will include the proof in the final version of our manuscript to strengthen our claim.

评论- Official Comment by Authors (3/3)

2024-08-13

Q2. The approach of learning through Flow Matching via latent embedding has been already discussed in many recent works. These works also discuss the straightened latent trajectory and showed good performance in high-dimensional experiments, which leads me to believe that the contribution of this paper is somewhat limited.

A2. As discussed in L126-127 in the paper as well as in our previous response, we would like to clarify that our work has important differences to the prior works that apply the flow matching on the latent space.

Firstly, our method proposes to learn the embedding space jointly with the dynamic function to minimize the flow matching loss in an end-to-end manner. This contrasts with recent works that utilize a fixed latent embedding (e.g., ones suggested by the reviewer [3,4]), typically obtained by a pretrained VQ encoder. As discussed in our paper (L257-276) and our response to reviewer 4GBu, a fixed latent does not resolve the issue of target trajectory crossing, which motivates the introduction of learnable encoders. Please note that we also empirically compared the proposed method to flow matching baseline with fixed embedding in Tab. 2 in the paper and also in our rebuttal response A3 above, demonstrating that learning the data embedding is crucial in our problem.

Secondly, we clarify that there is a notable difference on problem settings considered in our work and the prior works on latent flow matching. We focus on learning a deterministic mapping between paired data, which introduces an important constraint of preserving the original coupling of data pairs. This specific formulation renders certain recent techniques for straightening trajectories in generative tasks, such as reflow [3] that straightens the paths by altering the initial coupling, inapplicable to our setting since it breaks the original coupling of data pairs. This also motivated us to learn latent embedding to straighten the path while preserving the coupling.

[1] Boosting Latent Diffusion with Flow Matching, Fischer et al., 2023

[2] Flow matching in latent space, Dao et al., 2023

[3] Flow straight and fast: Learning to generate and transfer data with rectified flow, Liu et al., 2022

作者回复

2024-08-07

Dear reviewers,

We appreciate the constructive feedback provided by all reviewers, which has significantly contributed to the improvement of our paper. We are encouraged by the positive recognition our paper has received, including:

"begins with a reasonable motivation and is well-presented" (YTiC),
is "a simple approach that can remarkably improve NODE-based models for supervision tasks" (4GBu),
is "a good example of the effort to link intuitions from various domains together to create a clear and simple method" (RQFX),
"presents an interesting approach" for "NODEs with flow matching when preserving the training set coupling is required" (bNYt).

In response to your valuable feedback, we have thoroughly revised our manuscript, addressing your concerns across several aspects. Please find our reviewer-specific feedback below. We look forward to any further comments or discussion.

最终决定Accept (poster)

2024-09-25

The reviewers agreed that the paper summary was as follows: a new training scheme for neural ODEs using “flow matching” with a pairwise loss on an embedding base is proposed and applied to image classification problems.

The reviewers were broadly on the threshold of a borderline accept, with one reviewer at borderline reject, and one review flipping between borderline reject to borderline accept in light of the extra results. All reviewers agreed that the paper was well written and easy to follow.

Reviewer 4GBu and YTiC noted that the theoretical contribution would be limited. Reviewer 4GBu agreed it improved against NODE benchmarks, which was sufficient even though NODEs do get SOTA. (The AC notes that NODE methods can reach SOTA on CIFAR-10 and CIFAR-100 when having comparable parameter counts to the benchmark discrete NNs.)

The most favorable review (by RQFX) was the least thorough and the reviewer did not respond during the discussion period, and as such provided the AC with less evidence for the decision, but the review was not discounted.

The biggest issue the AC sees in the review is subtle but potentially major: (YTiC)

The improvement in accuracy may be due to the additional embedding networks rather than resolving the issue of trajectory crossings.

There was an in depth discussion with reviewer YTiC on the effectiveness of the proposed approach and attribution of the gains. The reviewer was not convinced by proposed experiments, and did not reply to an additional proof. This particular AC cannot judge the accuracy of the proof, but the AC is familiar with NODE models relying more on the non-NODE parts of the network than on the NODE part, and the reviewer’s concerns are convincing.

The authors added the following additional elements during the discussion period:

The authors added one additional page of results to the rebuttal with two plots and two tables, which made 4GBu increase to a borderline accept.
A proof was added in the discussion to address the reviewer’s YTiC concern.
A second table of experiment results was added in response to bNYt.

The additional results and promised proof can be added into the camera ready version without constituting a major revision of the paper. The AC emphasizes to the authors that the additional results were necessary to increase the score.

Because most reviewers believe this paper is over the borderline of acceptance, and there are no major disqualifying issues, the AC believes the paper can be accepted. The issues highlighted by YTiC are subtle and could be settled by future investigators.