/10

Poster4 位审稿人

最低2最高4标准差0.7

ICML 2025

Discovering Physics Laws of Dynamical Systems via Invariant Function Learning

提交: 2025-01-18更新: 2025-07-24

摘要

We consider learning underlying laws of dynamical systems governed by ordinary differential equations (ODE). A key challenge is how to discover intrinsic dynamics across multiple environments while circumventing environment-specific mechanisms. Unlike prior work, we tackle more complex environments where changes extend beyond function coefficients to entirely different function forms. For example, we demonstrate the discovery of ideal pendulum's natural motion $\alpha^2 \sin{\theta_t}$ by observing pendulum dynamics in different environments, such as the damped environment $\alpha^2 \sin(\theta_t) - \rho \omega_t$ and powered environment $\alpha^2 \sin(\theta_t) + \rho \frac{\omega_t}{\left|\omega_t\right|}$. Here, we formulate this problem as an *invariant function learning* task and propose a new method, known as **D**isentanglement of **I**nvariant **F**unctions (DIF), that is grounded in causal analysis. We propose a causal graph and design an encoder-decoder hypernetwork that explicitly disentangles invariant functions from environment-specific dynamics. The discovery of invariant functions is guaranteed by our information-based principle that enforces the independence between extracted invariant functions and environments. Quantitative comparisons with meta-learning and invariant learning baselines on three ODE systems demonstrate the effectiveness and efficiency of our method. Furthermore, symbolic regression explanation results highlight the ability of our framework to uncover intrinsic laws.

关键词

ordinary differential equationinvariant learning

评审与讨论

审稿意见

评分: 42025-03-04

This paper propose a method to learn ODE based dynamical systems from observed sequences. The main feature of the proposed method which sets it apart from prior work is that it can learn invariant functions that can be reused across different environments, effectively disentangling general and reusable functions from environment-specific functions. This is motivated by the general context of automating scientific discoveries from observations. The authors propose to achieve this disentanglement by learning a function embedding which is maximally informative about the observed sequence while remaining independent on which specific environment the sequence is from. The paper offers a theoretical account of why this approach works and empirically verify their claims with three common dynamical systems that are modified to the multi-environment setting.

给作者的问题

See experimental design section and evaluation section.

What would happen if changes across function environments are more drastic and non-linear?
Please clarify the baselines.

post rebuttal: After the authors' clarification, I have raised my score to 4: accept.

论据与证据

Broadly speaking, the main claims of the paper are that 1) the proposed method can learn invariant functions and that 2) learning invariant functions leads to better prediction performances.

In the experiments, 1 is demonstrated through qualitative examples where decoding f_c gives reasonable trajectories that correspond well to the underlying invariant function. For claim 2, the authors compare against invariant learning baselines and meta-learning baselines. In all cases, the proposed method seems to outperform the baselines across all datasets.

Many extra results are included in the appendix which further corroborate their claims.

方法与评估标准

The problem setting considered in this paper is somewhat novel in that it requires similar ODEs from different environments (i.e. function forms). Prior works mainly work with environments where the only coefficient change across trajectories. As such, this paper modifies common dynamical systems such that they can have different functional forms across environments by adding extra terms such as damping. I believe this is sufficient for a proof-of-concept, although, as acknowledged by the authors in the limitations section, a more comprehensive benchmark for these environments would be helpful in evaluating the full potential of the approach.

I have one question regarding the choice of datasets: It seems that across function environments, the changes are all additive (i.e. adding extra forcing terms to the ODE). Would we expect the method to still work if the changes are more non-linear? for example changing f(theta) = sin(theta) to f(theta) = sin(theta)^2.

理论论述

I have checked the main theoretical claims (3.1, 3.2 and 3.3) and I am happy with their soundness.

实验设计与分析

My main question regarding the experiments is about the baselines. It is unclear from the text how the baselines are implemented. While the proposed baselines are representative of the respective fields, it is not clear to me how these methods are adapted to the problem setting at hand: what are the architectures of the baselines? what are the input and outputs of these methods (do they also take in X_p and output z)? In the baselines, do you still have separate embeddings z_c and z_e?

Relatedly, in Fig. 5b, how is X^c decoded for the baselines?

So far, the experimental setup seems sensible and qualitatively, the results for the approach look promising. But I think I can only assess the analyses and the results when the questions about the baselines are clarified.

补充材料

I have read through the supplementary material which provides extra experimental results that complements the main results. The FAQ section is particularly helpful in clarifying some of my initial questions.

与现有文献的关系

This paper brings concepts from invariant learning to the field of learning dynamical systems. I agree with the authors that applying the idea of invariances in this setting is non-trivial as it requires defining invariances in the function space. I believe the idea of learning invariant functions from observations is highly relevant to the dynamical system community, where the goal is not only to make forecast, but also to derive insight from the observed sequences. Here, the ability to learning meaningful and reusable functions from observations is particularly important.

遗漏的重要参考文献

Not that I am aware of.

其他优缺点

The work offers an original framework for learning reusable functions which is an important step towards ML methods that can discover generalisable and interpretable physics laws. The core concept is generally applicable and easy to understand.

The main weakness for me is that lack of clarity in terms of the experiment setup w.r.t. the baselines, which I think can be improved straight-forwardly. I would be happy to increase my score if this is clarified.

其他意见或建议

One minor comment: at the start of section 5, the authors list their 6 research questions. I think the only 'real' research questions that this paper is addressing is 1 and 2, whereas the other ones are more 'additional evidence'. Maybe it is worth rewriting that to make the contribution of the paper more concise.

作者回复

2025-03-28

We appreciate the reviewer's thoughtful feedback. We will address them point by point below.

Drastic and non-linear changes across function environments

What would happen if changes across function environments were more drastic and non-linear? For example, changing f(theta) = sin(theta) to f(theta) = sin(theta)^2.

Thank you for this insightful question. In short, drastic changes are actually good for invariance discovery, but non-linear changes are much more sophisticated to analyze.

Drastic changes: According to our initial experiments, larger environmental differences will lead to better results. The experimental results shown in the paper indeed come with decent environment differences. For example, in the "powered" pendulum environment, the mass is flung over the top in many trajectories (Figure 7: upper left one).

Non-linear changes: In most optimization problems, expanding from linear to non-linear is challenging. For example, the invariant learning pioneer work IRM can only prove that the invariant principle works under the linear scenario and one SCM. Therefore, we don't expect our method to work directly for all non-linear scenarios. We provide analyses and open discussions around it and hope you find it interesting and useful.

First, we consider the theory basis, the SCM in Appendix C.1. In the SCM, $f:=g\_{comp}(f\_c, f\_e)$ , while $f\_e$ can be different across $e$ , $g\_{comp}$ is shared. That means, if we consider $f(theta) = sin(theta)^2=sin(theta) * sin(theta)$ as $f:=f\_c * f\_e$ , then the SCM remains valid if all the environments obey $f:=f\_c*f\_e$ . If not all environments share the same function composition process, extended SCMs or extended $g\_{comp}$ will be required to be designed.
Second, if SCM remains valid, the next challenge is the reverse decomposition (red dash arrows in Figure 6). In the current hyper-network implementation, given z_c denotes sin(theta), z_e denotes sin(theta), letting z_c + z_e denote sin(theta)^2 in the function representation space is a learning challenge, which may require much deeper networks or extra techniques. For example, we may choose to learn in the space of the derivative of sin(theta)^2 to alleviate the learning challenge.
Finally, in the real world, many effects are composed additively or can be transformed to be composed additively like forces, so we believe our method is applicable in many cases.

Baseline implementation details

Please clarify the baselines. It is unclear how the baselines are implemented, what are the architectures, what are the input and outputs, and do they also have separate embeddings z_c and z_e?"

We appreciate this important point about baseline clarification. In the revised paper, we will provide a detailed explanation of the baseline implementations:

Basically, for fair comparisons, we tried our best to make baselines as similar to our architecture as possible. So all the performance gains come from our IFL principle.

Architecture

To better distinguish the baselines, let's denote our DIF framework without $\hat{e}$ predictions (Figure 3 and remove the $g_\phi$ MLP and $\hat{e}$ branch) as DIF-Base.

MAML: Only Forecastor in the DIF-Base (with MAML optimization) since MAML does meta-learning at the optimization level.
CoDA: DIF-Base and set the dimension of $\hat{z}_e=2$ (2 is the original paper setting).
IRM: DIF-Base. IRM regularization is at the loss level so no architecture change.
VREx: DIF-Base. VREx regularization is also at the loss level.

Inputs and Outputs:

MAML takes X_p and outputs the forecasting trajectory directly.
CoDA, IRM, VREx take X_p as input and output function representations $\hat{z}_c$ and $\hat{z}_e$ , then finally output the forcasting trajectory. The same as DIF.

Training:

MAML: MAML optimization on Forcaster.
CoDA, IRM, VREx the same as DIF.

Inference: (Decoding X^c for the baselines)

MAML: use the Forcaster with meta-parameters.
CoDA, IRM, VREx: the same as DIF: use the f_c branch.

Research Questions

At the start of section 5, the authors list their 6 research questions. I think the only 'real' research questions that this paper is addressing is 1 and 2, whereas the other ones are more 'additional evidence'. Maybe it is worth rewriting that to make the contribution of the paper more concise.

We agree with this valuable suggestion. In the revised paper, we will restructure Section 5 to emphasize Research Questions 1 and 2 as the primary contributions of our work.

We will reduce the remaining questions (3-6) to "supplementary analyses" that provide additional evidence for our claims rather than core research questions.

We thank the reviewer for the constructive feedback and hope the questions are addressed. We will incorporate all these improvements in the revised version of the paper.

审稿人评论

2025-04-02

I thank the authors for the clarification.

I believe including some of the discussion here, namely the effect of more diverse changes, the role of linearity, and the baseline setup, into the main text would strengthen the paper. As I mentioned in my review, I believe the paper addresses an interesting problem and I am happy to vouch for its acceptance.

审稿意见

评分: 22025-03-05

This paper proposes a method for discovering invariant functions underlying dynamical systems governed by Ordinary Differential Equations (ODEs). The key claim is that different environments modify the observed system, but an invariant function can be disentangled and recovered to represent the core governing dynamics. The authors introduce a framework called Disentanglement of Invariant Functions (DIF), which uses an encoder-decoder hypernetwork to separate invariant functions from environment-specific dynamics. The paper also presents a causal graph-based formulation to formalize the problem and proposes an information-theoretic principle to enforce invariance.

To validate their approach, the authors conduct experiments on three multi-environment ODE datasets, comparing DIF against baseline methods such as meta-learning approaches and traditional invariant learning techniques. The results reportedly demonstrate that DIF is more effective in recovering the underlying physical laws across environments and performs well in symbolic regression tasks.

给作者的问题

See previous sections.

Post-rebuttal

One of my main concerns was the validity of the theoretical claims (Theorem 3.1, in particular) made in the paper. After the discussion with the authors, I believe it is mostly a clarity issue and the claims are mostly valid. Concretely:

Thm 3.1 is essentially saying that $f_c$ is the unique solution to the optimization problem $\arg\max_{f'} I(f';f | X_0)$ s.t. $f' \perp e$ . It makes sense when all the $h$ and $\theta_c$ are removed from the statement and its proof.
The original statement in Thm 3.1 is misleading and also (slightly) incorrect. While a unique solution $f_c$ always exists, there is no guarantee that the function class of all possible $h$ , which depends on the implementation of the hypernetwork, can approximate the true $f_c$ . That being said, this is fine in practice. The authors just need to rewrite the theorem and make a comment about this gap between theory and realization.

Overall, I think this paper has some valid contributions, though marred by the lackluster presentation. It could be accepted should the authors improve the writing and fix the clarity issues. I have updated my score to reflect this.

论据与证据

I am doubtful towards the main claims of this paper.

Claim: An invariant function exists that captures the common dynamics across environments.
- The definition of an invariant function is vague. At first thought, it represents the "common parts" of different governing equations, it does not specify what constitutes "common." As an analogy, what is considered the invariant function for $f_1(x) = x,\ f_2(x) = x+1,\ f_3(x)=x+2$ ? Should it be $x$ , due to its simplicity, or $x+1$ because it is the average of all functions? Anyways, how does one rigorously define this commonality beyond intuition? The provided examples, such as variations of a pendulum's equation, seem to assume that a clear invariant function exists, but this assumption is not justified theoretically. If environments are too diverse, there may be no meaningful invariant function.
Claim: The method can successfully identify the invariant function across environments.
- There is no strong theoretical justification for why the invariant function is identifiable in all cases. If the environments are drastically different, could it be possible that the method learns an arbitrary function rather than the true underlying invariant one? Moreover, since there seems to be no clear definition of the "true" invariant function, as pointed out earlier, how do we know the learned invariant function is correct?

方法与评估标准

The paper constructs multi-environment ODE datasets to evaluate the method. The datasets are reasonable choices for testing the framework since they include variations of well-known physical systems.
The performance is measured in terms of error in predicting future states for both the observed trajectories and the invariant trajectories. These metrics align with the goal of recovering governing equations, but in terms of verifying the "correctness" of the invariant functions, the authors assume the knowledge of a "ground truth" invariant function. However, as discussed earlier, it is unclear to me why these specific equations are preferred over others as the invariant function.

理论论述

I do not fully understand some theoretical claims in this paper, Theorem 3.1 (and its proof) in particular. What defines the "true invariant function"? From the statement I understand that we have the solution to an optimization problem, which can be mapped to a function $f_c$ . Then what are we trying to prove? Is it some property of this function? Or is it the existence and uniqueness of this solution, as seemingly suggested by the proof? If it is the latter, I don't understand what this theorem is trying to establish.

实验设计与分析

The experimental results indicate that DIF generally achieves lower error and better symbolic regression outcomes than baselines.
One key concern is whether the method generalizes well to settings where environmental changes are more drastic. For example, what would happen if you make $\alpha$ smaller and $\rho$ larger in the pendulum experiment? The "powered" environment would possibly break the typical behavior of the pendulum and fling the mass over the top.

补充材料

No supplementary material is provided.

与现有文献的关系

The work is related to previous studies on invariant learning (e.g., Arjovsky et al., 2019; Rosenfeld et al., 2020), but many prior work focuses on categorical settings rather than function spaces. The authors attempt to extend these ideas to dynamical systems, which is a reasonable direction.
There are abundant works on governing equation discovery for ODE systems, deriving from SINDy (Brunton et al., 2016), which the authors should have discussed.
Apart from SINDy sparse regression, genetic programming-based methods are also widely applied to this task. The authors did mention related works such as PySR (Cranmer, 2023) in the Appendix, but I feel it should also be included in the related works section in the main text.

References

Brunton, Steven L., Joshua L. Proctor, and J. Nathan Kutz. "Discovering governing equations from data by sparse identification of nonlinear dynamical systems." Proceedings of the national academy of sciences 113.15 (2016): 3932-3937.
Cranmer, Miles. "Interpretable machine learning for science with PySR and SymbolicRegression. jl." arXiv preprint arXiv:2305.01582 (2023).

遗漏的重要参考文献

See above.

其他优缺点

The writing of this paper can be significantly improved. First of all, it is too verbose, and I'd suggest the authors run a grammar check using Grammarly to reduce redundancy. The paper also lacks coherence at some point. For example, in Section 2.2, the authors first identify two challenges in L147-161, but I don't understand how the paragraphs describing solutions in Section 2.2.1 correspond to these two challenges.

Then, in terms of math writing, there are many instances of missing definitions, incorrect notations, and confusing statements. Besides the lack of clear definition of an invariant function already mentioned above, other examples include:

$\mathbf X_0$ is referred to before definition.
When expressing a function mapping, the common practice is to write $f: X \to Y, x \mapsto f(x)$ instead of $f: X \mapsto Y$ .
In Thm 3.1, why "given the predicted function random variable $\hat f_c$ ? It is not used anywhere in the theorem or the proof.

其他意见或建议

N/A

作者回复

2025-03-27

Thank you for your review. We appreciate your feedback and would like to address your concerns point by point.

Definition and identifiability of invariant function

The definition of an invariant function is vague...

We thank the reviewer for raising this point.

Rigorous definition: Our invariant function is not defined arbitrarily; it is rigorously grounded in causality and formally specified within our Structural Causal Model (SCM) (see Section 2.2.1, Figure 2, and Appx. C.1). Specifically, the invariant function fc is the function generated by the structural equation and exogenous variable c, which represents the core mechanism shared across all environments. This causal definition distinguishes our approach from ad-hoc solutions.
Example pitfall: We respectfully argue that the reviewer’s example oversimplifies the problem by focusing on a single realization, however, our treatment considers the invariant function as a random variable, a distribution that must be learned, not a specific instance.
A more proper example: If we have $\mathbf{f}_1=\bm{\alpha} x + \bm{\beta}$ and $\mathbf{f}_2=\bm{\alpha} (x + 1) + \bm{\rho}=\bm{\alpha} x + (\bm{\alpha} + \bm{\rho})$ , then since $f_c \perp e$ , we find the decision boundary for the $e$ prediction, which determines by the distribution differences between $\bm{\beta}$ and $\bm{\alpha} + \bm{\rho}$ , not by the distribution of $\bm{\alpha} x$ (invariant). Note that the three functions the reviewer provide can be sampled from both $\mathbf{f}_1$ and $\mathbf{f}_2$ , illustrating the pitfall of analysing in instance level. This example is not perfect. For rigorous understanding, please resort to our SCM and information theory based proofs. Furthermore, we invite the reviewer to refer to Appendix B (Q2) and related works in invariant learning for additional insights.
Community acknowledgment: The invariant function is defined rigorously in a statistically significant way using SCM. We like to share that using SCM to define invariant representation/mechanism has been a mature knowledge in invariant representation learning community.

There is no strong theoretical justification for why the invariant function is identifiable in all cases.

Identifiability: Sufficiency & necessity: In Section C.2 of the appendix, we provide a complete proof of the existence and uniqueness of the solution to our optimization problem. The proof demonstrates that our invariant function learning principle guarantees identifiability under the causal assumptions.

I do not fully understand some theoretical claims in this paper, Theorem 3.1 (and its proof) in particular.

Theorem 3.1 provides the optimization principle to sufficiently and necessarily discover the function random variable defined in our SCM.

Note that our definition and theoretical proof have been acknowledged by other reviewers:

Reviewer pHQv stated:

"structural causal model drawn by the authors (Figure2) illustrates the concept of invariant functions very well"
"I read the theoretical proof in Appendix C, and there is no obvious problem with the derivation process."

Reviewer cD7J: remarked on the solid theoretical framework and highlighted the connection to Independent Causal Mechanism, underscoring that invariant learning is a well-established area.

Reviewer T91b confirmed "I have checked the main theoretical claims (3.1, 3.2 and 3.3) and I am happy with their soundness."

Due to space limitations, for further context, we recommend reviewing the fundamentals of invariant learning and the role of SCM in invariant learning listed in our essential related works.

Experimental Design and Generalization

One key concern is whether the method generalizes well to settings where environmental changes are more drastic. ... The "powered" environment would possibly break the typical behavior of the pendulum and fling the mass over the top.

Theoretically: The alterations in parameters (α and ρ) do not undermine the structure of our SCM; hence, our method remains generalizable.

Empirically: Our experiments in the "powered" environment do fling the mass over the top, and the results shown are still robust, suggesting it can handle increasingly drastic changes.

Writing and Organization

We appreciate this feedback and will improve the writing in our revision. We will:

Address redundancy and improve grammar
Enhance coherence, particularly in Section 2.2
Fix notation issues.
Clearly define all variables before use
Reorganize the explanation of challenges and solutions for improved flow

Related Works

Thank you for highlighting these important references. While we did mention PySR and symbolic regression in the Appendix, we agree that SINDy and genetic programming-based methods deserve more attention in the main text. We will expand Section 4 (Related Work) to include them.

审稿人评论

2025-04-04

Thank you for the response. I am aware that other reviewers have acknowledged the soundness of the theorems and proofs. However, I did not find any detailed comments from other reviews that clarified or answered my previous questions. So I'd like to ask for additional clarifications from the authors.

In Thm 3.1, why "given the predicted function random variable $\hat f_c$ "? The notation $\hat f_c$ is not used anywhere in the proof.
The proof of Thm 3.1 only established the existence and uniqueness of the solution to the given optimization problem. It did not justify the statement that "the true invariant function is given by this solution". If I understand correctly, the true invariant function is defined through the SCM in Figure 2. Then why is the $h_{\theta_c^*}$ the true invariant function according to this definition?

I do not have the bandwidth to go over other proofs in detail. But at least I understand what other theorems are trying to establish. Other reviewers have mentioned that some of the assumptions are impractical, e.g. white noise in observed trajectories, but I think it is a good starting point for theoretical analysis. Thm 3.1, however, as the foundational principle of this paper, is missing clear statements and exposition (at best).

作者评论

2025-04-04

Thank you for the reply. We appreciate the additional comments and will provide clarifications to the questions:

In Thm 3.1, why "given the predicted function random variable $\hat{f}_c$ "? The notation $\hat{f}_c$ is not used anywhere in the proof.

Notation $\hat{\mathbf{f}}_c$ is not used in the proof due to its role in the optimization process:

$\hat{\mathbf{f}}_c$ is an optimizable object. In the proof, we focus on the solution of $\hat{\mathbf{f}}_c$ after the optimization process, leading to optimized object $\mathbf{f}_c$ and $\mathbf{f}_c'$ (used in the contradiction). So $\mathbf{f}_c$ and $\mathbf{f}_c'$ in the proof stand for the optimized $\hat{\mathbf{f}}_c$ .
Similarly, $\theta_c$ (optimizable) is also not used in the proof, but it does not affect that $\hat{\mathbf{f}}_c$ and $\theta_c$ are both essential components in describing the optimization process.

Furthermore, mentioning $\hat{\mathbf{f}}_c$ serves two additional purposes:

Connection to the implementation: $\hat{\mathbf{f}}_c$ is mentioned in order to connect to Sec. 3.1, where the prediction of $\hat{\mathbf{f}}_c$ is introduced. This connection build the bridge between the implementation and the optimization.
Clarification of the ouput of $h_{\theta_c}(\mathbf{X}_p)$ : Emphasizing the "predicted function random variable $\hat{\mathbf{f}}\_c$ " reminds readers of our optimization goal, i.e., to extract $\mathbf{f}\_c$ from $\hat{\mathbf{f}}\_c$ , and avoids potential confusion that might arise if only $h\_{\theta\_c}(\mathbf{X}\_p)$ were mentioned.

The proof of Thm 3.1 only established the existence and uniqueness of the solution to the given optimization problem. It did not justify the statement that "the true invariant function is given by this solution". If I understand correctly, the true invariant function is defined through the SCM in Figure 2. Then why is the $h\_{\theta\_c^*}$ the true invariant function according to this definition?

We respectfully argue that the understanding of the existence and uniqueness of the solution is not accurate. The existence and uniqueness are described under the condition that $\mathbf{f}\_c$ $=h\_{\theta\_c^*}(\mathbf{X}\_p)$ (the solution extracts the true invariant function), where the $\mathbf{f}_c$ in the proof is from the definition in the SCM.

Existence: In the first sentence of the existence proof in Appendix C.2, we state "We first prove the existence of a solution $\theta_c^*$ to the optimization problem, such that $\mathbf{f}\_c$ $=h\_{\theta\_c^\*}(\mathbf{X}\_p)$ ." This existence proof is not trivial, it is a conditional existence proof. Therefore, this existence proof describes that "there exists at least one $\theta\_c^\*$ that extracts the true invariant function".
Uniqueness: Similarly, the uniqueness proof in Appendix C.2 begins with "We now prove that for any solution $\theta\_c^*$ of the optimization process, it holds that $\mathbf{f}\_c$ $=h\_{\theta\_c^*}(\mathbf{X}\_p)$ ." This proof is also conditional and it guarantees that all solutions to the optimization process yield the same function, i.e., $\mathbf{f}_c$ .

Here is a symbolic explanation:

Existence: Denote the solution set as $\Theta\_c^*$ , $\exists \theta\_c^* \in \Theta\_c^*$ , s.t. $\mathbf{f}\_c$ $=h\_{\theta\_c^*}(\mathbf{X}\_p)$ .
Uniqueness: $\forall \theta\_c^* \in \Theta\_c^*$ , it follows that $\mathbf{f}\_c$ $=h\_{\theta\_c^*}(\mathbf{X}\_p)$ . Given existence proved, we only need to prove that all the extracted functions are the same ( as $\mathbf{f}\_c$ ).

Briefly, the reason "why the $h\_{\theta\_c^*}(\mathbf{X}\_p)$ is the true invariant function according to this definition" is provided as our proof. Our proof shows that $h\_{\theta\_c^*}(\mathbf{X}\_p)$ is the true invariant function according to the SCM definition.

We will incorporate these explanations into the revised version for better clarity.

审稿意见

评分: 32025-03-07

This paper introduces an approach to learning invariant functions in dynamical systems (governed by ODEs) in different environments. The authors propose a method called Disentanglement of Invariant Functions (DIF), which aims to discover intrinsic dynamics across multiple environments while avoiding environment-specific mechanisms. The approach is an encoder-decoder hypernetwork to disentangle invariant functions from environment-specific dynamics. The authors have shown that the probabilistic information-maximization objective can be implemented using a simple easy-to-implement MSE loss function. The method is evaluated against meta-learning and invariant learning baselines, demonstrating its effectiveness.

update after rebuttal

Thank you for the detailed response and for providing additional context regarding your design choices and theoretical connections. While I appreciate the clarifications, my initial concerns regarding the generalization and broader applicability of the work remain. As these issues are still relevant to the overall impact, I have decided to keep my original evaluation unchanged at a 3 (Weak Accept). I recognize the potential of the work, but feel that these concerns need further attention.

给作者的问题

Could you elaborate on the limitations of extending your method to PDE systems and how you plan to address them in future work?
Based on the definition of invariant and environment-specific dynamics used in the paper, it seems there is a strong connection to the independent causal mechanism principle. Can you elaborate if you can recognize such connection?
It seems all evaluations are done with simulated data. The assessment of the potential of the work would be easier if the authors could evaluate with real-world high-dimensional examples both in terms of scalability (higher dimensions) and also in terms of dealing with more complex dynamics.

论据与证据

The effectiveness of the method could be limited given the experimental setup.

方法与评估标准

Yes, but mostly limited to low-dimensional simulated data.

理论论述

The connection between information-maximization and MSE loss is a known theoretical result and makes intuitive sense. I did not check the details of the proof in the appendix though.

实验设计与分析

The design of the experiments sounds logical but lacks high-dimensional real-world experiments to support the effectiveness of the method in more challenging domains.

补充材料

Briefly, the symbolic regression evaluation.

与现有文献的关系

Although the design of the hyper-network may be novel in discovering dynamical systems, it is not so novel in ML literature as there have been methods to use prediction loss as a proxy for information-maximization and also representing functions as weights of a NN.

遗漏的重要参考文献

Connection to Independent Causal Mechanism and Independent Component Analysis. There is a vast literature on both topics. I think this work should also be discussed in relation to these concepts.

其他优缺点

Strengths

Novelty: The paper presents a new perspective on invariant function learning, focusing on disentangling invariant dynamics from environment-specific factors.
Methodology: The proposed hypernetwork which is based on information-maximization objectives derived from the causal-graph is relatively innovative in this context and in line with gradient-based learning which suggests better scalability compared with Bayesian inference methods.
Effectiveness: Quantitative comparisons show that the proposed method outperforms existing meta-learning and invariant learning techniques on the simulated data benchmarks.

Weaknesses

Scope Limitation: The method is currently limited to ODE systems and seems not directly applicable to PDE systems which govern a large class of dynamical systems.
Generalization: The experiments are all low-dimensional and with relatively simple dynamics and simulated data. High-dimensional real-world examples (e.g. from robotics, cell biology) could better illustrate the effectiveness of this method.
Novelty limitation: Although the experiments show the effectiveness of the proposed method, the use of information maximization to disentangle causal mechanisms has been used in various contexts before.

其他意见或建议

作者回复

2025-03-28

We appreciate your feedback and the acknowledgment of the novelty and effectiveness. Below, we address your concerns point by point.

Limited experimental setup

Thank you for raising this concern. Our work represents the first step in invariant function learning, introducing a new paradigm for scientific discovery. The focus on low-dimensional systems was intentional for three reasons:

Conceptual clarity: Low-dimensional systems allow us to clearly demonstrate the core principles of invariant function learning without obscuring them with implementation details for high-dimensional data.
Methodological validation: As the first work on invariant function learning, we needed to establish baseline performance on well-understood systems where ground truth is available.
Benchmark contribution: We've constructed three multi-environment datasets (ME-Pendulum, ME-Lotka-Volterra, and ME-SIREpidemic) that will serve as important benchmarks for future research.

We fully agree that extending to high-dimensional real-world examples is an important direction, as we acknowledge in our limitations section (Section 6). The current work lays the theoretical and methodological foundation for these future extensions.

Scope limitation to ODE systems

Thank you for this important question. As we discussed in Appendix B, we acknowledge that our method is currently designed for ODE systems. This choice was driven by three key challenges:

Lack of multi-environment PDE datasets: Unlike domain adaptation tasks, multi-environment datasets for PDEs aren't readily available and require domain-specific expertise to construct.
PDE-specific technical challenges: PDEs introduce multi-scale and multi-resolution problems requiring specialized techniques for stability and computational efficiency.
Interpretation challenges: Our current approach uses symbolic regression for interpretability, which doesn't extend naturally to PDE systems.

Our future work will tackle these challenges by: (1) collaborating with domain experts to develop multi-environment PDE benchmarks, (2) incorporating specialized PDE-handling techniques from neural operators literature, and (3) developing new interpretability methods for invariant function learning in PDEs.

Connection to causal mechanisms

We agree that there is indeed a strong connection to the independent causal mechanism (ICM) assumption/principle. Our work can be viewed as discovering ICM specifically to function spaces in dynamical systems.

Note that "SCMs are underpinned by the assumption of independent causal mechanisms, a fundamental premise in causality. Intuitively, this supposition holds that causal correlations in SCMs are stable, independent mechanisms akin to unchanging physical laws, rendering these causal mechanisms generalizable."

ICM is the basis of Jonas's invariant causal predictor and IRM. When we use SCM in Section 2.2.1, we already adopt the ICM assumption, as in the SCM, we're assuming that invariant mechanisms (fc) and environment-specific mechanisms (fe) operate independently and can be disentangled. From the ICM perspective, our optimization goal is to identify the ICM showing in our SCM.

We acknowledge that we could have made this connection more explicit in our related work, and we appreciate the opportunity to clarify it here. In our revised version, we would enhance our discussion of this connection in the related work section.

Novelty limitations and related work

Although the experiments show the effectiveness of the proposed method, the use of information maximization to disentangle causal mechanisms has been used in various contexts before.

We acknowledge that information maximization for disentangling causal mechanisms has precedents in the literature. Our contribution lies in:

Novel task formulation: We introduce invariant function learning as a new task for scientific discovery (Section 2.2).
Function space application: While previous work has applied similar principles to fixed representations, we extend them to function spaces where invariance must be maintained across all possible states.
Theoretical guarantees: Our work provides theoretical guarantees for invariant function discovery (Theorem 3.1) specifically adapted to the dynamical systems context.
Empirical validation: Our experimental results (Section 5.4) demonstrate that general invariant learning principles like IRM and VREx do not perform well on our task, highlighting the need for our specialized approach.

These points, we believe, substantiate the novelty and significance of our approach.

Regarding essential references

Thank you for this valuable suggestion. We agree that enhancing our discussion of connections to ICM and ICA literature would strengthen the paper. In the revised versions, we would expand Section 4 (Related Work) to include them.

审稿人评论

2025-04-07

Thank you for the detailed response. I appreciate the clarifications and the additional context regarding your design choices and theoretical connections. While the work is promising, my initial concerns regarding generalization and broader applicability remain. I will keep my original score.

审稿意见

评分: 32025-03-15

The authors proposed the task of learning invariant functions in dynamical systems: considering the ODE equation $\frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t} = f(\mathbf{x})$ of the system evolution as a combination of the invariant function $f_c$ corresponding to the inherent properties of the system itself and the function $f_e$ caused by the environment, and learning the common part $f_c$ from the trajectory data collected in different environments. To solve this problem, the authors proposed the DIF method, which uses a meta neural network to generate parameters of two neural networks $\hat{f}$ and $\hat{f}_c$ based on a single trajectory $X$ collected in a specific environment to fit the governing ODE equation $f$ and the invariant function $f_c$ . DIF minimizes the MSE between the system's evolution trajectory governing by $\hat{f}_c$ and the actual evolution trajectory $X$ (equivalent to maximizing the mutual information of $\hat{f}_c$ and $f$ ), and uses adversarial training to make a discriminator unable to infer information about the environment from $\hat{f}_c$ , so that the learned $\hat{f}_c$ becomes an invariant function that describes the evolutionary properties of the system itself and is invariant to the environment. The authors conducted experiments on three systems: Penduium, Lotka-Volterra, and SIR. They considered four variants of each system under different environments and generated 200 trajectories for each variant using different parameters and initial values. DIF learns the system's invariant function $\hat{f}_c$ from these trajectories. The evolution trajectory governing by $\hat{f}_c$ is consistent with the evolution trajectory governing by the ground truth $f_c$ (NRMSE < 1.0, suggesting positive correlation). The authors further extract the symbolic expression of $\hat{f}_c$ through symbolic regression, finding that it is similar to the ground truth invariant function, proving the ability of the DIF method to learn invariant functions in dynamic systems.

给作者的问题

Q1. If I understand correctly, $X$ is the observed evolution trajectory, and $X_p$ is the part of $X$ before $T_c$ . In training, $X_p$ is used as the input of the model, and $X$ is used to calculate the loss function. Why do we need to distinguish between $X_p$ and $X$ ? Is it feasible to not distinguish between them and take the whole $X$ as $X_p$ ?

论据与证据

Mostly, the structural causal model drawn by the authors (Figure2) illustrates the concept of invariant functions very well, and a complete mathematical proof for the training scheme used is given in the appendix. The effectiveness of the proposed DIF method is also demonstrated experimentally on three systems.

However, I have doubts regarding the practicability of the proposed method in real-world scenarios.

方法与评估标准

Yes, although the authors focus on a new problem and therefore have no existing dataset, they construct three datasets based on the Penduium, Lotka-Volterra, and SIR systems that are consistent with the invariant function learning task they describe.

理论论述

I read the theoretical proof in Appendix C, and there is no obvious problem with the derivation process. However, the proof in Appendix C.3 relies on the assumption of Gaussian white noise, that is, the evolution trajectory is the result of adding Gaussian white noise to the noise-free evolution trajectory. This may render the derivation untenable when evolutionary trajectories contain colored noise that is more common in the real world. This is particularly concerning given that the authors did not present experiments on real-world datasets.

实验设计与分析

Yes, the authors' experimental design effectively illustrates the ability of DIF to learn invariant functions of the system itself from evolutionary trajectories collected in different environments.

补充材料

I read the appendix and found that the authors provided a complete mathematical justification for the training scheme adopted in the main paper as well as more experiments, which was very helpful in understanding the method.

与现有文献的关系

The author proposed for the first time the task of invariant function learning in dynamic systems. This task is related to Meta-Learning and invariant function learning. However, existing invariant function learning methods are not applicable to dynamical systems that evolve over time; and it is questionable whether the functions learned by Meta-Learning that can quickly fit different environments are invariant functions.

遗漏的重要参考文献

I think authors should discuss possible limitations of the proposed method in high dimensional systems with graph structure, as in following papers.

Cranmer, Miles, et al. "Discovering symbolic models from deep learning with inductive biases." Advances in neural information processing systems 33 (2020): 17429-17442.
Shi, Hongzhi, et al. "Learning symbolic models for graph-structured physical mechanism." The Eleventh International Conference on Learning Representations. 2022.

其他优缺点

Strengths: The authors proposed the task of learning invariant functions in dynamical systems and constructed a solid theoretical framework and a novel DIF method for this task. Although this task seems rather challenging, the learned invariant functions are surprisingly consistent with the real results in terms of both evolving behavior and symbolic nature.

Weakness: Although the authors provide a discussion of possible applications of the method in Appendix B (Q5), it is difficult to imagine any specific scenarios for this task. If we want to get a predictive model that predicts system behavior, directly using a neural network to fit $f$ (rather than $f_c$ ) seems to be a better choice; if we want to get a symbolic model that explains the system behavior, directly using symbolic regression methods to find $f$ 's symbolic expression in different environments and then obtaining the same part $f_c$ through observing seems to be a better choice.

其他意见或建议

Considering that the authors focus on a new task that learns invariant functions of dynamical systems, it would be better to provide experiments based on real-world datasets, which can help readers establish an understanding of the application scenarios and importance of this task, and, on the other hand, better verify the capabilities of the proposed DIF method.

Also, it is better to discuss its applicability in much complex scenarios like networked complex systems (as in my provided references).

作者回复

2025-03-28

Thank you for your thoughtful review. We are glad you recognized our theoretical framework and experimental validation. We address your concerns point-by-point as follows.

Practicability in real-world scenarios

Thank you for this question. While we focus on establishing theoretical foundations in this work, we believe our method has significant real-world potential. In principle, the methodology, particularly the disentanglement via adversarial training and mutual information maximization, is applicable to real-world scenarios where environmental variations are present. Invariant function learning can extract fundamental dynamical laws from noisy, environment-influenced data, a critical challenge in scientific discovery. Application areas include:

Extracting physical laws from video data where environment factors (e.g., lighting, camera position) complicate analysis
Learning generalizable physics for simulation and control across varying conditions
Contributing to foundation models in physics by identifying invariant mechanisms

Note that some of these applications might not belong to invariant function learning but fall back to general invariant learning, which is not our target. PDE scenarios that need invariant function learning, are in our future work since much work needs to be done before being able to transfer to PDEs described in Appendix B Q1.

Theoretical claims and noise assumptions

We acknowledge that Appendix C.3 assumes Gaussian white noise. While this is a common starting point in theoretical analysis, we agree this is a limitation. Future work will extend our framework to other noise scenarios or relax this assumption.

High-dimensional systems with graph structure

Thank you for the suggestions and discussions on graph-structured physical mechanisms as in Cranmer et al. (2020) and Shi et al. (2022). We agree these are relevant to extending our work. Our current focus was establishing the basic framework for invariant function learning in ODE systems, but we acknowledge that discussing the approach to high-dimensional or graph-structured settings is worthwhile. In the revised paper, we will discuss these limitations and outline potential directions for adapting our method to such complex scenarios.

Utility Compared to Alternative Approaches

Thank you for your comparison with direct neural network fitting and symbolic regression approaches. Our method targets extracting invariant dynamics. Regarding alternative strategies like finding symbolic expressions in different environments and then obtaining the same part $f_c$ through observing, the most difficult part is observing. The reason is direct: the fitting of f tends to capture both invariant and environment-specific aspects, which leads to spurious correlations so that the final equation forms from different environments seem significantly different. That is to say, after extracting equations from different environments, it is challenging to find the common part directly since the common parts are blended with environment parts and look like not disentangable.

Our DIF method explicitly enforces the separation of invariant dynamics from environment-specific factors, providing a more reliable basis for scientific discovery. This advantage becomes especially important in scenarios with complex environmental effects or when specific invariant mechanisms need to be identified. We will expand on this distinction in the revised manuscript.

Real-world datasets

We agree that real-world validation would strengthen our paper. Our focus in this initial work was establishing theoretical foundations and validating on controlled systems. Testing on real-world data is the next step we discussed in the Appendix and we're actively pursuing this direction.

Question on X_p vs X

Regarding the specific question (Q1): Your understanding is mostly correct. We distinguish between $X_p$ (past observations before time $T_c$ ) and $X$ (full trajectory) because the forecasting task takes $X_p$ as input $X$ as output. Similar to $X\rightarrow Y$ in general prediction tasks, the forecasting task does $X_p \rightarrow X$ . In training, $X$ works as "label $Y$ " for loss, while in inference time, $X$ after $T_c$ is not available. Using the whole trajectory $X$ as $X_p$ would blur the line between training and testing data. We will clarify this point further in the revised manuscript.

We hope these responses address your concerns, and we thank you for your constructive feedback.

最终决定Accept (poster)

2025-05-01

The paper proposes a novel framework for extracting ODE-based physical laws from data using a new algorithm that disentangles invariant mechanisms based on structural causal models. This is an interesting new algorithm on an emerging area, and while experiments are on simple settings in general, they are convincing enough at this stage. Reviewers in general agree on the value of the paper, both in terms of technical novelty, rigor, and validation, although there are some concerns on the validity of the claims and whether they are clearly presented. What is more, all reviewers point on works from Brunton and Kranmer that could be discussed more. I agree on this. In fact, the paper should be compared with several related works published in the recent NeurIPS/ICML/ICLR [1, 2, 3, 4] in the related work section for the camera-ready version. This is not only for explaining the novelty better and having a more complete paper, but also giving more mass to the community.

[1] Pervez et al, Mechanistic neural networks for scientific machine learning, ICML 2024 [2] Chen et al., Scalable Mechanistic Neural Networks, ICLR 2025 [3] Liu et al., Amortized Equation Discovery in Hybrid Dynamical Systems, ICML 2024 [4] Auzina et al., Modulated Neural ODEs, NeurIPS 2023