PaperHub
6.3
/10
Poster3 位审稿人
最低5最高7标准差0.9
7
5
7
3.0
置信度
正确性2.7
贡献度2.3
表达2.7
NeurIPS 2024

Nature-Inspired Local Propagation

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

关键词
hamilton equationbackpropagation

评审与讨论

审稿意见
7

The authors introduce a new model for describing the evolution of weights in a neural network. They do so by defining the problem as a directed graph and reformulating it to derive a Hamiltonian which they can then apply Hamilton's principle to to solve. This solution yields a set of differential equations from which they argue they can read off the dynamics of weights. They further show this form can be reduced to standard gradient descent by letting the velocity of the system go to infinity.

优点

The paper takes a rigorous approach to understanding network dynamics and it does so with a relatively small set of assumptions. The mathematics, while slightly over-dominating, appear to be well derived and lead to consistent conclusions.

缺点

The authors find a set of differential equations which minimizes their derived action. What I struggled to see was how this yielded new approaches to understanding real dynamics. I appreciate that the author/s states in the conclusion that more work needs to be done to apply this to more applicable machine-learning problems, however, as a machine-learning practitioner I am not sure where I would start. I think this is something that could be cleared up in the paper. Exactly what the desired outcome is of the work, rather than just, a mathematical form, and how it can be applied. This all essentially boils down to, I think there would be a simpler way to present the results of the paper without the, at times, excessive complexity.

问题

  1. Which elements of your derivation depend on control theory? From what I followed, you introduce a Lagrange multiplier problem, derive a Hamiltonian, and then look for the minimum of the action. I understand that this process is followed in control theory, however, is there something I missed that relates it directly to some element of control?
  2. What is the overall conclusion regarding data requirements? I lost track of this message towards the end of the paper. Normally, evolution equations will depend on data in some way, whether you look for evolution in a parameter space or some interacting picture, in the introduction the authors argue that large data-sets are not needed. I missed how the derivation ties back to this, perhaps you can make that more clear to me here.

I also had some spelling/grammar points for future versions of the paper:

l.16: strongly rely on l.39: Some citations here would be helpful l.50: amounts to determining how l.51: in terms of environmental information (remove the an) Page 2 footnote line 2: domain and co-domain Page 5 footnote line 1: until now l.201: Starting from l.258: define a Hamiltonian track l.298: leads to the interpretation of back...

This list is not exhaustive, just what I marked.

局限性

Application to more relevant models would be very helpful. Despite the statement that this should be treated only as a theoretical contribution, it was not clear to me how I could use the conclusions. Therefore, either some demonstration or a clearer conclusion in that regard would be very useful.

作者回复

We thank the Reviewer for having appreciated our work. In what follows we will address the comments and questions raised.

I appreciate that the author/s states in the conclusion that more work needs to be done to apply this to more applicable machine-learning problems, however, as a machine-learning practitioner I am not sure where I would start. I think this is something that could be cleared up in the paper. [...]

Thanks for pointing this out—it’s a really important observation! The learning framework we have in mind while developing these ideas, and one of the main motivations for this work, is lifelong learning from a continuous, possibly infinite, stream of data. We decided to give this work a more technical style because, as we expressed in the conclusions, we feel it is necessary to establish some precise results to build upon. One prototype of the learning problems we believe this approach is best suited for is learning from a very long video stream. In this case, the input stream uu will be the video stream, and a simple Euler method for Hamiltonian equations will readily provide an “optimizer” for the weights of a recurrent neural network (RNN). We will do our best to incorporate explicit statements on the desired outcomes of the work in the revised version of the paper.

Which elements of your derivation depend on control theory? [...]

You are perfectly right. One of the main results of the paper, Theorem 1, is derived as an application of standard control theory. Indeed, what we are trying to do is find optimal trajectories for the weights of a continuous-time RNN by controlling the velocity of the weights. This is expressed in the first paragraph of Section 3.1. In particular, the control variables are denoted by v in Equation (10). The adoption of this theory makes it possible to define a new spatiotemporal neural propagation scheme that generalise backprogatation.

What is the overall conclusion regarding data requirements? [...] I missed how the derivation ties back to this, perhaps you can make that more clear to me here.

Yes, precisely as you are saying, data enters the evolution equations through the input signal tu(t)t\mapsto u(t). For instance, when processing a video stream, uu is the video itself, and u(t)u(t) is the frame of the video at time tt. Notice that this signal is used in the computation of the outputs of the neurons (see Eq. (5)), as one would expect: uu updates the input neurons, which in turn update their children, and so on. For this reason, as noted in the introduction, this theory is applied in the context of lifelong/continuous learning, where the main assumption is that data is handled as a stream of information instead of a static training set. This means that at each time tt we only need to access u(t)u(t) and we neither need to store nor to access data collections.

评论

Hi,

Thank you for taking the time to reply.

  1. I like your example for a possible application, it think it would improve the manuscript to provide this as an example somewhere as it is a fairly pressing problem, if not now, surely in the near future.
  2. I am not sure if my point was made very clear with the control theory argument. I merely meant that you are just finding an optimal trajectory along a surface using a (constrained) Lagrangian. This alone has nothing to do with control theory it is just a mathematical principle. I know this is a pedantic point, but I was just looking to see if I missed something.
  3. So the argument data-wise is that you do not need to keep historical data? How does the model deal with things like forgetting in these cases?
评论

Hello, thank you for the useful feedback.

  1. We will add the proposed example to better illustrate the potential applicative scenario.

  2. It seems there may have been some confusion in understanding the question, and our response could have been more precise. Specifically, we should have stated that the proposed theory is formulated as an “optimal control problem” rather than a generic control problem. Let us clarify this further. We have a dynamical system defined by Eq. (10), which describes how the neurons of a continuous-time RNN evolve. This system depends on certain variables, namely the weights of the NN, and our goal is to steer the dynamics of the system by acting on the velocities of these weights (denoted as v) in such a way that the cost in Eq. (9) is minimized. Thus, and please let us know if this does not address your comment, the formulation we present in section 3.1 is indeed an instance of an optimal control problem. More precisely, it is a Bolza-type control problem with a null terminal cost [see Cannarsa, Piermarco, and Carlo Sinestrari. Semiconcave Functions, Hamilton-Jacobi Equations, and Optimal Control. Vol. 58. Springer Science & Business Media, 2004. Page 213, Section 7.4]. It is not merely a generic constrained optimization problem from the Calculus of Variations.

  3. We place ourselves in a natural setting where the major difference from most of the existing literature lies in conducting the learning and inference processes within the same sequence, which consists of the environmental information of the agent's life. Information about the past is retained through an encoded representation in the state of the model (such as weights or outputs) enforced by the specific structure of the Lagrangian. We are in front of a new learning and inferential scheme whose major novelty is that of offering a local spatiotemporal algorithm that can open the doors to facing the classic problem of dealing with forgetting issues. However, you have rightly pointed out a problem of enormous significance in lifelong learning research. At this stage, we cannot claim that this paper specifically addresses the issue of forgetting. Our intuition is that the current proposal does not solve the problem in general, but it may offer effective solutions for specific tasks. We will add a sentence to clearly indicate that this remains an open issue, even within the proposed model.

评论

Hi,

Thank you for further clarifying. I appreciate your time on the replies.

  1. Great!
  2. That was helpful, thank you.
  3. Very cool idea, and I like the way you summarised it here. I could probably ask questions about this particular point for a while, but at some point, it is just for my own interest.

I do think some things about the paper could be made clearer, and perhaps you will have addressed that by the time the paper comes out. But I really appreciate the discussion in the reviews, it has helped me gain a clearer picture of the work and its possible significance.

I am happy to bump your score up a bit on my side. I will recommend, though, that you add some of these concrete statements in the paper regarding things like theoretical applications and certainly addressing memory. The reason is that the points didn't come across easily in the actual manuscript, which is a shame.

评论

We genuinely appreciate your constructive suggestions and the insightful discussion. We will definitely incorporate statements regarding theoretical applications and memory into the paper to make these aspects clearer.

审稿意见
5

The paper proposes a spatially and temporally local alternative to backprop for training RNN. The paper first proposed viewing learning as minimizing a functional of a variational problem using Hamiltonian equations. The standard backprop can be seen as a special case when the velocity in the functional is set to infinity. Then, the paper proposes an alternative to solving the functional based on Hamiltonian Sign Flip that does not require going backward in time to compute the weight update as in conventional backprop. This allows online processing of inputs without needing a separate backprop phrase.

优点

  • The paper gives a new perspective on how backprop can be seen as a special case of solving a variational problem with infinite velocity
  • The paper proposes an alternative to solving the variational problem without requiring going backward in time as in backprop

缺点

  • Lack of clarity: The paper can be written less technically and more intuitively so it is accessible to a broad audience. I found it difficult to understand and struggled to fully understand how the sign flips remove the need to go backward in time. It would also be much better if the pseudo-code of the final algorithm were presented directly, which would allow for more reproducible results.
  • Lack of experiment: even though the paper focuses on a theoretical understanding of the learning rule, it is essential to have more standard experiments to understand the difference with the conventional backdrop. For example, evaluation on MNIST and comparing it with backprop, using more neurons instead of just 5, is necessary for readers to understand its border applicability.
  • Related work: There is a lot of related work on biologically plausible learning rules that are temporally local. A related work section and discussion on the relationship with these previous works will allow readers to understand its contribution much more easily.

问题

Am I correct that the proposed algorithm is based on (23) with zero initial conditions and s(t) defined in (24)? In case the loss function is the difference between the label and output neuron's value in the final time step, and the input is a constant image, will this algorithm work as desired? How could the weight be updated when the label is not accessible before the final time step?

局限性

The authors adequately addressed the limitations.

作者回复

We thank the reviewer for acknowledging the novelty of our approach. We will address the comments and questions raised.

It would also be much better if the pseudo-code of the final algorithm were presented directly, [...]

We agree that the presentation of the final algorithm that we are using could be improved. We will add an appendix with the pseudo-code that you can read in the attached PDF (at the end of the overall rebuttal).

Lack of experiment [...] is necessary for readers to understand its border applicability.

This paper proposes a natural learning scheme that promotes environmental interactions. It covers theoretical foundations, and the experimental results are expected to validate the consistency of the theory rather than achieving high performance on classical benchmarks. We acknowledge the importance of conducting similar experiments. The proposed test can be conducted for instance by generating an appropriate data stream from MNIST, where each pattern is presented at the input for a fixed amount of time. However, this requires a considerable amount of experimental work and is beyond the scope of this paper. It is worth mentioning that the MNIST experimental setting described above is not the primary focus of the theory and that experimental work in this area is still in its early stages. We also would like to point out that one of the topic from the call for paper is "theory (e.g., control theory, learning theory, algorithmic game theory)", which we believe is coherent with the spirit of this paper.

Related work [...]

We will add a “Related Work” section either after the introduction or as a subsection of it. In the overall rebuttal comments, we have included a draft of this section.

Am I correct that the proposed algorithm is based on (23) with zero initial conditions and s(t) defined in (24)? [...] How could the weight be updated when the label is not accessible before the final time step?

Yes, you are correct but with random initialisation on the state. If supervision is not provided until the end of the temporal interval, changes in the weight during the middle of the interval will not be informed by the label but only by regularisation terms; all the Hamiltonian equations are the same.

评论

Thanks for the response. Since the paper has been revised to include the related work section and the pseudo-code, I would slightly increase the score to 5.

However, I still strongly suggest more experiments to verify the proposed learning rule. Yes, this work is intended to be theoretical, and the theoretical result that backprop is the special case (when velocity equals infinity) of the proposed learning rule is significant. Yet, many researchers will also likely be interested in the algorithm performance where velocity is not infinity. This requires either theory on the convergence rate showing how the loss goes down or experiments on some minimal dataset such as MNIST (note: the primary goal of experiments is to show the property of the proposed algorithm, e.g., ablation on the velocity, instead of achieving high performance). A new learning rule needs either theory or experiments to support it, and the lack of convergence guarantee and experiments on standard datasets do not inform readers how useful the proposed learning rule is.

评论

Your suggestions concerning experimental investigation are on our agenda and we fully understand your comments. Thank you for your time and input.

审稿意见
7

This paper, titled "Nature-Inspired Local Propagation," explores a novel learning framework that diverges from traditional machine learning methods, which heavily rely on large data collections and professional expertise. The authors propose a biologically inspired model emphasizing the critical role of temporal information processing. The framework assumes that learning and inference develop through a nature-based protocol of environmental interactions, where the agent uses limited memory buffers for processing information without recording the entire temporal stream. This approach, inspired by optimal control theory and principles of theoretical physics, emphasizes continuous learning through differential equations akin to natural laws.

The paper makes two main contributions. First, it introduces a local spatiotemporal pre-algorithmic framework inspired by Hamiltonian equations, showing that Backpropagation can be seen as a limit case of the proposed diffusion process at infinite velocity. This insight addresses the biological plausibility of Backpropagation by presenting a computational scheme that is local in both space and time, assuming the associated ordinary differential equations are solved as boundary problems. Second, the paper proposes a method for approximating the solution of Hamiltonian problems with boundary conditions using Cauchy's initial conditions, stabilizing the learning process through time reversal schemes related to the focus of attention mechanisms. The Hamiltonian Sign Flip (HSF) policy is experimentally validated for tracking problems in automatic control. While the proposed scheme optimally handles temporal settings and surpasses traditional algorithms like BPTT and RTRL, the authors note that real-world applications will require further development and substantial collaborative research.

优点

  • The computational model is clearly and generally defined based on the notations of computational graphs with natural limitations, which makes the theoretical achievements independent of the network architecture. This independence from architecture is an important feature in local learning approaches.

  • The paper is well-written and most of the time, the authors explained immediately what they meant after they used a technical term. Moreover, despite the paper being mostly theory-based, they tried to transfer some intuition behind their choices and assumptions.

  • I enjoyed the way the authors formalized the problem and the general idea of introducing the laws of learning in terms of Hamiltonian equations. However, there are some questions and assumptions that its validity is still not clear to me. I discussed them in the following sections. I would like to learn about those and possibly increase my score since I did enjoy the paper overall.

  • I liked the reasonable explanations of the assumptions when formalizing the problem and the computational schemes. For example, the assumption of locality in time and causality are both aligned with our understanding of time and the universe. They were even stepped further by assuming a more general case of l-locality in time, meaning that the values can depend on the values in the horizon l of the past (not necessarily Markovian!). Moreover, defining τ as a fixed variable shows that the authors thought about the possibility of an independent time scale of nodes, This implies that the authors were knowledgeable and careful about the practicality of the proposed method.

  • The authors were honest about the limitations of the work such as limitations regarding solving the ODE as a boundary problem.

Overall, I found the paper to be very engaging. I would like to improve my score once these questions and comments have been addressed.

缺点

Main weaknesses:

  • This paper introduces a pre-algorithmic approach, which provides some theoretical perspectives and hypotheses. The experimental evaluation and analysis of the approach are weak. The explanation of the graphs and experimental setup is not enough. Is the network 5 neurons with depth 1 and is linear or with sigmoid as activation? The explanation and notions used in the graphs are not clear. The goal of each plot and the conclusion is not clear. What were the authors trying to evaluate or show?

  • One of my biggest concerns is that it seems like there is an underlying assumption in the formalism of the paper where the authors assume the inputs, parameters and outputs are all a function of the variable of time. I am not so sure about the validity of this assumption. These values change based on other variables over time, meaning that they usually are indexed in time, rather than being a direct function of time. I would like to see more elaboration on this fundamental assumption and understand why it is valid. I agree that there could be defined a problem setting to see this dynamic, however, in most cases, this is not the case.

  • Another concern is around the assumption that the authors implicitly made in line 81, where they simplify equation 3. I am not sure why the causality in time is valid and necessary while later on

  • I believe the intuitive explanations of mathematical notions defined or borrowed from physics are not enough. I recommend modifying the write-up by defining a simple computational graph and showing the variables on it. Then discuss how possibly a Hamiltonian formulation can intuitively fit in for solving that graph using a spatiotemporal local learning rule. You can instead, enrich the references to cover the background better if there exist other works which are addressing the intuitive aspects of Hamiltonians in NNs.

  • What metrics or benchmarks were used to measure the effectiveness of the HSF policy and the overall framework?

  • Appendix C is not elaborated enough to show how to go from eq.14 to eq.16. I suggest opening up the equations and differentiations and actually derive eq. 16 for the final version of the paper. Readers and reviewers would not want to spend time doing it on their own.

  • line 165 "Proof. The proof comes immediately from (16)." Justify better how you got eq. 17 directly from eq. 16. in the limit.

  • The references provided in this paper are not enough. The reader struggles to understand the background of different concepts used in the paper.

Suggestions on referencing:

  • Line 35 "The discrete counterpart, which is more similar to recurrent neural network algorithms that can be found in the literature, can promptly be derived by numerical solutions." The reference for the literature is missing here.

  • line 39: "... thus addressing also the longstanding debate on Backpropation biological plausibility." a reference to the debates is missing.

  • The possibility of "time reversal"

Some suggestions on writing and a few typos and grammatical errors:

  • line 10: " ... when the speed of ... "
  • line 30 "..., those small buffers allow(s) the agent ..."
  • line 51: "Formally this amounts to determine to assign to each vertex i ∈ V a trajectory x_i that is computed parametrically in terms of the other neuron outputs and in terms of an environmental information, mathematically represented by a trajectory1 u: [0,+∞) → Rd." This sentence can be written more clearly.
  • line 81, "Causality instead express(es) ..."
  • line 88, "the velocity constant (of) that controls the updates of"
  • line 152 "defined on chilren’s nodes" --> children's
  • line 201, "starting form" --> from

问题

  • The authors mention that "Basically, the agent is not given the privilege of recording the temporal stream, but only to represent it properly by appropriate abstraction mechanisms. While the agent can obviously use its internal memory for storing those representations, we assume that it cannot access data collection. The agent can only use buffers of limited dimensions for storing the processed information. From a cognitive point of view, those small buffers allow the agent to process the acquired information backward in time by implementing a sort of focus of attention." There are two claims on the conditions here: 1) no privilege if recording the temporal stream, and 2) no access to data collection, but only abstract representation from its internal memory. What is the buffer with limited dimensions that is used for storing processed information? What is the processed information and how is it encoded in a biological brain? I am wondering if this is a valid assumption neuroscientificly.

  • "Interestingly, we show that the on-line computation described in the paper yields spatiotemporal locality, thus addressing also the longstanding debate on Backpropagation biological plausibility." what do you exactly mean by this sentence? The debate on the biological plausibility of BP has many aspects. It is not clear in this sentence, that the authors intend to address which aspect of the biological plausibility of BP?

  • line 73: "In general we are interested in computational schemes which are both local in time and causal." It is not clear to me why causality matters. Is causality part of the solution constraints or problem constraints? And why is it necessary? Later on, we reverse this causality by considering reverse time.

  • c_i is the velocity constant to control the updates of neurons. Is the assumption c_i = c necessary since it is being used in most of the theorems?

  • Based on lines 74-77, and the definition of x(t) in lines 59-61, the underlying assumption is having inputs, parameters, and outputs as direct functions of (continuous) time rather than indexed by time. Is this a valid assumption? Do these values change by variable time, or do they change over time?

  • Is τ considered as a fixed variable for all vertices or its value can change for each vertex? Also, have you considered the setting where τ is also variable for each vertex?

  • line 52 says "trajectory1 u: [0,+∞) → Rd." Lines 92 and 93 define "vector u ∈ RN 7→ INi(u)". This is not clear what IN is and if u is related to the previous definition of trajectories of input or not. Overall, in the spatial locality section, it is not clear to me what IN is and what is the intuition behind IN(u) and IN(w(t)). This section needs improvements.

  • Similarly, line 111 "ϕ: [0, T] → R is a strictly positive smooth function that weights the integrated" is confusing with the definition of \phi earlier on in line 89? Are they conceptually related?

  • In equation 11, what is p. It is not also defined in appendix A.

  • line 155 "where we have introduced the notation ai(t) to stand for the activation of neuron i". Does a_i(t) represent pre-activations or activations?

  • how much is this approach scalable to the higher dimension of d and network architecture?

  • what are x, u, z in the plots? what is q? I did not understand the messages of the plots?

局限性

NA

作者回复

Explanation of the graphs and experimental setup We utilize a fully connected recurrent network with five neurons (see Figure 1 in the PDF rebuttal), each using tanh\tanh activation functions. One neuron, designated as the output unit, aimed to approximate a given reference signal through a quadratic loss in the Lagrangian. Figures 1–3, demonstrate the effectiveness of the local spatiotemporal neural propagation algorithm and the sign flip policy. Figure 4 displays the behaviour of the Lagrangian and Hamiltonian over time. In the figure caption “Track a highly-predictable signal” should be “Track a highly-unpredictable signal.”

Assumptions on temporal dependence See Q5 below

Assumption in line 8 We assume for the derivation of the Hamilton equation that the state satisfies Equation (4). This equation is associated with multiple approximating sequences at discrete time, of the form in Equation (3). We emphasize the importance of causality because we aim to develop an algorithmic scheme that can be easily computed forward in time. However Section 3 does not rely on this implicit assumption, it only assumes~(4).

Defining a simple computational graph Yes, this facilitates understanding and will be included in the final version of the paper. In the attached PDF file we have illustrated the structure of the network used in the experiments, along with the main variables.

Metrics for HSF policy effectiveness Our goal was to demonstrate the stabilizing effect and tracking capabilities of the HSF policy. We monitored the effectiveness of our approach using a quadratic loss function, which is commonly used in regression.

Appendix C Thank you again for the suggestion. We will expand the derivations in the proof in the final version of the paper.

Line 165 From the last line of (16) with ci=cc_i=c we can divide both sides by cc. In the limit cc\to\infty the only surviving terms are the ones that do not have 1/c1/c in front; these are exactly the terms in Eq (17). We will expand the proof in the final version of the paper.

Suggestions on referencing (in order)

  • Bertsekas, Dimitri. Abstract dynamic programming. Athena Scientific, 2022, pag. 2.
  • Francis Crick. The recent excitement about neural networks. Nature, 337:129–132, 1989.
  • Stork. "Is backpropagation biologically plausible?." International 1989 Joint Conference on Neural Networks. IEEE, 1989.
  • Timothy P. Lillicrap Geoffrey Hinton, Backpropagation and the brain, Nature Reviews - Neuroscience, 2020
  • Evans, Lawrence C. Partial differential equations. Vol. 19. American Mathematical Society, 2022. Pag.597

Answers to questions

Q1. By “processed information,” we refer to the data stream uu. The internal memory representation in our model is given by the state variables (outputs and weights). Some justification for our assumptions may be found in the way representational systems are realized from a neurobiological perspective.

  • Byrne, John H. Learning and memory: a comprehensive reference. Academic Press, 2017. Pag. 228–230.

Q2. We address two issues: the “update locking problem” and the often-overlooked issue of infinite signal propagation speed in neural networks. A clarifying sentence will be added.

Q3. We aim to develop a causal learning algorithm that processes information forward in time. Confusion may arise from continuous-discrete transitions in Section 2. We will make efforts to address it further.

Q4. The assumption is not necessary; it was used for simplification. We only require the condition ci/c1c_i/c \to 1 as cc \to \infty for all ii if we want a reduction to backpropagation.

Q5. We assume tu(t) t \mapsto u(t) given function of time. Outputs, while also functions of time, are subject to dynamic constraints Eq.(5), and indeed depend on variables (neurons, inputs, and weights). Our proposal emphasizes the explicit temporal dependence of the parameters, modeling the continual adaptation necessary for continuous learning. We advocate shifting from finding a set of optimal parameters to searching for optimal weight trajectories. This approach reflects the ordinary cognitive perspective where learning and inferential processes are not separate.

Q6. All the results holds if the spacing between points in the temporal partition is non-uniform, so ττn\tau\to\tau_n.

Q7. Yes, we agree that we need to be more clear. At any rate the uu in line 52 is the input signal while the bold u in line 92 and 93, is a generic vector that lives in the parameter space. INi(w){\rm IN}^i(w) selects the components on the weights ww associated to arcs that point to neuron ii. We will change the notation.

Q8. No, they are not related. The function in line 89 is a capital Φ\Phi and it specifies the structure of the neuron. The one on line 111 is an overall factor to the lagrangian that weights the contributions of different time steps to the overall cost. We will use more different symbols.

Q9. It is a generic vector that has the same dimension of xx. You are right, we need to add a definition.

Q10. Since in the literature the term activation is not always used consistently the best way to answer your question is to say that aia_i are the quantities defined in Eq (15), that may be what you refer to as pre-activations.

Q11. From a computational point of view the approach is equivalent to GD + backpropagation and so it can in principle be used on networks of the same scale that are now used in deep learning.

Q12. Thanks, for the question. In the plots xx is the output of the output neuron (in the new attached figure it is x1x_1), uu is the input signal, and zz is the reference signal. The parameter q>0q>0 weights the term of the lagrangian that enforces the fit between x1x_1 and uu. The purpose of the plots is to show that the usage of Hamilton’s equation with the described sign flip prescription is effective in solving an online tracking problem.

评论

Thank you to the authors for their thoughtful response and for addressing the suggestions. As I mentioned in my review, in my opinion, this paper is a valuable contribution to the field and deserves publication. In particular, its focus on rethinking current learning formulations to better align with online (or continual) problem settings is both timely and important. The new perspectives and insights offered in this work provide a solid foundation for approaching these challenges.

评论

Thank you for your thorough review. We’re pleased that you found our work valuable.

作者回复

We are pleased to see the detailed analysis and the insightful list of comments and criticisms, which are greatly appreciated, along with the time dedicated to reviewing our work. We are confident that these will significantly improve the quality of our paper.

Since more than one reviewer commented on the necessity of additional references, we followed Reviewer FXeY’s suggestion to reorganize the introduction and include a standalone “Related Work” section. Here is a draft of the content for this new section:

Optimal control. The main focus of optimal control is the study of minimality problems for dynamical systems [1, 2]. The two main complementary approaches to this problem are the Pontryagin Maximum Principle [3] and dynamic programming. As a general minimization problem, both approaches intersect significantly with the calculus of variations [4]. Optimal control for discrete problems is also a classic topic [5, p. 2].

Neural ODE. Recently, in [6] and many subsequent papers [7, 8], results from optimal control have been used to define learning algorithms based on differential relations. However, these approaches differ significantly from the continual online learning considered in the present work, as the time in the class of ODEs they study is not related to the input signal that describes the flow of the learning environment.

Online. Several works have proposed formulating learning problems online from a single stream of data [9, 10]. The classical approach for learning RNNs online is Real-Time Recurrent Learning (RTRL) [11]. Since then, many approaches have been proposed to reduce the high space and time complexities associated with the progressive update of a Jacobian matrix [12]. In our method, no Jacobian matrices are stored; therefore, the proposed method is neither a generalization nor a reformulation of RTRL or related approaches like [13].

Nature-inspired computations. The major difference between our approach and other attempts to address the biological plausibility of backpropagation is that we propose a theory fully based on temporal analyses in the environment and the concept of learning over time. Many other classical [14] and recent approaches [15, 16, 17, 18, 19] are inspired by brain physiology, even if they share some locality properties described in this paper. Similarly, the vast majority of works that discuss the biological plausibility of backpropagation [20, 21, 22] overlook the role of time as we present it in this work. In contrast, we propose laws of neural propagation where the neural connections are updated over time, resembling natural processes.

[1] Bardi, Martino, and Italo Capuzzo Dolcetta. Optimal control and viscosity solutions of Hamilton-Jacobi-Bellman equations. Vol. 12. Boston: Birkhäuser, 1997.

[2] Cannarsa, Piermarco, and Carlo Sinestrari. Semiconcave functions, Hamilton-Jacobi equations, and optimal control. Vol. 58. Springer Science & Business Media, 2004.

[3] Gamkrelidze, R. V., Lev Semenovich Pontrjagin, and Vladimir Grigor'evic Boltjanskij. The mathematical theory of optimal processes. Macmillan Company, 1964.

[4] Giaquinta, Mariano, and Stefan Hildebrandt. Calculus of variations II. Vol. 311. Springer Science & Business Media, 2013.

[5] Bertsekas, Dimitri. Abstract dynamic programming. Athena Scientific, 2022.

[6] Chen, Ricky TQ, et al. "Neural ordinary differential equations." Advances in neural information processing systems 31 (2018).

[7] Kidger, Patrick, et al. "Neural controlled differential equations for irregular time series." Advances in Neural Information Processing Systems 33 (2020): 6696-6707.

[8] Massaroli, Stefano, et al. "Dissecting neural odes." Advances in Neural Information Processing Systems 33 (2020): 3952-3963.

[9] Mai, Zheda, et al. "Online continual learning in image classification: An empirical survey." Neurocomputing 469 (2022): 28-51.

[10] Wang, Liyuan, et al. "A comprehensive survey of continual learning: theory, method and application." IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).

[11] Irie, Kazuki, Anand Gopalakrishnan, and Jürgen Schmidhuber. "Exploring the promise and limits of real-time recurrent learning." arXiv preprint arXiv:2305.19044 (2023).

[12] Marschall, Owen, Kyunghyun Cho, and Cristina Savin. "A unified framework of online learning algorithms for training recurrent neural networks." Journal of machine learning research 21.135 (2020): 1-34.

[13] Zucchet, Nicolas, et al. "Online learning of long-range dependencies." Advances in Neural Information Processing Systems 36 (2023): 10477-10493.

[14] Rao, Rajesh PN, and Dana H. Ballard. "Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects." Nature neuroscience 2.1 (1999): 79-87.

[15] Salvatori, Tommaso, et al. "Brain-inspired computational intelligence via predictive coding." arXiv preprint arXiv:2308.07870 (2023).

[16] Millidge, Beren, Alexander Tschantz, and Christopher L. Buckley. "Predictive coding approximates backprop along arbitrary computation graphs." Neural Computation 34.6 (2022): 1329-1368.

[17] Ororbia, Alexander, and Ankur Mali. "The predictive forward-forward algorithm." arXiv preprint arXiv:2301.01452 (2023).

[18] Ororbia, Alexander G., and Ankur Mali. "Biologically motivated algorithms for propagating local target representations." Proceedings of the aaai conference on artificial intelligence. Vol. 33. No. 01. 2019.

[19] Hinton, Geoffrey. "The forward-forward algorithm: Some preliminary investigations." arXiv preprint arXiv:2212.13345(2022).

[20] Francis Crick. The recent excitement about neural networks. Nature, 337:129–132, 1989.

[21] Stork. "Is backpropagation biologically plausible?." International 1989 Joint Conference on Neural Networks. IEEE, 1989.

[22] Timothy P. Lillicrap Geoffrey Hinton, Backpropagation and the brain, Nature Reviews - Neuroscience, 2020

最终决定

This paper proposes a new, pre-algorithmic viewpoint on learning as an optimal control problem in real time. That is, the parameter trajectory over time is governed by a Hamiltonian that solves the Hamilton-Jacobi-Bellman equation. By requiring that the resulting equations are both causal and local (both in time and in the sense of the computational graph), the authors then derive a set of equations governing learning. Unfortunately, these equations involve boundary conditions at the end of the time interval, violating causality. In order to restore causality, the authors propose a a new set of dynamics based on sign flips and conjecture that this is an optimal solution.

All reviewers agreed that the idea is interesting and that the pre-algorithmic, theoretical approach is valuable. However, reviewers also agreed that the paper is quite difficult to read, both because it is particularly heavy on formality and because little effort is expended to give general machine learning readers an intuition for the problem. Likewise, the experiments were judged as weak, even for a theoretical paper, and this objection is not negligible, since the actual learning equations proposed are not proven but only conjectured to be optimal.

However, on balance, this seems like an interesting work likely to spark discussion. Accept.