PaperHub
5.0
/10
Rejected4 位审稿人
最低3最高6标准差1.2
3
6
6
5
3.8
置信度
ICLR 2024

The Trifecta: Three simple techniques for training deeper Forward-Forward networks

OpenReviewPDF
提交: 2023-09-22更新: 2024-02-11
TL;DR

We propose three simple techniques to improve the Forward-Forward algorithm on deeper networks.

摘要

关键词
Forward-ForwardDeep LearningLocal LearningRepresentation Learning

评审与讨论

审稿意见
3

The paper focuses on forward-forward (FF), an alternative to backpropagation proposed by G. Hinton in 2022 (arxiv only). FF is based on a contrastive loss applied layer-wise between the original points and "negative" points. The authors apply three modifications to the original algorithm (that they call "the Trifecta"), which include a different loss term, batch normalization instead of layer normalization, and a blockwise model where some layers receive feedback from the subsequent layers. With these three modifications they are able to scale FF from MNIST to CIFAR-10 with good performance.

优点

Finding alternatives to backpropagation that scale to larger datasets is a very interesting research direction. In this sense, the results shown here are good if compared with state-of-the-art models with no backward passes, although the authors are not providing these comparisons (see below). The paper is well written and easy to read, apart from the mathematical notation that could be drastically improved.

缺点

I see a few weaknesses making me lean towards rejection.

  1. NOVELTY: the "Trifecta" (which is a bit excessive as name) is composed of three small modifications, of which one was already published and applied to FF (different loss term), one was published but not applied to FF (Overlapping Local Updates), the final one is using batch normalization instead of layer normalization. These 3 methods are also justified in an informal way or empirically by some visualization (which are mostly copy-paste from a W&B run). Overall, this looks more like a report of someone experimenting on FF than a true scientific publication. This is made worse by the fact that (a) FF is unpublished, and (b) there is no convergence guarantee to begin with on FF optimization.

  2. RELATED WORKS: the paper is very shallow in its analysis of the related works, which should include other backpropagation-free techniques such as zeroth-order gradients (https://arxiv.org/abs/2305.17333), forward gradients (https://openreview.net/forum?id=JxpBP1JM15-), error-driven input modulation (https://proceedings.mlr.press/v162/dellaferrera22a/dellaferrera22a.pdf), etc. Many of these methods are similar to FF.

  3. EXPERIMENTS: connected to 2, it's unclear why the authors are only validating with respect to standard FF and not other alternatives to backpropagation (e.g., direct feedback alignment). Many of these methods are on-par with the results shown here.

问题

I don't have many questions on the paper. Improving related works and the experimental evaluation are good points (see above), but the paper remains of an incremental value and lacking any formal guarantees of convergence. Hence, I do not think this is a valuable contribution to the conference.

评论

We thank the reviewer for their time and feedback linked to our related work section and our comparisons to other work.

The paper is well written and easy to read, apart from the mathematical notation that could be drastically improved.

We acknowledge the math is slightly informal in the spirit of readability, but should be sufficient to bring across all necessary statements we set out to make. We would appreciate it if the reviewer could provide a more concrete pointer to any specific formula or notation that could be improved.

These 3 methods are also justified in an informal way or empirically by some visualization (which are mostly copy-paste from a W&B run)

We made certain that we referenced the corresponding appendices (A through E) for a more comprehensive evaluation that proposes numerous experiments and plots to back up our statements. As noted in the work, all conclusions drawn from these plots are verified across a multitude of settings that are not all shown. If there are any missing justifications in your opinion, we would be very glad to hear them.

Overall, this looks more like a report of someone experimenting on FF than a true scientific publication. This is made worse by the fact that (a) FF is unpublished, and (b) there is no convergence guarantee to begin with on FF optimization.

It is true that the FF paper is unpublished. However, given its author and the recent popularity and scrutiny it has received from the wider ML community, we would argue that it has been verified to a further extent than many publications (for instance, the work was presented at NeurIPS https://neurips.cc/virtual/2022/invited-talk/55869). Further, FF indeed doesn't have a convergence guarantee, its main focus is to be a preliminary investigation into a novel learning algorithm based on the popular contrastive learning method, that has been shown to work in numerous publications.

The paper is very shallow in its analysis of the related works, which should include other backpropagation-free techniques such as zeroth-order gradients

This is a very valid point. Backpropagation-free techniques and local learning have received increased attention in the past years resulting in lots of papers. To our knowledge, all existing items in the literature differ from FF by at least one key aspect (and most times by more). For instance, the reference https://arxiv.org/abs/2305.17333 is quite similar in approach but is presented as a finetuning technique of LMMs, not a full learning algorithm. Further, https://openreview.net/forum?id=JxpBP1JM15- differs by its use of a block-based approach. There are several block-based learning algorithms without any specific approach standing out in similarity to FF or in popularity which led us to the choice to not discuss them. We have included (https://proceedings.mlr.press/v162/dellaferrera22a/dellaferrera22a.pdf) and some additional work in our related work discussion.

connected to 2, it's unclear why the authors are only validating with respect to standard FF and not other alternatives to backpropagation (e.g., direct feedback alignment)

As stated above, to our knowledge, there are no algorithms that stand out in similarity to FF that would lead to an informative and direct comparison (apart from BP, which was chosen as a baseline due to its ubiquity). Nevertheless, the reviewer is fully accurate in saying that additional comparisons are necessary to paint a more comprehensive picture. Therefore, Table 2 will soon be updated to demonstrate a more thorough comparison with the closest alternatives.

but the paper remains of an incremental value

The Trifecta is composed of three existing techniques, this is correct. However, we argue that our contributions reach further than simply proposing The Trifecta. Primarily, we uncover and study three weaknesses within the vanilla FF algorithm and elucidate how they can be solved by our proposed techniques to achieve very competitive accuracy on several datasets. Additionally, we outline the general characteristics of our solutions that cause the improvement such that future work can build upon our findings. In conclusion, there are several findings that are of significant importance on the road to further our understanding of FF and other local learning algorithms.

We thank the reviewer again for the feedback and look forward to hearing if the primary concerns have been resolved with our modifications.

评论

I thank the authors for the answer. I briefly comment on some points.

  1. On the math: currently there are only 2-3 equations that are loosely discussed inside the main text. This is enough to provide a loose understanding of the topic, but many details are unclear in this way. For example, reviewer ugNW has several interesting points on the batch dimension that for the most part could have been avoided by precise, mathematical descriptions. However, I underline this is only a minor part of my evaluation of the paper.

  2. On justification and novelty: we are saying the same thing; all methods are justified informally by changing a component, running some experiments, and seeing if the results improve. I am not saying this is necessarily a wrong approach, but in itself it is a limited approach since the methods themselves are known and repurposed.

  3. Backpropagation-free: the FF algorithm is an example of backpropagation-free training of neural networks. I suggested several other examples which I think are sufficiently known to be added as comparisons (e.g., direct feedback alignment). Saying "there are no algorithms that stand out in similarity to FF" seems a way to avoid having comparisons. I do not think there is anything special in FF apart from avoiding backpropagation.

Overall, while I appreciate the effort in answering, I believe the paper is still a mostly incremental evaluation that adapts existing methods to FF.

评论

We thank the reviewer for their comments providing further details regarding previous statements.

In light of this, we have revised all equations in the main text to explicitly highlight how the summation within the loss happens (Sections 3 and 4). Further, we have already rephrased all related descriptions to remedy the comments of reviewer ugNW.

In our previous revision, we have added comparisons to FA and PEPITA. We also experimented with DFA, however, we found that for the architectures we considered in our experiments, the algorithm was quite unstable and failed to converge. To verify, we cross-referenced these findings with an implementation from https://github.com/lightonai/dfa-scales-to-modern-deep-learning and observed the same results. Therefore we opted to not include them.

审稿意见
6

The authors propose The Trifecta—three modifications to the recent Forward-Forward algorithm to improve convergence stability and scalability to more complex datasets. Specifically, they transfer existing techniques into their design of a modified Forward-Forward training algorithm and empirically demonstrate that their Trifecta Forward Forward (TFF) algorithm outperforms vanilla Forward-Forward and brings the accuracy on a collection of datasets equal or closer to accuracy achieved by training the same models via backpropagation. Motivation for this research is the consideration and development of alternative training paradigms for deep learning models.

优点

• Design of Trifecta Forward Forward (TFF) algorithm for backpropagation free neural network training

• Large-scale empirical analysis of TFF on 3 architectures and 4 datasets to compare performance against vanilla Forward-Forward (VFF) and backpropagation

• TFF offers sizeable improvements compared to VFF

• CIFAR-10 accuracy of TFF/d achieves nearly-SOTA performance for gradient-free training

缺点

• Work feels largely incremental—authors adopt existing technologies (SymBa loss, Batchnorm, Overlapping Local Updates) to design TFF algorithm

• Main paper seems to contain information that could be summarized more concisely or moved to appendix (e.g., Section 5 seems very long for experimental setup).

问题

  1. While the performance of TFF on CIFAR-10 is very good for backpropagation-free training, I found the design of the algorithm to borrow from existing technologies. I wanted to give you the chance to clarify if there were additional contributions in the design of TFF beyond adopting existing methods from the literature.

  2. On p. 8, you say that “Further, on simple datasets, our algorithm is on par with our backpropagation baseline, in the same amount of epochs.” Considering Table 2, this only appears to be the case for MNIST. Am I missing something?

  3. What is the walltime per epoch when using TFF compared to BP? I know this will vary per architecture and dataset so let’s say the largest model on CIFAR-10.

评论

We are thankful for the time and effort of the reviewer in providing feedback related to the general contributions of this work.

Work feels largely incremental—authors adopt existing technologies (SymBa loss, Batchnorm, Overlapping Local Updates) to design TFF algorithm

Given the nascent state of the FF algorithm, our goal is to further explore its learning properties and uncover simple changes that improve the scalability and overall accuracy in a wide range of settings. To this end, we uncovered and focussed on three weaknesses of the original algorithm. Following this, we propose three modifications with techniques that are purposefully easily implementable due to their familiarity or simplicity. We argue that the main contributions of our work lie in finding these weaknesses, proposing general solutions to them and subsequently improving the scaling behaviour of the FF algorithm.

Main paper seems to contain information that could be summarized more concisely or moved to appendix (e.g., Section 5 seems very long for experimental setup).

Thanks for the feedback, due to the novelty of the algorithm, we strove to make our setup and implementation-related decisions abundantly clear. However, we agree it is slightly too long for our intended purposes. The hyperparameter section has now been moved to the appendix.

While the performance of TFF on CIFAR-10 is very good for backpropagation-free training, I found the design of the algorithm to borrow from existing technologies. I wanted to give you the chance to clarify if there were additional contributions in the design of TFF beyond adopting existing methods from the literature.

Indeed, the main components of TFF borrow from existing techniques, as motivated above. However, this is not the main contribution of this work. Nevertheless, there are some minor modifications that are not specifically highlighted in the paper. The architecture, or specifically, the layer composition is slightly different from ordinary CNNs. Namely, we use the ordering BN, Conv, ReLU and then Maxpool. We found that performing a ReLU before the maxpool is very beneficial to the overall scaling behaviour of the network. Further, we propose a method to embed the label suitably in a CNN. These modifications are stated in the paper but are also the main reason we openly share our code, such that future experiments of FF can start on a solid foundation.

On p. 8, you say that “Further, on simple datasets, our algorithm is on par with our backpropagation baseline, in the same amount of epochs.” Considering Table 2, this only appears to be the case for MNIST. Am I missing something?

This is indeed only the case for MNIST, this is updated to be more clear.

What is the walltime per epoch when using TFF compared to BP? I know this will vary per architecture and dataset so let’s say the largest model on CIFAR-10.

We did not include any details on this aspect as this is in line with expectation: FF is twice as slow per epoch due to the inclusion of negative samples. These are the averaged wall-clock times for CIFAR-10 (first) and MNIST (second) when trained for 100 epochs (on an rtx4080). As the models are identical (except the first layer) throughout datasets, this stays largely consistent. Please note that these experiments were not performed in a controlled environment and therefore, suffer from substantial variance.

  • BP small: 5m 53s <-> FF small : 10m 36s
  • BP large: 7m 54s <-> FF large: 16m 38s
  • BP small: 4m 41s <-> FF small: 9m 39s
  • BP large: 7m 22s <-> FF large: 15m 03s

Hopefully, our answers are satisfactory and have answered your questions. If the reviewer has any additional comments, please let us know.

评论

We appreciate the reviewer's feedback and welcome any remaining doubts or questions for discussion.

评论

Thank you for addressing the weaknesses I listed and answering all of my questions. I appreciate the results on the wall-clock time for FF. I do not have any follow up questions and I will keep my original score for the paper.

审稿意见
6

This paper proposes 3 techniques that together significantly help training a deep network with forward-forward algorithms. Those techniques are:

  • Improved loss function which treats false positives and false negatives equally.
  • Use of batch normalisation instead of layer normalisation.
  • Use of overlapping local updates, so that each layer is also optimised to produce useful features to help the next layer goodness loss.

The method is evaluated on MNIST, Fashion-MNIST, SVHN and CIFAR-10 using CNNs and FCNs. While ablating the added techniques the performance on CIFAR-10 is brought from 44% to 75%. When training longer with more layers it reaches 83% while the backprop reaches 89%.

优点

The main text is overall easy to read without significant previous knowledge. The three techniques are motivated and refer to previous work. The main contribution seems to be using them all together on a FF setup for a combined improved accuracy and to map to the community the sort of techniques that can bring FF closer to BP training.

The ablations are clear, e.g. the layernorm->batchnorm has a very significant effect on allowing the subsequent layers to improve on the decisions from previous ones (Figure 1). Table 1 also shows the effect of the BN and OLU applied on top of a SymBA loss baseline.

缺点

It appears to me that the proposed techniques run somehow counter to (what I on my limited knowledge) perceive to be some of the motivation of FF:

  • Weight updates as described in section 3, are supposed to be done in 2 forward steps, one for positive and one for negative samples (maybe aiming to be able to completely decouple them further in time eventually (sleep section in Hinton 2022)). However here by removing the threshold it appears one needs the batch to contain both positive and negative examples and resembles a self-supervised contrastive setup.
  • Overlapping local updates although not using non-global gradients is a step away from being able to introduce black boxes in the middle of the forward pass.
  • I also can't stop to notice that batch normalization as done here also disallows training by seeing one example at a time (though maybe keeping batch statistics would easily address that?).

Combined together with CNN and evaluated on a supervised contrastive training it is maybe of no surprise that this set of techniques obtain better results than vanilla FF. Despite that, I think this work may still be interesting to help mapping the set of techniques that bridge the gap to BP and leave the discussion to whether they are in the spirit of VFF or biologically plausible to the reader.

问题

A) Since in the loss discussion and appendixes the batch dimension is mostly left out, I feel like some details were unclear. E.g. is g_pos and g_neg used in the SymBa loss the mean of l2_norm of all the positive and negative examples in the batch? How is the batch constructed, is it highly correlated (e.g. contains the same input in both positive and negative case)?

B) Do you think keeping moving averages of the norm of positive and negative activations would allow one to train this without requiring both positive and negative examples to be in the batch?

C) Given the size of networks would it be practical and interesting to provide ablation of BN and OLUs on top of a VFF without the SymBa loss? Or am I missing something?

评论

We thank the reviewer for their time and valuable comments. The reviewer raises some very valid questions and comments that will result in a solid improvement to the work. Some questions and comments are addressed slightly out of order to allow for more structure within our answers.

Since in the loss discussion and appendixes the batch dimension is mostly left out, I feel like some details were unclear. E.g. is g_pos and g_neg used in the SymBa loss the mean of l2_norm of all the positive and negative examples in the batch? How is the batch constructed, is it highly correlated (e.g. contains the same input in both positive and negative case)?

This was previously not made clear enough in the main text, g^pos and g^neg are vectors containing the goodness for positive and negative samples respectively within each batch. Further, the reviewer is correct in saying that each batch is highly correlated; the negative samples are simply the same positive samples with a bogus label. Among other public implementations, we found that using a highly correlated setup produces the best results. The manuscript was updated to clarify the topic in accordance with this explanation.

Weight updates as described in section 3, are supposed to be done in 2 forward steps, one for positive and one for negative samples (maybe aiming to be able to completely decouple them further in time eventually (sleep section in Hinton 2022)). However here by removing the threshold it appears one needs the batch to contain both positive and negative examples and resembles a self-supervised contrastive setup.

Given the fact that the sleep idea currently is not possible, as stated by Hinton 2022 in the revised version, using a concatenated batch (pos+neg) or two separate batches yields identical results, but the former is slightly faster on most hardware. The FF algorithm differentiates itself from most forms of contrastive learning by its use of goodness. Instead of performing similarity comparisons on entire embeddings, they are done on this goodness value. By changing the loss function to SymBa, these goodness values are indeed compared to each other to calculate the loss, instead of to a set threshold. These would need to be stored if used in a fully decoupled manner.

Do you think keeping moving averages of the norm of positive and negative activations would allow one to train this without requiring both positive and negative examples to be in the batch?

As alluded to above, this decoupling is certainly achievable but in a slightly different manner from the proposed method. There are two observations that are key to this topic: first, samples within g^pos and g^neg are correlated and hence, the distance is calculated on a per-sample basis. Second, there is some variance between the thresholds of different samples, which results in this observed, stabler convergence. Therefore, rolling averages are not the best way to approach this decoupling as we have verified in a small experiment. However, the statistics necessary are quite minimal, as only the goodness needs to be stored.

Overlapping local updates although not using non-global gradients is a step away from being able to introduce black boxes in the middle of the forward pass.

There are several aspects to consider when using semi-local gradients (outlined in Appendix F).

  • Biological plausibility: Within our setup, all error communication is very local, further, by using slight alterations like random synaptic feedback, biological plausibility can be increased (if desired).
  • Black boxes: OLU doesn't hinder black boxes, it simply improves parts of the network that are uninterrupted. The strength of FF lies in its flexible objective that does not necessitate specialised classification layers that may disturb these black boxes.
  • Non-differentiable operations: There are instances where each layer has a non-differentiable operation and therefore OLU cannot be used. However, these scenarios are quite niche but we would like to highlight that training on MNIST without OLU actually yields slightly better results (~99.7% instead of ~99.6%). On more complex datasets such as Cifar-10, OLU works better as seen in the ablation.

I also can't stop to notice that batch normalization as done here also disallows training by seeing one example at a time (though maybe keeping batch statistics would easily address that?).

This is exactly true. As our experiments show, BN works well in this scenario as it finds a middle ground between fully renormalising features and no normalisation. We argue any sensible normalisation function with this property would thus perform comparably. We performed the proposed experiment, using only running averages of BN during training, with otherwise identical settings as TFF in Table 1 and achieved 73.3%, which is slightly lower but still very competitive. As this is quite specific these updates will be pushed to the appendix (E), instead of the main text.

评论

Continuation of the previous comment

Given the size of networks would it be practical and interesting to provide ablation of BN and OLUs on top of a VFF without the SymBa loss? Or am I missing something?

Again, yes, this should’ve been made more clear originally and the manuscript has been updated. Also in accordance with the comment of R1. In summary: SymBa is the most important part of the Trifecta, without it the other techniques do not produce competitive results (shown below).

  • No SymBa (tau=10) + BN: 37.3%

  • No SymBa (tau=10) + OLU: 50.8%

  • No SymBa (tau=10) + BN + OLU: 59.0%

  • No SymBa (tau=2) + BN: 41.0%

  • No SymBa (tau=2) + OLU: 48.2%

  • No SymBa (tau=2) + BN + OLU: 57.9%

Despite that, I think this work may still be interesting to help mapping the set of techniques that bridge the gap to BP and leave the discussion to whether they are in the spirit of VFF or biologically plausible to the reader.

This hits the nail on the head. We endeavoured to find the three simplest modifications to the FF algorithm that drastically improve its accuracy. Our specific goal was to use techniques that are familiar (or at least understandable) to anyone so that they can be easily implemented. On top of this, our experiments attempt to extract the positive properties from each technique such that future work can adopt our learned principles and modify the algorithm to be more biologically plausible or achieve even higher accuracy. Thanks again for the comments, which we believe have strengthened our paper. If there is anything else, please let us know.

Hinton, Geoffrey. "The forward-forward algorithm: Some preliminary investigations." arXiv preprint arXiv:2212.13345 (2022).

评论

We have recently uploaded a revision that further remedies the reviewer's comments regarding the use of positive and negative examples in the loss. If there are any additional comments or questions, we would be happy to respond.

审稿意见
5

Modern machine learning models are able to outperform humans on a variety of non-trivial tasks. However, as the complexity of the models increases, they consume significant amounts of power and still struggle to generalize effectively to unseen data. Local learning, which focuses on updating subsets of a model’s parameters at a time, has emerged as a promising technique to address these issues. Recently, a novel local learning algorithm, called Forward-Forward, has received widespread attention due to its innovative approach to learning. Unfortunately, its application has been limited to smaller datasets due to scalability issues. To this end, we propose The Trifecta, a collection of three simple techniques that synergize exceptionally well and drastically improve the Forward-Forward algorithm on deeper networks. Our experiments demonstrate that our models are on par with similarly structured, backpropagation-based models in both training speed and test accuracy on simple datasets. This is achieved by the ability to learn representations that are informative locally, on a layer-by-layer basis, and retain their informativeness when propagated to deeper layers in the architecture. This leads to around 84% accuracy on CIFAR-10, a notable improvement (25%) over the original FF algorithm. These results highlight the potential of Forward-Forward as a genuine competitor to backpropagation and as a promising research avenue.

优点

  • The methods are relatively straightforward to be applied to the existing Forward-Forward Algorithm.
  • Paper is well-written. It is easy to follow the explanations written in the paper itself.
  • Experiemnt results have shown that following these recipes improves the vanilla Forward-Forward Algorithm and exhibits an on-par result with network learned from backpropagation in terms of classification accuracy.
  • Code is available.

缺点

My main concern is with regard to the generalizability and scalability of the proposed method when it is applied to the Forward-Forward algorithm. This can be viewed from multiple lenses as follows:

  • First of all, while the image classification results are promising when Trifecta is applied to the Forward-Forward algorithm, that does not mean that the only task we need to tackle is the classification task. It is desirable to see how well the Trifecta fares with the vanilla FF and SGD on a wider variety of tasks such as Machine Translation, image generation, or different types of learning mechanisms such as self-supervised learning, reinforcement learning, multimodal learning, etc.
  • In addition, the network used for evaluating this method is relatively small compared to most experiments that are performed in this avenue. It would bring substantial improvement to the paper to use larger networks such as ResNet and attention-based models to evaluate the performance of Trifecta on a given task compared to the vanilla FF and SGD.
  • Furthermore, in the ablation, it would be interesting if we include ablation studies on how the vanilla loss function and incorporate either BN or OLU and compare it when we use the SymBa loss function [2] with one of these two other ingredients just like the results in Table 1.
  • Finally, even though it is not desired to perform evaluation on large datasets such as ImageNet, I am curious regarding the performance of TFF when it is compared to VFF and BP on image classification task, since most of the methods proposed in this avenue are evaluated on large datasets.

问题

See Weaknesses

[1] Hinton, Geoffrey. "The forward-forward algorithm: Some preliminary investigations." arXiv preprint arXiv:2212.13345 (2022).

[2] Lee, Heung-Chang, and Jeonggeun Song. "SymBa: Symmetric Backpropagation-Free Contrastive Learning with Forward-Forward Algorithm for Optimizing Convergence." arXiv preprint arXiv:2303.08418 (2023).

评论

We thank the reviewer for their detailed concerns regarding the further scalability of the TFF algorithm.

First of all, while the image classification results are promising when Trifecta is applied to the Forward-Forward algorithm, that does not mean that the only task we need to tackle is the classification task. It is desirable to see how well the Trifecta fares with the vanilla FF and SGD on a wider variety of tasks …

This is absolutely correct and is an interesting avenue for future research. However, given the nascent state of FF, we opted to limit the scope of this work to image classification. This has two reasons: the original paper exclusively has results in this modality, to which we wanted to make a direct comparison, and image classification tasks have the most well-known and standardized datasets.

In addition, the network used for evaluating this method is relatively small compared to most experiments that are performed in this avenue. It would bring substantial improvement to the paper to use larger networks such as ResNet and attention-based models to evaluate the performance of Trifecta on a given task compared to the vanilla FF and SGD.

Again, we certainly agree that a more thorough architecture search is very desirable. In the context of local learning, however, many ubiquitous architectures from BP simply do not work well. For example, a bottleneck architecture will yield terrible accuracy as the error signal is "not strong enough" to learn compressed representations. That said, we have tried adding residual connections to our networks. From our limited experiments, we found that the accuracy increases (slightly less than 1% on the CIFAR-10 model). However, as this is only a very slight improvement, we decided not to include this finding and put more focus on our three main modifications.

Furthermore, in the ablation, it would be interesting if we include ablation studies on how the vanilla loss function and incorporate either BN or OLU and compare it when we use the SymBa loss function [2] with one of these two other ingredients just like the results in Table 1.

After some consideration, we opted to not show the full ablation table in the spirit of clarity. The primary reason is that SymBa was found to be the most important technique, without it, the accuracy differences given other techniques are comparably minor and therefore less informative. However, we now understand this was not made clear enough. Therefore, the manuscript has been updated appropriately by discussing this aspect and showing the full table (shown below for ease of access).

  • No SymBa (tau=10) + BN: 37.3%

  • No SymBa (tau=10) + OLU: 50.8%

  • No SymBa (tau=10) + BN + OLU: 59.0%

  • No SymBa (tau=2) + BN: 41.0%

  • No SymBa (tau=2) + OLU: 48.2%

  • No SymBa (tau=2) + BN + OLU: 57.9%

Finally, even though it is not desired to perform evaluation on large datasets such as ImageNet, I am curious regarding the performance of TFF when it is compared to VFF and BP on image classification task since most of the methods proposed in this avenue are evaluated on large datasets.

In light of this comment, we performed experiments on Cifar-100. The small model is able to achieve 35.8% after 200 epochs and the large model achieves 35.4% after 500 epochs using the same setup as all other datasets. Originally, we decided not to evaluate larger datasets as the main goal was to scale from MNIST to CIFAR-10 and compare the results to the original paper. Most local learning approaches currently are not yet competitive with BP for more complex datasets. We have added these new results to Table 2.

Lastly, if there are any additional questions or comments, we would be glad to hear them in detail.

评论

We thank the reviewer again for the feedback if there are any remaining doubts or questions, we would be delighted to hear them.

评论

I highly appreciate the authors for addressing my concerns.

However, I still find the reasoning for the first and second weaknesses to be not substantial enough.

  • For instance, in my first point regarding the weakness, it is desirable to include different tasks as evaluation benchmarks to better understand how TFF performs against VFF.
  • In addition to that, you can simply include your initial findings for using residual networks in the benchmarks.

Due to this, I believe that I will keep my score as it is.

评论

We thoroughly thank the reviewer for further elaborating their concerns on the perceived weaknesses of this work, which we hope to further clarify.

Forward-Forward is an algorithm that is very flexible in its learning requirements but currently hasn’t been explored in terms of its input and output structure. In its simplest form, FF is simply a binary classifier. This can be extended to n-way classification in an intuitive manner by sampling all classes. The drawback to this is that tasks where a large number of classes are required are less feasible. This includes complex datasets such as ImageNet but also most generative tasks. This is not unusual for local learning and consequently, most papers stick to the well-understood image classification datasets to illustrate the capabilities of their method. We are very confident these weaknesses will be resolved by exploring alternative evaluation techniques (briefly covered in Appendix H) and improving FF to be suitable for these tasks. We encourage any future work comparable to [1] to push the limits of FF.

As for the residual connections, given our limited experiments and preliminary results, we added our initial experiments in Appendix I.

We are grateful to the reviewer for their feedback and hope the reviewer better understands the reasoning behind our decisions.

https://arxiv.org/abs/2006.12878

评论

We would like to sincerely thank all reviewers for their time and insightful feedback. We carefully reviewed all comments and would like to address the raised concerns.

R1, R3, and R4 note their dismay towards the use of existing techniques within The Trifecta. Given the nascent state of the FF algorithm, the main goal of the work is to further explore its learning properties and uncover simple changes that improve the scalability and overall accuracy in a wide range of settings. To this end, we uncovered and focussed on three weaknesses of the original algorithm. Following this, we propose three modifications with techniques that are purposefully easily implementable due to their familiarity or simplicity. We argue that the main contributions of our work lie in finding these weaknesses, proposing general solutions to them (from which we picked the simplest ones and show that they resolve all issues, hence our title), and subsequently improving the scaling behavior of the FF algorithm. In discussion with R2, we further show that variations on these techniques achieve similar results.

We appreciate that R1 R2 and R4 emphasized that the paper is well-written and clear. Further, all reviewers note our significant improvement in accuracy over the original FF algorithm. This aligns with one of the main goals of the paper making FF more competitive.

R1 and R2 raise a comment on the incomplete ablation study on The Trifecta. This has been fully rectified (Table 1).

R1 presents some comments about the further scaling of FF and its ability to generalize to other modalities. The goal of the paper was to make the leap from competitive results on MNIST to CIFAR10. However, in light of this feedback, additional results for CIFAR-100 have been added.

R2 asks several questions about additional constraints that The Trifecta incurs. However, with some additional experiments and explanations (Table 1 and Appendix E), we hope to have answered these in a satisfactory manner.

R3 asks to clarify some details regarding the contributions and the characteristics of the Trifecta and FF, specifically, regarding the wall clock time. We clarified our motivations, our contributions, and these details further.

R4 brings up comments regarding the shallow comparisons in our related work and in our experiment section. We have updated both sections in accordance with the proposed suggestions.

AC 元评审

The paper introduces "The Trifecta," an enhancement to the Forward-Forward algorithm, aimed at improving deep network training. The proposed techniques include a balanced loss function, batch normalization, and overlapping local updates. Evaluated on several datasets (MNIST, Fashion-MNIST, SVHN, CIFAR-10), the Trifecta demonstrates improved performance over the original Forward-Forward algorithm.

Strengths:

  • The Trifecta's methods are straightforward, easily integrated into the existing Forward-Forward Algorithm.
  • The paper is well-structured and clear.

Weaknesses:

  • The Trifecta's generalizability and scalability, especially on diverse tasks and larger datasets, remains unproven.
  • Limited evaluation scope, primarily on smaller networks, raises questions about its efficacy on larger, more complex architectures. This is despite is specific claim of expanding toward deeper networks.
  • A lack of comprehensive ablation studies limits the understanding of the specific contributions.
  • Performance comparisons with a broader range of backpropagation-free methods are lacking.

为何不给更高分

See weaknesses

为何不给更低分

N/A

最终决定

Reject