4.8

/10

Rejected4 位审稿人

最低3最高6标准差1.1

3.5

置信度

正确性2.8

贡献度2.8

表达2.3

ICLR 2025

From MLP to NeoMLP: Leveraging Self-Attention for Neural Fields

Miltiadis Kofinas,Samuele Papa,Efstratios Gavves

OpenReview PDF

提交: 2024-09-28更新: 2025-02-05

TL;DR

We propose a method that performs message passing on the graph of MLPs via self-attention and functions as a conditional neural field

摘要

关键词

Neural fieldsSelf-attentionAuto-decodingTransformersConditional neural fieldsImplicit neural representationsGraphs

评审与讨论

审稿意见

评分: 5置信度: 22024-10-28

The authors change the architecture of an MLP for a neural field into a similar format of a transformer, self-attend input tokens representing a position with learned tokens, and finally use it to regress the values of the target object at that position.

优点

Very interesting method of unifying transformer architecture with INRs…was definitely wondering if something like this existed and the authors seem have come up with it. Excited for what other researchers can build on this.
Strong results showing the internal representations of the trained networks can be used for classification (i.e. MNIST) against several recent baselines (Table 2)
Strong ablation studies

缺点

Too much hyperparameter tuning to be generalizable (i.e. all of Appendix B). Authors should defend why this is ok. Since they are from a single sample, I wonder if they are overfit to them, and if researchers can reliably use this for other samples without extensive tuning?
Should use stronger baselines. For video, SPDER seems to be the most similar to SIREN but stronger. There is also NeRV (Neural Representations for Videos) and VideoINR which are more complex but probably should be compared also.
Image representation is standard for INR experiments and is missing.
Novel view synthesis is not included (NeRF)
The parameter count may be the same as SIREN, but due to the fitting/fine-tuning on a large dataset (which SIREN does not do as it fits to a sample) I suspect the FLOPs of this model are significantly higher, which means it’s not fair to compare it to a model with no “pre-training”. I may be misunderstanding the “fitting dataset” here but just referencing 2.3 paragraph 3.

问题

Are there quantitative results for audio? Figure D in the Appendix is quite suspicious as no metrics are included and the errors seem quite large even though they’re better than SIREN.

伦理问题详情

n/a

评论- Author response to Reviewer GNWA [2/2]

2024-11-27

The parameter count may be the same as SIREN, but due to the fitting/fine-tuning on a large dataset (which SIREN does not do as it fits to a sample) I suspect the FLOPs of this model are significantly higher, which means it’s not fair to compare it to a model with no “pre-training”. I may be misunderstanding the “fitting dataset” here but just referencing 2.3 paragraph 3.

NeoMLP can function both as an unconditional neural field (i.e. we fit its parameters to individual signals from scratch, similar to Siren), as well as a conditional neural field. Section 3.3 (2.3 in the original version of the manuscript) describes NeoMLP as a conditional neural field, where we learn one set of embeddings for each signal (e.g. each image) in a dataset. In the experiments in section 4.1, where we fit high-resolution signals, we use NeoMLP as an unconditional neural field; there is no pre-training involved there.

Indeed, however, the reviewer’s intuition is correct; the FLOPs for our method are much higher than Siren. More specifically, we measure the FLOPs for NeoMLP and Siren on the “bikes” signal, using the hyperparameters described in Appendix E.1 in the revised manuscript. NeoMLP has 51.479 MFLOPs, while Siren has 3.15 MFLOPs.

Despite having a higher computational complexity compared to the baselines, NeoMLP can actually fit high resolution signals faster, and does so while having a smaller memory footprint, since it can make use of small batch sizes. As an example, for the BigBuckBunny signal, NeoMLP requires 13.2 GB of GPU memory, compared to 18.5 for Siren. We refer the reviewer to Appendix C, figure 5, and table 6 in the revised manuscript for further details.

Are there quantitative results for audio? Figure D in the Appendix is quite suspicious as no metrics are included and the errors seem quite large even though they’re better than SIREN.

The quantitative results for audio are shown in Table 1. We apologize for the confusion. The y-axes between the subfigures in figure 6 and the subfigures in figure 8 in the updated manuscript are different. We have updated the manuscript to clarify that and included one more figure (figure 7 in the updated manuscript) that shows the amplitude of the errors compared to the groundtruth signal.

评论- Author response to Reviewer GNWA [1/2]

2024-11-27

We would like to thank the reviewer for their time and insightful comments. We appreciate that they find our method “very interesting” and for being “excited for what other researchers can build on this”. Below, we address the reviewer’s concerns in detail.

Too much hyperparameter tuning to be generalizable (i.e. all of Appendix B). Authors should defend why this is ok. Since they are from a single sample, I wonder if they are overfit to them, and if researchers can reliably use this for other samples without extensive tuning?

We understand that Appendix B (Appendix E in the revised manuscript), can give the impression that we are doing too much hyperparameter tuning. However, we are just reporting the full set of hyperparameters for completeness and reproducibility. In most experiments, we are only tuning a few hyperparameters (e.g. the learning rate and the RFF dimensionality) and the method works out of the box. More specifically, for fitting single signals (appendix E.1), we use the exact same hyperparameters for the Bikes video, and the BigBuckBunny video with audio, except that we fit BigBuckBunny for more epochs. The hyperparameters for FFN hidden dim, token dimensionality, and number of layers are chosen such that the number of parameters in NeoMLP approximately matches the number of parameters for Siren to ensure fair comparison. The audio clip is a much smaller signal, and thus, we scale down the FFN hidden dim, token dimensionality, number of heads, and number of layers. The only hyperparameter that is perhaps counter-intuitive and required some tuning is the RFF dimensionality. For the audio piece, we used a large value of 512, perhaps due to the high frequency components of the signal.

Similarly, for fitting datasets of signals, in Appendix E.2, we use the exact same hyperparameters for ShapeNet10 and MNIST, while a few hyperparameters differ for CIFAR10. Furthermore, as shown in tables 3 and 4, our method is pretty robust to various combinations of hyperparameters.

Should use stronger baselines. For video, SPDER seems to be the most similar to SIREN but stronger. There is also NeRV (Neural Representations for Videos) and VideoINR which are more complex but probably should be compared also.

We thank the reviewer for the suggestion. Following this suggestion, along with suggestions from reviewer ANQc, we have included 2 additional baselines. The first baseline is RFFNet [1], an MLP with ReLU activations and random Fourier features (RFF) that encode the input coordinates. The second baseline is SPDER [2], a recent state-of-the-art neural field, that uses an MLP with sublinear damping combined with sinusoids as activation functions. We report the results in the table below and in table 1 in the revised manuscript.

Method	Bach	Bikes	Big Buck Bunny (Audio)	Big Buck Bunny (Video)
RFFNet	54.62	27.00	32.71	23.47
Siren	51.65	37.02	31.55	24.82
SPDER	48.06	33.80	28.28	20.44
NeoMLP	54.71	39.06	39.00	34.17

Furthermore, NeoMLP is effectively more memory efficient, as it requires less GPU memory than the baselines to fit the signals, since it uses smaller batch sizes. As an example, for the BigBuckBunny signal, NeoMLP requires 13.2 GB of GPU memory, compared to 13.9 for RFFNet, 18.5 for Siren, and 39.2 for SPDER. We include the full details about runtime and memory requirements in Table 6 (Appendix C) in the revised manuscript.

Finally, we thank the reviewer for pointing out NERV and VideoINR, which we have cited in the revised manuscript. We note that these works are video-specific and orthogonal to our work. For example, they both employ Siren as a backbone, which could be replaced by NeoMLP; we are excited to see such applications of our method in the future.

[1] Tancik et al. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. NeurIPS 2020.

[2] Shah et al. SPDER: Semiperiodic Damping-Enabled Object Representation. ICLR 2024.

审稿意见

评分: 3置信度: 42024-10-29

This paper targets an important problem in NeFs which is how to represent the signals with good reconstruction ability while maintaining good classification ability. The authors propose NeoMLP, viewing MLP as a complete graph, and employ self-attention for message passing among all the nodes. The experiments show that NeoMLP can represent complex signals, especially multi-modality signals such as video with audio, and have a better performance on downstream classification task.

优点

The idea of viewing MLP as a complete graph is novel.
The experiment on multi-modality data is cool.

缺点

Some important closely related works are missing. Apart from conditioning NeFs with an auto-decoder, a more efficient condition method is hyper-network, such as [1][2]. More importantly, the idea of delivering self-attention for handling vectors that consist of nodes of MLP is quite similar to [1][2].
The definition of NeoMLP is not clear. In Figure 1, NeoMLP is the MLP with a fully connected graph while in Figure 2, NeoMLP is the self-attention backbone,
There is no clear evidence that viewing MLP as a fully connected graph may help to improve the reconstruction and classification ability. Current improvement may be due to the better fitting ability from self-attention. I suggest the authors use simple Linear layers as their symmetric function. Then the NeoMLP will just become a simple MLP with more input dimension and output dimension due to the fully connected graph. If this simple MLP still has better performance, the claim that viewing MLP as a fully connected graph leads to a better reconstruction and classification ability can be better proved.
The quantitative ablation of the self-attention backbone is missing. Is it possible to replace the self-attention with other symmetric functions in graph learning?
The details for I, H, and O in line 179 are missing. From line 680, it seems that I+H+O=8, then for a audio regression task, we have I=1, O=1, and H=6?
The claim that “the optimal downstream performance was often achieved with medium quality reconstructions” needs more evidence. To show your method has a better performance to balance PSNR and classification accuracy, I suggest the authors provide curves for different methods for PNSR vs. accuracy, rather than the PSNR at best Accuracy.
More examples and compared methods such as Miner [3] should be discussed in Table 1.

[1`] Chen, Yinbo, and Xiaolong Wang. "Transformers as meta-learners for implicit neural representations." ECCV2022. [2] Zhang, Shuyi, Liu, Ke, et al. "Attention beats linear for fast implicit neural representation generation." ECCV2024. [3] Saragadam, Vishwanath, et al. "Miner: Multiscale implicit neural representation." ECCV2022

问题

The comparison with the hyper-network-based condition methods.
The clear evidence for the claim that viewing MLP as a fully connected graph leads to a better reconstruction and classification ability.
The curves for different methods for PNSR vs. accuracy.
More examples and compared methods should be in Table 1.

伦理问题详情

No ethics review needed

评论- Author response to Reviewer UBft [1/2]

2024-11-27

We would like to thank the reviewer for their time and insightful comments, as well as for finding our method “novel” and our experiment on multimodality “cool”. Below, we address the reviewer’s concerns in detail.

There is no clear evidence that viewing MLP as a fully connected graph may help to improve the reconstruction and classification ability. Current improvement may be due to the better fitting ability from self-attention. I suggest the authors use simple Linear layers as their symmetric function. Then the NeoMLP will just become a simple MLP with more input dimension and output dimension due to the fully connected graph. If this simple MLP still has better performance, the claim that viewing MLP as a fully connected graph leads to a better reconstruction and classification ability can be better proved.

The quantitative ablation of the self-attention backbone is missing. Is it possible to replace the self-attention with other symmetric functions in graph learning?

We do not intend to claim that merely viewing MLP as a fully connected graph guarantees improved reconstruction and downstream abilities. Instead, we are inspired by viewing MLPs as computational graphs on which we perform message passing, which allows us to introduce conditioning as a built-in component of the architecture. Since NeoMLP operates on a fully connected graph without edge features, the choice of Transformers seems more natural than any graph neural network, which would be as computationally demanding as a Transformer, given the quadratic complexity on the number of tokens. Thus, we opt for a Transformer backbone, as it has proven to be an expressive and scalable architecture across a wide range of tasks.

The details for I, H, and O in line 179 are missing. From line 680, it seems that I+H+O=8, then for a audio regression task, we have I=1, O=1, and H=6?

We discuss the details for I, H, and O in the first paragraph of section 3.1 in the revised manuscript (lines 98-101), the first paragraph of section 3.2 (lines 156-161) , and in lines 192-195. We have updated the manuscript in section 3.2 to clarify the details for these variables. I denotes the number of input dimensions and O denotes the number of output dimensions, and thus, they are defined by the problem at hand. In contrast, H denotes the number of hidden nodes, and is chosen as a hyperparameter. As an example, for a single-channel audio signal, we have I=1 (time) and O=1 (single-channel amplitude). We then choose H=6 as a hyperparameter for a total of 8 tokens. For a video signal, we would have I=2 (x-y coordinates) and O=3 (R, G, B color channels).

Some important closely related works are missing. Apart from conditioning NeFs with an auto-decoder, a more efficient condition method is hyper-network, such as [1][2]. More importantly, the idea of delivering self-attention for handling vectors that consist of nodes of MLP is quite similar to [1][2].

We thank the reviewer for pointing out these related works, which we have included in the revised version of the manuscript. While there are similarities between these works and ours, there are also important differences. One very important difference is that both suggested works use data patches as input in a transformer-based hyper-network. The use of patches makes these methods resolution-dependent and modality-dependent. For example, if our input was a giga-pixel image, the hyper-network would generate a large number of patches as context, which would significantly increase the spatial complexity of these methods. Further, regarding the attention operator, the first work uses self-attention with tokens that represent the weights columns to generate INR weights, while the second work uses cross-attention to provide context to the query coordinates. Instead, our method uses self-attention on set of coordinate dimensions and latent tokens, which are learned through auto-decoding.

The definition of NeoMLP is not clear. In Figure 1, NeoMLP is the MLP with a fully connected graph while in Figure 2, NeoMLP is the self-attention backbone

Figure 1 (right) shows the graph on which NeoMLP performs message passing; it does not represent the computational graph of NeoMLP. Instead of performing message passing on the original MLP graph (Figure 1, left), we treat it as a fully-connected graph and use high-dimensional features to make message passing more scalable and expressive. Figure 2 shows the architecture with which NeoMLP performs message passing: we employ weight-sharing through self-attention. We have revised the manuscript to clarify this distinction.

评论- Author response to Reviewer UBft [2/2]

2024-11-27

The claim that “the optimal downstream performance was often achieved with medium quality reconstructions” needs more evidence. To show your method has a better performance to balance PSNR and classification accuracy, I suggest the authors provide curves for different methods for PNSR vs. accuracy, rather than the PSNR at best Accuracy.

The study of Papa et al. [1] showed (Figure 6) a clear trend for unconditional neural fields, where the test accuracy was positively correlated with PSNR for low PSNR values, until it reached a critical point, after which it was negatively correlated with PSNR. On the other hand, our ablation study on the importance of various hyperparameters, shown in Tables 3 and 4, shows a positive correlation between test accuracy and PSNR until a critical point after which the accuracy plateaus. We have compiled the results from these tables in Figure 9 in Appendix H in the revised manuscript, where the positive correlation is more clear visually (rho=0.65).

[1] Papa et al. How to Train Neural Field Representations: A Comprehensive Study and Benchmark. CVPR 2024.

More examples and compared methods such as Miner [3] should be discussed in Table 1.

We thank the reviewer for the suggestion. Following suggestions from reviewer ANQc and reviewer GNWA, we have included 2 additional baselines. The first baseline is RFFNet [1], an MLP with ReLU activations and random Fourier features (RFF) that encode the input coordinates. The second baseline is SPDER [2], a recent state-of-the-art neural field, that uses an MLP with sublinear damping combined with sinusoids as activation functions. We report the results in the table below and in table 1 in the revised manuscript.

Method	Bach	Bikes	Big Buck Bunny (Audio)	Big Buck Bunny (Video)
RFFNet	54.62	27.00	32.71	23.47
Siren	51.65	37.02	31.55	24.82
SPDER	48.06	33.80	28.28	20.44
NeoMLP	54.71	39.06	39.00	34.17

Finally, we thank the reviewer for pointing out works like Miner, which we have included in the revised version of the paper. We note that, multiscale neural fields like Miner are orthogonal to our work, as they use a collection of MLPs to model the signal in multiple scales and increase the fidelity of reconstruction.

[1] Tancik et al. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. NeurIPS 2020.

[2] Shah et al. SPDER: Semiperiodic Damping-Enabled Object Representation. ICLR 2024.

2024-12-03

Thank you very much for providing such a detailed rebuttal. However, I do not think my major concerns are well addressed.

I do not agree that the choice of Transformers is so natural. As the authors mention, the Transformer is given quadratic complexity, which provides higher non-linearity than the linear layer. Therefore I think the performance boost should be own to the higher non-linearity of the Transformer. There is no clear evidence to support the motivation of viewing MLPs as computational graphs (Reviewer ewZN also points out that "it appears to be much more similar to a simple Transformer"). Maybe another way to demonstrate this is to show that the proposed method has a better performance than the simple Transformer with coordinates as input and queried values as output.
I agree with the authors that the transformer-based hyper-network methods may fail to handle a giga-pixel image dataset due to some efficiency problem. However, the authors also do not provide clear evidence that the auto-decoder methods have some advantages in handling a giga-pixel image dataset over the Transformer-based hyper-network methods. I still believe that the Transformer-based hyper-network methods have a better generalization ability than the auto-decoding methods because the hyper-network can provide a much stronger representation ability than a single representation vector as in the auto-decoding methods.

Due to these reasons, I tend to maintain my score currently.

2024-12-04

We thank the reviewer for their valuable feedback. Below, we address the reviewer’s concerns.

I do not agree that the choice of Transformers is so natural. As the authors mention, the Transformer is given quadratic complexity, which provides higher non-linearity than the linear layer. Therefore I think the performance boost should be own to the higher non-linearity of the Transformer. There is no clear evidence to support the motivation of viewing MLPs as computational graphs (Reviewer ewZN also points out that "it appears to be much more similar to a simple Transformer"). Maybe another way to demonstrate this is to show that the proposed method has a better performance than the simple Transformer with coordinates as input and queried values as output.

We argue that the choice of Transformers is straightforward and natural compared to other graph neural networks, since we operate on a fully-connected graph without edge features. Given the fully connected graph, any GNN architecture would have quadratic complexity on the number of nodes.

Following up on the reviewer’s suggestion, we perform an ablation study comparing our method with a simple Transformer that uses cross-attention. The coordinates are used as input queries in the Transformer, while the embeddings are used as keys and values. Such cross-attention based neural fields have been increasingly popular in the literature.

We ran an experiment on fitting the “Bach” audio signal. We use the same hyperparameters for the Transformer as with NeoMLP from Appendix E, except that we increase the dimensionality to 128 to account for the fact that the model can only have one layer. This results in 199,937 parameters, which is comparable with the number of parameters for our method and the baselines. We also run a second experiment on fitting MNIST. We show the results in the following table.

Method	Bach PSNR	MNIST PSNR
Transformer	50.90	24.13
NeoMLP	54.71	33.98

NeoMLP outperforms the simple Transformer baseline in both cases.

I agree with the authors that the transformer-based hyper-network methods may fail to handle a giga-pixel image dataset due to some efficiency problem. However, the authors also do not provide clear evidence that the auto-decoder methods have some advantages in handling a giga-pixel image dataset over the Transformer-based hyper-network methods. I still believe that the Transformer-based hyper-network methods have a better generalization ability than the auto-decoding methods because the hyper-network can provide a much stronger representation ability than a single representation vector as in the auto-decoding methods.

We agree with the reviewer that encoder-style hyper-network neural fields have very strong representation capabilities, courtesy of the ViT backbone used in them. We also agree that neural fields that use a single latent vector have weaker representation capabilities, and that is why we use a set of latent vectors in NeoMLP.

Overall, auto-decoding methods inherit one of the fundamental advantages of neural fields: they are resolution independent and they scale gracefully with the signal complexity instead of the signal size. As such, in the case of giga-pixel images, auto-decoding methods can fit the data assuming a sufficiently large architecture, while encoder-style methods would fail to do so. In general, we are not trying to replace encoder-style approaches; instead, auto-decoding approaches have other benefits, e.g. they are resolution independent, and they do not make assumptions about the observations, i.e. they are modality independent.

审稿意见

评分: 6置信度: 52024-11-01

This paper proposes a new neural network paradigm for neural function approximation, particularly motivated by improving the fitting capacity and representations of Neural Fields (NeFs). In particular, the authors propose to replace the feed-forward nature of MLPs with a fully connected neural network, coined NeoMLP. In this case, information processing happens with synchronous message passing, where neurons (input, hidden and output) are all connected and exchange information.

To make this possible, the authors propose to initialise the features of all nodes (apart from the input ones which are initialised using input values) using learnable embeddings for hidden and output nodes. Additionally, they use attention for information aggregation to reduce the number of parameters, where the attention weights are shared across the entire graph. This architecture is also used for conditional neural fields, i.e. to fit multiple neural fields using the same backbone, where the hidden/output embeddings are learned and can be later used as a representation for each neural field. Experimentally, the proposed method shows promising performance in terms of its ability to accurately fit neural fields, as well as the ability of the learned representations to perform well on downstream tasks, compared with other NeF processing architectures.

优点

Significance. The paper attempts to address an important and timely problem. In particular, as NeFs are becoming increasingly popular in various deep learning application domains, designing new methodologies for learning informative NeF representations is a key desideratum of the field.
Novelty. The paradigm proposed for function approximation is, to the best of my knowledge, a new and refreshing idea (also a quite natural one) and could potentially allow for further advancements beyond classical MLPs.
Simplicity and Presentation. The modifications to MLPs proposed are simple and easy to implement. Additionally, they are mostly well-presented and easy to follow.
Experimental evidence. The provided results seem promising both in terms of fitting capacity, as well as generalisation of the representations in downstream tasks.

缺点

Evaluation. One of the major weaknesses that I see in this paper is that some aspects are not well-evaluated. In detail:
- The authors have not adequately examined the trade-offs in terms of runtime. In particular, neither the fitting phase nor the finetuning phase are evaluated w.r.t. this aspect, although this architecture might turn out to be slower, e.g. compared to the Functa approach, especially w.r.t. the finetuning phase. Also, reporting the training time of Siren vs NeoMLP would be a helpful addition.
- Certain implementation details are not well-justified or ablated:
  - Why did the authors use Random Fourier features? Is that a necessary addition? I would suggest ablating this choice, e.g. by comparing with an MLP + RFF or NeoMLP without RFF vs MLP.
  - Why did the authors choose a Transformer-like architecture and not a GNN, with e.g. linear/MLP aggregation? Perhaps baselining with such an approach can provide an adequate justification via experimental evidence. Note that this approach will probably also be more computationally friendly.
Analysis of the method/Theory. I believe that since this is a new paradigm, additional effort is expected to analyse its behaviour. For example,
- Could the authors discuss the internal symmetries of this approach? My understanding was that since the authors are using positional embeddings for the hidden nodes, then there might not be any permutation symmetries, but the authors mention that such symmetries do exist. I think this claim should be made formal.
- Could the authors discuss the expressivity of this paradigm? MLPs are known to be universal approximators. Could it be the case that NeoMLP is also universal?
Motivation. Although I liked the idea and it seems reasonable, I am unsure if the motivation provided is adequate. It may be improved by discussing the aspects I mentioned in the previous bullet point, but currently, it seems mostly ad hoc. For example, the authors mention: L058: “shares the connectionist principle: cognitive processes can be described by interconnected networks of simple and often uniform units.”. I do not see how this statement can be related to learning better NeF representations while fitting them to signal data. Could the authors provide more concrete arguments concerning that?

问题

Minor:

L122: “Finally, instead of having scalar node features, we increase the dimensionality of node features, which makes self-attention more scalable” --> I would understand using high-dimensional features as a means to make the network more expressive (although this is not discussed), but I do not understand why this makes the network more scalable.
There are a few typos throughout the text. I suggest that the authors perform a thorough proof-reading before updating their manuscript
L112: “we create learnable parameters for the hidden and output neurons” --> I believe the authors here refer to the initialisation of the features of the neurons (input neurons are initialised with input values, while hidden + output are initialised with a learnable initialisation). Is my understanding here correct? Perhaps, explaining this in detail will help the interested reader.
Does the number of latents in Table 3 correspond to the number of hidden nodes?
There are some very recent papers providing algorithms to process NeF parameters among others (related to their symmetries) that the authors might want to cite. For example:
- The Empirical Impact of Neural Parameter Symmetries, or Lack Thereof, Lim et al., NeurIPS'24
- Monomial Matrix Group Equivariant Neural Functional Networks, Tran et al., NeurIPS'24
- Scale Equivariant Graph Metanetworks, Kalogeropoulos et al., NeurIPS'24

评论- Author response to reviewer ANQc [3/3]

2024-11-27

L112: “we create learnable parameters for the hidden and output neurons” --> I believe the authors here refer to the initialisation of the features of the neurons (input neurons are initialised with input values, while hidden + output are initialised with a learnable initialisation). Is my understanding here correct? Perhaps, explaining this in detail will help the interested reader.

Yes, the understanding is correct. We have updated the manuscript to make the distinction more clear.

Why did the authors use Random Fourier features? Is that a necessary addition? I would suggest ablating this choice, e.g. by comparing with an MLP + RFF or NeoMLP without RFF vs MLP.

As shown by Rahaman et al. [1], neural networks suffer from spectral bias, i.e. they prioritize learning low frequency components, and have difficulties learning high frequency functions. We expect that these spectral biases would also be present in NeoMLP if left unattended. To that end, we employed Random Fourier Features (RFF) to project our scalar inputs to higher dimensions. Compared to alternatives like sinusoidal activations, RFFs allow our architecture to use a standard transformer.

Following on the suggested ablation study, we train NeoMLP without RFF, using a learnable linear layer instead. We train this new model on the “bikes” video, and on MNIST. We present the results in the following two tables.

Table 1: Ablation study on the importance of RFFs. Experiment on the bikes video.

Method	PSNR
NeoMLP (no RFF)	35.92
NeoMLP	39.06

Table 2: Ablation study on the importance of RFFs. Experiment on MNIST.

Method	PSNR	Accuracy
NeoMLP (no RFF)	30.33	98.81 +- 0.03
NeoMLP	33.98	98.78 +- 0.04

The study shows that RFFs clearly help with reconstruction quality, both in reconstructing a high-resolution video signal, and on a dataset of images. Interestingly, the reconstruction quality drop from removing RFFs does not translate to downstream performance drop, where, in fact, the model without Fourier features is marginally better than the original. We have included this ablation study in the revised manuscript.

[1] Rahaman et al. On the Spectral Bias of Neural Networks. ICML 2019.

Does the number of latents in Table 3 correspond to the number of hidden nodes?

The number of latents in table 3 corresponds to the number of hidden and output nodes. The models in this ablation study have 8 and 16 nodes in total, respectively. 2 nodes correspond to the input dimensions, resulting in 6 and 14 nodes, respectively. Out of those, 3 nodes correspond to the output dimensions (RGB). Hence, we have 3 hidden nodes and 11 hidden nodes, respectively.

There are some very recent papers providing algorithms to process NeF parameters among others (related to their symmetries) that the authors might want to cite.

We thank the reviewer for their suggestion. We were already citing the work of Lim et al. in the original manuscript. We have gladly included the suggested works in the related work in the updated manuscript.

评论- Author response to reviewer ANQc [2/3]

2024-11-27

Could the authors discuss the internal symmetries of this approach? My understanding was that since the authors are using positional embeddings for the hidden nodes, then there might not be any permutation symmetries, but the authors mention that such symmetries do exist. I think this claim should be made formal.

We thank the reviewer for the suggestion. We have included a discussion and proofs about the permutation symmetries of NeoMLP in the revised manuscript in appendix B. Here, we include a brief discussion on the symmetries, and refer the reviewer to the manuscript for further details. Intuitively, when we permute two hidden embeddings from a randomly initialized or a trained model, we expect the behaviour of the network to remain the same, as the final output of the network does not depend on the transformed hidden embeddings. Formally, NeoMLP is a function that comprises self-attention and feed-forward networks applied interchangeably for a number of layers, following equations 2 and 3 in the manuscript. As a transformer architecture, it is a permutation equivariant function.Thus, the following property holds: $f\left(\mathbf{P} \mathbf{X}\right) = \mathbf{P} f\left(\mathbf{X}\right)$ , where $\mathbf{P}$ is a permutation matrix, and $\mathbf{X}$ is a set of tokens fed as input to the transformer.

Now consider the input to NeoMLP: $\mathbf{T}^{(0)} = [\\{\mathbf{i}\_i\\}\_{i=1}^I, \\{\mathbf{h}\_j\\}\_{j=1}^H, \\{\mathbf{o}\_k\\}\_{k=1}^O], \mathbf{T}^{(0)} \in \mathbb{R}^{(I+H+O) \times D}$ . We look at the case of permuting the hidden neurons. The permutation matrix is $\mathbf{P}\_1 = \mathbf{I}\_{I \times I} \oplus \mathbf{P}\_{H \times H} \oplus \mathbf{I}\_{O \times O}$ , where $\mathbf{I}$ is the identity matrix, $\mathbf{P}\_{H \times H}$ is a permutation matrix, and $\oplus$ denotes the direct sum operator, i.e. stacking matrix blocks diagonally, with zero matrices in the off-diagonal blocks. Applying this permutation to $\mathbf{T}^{(0)}$ permutes only the hidden neurons. Next, we apply NeoMLP on the permuted inputs. Making use of the equivariance property, the output of the function applied to the permuted inputs is equivalent to the permutation of the output of the function applied to the original inputs, i.e. $\left(\mathbf{P}\_1 \mathbf{T}^{(0)}\right) = \mathbf{P}\_1 f\left(\mathbf{T}^{(0)}\right)$ . Since the network is only using the output tokens in the final step as an output of the network, the overall behaviour of NeoMLP is invariant to the permutations of the hidden nodes.

Although I liked the idea and it seems reasonable, I am unsure if the motivation provided is adequate. It may be improved by discussing the aspects I mentioned in the previous bullet point, but currently, it seems mostly ad hoc. For example, the authors mention: L058: “shares the connectionist principle: cognitive processes can be described by interconnected networks of simple and often uniform units.”. I do not see how this statement can be related to learning better NeF representations while fitting them to signal data. Could the authors provide more concrete arguments concerning that?

We find that existing conditional neural field architectures include ad-hoc and over-engineered modules that are glued together, often resulting in weak conditioning methods or weak representations for downstream tasks. Thus, we believe that neural fields need a more “native” architecture that encompasses the need for powerful conditioning and representation. We draw inspiration from connectionism and the history of neural networks to build such a native architecture, in which conditioning is built-in, while self-attention boosts the expressivity and scalability of the method.

L122: “Finally, instead of having scalar node features, we increase the dimensionality of node features, which makes self-attention more scalable” --> I would understand using high-dimensional features as a means to make the network more expressive (although this is not discussed), but I do not understand why this makes the network more scalable.

With scalability, we refer to the fact that self-attention on nodes with scalar features is a trivial operation, since the dot product becomes a simple scalar multiplication, and, thus, cannot scale to more complex datasets. Indeed, however, expressivity is another aspect for using self-attention. We discuss the expressivity of NeoMLP in a previous question.

There are a few typos throughout the text. I suggest that the authors perform a thorough proof-reading before updating their manuscript.

We thank the reviewer for pointing out the existence of typos. We have performed a round of proofreading and corrected the typos in the revised version.

评论- Author response to reviewer ANQc [1/3]

2024-11-27

We would like to thank the reviewer for their time and insightful comments. We appreciate that they find our paradigm “a new and refreshing idea”, and our method “simple and easy to implement”. Below, we address the reviewer’s concerns in detail.

The authors have not adequately examined the trade-offs in terms of runtime. In particular, neither the fitting phase nor the finetuning phase are evaluated w.r.t. this aspect, although this architecture might turn out to be slower, e.g. compared to the Functa approach, especially w.r.t. the finetuning phase. Also, reporting the training time of Siren vs NeoMLP would be a helpful addition.

We report the runtime for Functa and NeoMLP in the tables below, and in Appendix E in the revised manuscript.

Table 1: MNIST

Method	Fitting epochs	Fitting runtime (min.)	Finetuning epochs	Finetuning runtime (sec.)
Functa	192	240	3	16
NeoMLP	20	63	10	318

Table 2: CIFAR10

Method	Fitting epochs	Fitting runtime (min.)	Finetuning epochs	Finetuning runtime (sec.)
Functa	213	418	3	16
NeoMLP	50	305	10	646

Table 3: ShapeNet

Method	Fitting epochs	Fitting runtime (min.)	Finetuning epochs	Finetuning runtime (sec.)
Functa	20	1002	3	250
NeoMLP	20	713	2	1680

NeoMLP consistently exhibits lower runtimes for the fitting stage, while Functa is much faster during the finetuning stage, which can be attributed to the meta-learning employed for finetuning, and the highly efficient JAX implementation. As noted by the authors of Functa, however, meta-learning may come at the expense of limiting reconstruction accuracy for more complex datasets, since the latent codes lie within a few gradient steps from the initialization.

For fitting high resolution signals, we train NeoMLP and Siren for the same amount of time. We report the plots for PSNR vs time in Figure 5 in the revised manuscript, where it is clear that NeoMLP fits faster and with better quality. Interestingly, NeoMLP is effectively more memory efficient as well, as it can leverage smaller batch sizes, which leads to lower GPU memory used. As an example, NeoMLP requires 13.2 GB of GPU memory for the BigBuckBunny signal vs 18.7 for Siren. We refer the reviewer to Table 6 in the revised manuscript for more details.

We ran all experiments on single-GPU jobs on an Nvidia H100.

Why did the authors choose a Transformer-like architecture and not a GNN, with e.g. linear/MLP aggregation? Perhaps baselining with such an approach can provide an adequate justification via experimental evidence. Note that this approach will probably also be more computationally friendly.

Since NeoMLP operates on a fully connected graph without edge features, the choice of Transformers seems more natural than any graph neural network, which would be as computationally demanding as a Transformer, given the quadratic complexity on the number of tokens. Thus, we opt for a Transformer backbone, as it has proven to be an expressive and scalable architecture across a wide range of tasks.

Could the authors discuss the expressivity of this paradigm? MLPs are known to be universal approximators. Could it be the case that NeoMLP is also universal?

The universal approximation capabilities of Transformers have been studied and proven in previous works [1]. Since each output dimension in NeoMLP is a function of the input coordinates, we expect that we can approximate the underlying function to an arbitrary precision. This is also in line with our intuition, which motivated us to employ a fully-connected graph structure and a self-attention based architecture, instead of the limiting cross-attention based architectures. While a full proof of the expressivity and approximation capabilities is beyond the scope of our work, we are excited to see future works that verify our hypothesis and intuition.

[1] Yun et al. Are Transformers universal approximators of sequence-to-sequence functions? ICLR 2020.

审稿意见

评分: 5置信度: 32024-11-04

The authors propose to create a 'new MLP' architecture which instead models the MLP as self-attention over a fully connected graph of input, hidden, and output 'nodes' (which take the form of learned embeddings). This model is then applied to neural field modeling tasks, demonstrating strong reconstruction performance, and some tangential applications to downstream tasks using the learned node embeddings.

In conclusion, while the idea is interesting and certainly worthy of further investigation, it seems the paper is not quite ready for publication in my opinion. The claims of state-of-the-art are not quite founded by the results (significantly more baselines are needed), and the writing of the paper seems to be heavily engrained in the neural-field literature, despite making claims which seem to extend beyond that space. I would encourage the authors to re-write the paper with a more in-depth discussion of related work and prior work, allowing the reader to situate the proposed model better in the current field.

优点

The application of self attention to perform message passing over an 'mlp-like' graph is interesting and clever.
The use of extra node embeddings for conditioning is additionally clever and appears to work well for neural field modeling.
The reconstruction results seem promising on the few datasets tested.

缺点

Line 39 typo: 'nconditional'
Writing is not the most clear. Especially the introduction is more a rushed list of related work.
There is no background section to describe formally what a neural field is, despite this being a core application of the proposed model. The large algorithm blocks could be moved to the appendix to allow for this background information to be included in the main text.
The authors use significant jargon without proper explanation when discussing neural field models (such as 'latent code' & 'latent conditional') which makes the interpretation of their model unclear to anyone not familiar with that literature.
Despite the author's efforts, the connection with the MLP is tentative at best. It is perhaps a bit misleading to call the method the NeoMLP, since in actuality it appears to be much more similar to a simple Transformer which has additional placeholder tokens which are believed to allow 'intermediate computations'. Furthermore, since the authors only evaluate the model on 'neural field' tasks, it seems a bit presumptuous to call it the NeoMLP considering how broad of applications traditional MLPs can and have been used for.
Only a single baseline is reported (Siren, 2020) for the neural field modeling work (Table 1), this is insufficient given the claimed generality of the proposed model -- and the claims of 'state of the art' in the conclusion.
Section 3.2 again starts with a rushed list of related work without sufficient explanation of the methods to allow interpretation by outside parties.
The downstream task performance improvement in Table 2 is marginal, although the reconstruction quality is high.
As the authors note, this model seems very similar to the Graph Neural Machine of Nikolentzos et al. with a transformer used in place of a graph neural network.
Typo, line 508: " indicating that inductive biases that can be leveraged to increase downstream performance"

问题

How is the NeoMLP different from a transformer with extra placeholder tokens (with unique learned 'embeddings')?
Can you provide more details for why you need the separate fitting and fine-tuning steps?

评论- Author response to Reviewer ewZN [2/2]

2024-11-27

As the authors note, this model seems very similar to the Graph Neural Machine of Nikolentzos et al. with a transformer used in place of a graph neural network.

Our architecture does share similarities with the Graph Neural Machine (GNM). There are, however, notable differences between the two methods. First, GNM uses edge-specific weights, i.e. there are dedicated weights between each node each $i$ and node $j$ . In contrast, we employ weight-sharing, i.e. the weight matrices are shared for all nodes. Second, we use message passing through self-attention, which is enabled by weight-sharing. Third, we use high-dimensional node features to increase expressivity, while GNM is using scalar node features, which can be limiting. Finally, we explore conditioning via instance-specific sets of latent codes, while GNM only functions as an unconditional function approximator.

How is the NeoMLP different from a transformer with extra placeholder tokens (with unique learned 'embeddings')?

NeoMLP is, of course, a transformer-based architecture. The major difference with existing Transformers is that the inputs correspond to individual dimensions, along with placeholder tokens. Take, for instance, a Vision Transformer (ViT) with a patch size of 1, operating on 32x32 images. The input to the ViT is a set of 1024 tokens that correspond to pixel values, with a coordinate embedding added to each value, plus one more token for classification. The output of the ViT is the output of the transformer at the CLS token. This is in stark contrast with the way NeoMLP operates. Using the same 32x32 images as an example, the input to NeoMLP is a single pixel coordinate (or a batch of pixel coordinates that are treated independently), and the output is the RGB of that pixel coordinate, captured from the output of the output tokens. NeoMLP operates using the input/output dimensions as tokens, while a ViT operates using a set of patches as tokens.

Can you provide more details for why you need the separate fitting and fine-tuning steps?

NeoMLP is an auto-decoding conditional neural field. Auto-decoding neural fields use latent variables (usually one latent vector per signal, e.g. per image, or a set of latent vectors), which are optimized through stochastic optimization. During training (we opt for the term fitting, as we find it is more appropriate), the latent variables are optimized jointly with the backbone neural field parameters. At test time (we use the term fine-tuning), we freeze the backbone and only optimize the latent variables of the test set signals. This is referred to as test-time optimization. If we were to fit the test set together with the training set during the fitting stage, we would be ”cheating”, as this would not reflect a real-world scenario in which new images arrive after the backbone is frozen, and thus, the metrics would more inflated than they should.

2024-12-03

Thank you for your detailed response and for helping clarify the training and evaluation procedure, this was very helpful. I additionally think the new results added to Table 1 are very strong, and make the method appear even more promising than it already was. For this reason, the improvements to the manuscript draft, and the explanation of the additional novelty of the model with respect to the GNM, I will increase my score from a 3 to a 5. I still think that the idea is interesting and clever, however I think the focus of the manuscript and evaluation on neural fields does not match the generality of the statements and algorithmic contributions. I think if the authors were able to demonstrate benefits of the NeoMLP idea beyond neural field applications, or if the authors were able to provide more theoretical motivation for why this approach would be beneficial for neural field applications in particular, this would greatly improve the paper. At this point however, the empirical results have still not entirely convinced me that this is an architectural innovation that is a significant contribution to the field; but I am open to discussion with the other reviewers. Thanks to the authors for their time.

评论- Author response to Reviewer ewZN [1/2]

2024-11-27

We would like to thank the reviewer for their time and insightful comments, as well as for finding our method “interesting and clever”. Below, we address the reviewer’s concerns in detail.

Only a single baseline is reported (Siren, 2020) for the neural field modeling work (Table 1), this is insufficient given the claimed generality of the proposed model -- and the claims of 'state of the art' in the conclusion.

Method	Bach	Bikes	Big Buck Bunny (Audio)	Big Buck Bunny (Video)
RFFNet	54.62	27.00	32.71	23.47
Siren	51.65	37.02	31.55	24.82
SPDER	48.06	33.80	28.28	20.44
NeoMLP	54.71	39.06	39.00	34.17

NeoMLP outperforms all baselines, especially in the more complex setup of multimodal data (BigBuckBunny). Our hypothesis is that NeoMLP can exploit smaller batch sizes and learn with stochastic gradient descent, while all baselines seem to rely on full batch gradient descent, which is intractable for larger signals. Furthermore, NeoMLP is effectively more memory efficient, as it requires less GPU memory than the baselines to fit the signals, since it uses smaller batch sizes. As an example, for the BigBuckBunny signal, NeoMLP requires 13.2 GB of GPU memory, compared to 13.9 for RFFNet, 18.5 for Siren, and 39.2 for SPDER. We include the full details about runtime and memory requirements in Table 6 (Appendix C) in the revised manuscript.

[1] Tancik et al. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. NeurIPS 2020.

[2] Shah et al. SPDER: Semiperiodic Damping-Enabled Object Representation. ICLR 2024.

Line 39 typo: 'nconditional'

Typo, line 508: " indicating that inductive biases that can be leveraged to increase downstream performance"

We thank the reviewer for pointing out the typos in the manuscript. We have corrected them and performed a round of proofreading in the revised version.

There is no background section to describe formally what a neural field is, despite this being a core application of the proposed model. The large algorithm blocks could be moved to the appendix to allow for this background information to be included in the main text.

The authors use significant jargon without proper explanation when discussing neural field models (such as 'latent code' & 'latent conditional') which makes the interpretation of their model unclear to anyone not familiar with that literature.

We thank the reviewer for the suggestion. We have included a small background section on neural fields in the revised manuscript, and moved one algorithm block in the appendix.

Despite the author's efforts, the connection with the MLP is tentative at best. It is perhaps a bit misleading to call the method the NeoMLP, since in actuality it appears to be much more similar to a simple Transformer which has additional placeholder tokens which are believed to allow 'intermediate computations'. Furthermore, since the authors only evaluate the model on 'neural field' tasks, it seems a bit presumptuous to call it the NeoMLP considering how broad of applications traditional MLPs can and have been used for.

We understand the reviewer’s perspective, our inspiration and motivation, however, stem from studying the MLP architecture and various conditioning methods as graphs, while searching for “native” architectures for neural fields, i.e. architectures that include a built-in conditioning mechanism and expressivity. Indeed, NeoMLP is a transformer-based architecture, but one that operates on the connectivity graph of an MLP. Finally, NeoMLP does not imply a universally superior MLP, and we are very excited to see applications of NeoMLP besides neural fields.

评论- Kind reminder for reviewers

2024-12-03

Dear Area Chair and Reviewers,

We would like to thank the reviewers again for their thoughtful reviews and valuable feedback, as they have significantly increased the quality of our work.

About a week ago, we responded to each reviewer and uploaded a revised manuscript (changes are denoted with a deep purple-red color) that contains multiple improvements motivated by the reviewers' points.

Since we have entered the last day of the discussion period, we would be thankful if the reviewers acknowledge reading our responses, and update their reviews if we have addressed their concerns. If not, we would be happy to do any last-minute follow-up discussion and incorporate further changes in the camera-ready version of our paper.

Kind regards,

The authors

AC 元评审

2024-12-21

The paper introduces a 'new MLP' architecture that models the MLP as a form of self-attention over a fully connected graph comprising input, hidden, and output 'nodes,' represented as learned embeddings. This approach is applied to neural field modeling tasks, demonstrating strong reconstruction performance and exploring some downstream applications using the learned node embeddings. While the idea is intriguing and shows potential for further exploration, the paper is not yet ready for publication due to weak baseline comparisons and limited experimental justification. Additionally, the presentation quality requires significant improvement.

The rebuttal did not adequately address the reviewers' concerns, leading to a consensus to reject the paper. The AC concurs with this decision but encourages the authors to enhance their work by considering all reviewers' suggestions and consider resubmission to a future venue.

审稿人讨论附加意见

During the discussion, Reviewer ewZN highlighted the paper's largest limitation: insufficient evaluation of the proposed method, despite its interesting premise. Additionally, the presentation and comparison to prior work were found lacking, making it difficult to situate the paper within the existing literature. Reviewer ewZN does not recommend acceptance at this stage and suggests the paper could benefit from another round of experiments and review.

Reviewer GNWA noted weak baselines and limited experiments, as well as unclear explanations of the architecture and v-representations. Also, there were concerns regarding the computational complexity of the proposed model.

Reviewer ANQc observed that some justifications remain overly intuitive or ad hoc. The experimental comparisons were deemed unconvincing and insufficient to support the conclusions drawn in the paper.

Reviewer UBft's concerns were not adequately addressed in the rebuttal. Their disagreement is on the performance of Transformer-based hyper-network methods v.s. the auto-decoding methods. Reviewer UBft felt that the authors failed to provide compelling evidence to justify their claim.

Overall, all reviewers lean toward rejecting the paper.

最终决定Reject

2025-01-22

Reject