PaperHub
5.0
/10
Rejected4 位审稿人
最低2最高5标准差1.1
3
3
2
5
3.8
置信度
创新性1.8
质量2.5
清晰度2.0
重要性2.0
NeurIPS 2025

From MLP to NeoMLP: Leveraging Self-Attention for Neural Fields

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29
TL;DR

We propose a method that performs message passing on the graph of MLPs via self-attention and functions as a conditional neural field

摘要

关键词
Neural fieldsSelf-attentionAuto-decodingTransformersConditional neural fieldsImplicit neural representationsGraphs

评审与讨论

审稿意见
3

The authors propose a transformer based approach to neural fields, where the input to a transformer layer is a set of tokens consistent of input tokens, latent tokens and output tokens. Each input coordinate has its own input token, generated by standard neural field encodings (fourier features + linear layer). The output tokens are also decoded into the output dimensionality using a linear layer. The explicit latent tokens make it easy to use this architecture in a conditional autodecoding framework.

优缺点分析

Strengths

  • The proposed architecture is a nice approach to using transformers for neural fields, the latent tokens are useful both for extra representational power and for conditional fitting, and the output tokens allows for easier manipulation of fine detail.
  • Very good results in certain areas like videos

Weaknesses

  • The essential structure of NeoMLP boils down to a transformer with input tokens, latent tokens, and output tokens all together. This kind of architecture has been used widely, and in particular has been formalized generally in Set Transformer and Perceiver IO (former has a different goal, latter is quite relevant). The paper proposes to view such a transformer layer as an MLP, with the input, latent and output tokens corresponding to input, hidden and output nodes. While an interesting perspective, especially in the context of neural fields where MLP architectures are often used, nothing is done with this perspective: it is not used to imbue any inductive biases or any architectural decisions.
  • The paper further tries to avoid making this connection to general transformers as much as possible, which feels very misleading
  • A lot of the standard transformer parts of the model, e.g. the reason for attention and for positional encoding, are explained as if the ideas are new rather than it is the standard when using a transformer. The main change, no layer normalization, is not properly explored/discussed. I wonder if RMSNorm would be better suited.
  • The auto-decoding framework from [29] is also very standard and there are many works that explore it. A lot of section 2.3 is explaining autodecoding as if the ideas are new (e.g. fitting and finetuning stages). ν\nu-sets and ν\nu-reps seems to be needlessly defined as well.
  • There are more important ablations to do to showcase the benefit of your architecture: removing the hidden tokens, removing the output tokens and instead directly mapping to them, etc. It is nice that ablations on RFF are done, but these are the least surprising component: their benefit is not a surprise at all as spectral bias is well studied and [42] proposes RFF specifically for this issue.
  • "Compared to alternatives like sinusoidal activations [40], RFFs allow our architecture to use a standard transformer" why? Could still have a initial SIREN layer, which would work just as well (as long as ω0\omega_0 is set sufficiently).

Set Transformer: "Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks" by Lee et al., ICML 2019, https://arxiv.org/pdf/1810.00825 Perceiver IO: "Perceiver IO: A General Architecture for Structured Inputs & Outputs" by Jaegle et al., ICLR 2022, https://arxiv.org/pdf/2107.14795

问题

  • I think the main issue with this paper is the writing/story. Why not be clear and up front that the architecture is essentially just a transformer? The connection to MLPs/message passing can be made, but what does the paper gain from such a story? To me it seems the main benefit of this work is how it represents tokens for transformer layers to efficiently process neural field data, and the design choices made to improve performance (such as RFF and no Layer Norm).

局限性

They have discussed this in the conclusion.

最终评判理由

There are three major point of discussion that I had with the authors, and my final thoughts on them are:

  • I think the approach is good, but it neglects very similar work upon which it should try to build upon, not try to differentiate from. The method is clearly a transformer, which the authors were not very forthright with in their method section. Furthermore, the idea of having input, latent, and output tokens is not new, with for example Perceiver IO being a notable example. During he discussion the authors make a differentiation with self and cross attention, and my thoughts are that the Perceiver IO could have easily be pure self attention on all three types of tokens, but doing cross attention at the start and end is much more efficient. The authors seems to suggest that the Perceiver IO does not use self attention and thus is limited to a single layer, but it does have multiple layers of self-attention. The authors should have tried to build upon these fundamental architectures from the transformer literature for the neural fields domain, adding and taking away as suits the task, rather than trying to re-implement a transformer based method from scratch and claim it is totally different to any other general transformer model.
  • I also still don't agree after back and forth that SIREN layers are not worth experimenting with since using them with transformers is "neither well studied nor evaluated extensively in the literature": well ofc not if you are saying you are the first to properly adapt self-attention for neural fields, so your paper should be the one to investigate this since SIREN layers is specifically for neural fields.
  • In conjunction with other reviewers' comments, the authors have provided a few more of the ablations that I think are necessary for the paper.

Overall, I think this paper needs to be more clear on their contributions and where the paper sits relative to the literature. As a result, I am keeping my score (borderline reject).

格式问题

None

作者回复

We would like to thank the reviewer for their time and insightful comments. Below, we address the reviewer’s concerns in detail.

We agree with the reviewer about the importance of the suggested experiments/ablations. We are still working on the suggested experiments, and due to time constraints we will present the results in the upcoming days during the discussion phase.

The essential structure of NeoMLP boils down to a transformer with input tokens, latent tokens, and output tokens all together. This kind of architecture has been used widely, and in particular has been formalized generally in Set Transformer and Perceiver IO (former has a different goal, latter is quite relevant). The paper proposes to view such a transformer layer as an MLP, with the input, latent and output tokens corresponding to input, hidden and output nodes. While an interesting perspective, especially in the context of neural fields where MLP architectures are often used, nothing is done with this perspective: it is not used to imbue any inductive biases or any architectural decisions.
The paper further tries to avoid making this connection to general transformers as much as possible, which feels very misleading
A lot of the standard transformer parts of the model, e.g. the reason for attention and for positional encoding, are explained as if the ideas are new rather than it is the standard when using a transformer. 
I think the main issue with this paper is the writing/story. Why not be clear and up front that the architecture is essentially just a transformer? The connection to MLPs/message passing can be made, but what does the paper gain from such a story? 

It is not our intention to claim that we invent the ideas for attention, and we certainly not avoid making the connection to general transformers. On the contrary, we think we are very upfront about NeoMLP being a transformer architecture, as well as about reusing the existing attention mechanisms. As an example, in lines 272-273 we mention that “RFFs allow our architecture to use a standard transformer” and in lines 308-309 we mention that “NeoMLP is a transformer architecture”. Furthermore, we often mention in the manuscript that we employ self attention (or leverage in the title), to highlight that we are using an existing module from the deep learning literature. Our novelty lies in using self-attention for neural fields instead of the previously used cross-attention. Set Transformer and Perceiver IO are both attention/transformer architectures, but they have a lot of differences with our method. Importantly, neither the Set Transformer nor the Perceiver IO are neural field architectures; they operate on sets of points instead of being a (neural) function of individual points. Futheremore, they are based on cross-attention operations that encode the input set of points in a latent set, while NeoMLP is only employing self-attention. In the updated manuscript, we will include these works in our related work, alongside a discussion with the similarities and the differences between these works. We consider the connection with MLPs and message passing important, because it organically led us to conceive this architecture. In retrospect, a self-attention based neural field might seem obvious, yet we are the first to propose it. As such, we think that establishing such a connection is an integral part of our story.

The auto-decoding framework from [29] is also very standard and there are many works that explore it. A lot of section 2.3 is explaining autodecoding as if the ideas are new (e.g. fitting and finetuning stages). v-sets and v-reps seems to be needlessly defined as well. 

We apologize for the confusion. We do not intend to claim that fitting and finetuning are our ideas, since they are standard practice in auto-decoding neural fields. We will update the manuscript to make it clear that these steps exist in auto-decoding neural fields. We define the terms v-sets and v-reps to differentiate them from existing neural field representations like functasets, and be more explicit when we refer specifically to v-sets.

Compared to alternatives like sinusoidal activations [40], RFFs allow our architecture to use a standard transformer" why? Could still have a initial SIREN layer, which would work just as well (as long as omega_0 is set sufficiently).

Many state-of-the-art neural fields address the spectral biases using non-standard activation functions, including sine waves, Gabor wavelets etc. On the other hand, a standard transformer is using ReLU/GeLU/SiLU activations. Using sine wave activations within Transformers is not common practice, and to the best of our knowledge, it has not been studied or tested extensively to provide any performance guarantees. An initial Siren layer could be used before the transformer, but that is functionally equivalent to Random Fourier Features (with trainable parameters), as shown in Benbarka et al. [1].

It is nice that ablations on RFF are done, but these are the least surprising component: their benefit is not a surprise at all as spectral bias is well studied and [42] proposes RFF specifically for this issue.

The study of Tancik et al. [42] focuses on standard MLPs and employs tools from the Neural Tangent Kernel literature to demonstrate the spectral biases of MLPs. While it might seem obvious, we postulate that spectral biases would be present in NeoMLP as well if left unattended, and thus, we performed this ablation study. Future work can employ similar NTK tools to theoretically study the spectral biases in NeoMLP.

References

[1] Benbarka et al. Seeing Implicit Neural Representations as Fourier Series, WACV 2022.

评论

Thanks to the authors for their response. I have some follow-up questions and comments (I apologise for responding so late).

"On the contrary, we think we are very upfront about NeoMLP being a transformer architecture, as well as about reusing the existing attention mechanisms. As an example, in lines 272-273 we mention that “RFFs allow our architecture to use a standard transformer” and in lines 308-309 we mention that “NeoMLP is a transformer architecture”."

  • I don't agree that you are upfront, as the two quotes you use are from the experiments section and the conclusion respectively. This should have been clearly stated in the method section.

"Importantly, neither the Set Transformer nor the Perceiver IO are neural field architectures; they operate on sets of points instead of being a (neural) function of individual points. Futheremore, they are based on cross-attention operations that encode the input set of points in a latent set, while NeoMLP is only employing self-attention"

  • Perceiver IO is designed as a general architecture for any input and output. Just because their examples do not use the same input and outputs as you, does not mean the key architectural insights are actually already in their paper. That would be like saying the DeepSDF paper invented the neural field architecture of MLPs, since previous MLP papers didn't operate as a neural function of individual points. I agree that your application of the architecture is novel, but you should mention that the architecture is not novel.

"An initial Siren layer could be used before the transformer, but that is functionally equivalent to Random Fourier Features (with trainable parameters), as shown in Benbarka et al. [1]."

  • Doesn't that mean that the statement in the paper "Compared to alternatives like sinusoidal activations [40], RFFs allow our architecture to use a standard transformer" is incorrect, since sinusoidal activations could be used?

Ablations

  • You have not addressed my concern about there being more important ablations
评论
The main change, no layer normalization, is not properly explored/discussed. I wonder if RMSNorm would be better suited.

We thank the reviewer for the suggestion. We perform an ablation study on CIFAR10, where we use different normalization methods, including Layer Norm, RMS Norm, and no normalization. We present the results in the table below.

Normalization typeFit PSNR (\uparrow)Test PSNR (\uparrow)Accuracy (%)
No normalization (default)30.5429.1558.82
Layer Norm [1]30.8226.7646.51
RMS Norm [2]29.9626.1644.11
We see that using no normalization outperforms both normalization layers, both in reconstruction accuracy and downstream performance. Interestingly, Layer Norm has the best training PSNR, which does not translate, however, to test PSNR.

References

[1] Ba et al. Layer Normalization. arXiv. 2016.

[2] Zhang et al. Root Mean Square Layer Normalization. NeurIPS 2019.

There are more important ablations to do to showcase the benefit of your architecture: removing the hidden tokens, removing the output tokens and instead directly mapping to them, etc. 

We thank the reviewer for the suggested ablation. We perform an ablation study on CIFAR10 where we vary the number of hidden latents from 0 to 11. We present the results in the table below.

HFit PSNR (\uparrow)Test PSNR (\uparrow)Accuracy (%)
030.4929.4060.64
330.4929.1458.73
1140.0535.4057.60

We observe that increasing the number of hidden latents consistently increases reconstruction accuracy, but that comes at the expense of downstream performance loss. We attribute that to the “naive” MLP used for downstream performance, which does not take into account permutation symmetries. Using no hidden latents effectively removes these symmetries, and as such, the downstream MLP is able to perform better. Future work can explore more sophisticated downstream architectures that incorporate the permutation symmetries of NeoMLP.

评论

Thanks for the results on normalization, and on the number of hidden latents.

I would like a comment on whether you agree with the ablations I propose. I do not expect (and did not) expect you to run them during this period, but I think the paper does not properly explore the affect of all the parts of the method if you do not have ablations with important parts removed to show their importance.

Do you have any comments on the rest of my last response (given you have not addressed them)?

评论
I don't agree that you are upfront, as the two quotes you use are from the experiments section and the conclusion respectively. This should have been clearly stated in the method section.

We thank the reviewer for the suggestion. We will explicitly mention that NeoMLP is a transformer architecture in the method section in the updated manuscript.

Perceiver IO is designed as a general architecture for any input and output. Just because their examples do not use the same input and outputs as you, does not mean the key architectural insights are actually already in their paper. That would be like saying the DeepSDF paper invented the neural field architecture of MLPs, since previous MLP papers didn't operate as a neural function of individual points. I agree that your application of the architecture is novel, but you should mention that the architecture is not novel.

Apart from having different inputs and outputs, the Perceiver IO is a cross-attention based transformer, which is different from our self-attention based transformer, as self-attention naturally scales to multiple layers, in contrast to cross-attention, which is limited to a single layer. As an architecture, NeoMLP is almost identical to a standard Transformer encoder with post-Layer Norm layers, and in that sense, it is not a novel architecture. NeoMLP is novel because it is the first neural field architecture that uses self-attention, instead of cross-attention.

Doesn't that mean that the statement in the paper "Compared to alternatives like sinusoidal activations [40], RFFs allow our architecture to use a standard transformer" is incorrect, since sinusoidal activations could be used?

We thank the reviewer for the follow-up question. Siren uses sinusoidal activations in every MLP layer; the equivalent for MLP would be to use sinusoidal activations in its hidden layers. As such, we believe our statement is factual; we cannot use sinusoidal activations inside the transformer architecture, because such an architecture is neither well studied nor evaluated extensively in the literature. On the contrary, RFF is a single encoding layer at the beginning, and can easily be prepended to a standard transformer.

You have not addressed my concern about there being more important ablations

During the rebuttal, we have included ablations for the number of NeoMLP layers (response to Reviewer 52s2), the number of hidden tokens (Reviewer 9q89 and Reviewer AVfU), and the different kinds of layer normalization (Reviewer AVfU). In conjunction with the original ablation studies on the importance of RFF, the number of latents, the dimensionality of the latents, the number of finetuning epochs, and the number of fitting epochs, we believe that we have addressed all the important ablations of NeoMLP.

审稿意见
3

The paper introduces NeoMLP, a transformer-style architecture that re-interprets a multilayer perceptron (MLP) as a complete graph whose nodes (input, hidden, output) exchange messages through weight-shared self-attention. The authors (1) detail the graph-theoretic reformulation and its linear-attention implementation, (2) show how the latent tokens are optimised in two phases—fitting a shared backbone and finetuning per instance—and (3) evaluate on three fronts: (i) fitting single high-resolution signals (audio, video, audio-video), (ii) fitting whole datasets (MNIST, CIFAR-10, ShapeNet-10) and (iii) using the learned ν-reps for downstream classification.

优缺点分析

Strengths:

  1. Strong gains on high-res multimodal signals and on ν-set classification
  2. Architecture is motivated step-by-step (MLP → graph → complete graph → self-attention).
  3. Demonstrates that self-attention with latent tokens scales to multimodal 40 M-point signals.
  4. Converts MLP to a complete graph and shares weights via attention, different from prior cross-attention conditional fields; introduces ν-reps / ν-sets terminology.

Weakness:

  1. Evaluation scope is still modest—CIFAR-10, MNIST and ShapeNet-10 are small; no large-scale or segmentation tasks despite claiming suitability.
  2. Theoretical analysis is limited to symmetry discussion; no convergence or expressivity results beyond empirical evidence.
  3. Some equations (Eq. 1) replicate standard transformer derivations without adding insight.
  4. Relies heavily on well-known ingredients (RFFs, linear attention, latent tokens) and does not provide new learning objectives.

问题

  1. Could the author report results on a higher-resolution 3D dataset (e.g., a full ShapeNet split or large-scale NeRF scenes) to demonstrate ν-reps at a realistic scale?
  2. Since ν-reps are said to support segmentation, please include a pixel- or point-level experiment, or clarify why permutation symmetries currently preclude it.
  3. The ablation focuses on latent size; varying the number of NeoMLP layers could show whether capacity should reside in tokens or in backbone depth.

局限性

The authors devote a paragraph to permutation symmetries and global-token locality issues, and list potential future fixes.

最终评判理由

Thanks for the authors' responses. Overall, I find the comments provided by Reviewer AVfU to be important and well-considered, and I will therefore maintain my original score.

格式问题

NA

作者回复

We would like to thank the reviewer for their time and insightful comments. We agree with the reviewer about the importance of the suggested experiments/ablations. We are still working on the suggested experiments, and due to time constraints we will present the results in the upcoming days during the discussion phase.

Could the author report results on a higher-resolution 3D dataset (e.g., a full ShapeNet split or large-scale NeRF scenes) to demonstrate ν-reps at a realistic scale?

We thank the reviewer for the suggestion. We consider the ShapeNet split that we use already a high-resolution dataset, with points clouds in the order of 100,000 points, on which the signed distance function is calculated.

评论
The ablation focuses on latent size; varying the number of NeoMLP layers could show whether capacity should reside in tokens or in backbone depth.

We thank the reviewer for their suggestion. We perform an ablation study on CIFAR10, where we vary the number of NeoMLP layers from 1 to 8; the results are shown in the table below.

Num. layersFit PSNR (\uparrow)Test PSNR (\uparrow)Accuracy (%)
123.1024.0453.92
229.5428.7657.36
3 (default)30.5429.1558.82
433.5131.1559.09
834.6731.0558.96

We see that increasing the number of layers is beneficial both for the reconstruction accuracy and for downstream performance, gradually up to 4 layers. Increasing the number of layers to 8 results in a higher training PSNR, which does not translate, however, to better reconstruction accuracy or downstream performance, as these two metrics are marginally lower than the best we achieve with 4 layers. Overall, though, the performance loss is minimal, so the capacity can safely reside in the backbone depth as well.

审稿意见
2

This study proposes NeoMLP for a neural field to utilize its data representation for a downstream task. Compared with a conventional MLP, NeoMLP, motivated by the philosophy of connectionism, uses a transformer architecture while each data instance is represented as a set of latent tokens. The instance-specific latent tokens for each data instance are updated by back-propagation as learning to reconstruct. The experimental results show competitive results of data reconstruction quality and downstream tasks such as classification.

优缺点分析


Strengths

S1. This study covers an important topic of conditional neural fields to represent various continuous signals and leverage the neural representations to downstream tasks.
S2. The paper is well-written and easy to understand.
S3. Experimental results show the potential for NeoMLP to improve both reconstruction and classification results.


Weaknesses

W1. This paper misses many related works, considering that using a transformer for a conditional neural field or generalizable INR is already well-explored [NewRef-1,2,3]. The experiments also need to compare the performance with these prior studies showing decent performance on downstream tasks.

W2. The motivation of this study is ambiguous and over-interpreted/claimed. For instances, the authors claim that the lack of Connectivism in MLP is the main motivation of NeoMLP. However, the architecture converges to a conventional transformer, which is commonly used in this field, and there is lack of motivations for the design of NeoMLP compared with the existing works on conditional neural fields or generalizable INRs. Also, the architecture design is not new concept but well known as Perceiver-IO [NewRef-4].

W3. Lack of experiments on downstream tasks (e.g. generation tasks) or analyses.


New References

[NewRef-1] Bauer, Matthias, et al. "Spatial functa: Scaling functa to imagenet classification and generation." arXiv preprint arXiv:2302.03130 (2023).
[NewRef-2] Kim, Chiheon, et al. "Generalizable implicit neural representations via instance pattern composers." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
[NewRef-3] Lee, Doyup, et al. "Locality-aware generalizable implicit neural representation." Advances in Neural Information Processing Systems 36 (2023): 48363-48381.
[NewRef-4] Jaegle, Andrew, et al. "Perceiver io: A general architecture for structured inputs & outputs." arXiv preprint arXiv:2107.14795 (2021).

问题

In addition to Weaknesses above, there are more detailed questions here.

Q1. Is the learned latents of NeoMLP able to use generation tasks? Spatial Functa [NewRef-1] and Locality-aware GINR [NewRef-3] consistently claimed that the lack of locality in the latent representations shows the limitations in generation tasks.

Q2. What is the main differences of NeoMLP from the existing studies using Transformer in terms of Connectivism?

Q3. The computational costs to decode each data instance compared with other methods.

局限性

No negative societal impact.

最终评判理由

I appreciate the authors' further explanation and carefully read the authors' response.
However, I finally decide to keep my original scores. They do not address my concern on the comparison with [NewRef-1], although it has highly relevance with this paper in terms of transferring neural fields to downstream tasks. In addition, the main claim of this paper, related to connectivism, is not reasonable, to converge the proposed model into existing transformer architecture. Using auto-decoding structure itself cannot be a differentiation or contribution, compared with encoder-based generalizable INRs, because auto-decoding has weak performance than the others or require iterative updates of latents through backpropagation per sample. Thus, I cannot increase my score because I think this paper requires major revision.

格式问题

N/A

作者回复

We would like to thank the reviewer for their time and insightful comments. Below, we address the reviewer’s concerns in detail.

This paper misses many related works, considering that using a transformer for a conditional neural field or generalizable INR is already well-explored [NewRef-1,2,3]. The experiments also need to compare the performance with these prior studies showing decent performance on downstream tasks.

We thank the reviewer for suggesting these related works, which we will include in the updated version of our manuscript. Generalizable INR [NewRef-2] and Locality-aware INR [NewRef-3] differ substantially from our method because they are encoder-based neural fields with encoders that operate on patches, in contrast to NeoMLP which is an auto-decoding neural field. Overall, auto-decoding methods inherit the fundamental advantages of neural fields: they are resolution independent, they scale gracefully with the signal complexity instead of the signal size, and they are modality independent, i.e. they do not make assumptions about the observations. Encoder-based neural fields, on the other hand, lose those advantages because of the encoder, which must be specialized for each modality and/or resolution. Furthermore, they use transformers to encode the signal by processing a set of patches at once, while NeoMLP is a transformer that processes individual coordinates (practically batches of individual coordinates).

The motivation of this study is ambiguous and over-interpreted/claimed. For instances, the authors claim that the lack of Connectivism in MLP is the main motivation of NeoMLP. However, the architecture converges to a conventional transformer, which is commonly used in this field, and there is lack of motivations for the design of NeoMLP compared with the existing works on conditional neural fields or generalizable INRs. Also, the architecture design is not new concept but well known as Perceiver-IO [NewRef-4].

Connectionism inspired us to perceive a unified architecture that integrates inputs and latents in a native fashion. Our architecture converges to a conventional transformer because transformers are ubiquitous set neural networks in modern deep learning, i.e. neural networks that operate on sets/fully-connected graphs, with numerous papers supporting their expressivity and performance, both theoretically and practically. Since we are operating on a fully-connected graph of input, output, and latent tokens, a transformer is the most logical approach. Perceiver IO is of course an attention/transformer architecture, but it has a lot of differences with our method. Importantly, the Perceiver IO is not a neural field architecture; it operates on sets of points instead of being a (neural) function of individual points. Furthermore, it is based on cross-attention operations that encode the input set of points in a latent set, while NeoMLP is only employing self-attention.

What is the main differences of NeoMLP from the existing studies using Transformer in terms of Connectivism?

Existing neural fields that use transformers either use cross-attention to encode individual coordinates [Wessels et al.], or cross-attention to encode the whole signal from a set of observations [NewRef-4], or a standard self-attention transformer with latent tokens that functions as an encoder in an encoder-based neural field [NewRef-2,3], encoding the whole signal from a set of observations/patches. Instead, NeoMLP is a self-attention based Transformer that encodes individual coordinates, and functions as an auto-decoding neural field.

The computational costs to decode each data instance compared with other methods.

We refer the reviewer to appendix C, which contains a detailed discussion of the computational complexity of NeoMLP compared to the other baselines, both for the high-resolution signal fitting experiments and for fitting v-sets. In summary, NeoMLP has a higher computational complexity compared to the baselines, but it can actually fit high resolution signals faster, and it does so while having a smaller memory footprint, since it can make use of small batch sizes. For fitting v-sets, NeoMLP consistently exhibits lower runtimes for the fitting stage, while Functa is much faster during the finetuning stage, which can be attributed to the meta-learning employed for finetuning, and the highly efficient JAX implementation.

审稿意见
5

NeoMLP is a novel framework to learn a compact latent space for neural fields. It re-interprets the MLP as a fully connected graph and further uses message passing with self-attention amongst the graph nodes. Further, NeoMLP conditions each sample in a dataset using a latent code and trains the parameters to successfully reconstruct the signal as well as learn the underlying latent space over a set of signals. NeoMLP yields learned latent representations for data, which yield good performance when used for downstream applications such as classification.

优缺点分析

Strengths:

The paper is very well written and provides sufficient empirical and theoretical analysis in the introduction and development of the NeoMLP architecture.

Novelty: NeoMLP showcases architectural ingenuity in modeling conditional neural fields. The re-interpretation of an MLP as a graph and further using self-attention as a computationally efficient way for capturing information across nodes is indeed quite interesting and novel to the field of neural fields.

The paper provides a detailed evaluation of common video datasets, demonstrating the ability of NeoMLP or multi-modal signal representation. The paper also shows that the learned latent codes ν\nu-reps work well for downstream tasks such as classification. The paper also provides a comprehensive latent space visualization.

Lastly, it is appreciated that the limitations and shortcomings of the method are well addressed.

Weakness:

Given the complexity of the method, it seems that NeoMLP may scale inefficiently for large images or realistic 3D scenes used in NERFs. It would be interesting to understand how NeoMLP scales with larger scenes.

The paper should demonstrate downstream performance on ν\nu-reps learned on a reasonable 256 x 256 signal (as shown in the relevant baseline method, Spatial Function [1]). Evaluation on moderately sized signal ν\nu-reps would also provide valuable insights into the stability and scalability of NeoMLP representations.

[1] Spatial Functa: Spatial Functa: Scaling Functa to ImageNet Classification and Generation, Bauer et.al.

问题

  1. How does the selection of HH affect the performance of NeoMLP given signal sizes?

  2. How does NeoMLP perform on tasks such as inpainting? In such a case, how do the learned ν\nu-reps perform for downstream tasks such as classification?

局限性

Yes

格式问题

No

作者回复

We would like to thank the reviewer for their time and insightful comments. We agree with the reviewer about the importance of the suggested experiments/ablations. We are still working on the suggested experiments, and due to time constraints we will present the results in the upcoming days during the discussion phase.

评论

I would greatly appreciate it if the authors could address the questions and weaknesses raised above.

评论
How does the selection of H affect the performance of NeoMLP given signal sizes?

We thank the reviewer for the suggested ablation. We perform an ablation study on CIFAR10 where we vary the number of hidden latents from 0 to 11. We present the results in the table below.

HFit PSNR (\uparrow)Test PSNR (\uparrow)Accuracy (%)
030.4929.4060.64
330.4929.1458.73
1140.0535.4057.60

We observe that increasing the number of hidden latents consistently increases reconstruction accuracy, but that comes at the expense of downstream performance loss. We attribute that to the “naive” MLP used for downstream performance, which does not take into account permutation symmetries. Using no hidden latents effectively removes these symmetries, and as such, the downstream MLP is able to perform better. Future work can explore more sophisticated downstream architectures that incorporate the permutation symmetries of NeoMLP.

最终决定

This paper was generally considered well-written by the reviewers, and they mostly agreed that it covers the important topic of continuous signal representation with conditional neural fields. The experimental results show the potential for the method to improve both reconstruction and classification results.

The main concerns that remain with the paper are:

  1. The introduction of the method (which boils down to implementing a neural field with a transformer) as incorporating connectionism in MLPs is over-interpretating and over-claiming and leads on an ambiguous motivation.
  2. It misses many related works that employ transformers for conditional neural fields or generalizable INR, and fails to compare to relevant baselines, and test more downstream tasks like generative modeling.

I encourage the authors to revise the paper to be more upfront about the architectural similarities to existing transformer models and to more clearly situate their novel application within this established context, rather than presenting the architecture itself as entirely new.