Infinite-Dimensional Feature Interaction
We introduce InfiNet, a modern architecture using kernel to explore the infinite-dimensional feature interaction.
摘要
评审与讨论
This work proposes a novel approach for enhancing neural network performance by scaling feature interaction spaces to infinite dimensions using kernel methods. Recent advancements have introduced feature interaction spaces, but these are often limited to finite dimensions, primarily through element-wise multiplications. To overcome these limitations, the authors propose InfiNet, a model architecture leveraging the Radial Basis Function (RBF) kernel to enable infinite-dimensional feature interactions. Finally, the authors provide several empirical results on standard vision tasks.
优点
This work provides an interesting generalization of feature-feature interactions via kernels. For the best of my knowledge, this is a novel idea that appears to perform well in practice. However, I am not overly familiar with the current state of the field of deep learning for computer vision. It further provides several larger-scale experiments and interesting ablations.
缺点
-
there is no theoretical justification that increasing the dimension of the feature-feature interaction space will lead to better generalization. The paper does a good job analysing this question with ablations. However, this remains an open theoretical question.
-
I understand that the motivation for this work comes from applications in computer vision. However, since a major focus in this paper is on comparing the proposed approach to self attention, it would be interesting to not only test this method on images, but also on language.
-
the method is reported to have lower FLOPs on average than competing methods. Why is that? Is that a major drawback of this method?
-
performance improvement on ImageNet is only marginally. In many cases the proposed method even performs worse than competing methods.
-
paragraph starting in line 148: this is on over-claim and has to be removed or rigorously proved. It is not clear how a higher order of implies better generalization or training. Unless shown in this paper or referenced from another paper, this has to be removed.
Minor:
-
line 28: more context for formulating self attention that way has to be provided. It is explained in more detail only at the end of section 3.
-
caption of figure 2: there is '?'. Moreover, a description of the presented images should be included. What is shown in Figure 2 on the right hand side? This is only explained in the main text,not the caption. This needs to be changed.
-
figure 2, first image on the left: hard to read -- text overlaps with drawing.
问题
-
what is meant in line 47 + 48? the current formulation is very cryptic. What exactly is linear in ?
-
figure 2: why does the addition and multiplication interactions reach the same accuracy on cifar10? Isn't that basically MLP vs self-attention? I would presume self attention to perform better.
局限性
-
The paper provides empirical results only on vision tasks. However, a major selling point of this paper is generalizing approaches like self attention in terms of feature-feature interactions. Therefore, comparisons with transformers on language tasks should be performed.
-
No theoretical analysis is provided proving that the proposed method leads to better generalization.
-
The method appears to have on average lower FLOPs than competing methods, while at the same time only marginally outperforming (or even performing worse than) competing methods on imageNet.
We sincerely appreciate the constructive comments from reviewer u52R and the time spent on reviewing this paper. We address the questions and clarify the issues accordingly as described below.
[Weakness 1]: There is no theoretical justification that increasing the dimension of the feature-feature interaction space will lead to better generalization. The paper does a good job analysing this question with ablations. However, this remains an open theoretical question.
[Re: W1]: We agree that an in-depth theoretical analysis of the proposed method from the interaction space perspective would enhance our understanding of the model's generalization ability. However, we must acknowledge that we have not yet found a rigorous theorem to fully prove such interaction mechanisms in deep networks. Theoretical analysis of complex systems like InfiNet presents significant challenges. To our knowledge, no existing theorem comprehensively analyzes the generalization ability of advanced neural architectures, such as those incorporating self-attention or gated convolution. Nonetheless, we offer some intuitive and empirical analyses to illustrate the methodology and philosophy underpinning our feature interaction perspective. We hope that these analyses will help readers better understand our motivations and provide guidance for designing improved architectures in future research work.
[Weakness 2]: About testing on language.
[Re: W2]: Computer vision models are generally similar to encoder language models. So, we are testing a BERT-like design with InfiNet's architecture on language. InfiNet works on the Encoder model, and it still needs engineering effort. In a GPT-like decoder-only architecture, like most vision models, InfiNet cannot be directly applied due to the limitations of auto-regressive modeling. However, our idea of utilizing kernel methods can be applied in network architecture design, and we are trying to combine the kernel method with xLSTM to construct a new language model.
[Weakness 3]: the method is reported to have lower FLOPs on average than competing methods. Why is that? Is that a major drawback of this method?
[Re: W3]: The FLOPs refers to the count of floating point operations in the deep learning architecture area. It represents the amount of computation required by the model. This is a metric where smaller values are better, and our model significantly outperforms others in this regard. Is not a drawback instead an advantage of this method.
[Weakness 4]: performance improvement on ImageNet is only marginally. In many cases the proposed method even performs worse than competing methods.
[Re: W4]: i. Our main contribution is a new perspective of model design instead of a visual system that can beat all the other models. We design the experiments for the purpose of verifying the effectiveness of our design perspective over models from plain feature superposition and interaction. We follow the architecture and training configuration of widely used models Swin Transformer for fair comparison. Our goal is to provide a new perspective of model design, but not a SOTA visual recognition system.
ii. Our performance on ImageNet-1K can be further improved with more refined configuration tuning. We can manually refine the various hyper-parameters of the training to obtain stronger performance on the ImageNet validation set. We did not do this, and this is because the performance improvement gained in this way is essentially an overfitting of the dataset. Therefore, there is still substantial room to further improve the performance on ImageNet-1K. We think many techniques including architecture optimization, and training methods would further improve the model performance.
[Weakness 5]: paragraph starting in line 148: this is on over-claim and has to be removed or rigorously proved. It is not clear how a higher order of implies better generalization or training. Unless shown in this paper or referenced from another paper, this has to be removed.
[Re: W5]: This is an association with Weakness 1. We will modify the wording here. We will state our view of this in the form of a conjecture/hypothesis instead of an explanation. Some relevant support is given in [1], but a complete theory of such a complex system still needs more effort.
[Response to Minor Points]: (1) We will provide more context to make the formulating more self-contained in Section 1. (2)(3) The "?" here corresponds to the one in the figure and represents a replaceable operation. We will change the caption and figure for readability.
[Question 1]: what is meant in line 47 + 48? the current formulation is very cryptic. What exactly is linear in k?
[Re: Q1]: This refers to the fact that if the high-order interaction is realized by changing the design of the model architecture, the computational cost will increase linearly with the size of the order k that one wants to achieve, because of the need for elementwise multiplication of k groups of features.
[Question 2]: figure 2: why does the addition and multiplication interactions reach the same accuracy on cifar10? Isn't that basically MLP vs self-attention? I would presume self attention to perform better.
[Re: Q2]: This is due to the small size of the CIFAR10 dataset. In fact, Transformer is considered to show some advantages over convolution only on large-scale datasets[3].
[Response to Limitation]: See response to W2, W1 and W3
[1] Wu, Yongtao, et al. "Extrapolation and spectral bias of neural nets with hadamard product: a polynomial net study." NeurIPS 2022
[2] Rao, Yongming, et al. "HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions." NeurIPS 2022
[3] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." ICLR 2021
I thank the authors for their detailed response. I agree that it is challenging to include an interesting theoretical analysis of this model. It is perhaps not necessary for this particular paper, as it is already an interesting contribution without.
I urge the authors to include a short description of FLOPs in the context of DL models. Coming from a computational background, readers might interpret the results as a downside of this method when in fact it is even an advantage (i.e., confusing FLOPs with FLOPS, just as I did).
Otherwise, I am happy with the rebuttal and will further increase my score.
Dear Reviewer u52R,
We appreciate your time and effort in providing feedback on our submission.
As the author-reviewer discussion period draws to a close, we look forward to hear whether our response addressed your concerns and are happy to engage in further discussion if there are still outstanding issues with the paper.
Authors
Thank you for your response. We appreciate your support in the acceptance of our paper. If you have any further concerns or questions, we are willing to discuss them with you.
We will include the description of FLOPs to make it clear. We will add it in the footnote:
Difference between FLOPs and FLOPS: FLOPs (floating point of operations), the number of floating point operations, is used to measure the complexity of an algorithm/model. This number gets smaller the better. The FLOPs is different from FLOPS (floating point of per second), which is a measure of hardware performance. Following other deep-learning architecture works, we use FLOPs to measure the computing demands. Since our model gets a smaller FLOPs, our method is more efficient.
Thank you for your advice!
Authors
This paper studies placing a kernel function inside of a neural network architecture to facilitate interaction of features/dimensional expansion. They consider deep convolutional networks with parallel pathway features and and a kernel function computed with both pathways' features as inputs . Standard kernel mathematics is used to explain feature expansion. The main novel results are empirical performance of these "InfiNet" architectures, which are shown to perform well in a number of computer vision tests.
优点
The idea of unifying different orders of interaction embodied in various neural network architectures, including Transformers is appealing and probably important. The accuracy of the InfiNet experiments is impressive, with a moderate reduction in FLOPs. The paper is easy to read and well-organized, although suggestions are given for how it could be improved.
缺点
My main concerns with the paper are a lack of context for the approach as well as missing important explanations. I also think a good amount of the math that's included could be considered "filler" material that could go into the appendix, since it doesn't represent new results. (I am referring to sections 4.1 and 4.2, most of which can be found in most textbooks which cover kernel methods.)
- Notation which is commonly used in the paper , , * is not explained. You should explicitly define it somewhere, at least in the appendix (and refer people there). In particular, people may be confused by * for elementwise/Hadamard multiplication, since in convnet literature this is often the convolution operator. You call this the "Star Operation" in line 124, but I think it is just elementwise multiplication.
- The authors seem to have missed the vast literature on the connections between random features, neural networks at init, and kernel methods. (CKNs are mentioned but without any discussion of the topics I mention here.) In particular, one way that you could approximate the InfiNet architecture would be to take the two feature streams and pass them each into the same wide, random network/layer and compute the dot product of features at the next level. That would only approximate the kernel function in the InfiNet architecture, and is likely less efficient, but it provides a way to perform dimensionality expansion with a more traditional layer. The authors should discuss these connections.
- Different order of interactions have been studied in random feature and kernel settings already. In random features, interaction order is connected to the sparsity of weights, see e.g. https://arxiv.org/abs/2103.03191 and https://arxiv.org/abs/1909.02603. In kernels, this were referred to as additive kernels https://arxiv.org/abs/1602.00287, also studied in multiple kernel learning https://arxiv.org/abs/0809.1493 (these are just some examples among a larger literature).
- The authors do not seem to want to release their code. They have said "Yes" on Question 5, stating that the code and data are open, but there is no link or indication in the text that the code is available or will be when the paper is published. That seems deceptive.
问题
- There is a tension between dimensional expansion, which leads to expressivity in networks, and generalization, which is typically better in low-dimensional settings. Can you discuss this?
- When queries and keys in a transformer are computed using a multilayer network with nonlinearities (rather than a single linear layer, as you've considered), aren't the effective order of interactions higher?
- You claim that the kernel map applied to inputs with channels takes constant time (section 4 intro). Wouldn't evaluating the kernel still take i.e. linear time?
- Can you please include the matrix/tensor shapes and layer sizes explicitly in section 5.1? They could be put into the appendix. It is unclear how many kernel evaluations are performed and on what shape input.
Minor points:
- Sentence lines 45-47 is confusing and should be reworded. Also, the combinatorial expression with the limit is unexplained, not obvious, and doesn't seem to contribute anything here. I suggest removing it.
- Line 61, the expression for span of a certain space is unclear. The main point seems to be that this is an infinite-dimensional function space. Does using this math really add anything?
- Line 61: "as low-overhead as exponential operations" is unclear. Do you mean "evaluating an exponential function"?
- Line 91: "Kernel Method" -> "Kernel Methods" typo
- Line 106: "isotropic" here is unclear to me, suggest removing
- Line 110: "medium" for the intermediate layer connotes different size, suggest changing to "intermediate" or "middle"
- Line 130: Without saying it, are you assuming that the image inputs span the pixel space vector?
- Line 149: "two element-wise multiplication" typo -> "multiplications"
- Notation is confusing: In equation (1) this seems to output a scalar. Is that the same in Eqn (6)? What are the shapes of the W matrices?
- What is a "feature branch"? Unclear throughout.
- You say the input is passed through "STEM" and refer people to the ConvNeXT paper https://arxiv.org/pdf/2201.03545. There is more than one "stem" in that paper. Can you be explicit about what you did?
局限性
I would strongly prefer that the limitations be included in the main text during the discussion. With movement of some of the standard math, there would be space.
The authors only consider the squared exponential kernel with bandwidth parameter equal to 1. Other kernels might work better. In particular, the effective dimensionality of the RKHS (related to the kernel decay rates) would be higher with a "less smooth" kernel like the exponential/Laplace kernel.
The results are likely not reproducible unless the authors release their code.
The results are also limited only to supervised vision tasks, rather than other modalities or unsupervised settings.
We sincerely appreciate the constructive comments from reviewer USts and the time spent on reviewing this paper. We address the questions and clarify the issues accordingly as described below.
[Weakness 1]: Notation is not explained...
[Re: W1]: We'll add a detailed explanation of notation in a subsequent release. The symbol denotes the Direct Sum of vector spaces. This corresponds to structures such as channel expansion/bottleneck in commonly used neural networks. While the is elementwise multiplication. Some of the expressions in our text use Star Operation (*). This is intended to be aligned with reference [1], which is essentially elementwise multiplication. We realize that we have overlooked the consistency of notation in the paper, and we will fix this in a subsequent version.
[Weakness 2]: About literature on the connections between random features, neural networks at init, and kernel methods.
[Re: W2]: Thank you for your advice. We will add the discussion on the connections between random features, neural networks and kernel methods.
The NTK framework establishes a direct connection between infinitely wide neural networks at initialization and kernel methods. Specifically, it shows that as the width of the network grows, the network's training dynamics can be described by a kernel function (the NTK), linking the neural network's behavior to that of kernel methods. Random features provide an efficient way to approximate the feature mappings used in kernel methods. By using random projections, one can approximate the inner product defined by a kernel function, making it feasible to apply kernel methods. Our contribution is that we look at the relationship between kernel methods and neural networks in a different light. Instead of examining the dynamics of neural networks from the standpoint of the kernel, we consider how kernel methods can benefit neural networks from the perspective of expanding the interaction space.
We appreciate your advice on using random features to approximate InfiNet. We recognize that this is something we haven't thought about in the past, and we think it's a worthy discussion for future work. Meanwhile, we believe that our work will be an important milestone in the development of this field.
[Weakness 3]: Different orders of interactions have been studied in random feature and kernel settings already. ...
[Re: W3]: We appreciate your references to research on different orders of interaction in the fields of random features and kernel methods. We will include a section to discuss them. However, we believe there is still a gap between the research in these areas and the research on interaction in NN design. Therefore it is difficult to derive a methodology from these studies that can be directly applied to NN design. Our work looks at interactions in the design of modern NN architectures in the hope of finding a new design consideration.
[Weakness 4]: About the code open source.
[Re: W4]: We release the code recently. However, due to the NeurIPS rebuttal policy, we cannot provide you with a direct link. Instead, as per NeurIPS rules, we have sent an anonymous link to AC.
[Question 1]: Tension between dimensional expansion and generalization.
[Re: Q1]: In neural network research, the prevailing view is that such a tension does not exist. This is due to the presence of the double descent phenomenon which challenges the traditional bias-variance trade-off perspective. Double Descent suggests a positive correlation between dimensional expansion and generalization when over-parametrized[2].
[Question 2]: About effective order of interactions of transformers
[Re: Q2]: An interesting question, and we think the answer is yes. There have been studies of neural network width/depth equivalence in the past, and we think the order of interaction can be added to that. A quantitative study of this issue is more complex and requires more effort.
[Question 3]: About the claim of constant time.
[Re: Q3]: Thanks for pointing out that we ignored the complexity of the kernel itself here. If we consider the complexity of the kernel itself, it's supposed to be linear time.
[Question 4]: About including the matrix/tensor shapes.
[Re: Q4]: We will further refine the description of the tensor shape in camera-ready. As in Fig.3(b), each rounded rectangular box does not change the tensor shape. And parallel boxes will create independent copies. The two branches of the kernel will each be r (7 in our implementation) times the size of the tensor.
[Re: Minor Points]: Due to character limitations, we combine responses for minor points. (1)We have noticed that the description of the paragraph in lines 45-49 may not be clear, and we will reorganize the description. (2)(3) we will remove the redundant expression and clarify that the in kernel method, the computation overhead is the same as evaluating an exponential function. (4)(6)(8)We will fix the typo. (5)Isotropic means constant shape. (7) Yes, we consider it as a basic assumption in computer vision. (9) No, W in Eq.1 is an n-d vector, and in Eq.6 is a matrix. We will change some notation to make it more clear. (10) feature branch means a multi-branch structure like in ResNeXt. (11) STEM refers to the same patchify stem (4*4 non-overlapping convolution).
[Re: Limitations]: We will tweak some paragraphs of the article to discuss limitations. We will try more kernels to enhance the described approach. The initial code is open source now. Our work on language is in progress but still needs engineering effort. The performance in unsupervised settings is still unknown, but we are confident that InfiNet has a strong representation.
[1] Ma, et al. "Rewrite the Stars." CVPR 2024
[2] Nakkiran, et al. "Deep double descent: Where bigger models and more data hurt." Journal of Statistical Mechanics: Theory and Experiment 2021
I appreciate the authors' response and willingness to address my suggestions. I will change my overall score to 7 and assume that these points will be taken seriously.
PS I know what isotropic means, but I still do not know what it means in your paper's context.
Thank you for your response. We appreciate your support in the acceptance of our paper. If you have any further concerns and questions, we are willing to discuss them with you.
We will remove the "isotropic" and find a better word to describe the constant feature shape of 2 layer-MLP.
Authors
The authors present a new architecture for computer vision applications that models high-order interactions between features. The architecture is similar to an attention block, but introduces an RBF Kernel layer that captures interactions of order higher than two. The resulting method has strong empirical performance across image classification tasks.
优点
- The idea of the paper is very interesitng and novel.
- The empirical results show promising performance across involve multiple tasks against sophisticated methods
缺点
- The presentation of the method seems overly complex in some places. For example, providing a clearer explanation of each new layer (perhaps in pseudocode) would help. While the Infiniblock definition is clear, the reader needs to go back to the previous section to understand the input/output shaped of the RBF layer, which takes work, and can be made simpler. Making clearer the intuition behind high-order interactions would be helpful as well. Showing examples of what the model learns would be helpful to make things concrete.
- The empirical performance is reasonably similar to those of previous methods, hence the empirical improvement is not that large.
问题
- What are some examples of features and interactions that help learning and that the new model can learn?
- Is it possible to analyze or visualize what interactions the model learned?
局限性
I do not see any ethical and societal implications of the work that need to be discussed.
We sincerely appreciate the constructive comments from reviewer Jbm2 and the time spent on reviewing this paper. We address the questions and clarify the issues accordingly as described below.
[Weakness 1]: The presentation of the method seems overly complex in some places. For example, providing a clearer explanation of each new layer (perhaps in pseudocode) would help. While the Infiniblock definition is clear, the reader needs to go back to the previous section to understand the input/output shape of the RBF layer, which takes work, and can be made simpler. Making clearer the intuition behind high-order interactions would be helpful as well. Showing examples of what the model learns would be helpful to make things concrete.
[Response to W1]:
(1) We adopt the diagrammatic and layer-wise formulaic representations commonly used in articles in the field of deep learning architecture. We will add input/output shapes for each layer in subsequent versions of Fig.3 to make it easier to understand. We have open-sourced our code, which will further help in the understanding of the model we design. However, due to the NeurIPS rebuttal policy, we cannot provide you with a direct link. Instead, as per NeurIPS rules, we have sent an anonymous link to AC.
(2) The intuition for high-order interactions is that new research suggests that high-order interactions are widespread in biological systems, neuroscience, and physical-social systems[1]. We believe that such high-order interactions are clearly present in neural network features as well. The idea of interaction-inspired designs like HorNet and MogaNet, but their models are limited to exploring simple interactions, whereas our model can explore a larger interaction space.
(3)We show some cases in Figure 4 in the Appendix that demonstrate the difference in the Class Activation Mapping feature regions learned in the feature representation space, the finite simple interaction space and the infinite-dimensional interaction space.
[Weakness 2]: The empirical performance is reasonably similar to those of previous methods, hence the empirical improvement is not that large.
[Response to W2]:
i. Our main contribution is a new perspective of model design instead of a visual system that can beat all the other models. We design the experiments on the purpose of verifying the effectiveness of our design perspective over models from plain feature superposition and interaction. We follow the architecture and training configuration of widely used models Swin Transformer for fair comparison. Our goal is to provide a new perspective of model design, but not a SOTA visual recognition system.
ii. Our performance on ImageNet-1K can be further improved with more refined configuration tuning. We can manually refine the various hyper-parameters of the training to obtain stronger performance on the ImageNet validation set. We did not do this, and this is because the performance improvement gained in this way is essentially an overfitting of the dataset. Therefore, there is still substantial room to further improve the performance on ImageNet-1K. We think many techniques including architecture optimization, and training methods would further improve the model performance.
[Question 1]: What are some examples of features and interactions that help learning and that the new model can learn?
[Response to Q1]: We give some examples in Figure 4. We can observe that the region responded to by the model that employs kernel methods better fits its actual class. This shows that our method is more effective in extracting effective features, which is largely due to the fact that the interaction of the target's features allows the target to be learned as a whole, thus enhancing the model to a certain extent.
[Question 2]: Is it possible to analyze or visualize what interactions the model learned?
[Response to Q2]: We believe that the commonly used methods such as GradCAM, LRP, etc. can be directly used for the visualization and analysis of InfiNet. However, it is worth noting that due to the multi-branch structure of InfiNet's features and the introduction of mixing of two branches of activation maps by kernel methods, it is possible that relevance scores are not successfully propagated back to the inputs, resulting in only providing the partial information of relevance.
We think that an interpretable visualization of InfiNet‘s interaction is a very complex task that still requires a great deal of effort to achieve. We give some basic ideas for possible visualization of interactions. (1) We can try to anchor a region of interest and obtain a heat map of the regions that are co-interacting with it by using the gradient method. (2) We can obtain the coefficients of the interactions in the region corresponding to the activations by Taylor expansion of the activations at each level of the kernel function, and construct a statistical map.
[1] Battiston F, Amico E, Barrat A, et al. The physics of higher-order interactions in complex systems[J]. Nature Physics, 2021, 17(10): 1093-1098.
I acknowledge the author's response. I am inclined to keep my score.
Thank you for your response. We appreciate your support in the acceptance of our paper. If you have any further concerns and questions, we are willing to discuss them with you.
Authors
The paper shifts the focus from traditional neural network design, which emphasizes feature representation space scaling, to feature interaction space scaling. It introduces a new model architecture, InfiNet, that enables feature interaction within an infinite-dimensional space using the RBF kernel, leading to state-of-the-art results. The paper also discusses the limitations of current models in capturing low-order interactions and proposes the use of classic kernel methods to engage features in an infinite-dimensional space.
优点
-
The idea of the paper is simple, novel and well exposed.
-
The paper introduces InfiNet, a model architecture that leverages infinite-dimensional feature interactions using RBF kernels, which enhances model performance of traditional models.
-
InfiNet achieves new state-of-the-art performance in various tasks, demonstrating the effectiveness of infinite-dimensional interactions.
-
The paper includes extensive experiments on datasets like ImageNet and MS COCO, showing the scalability and efficiency of InfiNet.
缺点
-
the paper builds on the simple use of kernel methods. The novelty of the methods is minimal, in the end it is an RBF kernel.
-
the performance improvement of Infinet over other models is mostly marginal and no errors have been displayed.
-
the paper doesn't really have theoretical novelty
问题
-
How does InfiNet compare to other models in terms of training time and resource consumption?
-
Can the kernel methods used in InfiNet be applied to other types of neural network architectures beyond those discussed?
-
Can the authors quantify the increased dimensionality of the kernel methods over simpler operations (sum, product). If the authors take the simplest architecture for imagenet and look at the representations generated by means of using different kernels, can they quantify what is the actual increase in the intrinsic dimensionality of the representation upon training? It is not fully clear to me that the increase in performance is due to an increase in dimensionality.
-
the author mention the possibility of exploiting a learnable kernel in place of RBF. Could the author explain and discuss the ratio behind using RBF in place of others? Is it solely driven by the computational complexity. Would the results be different with a different kernel?
局限性
None
We sincerely appreciate the constructive comments from reviewer Tii5 and the time spent on reviewing this paper. We address the questions and clarify the issues accordingly as described below.
[Weakness 1]: The paper builds on the simple use of kernel methods. The novelty of the method is minimal, in the end it is an RBF kernel.
[Re: W1]: Our novelty lies in the fact that we perform feature interaction in neural networks by means of kernel methods, which is not trivial. We provide an in-depth discussion of the reasons behind our proposed approach, and we point out that the dimensionality of a hidden space generated through feature interactions is an important part of the performance of a model.
[Weakness 2]: The performance improvement of Infinet over other models is mostly marginal and no errors have been displayed.
[Re: W2]:
i. Our main contribution is a new perspective of model design instead of a visual system that can beat all the other models. We design the experiments for the purpose of verifying the effectiveness of our design perspective over models from plain feature superposition and interaction. We follow the architecture and training configuration of widely used models of Swin Transformer for fair comparison. Our goal is to provide a new perspective of model design, but not a SOTA visual recognition system.
ii. Our performance on ImageNet-1K can be further improved with more refined configuration tuning. We can manually refine the various hyper-parameters of the training to obtain stronger performance on the ImageNet validation set. We did not do this, and this is because the performance improvement gained in this way is essentially an overfitting of the dataset. Therefore, there is still substantial room to further improve the performance on ImageNet-1K. We think many techniques including architecture optimization, and training methods would further improve the model performance.
The reason we do not give the error bar is that model training is expensive and it is difficult for us to schedule enough resources to perform multiple rounds of ImageNet-1K training and ImageNet-21K pre-training. For the same reason, this (no error bar is shown) is very common in the deep learning architecture community.
[Weakness 3]: The paper doesn't really have theoretical novelty
[Re: W3]: We agree that an in-depth theoretical analysis of the proposed method on the interaction space perspective is helpful to better understand our model. But to be honest, currently, we could not find a good theorem to prove rigorously such interaction mechanisms in deep networks, since theoretically analyzing a complex system like InfiNet is very difficult. However, we still have some intuitive and empirical analysis to show the methodology and philosophy behind our feature interaction perspective. We hope such analysis can help readers to better understand our motivation and provide some guidance to design better architectures in future research.
[Question 1]: How does InfiNet compare to other models in terms of training time and resource consumption?
[Re: Q1]: In terms of training time. We take an example of InfiNet-T level models on ImageNet-1K training with the same configuration on 4 A100.40GB based on our test.
| Model | Training Time(min/epoch) |
|---|---|
| ConvNeXt | 22 |
| Swin | 27 |
| HorNet | 31 |
| MogaNet | 43 |
| InfiNet | 30 |
As the model increases, the model width gradually increases and the FFN layer will occupy more computational load. The difference in training time for these models will be smaller.
We notice a large amount of data reuse in InfiNet in the multi-branch structure before the kernel. Therefore, the memory access bandwidth is the limiting factor in our model training. We substantially optimized the computation time by using fully equivalent grouped convolutions instead of computing depth-width convolutions in a round-robin fashion. This reduces the training time per epoch from 54 minutes to 30 minutes versus a substantial reduction through cyclic depth-width convolution.
Because of the lack of innate computational optimization, is the main reason for the current slowness compared to ConvNeXt and Swin, but our model training is still faster than most mainstream High-Order networks. The model suffers from load imbalance during computation. However, these problems are solvable and optimizable, and this is the goal of our subsequent work to further optimize the model computation.
[Question 2]: Can the kernel methods used in InfiNet be applied to other types of neural network architectures beyond those discussed?
[Re: Q2]: Yes! In fact, element-wise multiplication in any network architecture can be attempted to be replaced by the kernel method, under consideration of the convergence of the model. This includes state-space models (aka Mamba) (Gu, Albert, et al. 2023), HorNet (Rao et al. 2022), gated convolution (Dauphin, et al. 2017), and so on. Other than that, we are trying to use the kernel method on xLSTM (Beck M. et al. 2024).
[Question 3]: Can the authors quantify the increased dimensionality...
[Re: Q3]: We give a quantify ablation study on Sec 6.4 and Table 3(b). The intrinsic dimensionality can be calculated with Eq.5 with n=HWK^2 and k = interaction order in Table 3(b), where H/W is the height and width of the model and K is the convolution kernel size. In Table 3(b), we can see that as the interaction order increases, the intrinsic dimensionality increases and the performance improves. Thus, we can see a positive correlation trend that performance increases with dimensionality.
[Question 4]: About the kernel selection.
[Re: Q4]: The reason we use the RBF kernel in our model is: (1) The RBF kernel is the simplest kernel function that can realize infinite-dimensional feature interaction. (2) The empirical results. We show the result of using a linear kernel and monomial kernels in Table 3(b).
Dear Reviewer Tii5,
We appreciate your time and effort in providing feedback on our submission.
As the author-reviewer discussion period draws to a close, we look forward to hear whether our response addressed your concerns and are happy to engage in further discussion if there are still outstanding issues with the paper.
Authors
I appreciate the answers to my comments. My questions have been addressed. Yet, I think my score is appropriate and I will not change it, unless further discussions with the AC and other reviewers will prompt me to do so.
Thank you for your response. We appreciate your support in the acceptance of our paper. If you have any further concerns or questions, we are willing to discuss them with you.
Authors
Dear Area Chair and Reviewers,
We appreciate reviewers' precious time and valuable advice. We are happy that most of reviewers acknowledged our novel idea (Tii5, Jbm2, USts, u52R) and experiments (Tii5, Jbm2, USts, u52R).
At the same time, we note the concerns and suggestions of the reviewers on our work. We provide detailed answers to all the questions raised by the reviewers in the following individual responses. We hope these responses address your questions and concerns well. If you still have questions and concerns, we appreciate you discussing them further with us!
Since NeurIPS 2024 does not allow for the submission of revised papers at the rebuttal stage, we are committed to incorporating changes based on the comments of the reviewers in subsequent revised release.
We would like to thank the reviewers again for their valuable comments on our paper and for their time in reviewing and discussing it!
Best Regards,
Authors
This paper introduces and investigates an architectural component for infinite-dimensional feature interactions based on an RBF kernel. Extensive experiments show strong performance across a wide range of computer vision tasks. Although some of the reviewers noted a lack of theoretical development or justification, they found the method to be novel and the results promising. Regarding novelty, one relevant reference that did not emerge in the discussion is Choromanski, Krzysztof, et al. "Rethinking attention with performers." arXiv preprint arXiv:2009.14794 (2020), which also investigates infinite-dimensional feature interactions, albeit with a different focus and perspective. The authors should be sure to discuss this paper in their next revision. In any case, the present work will be of significant interest to the community and I recommend acceptance.