PaperHub
5.6
/10
Rejected5 位审稿人
最低5最高6标准差0.5
5
6
6
5
6
4.0
置信度
正确性2.4
贡献度2.4
表达2.6
ICLR 2025

Enhancing Accuracy and Parameter Efficiency of Neural Representations for Network Parameterization

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05
TL;DR

We enhance the accuracy and efficiency of neural represenations that predict neural network weights

摘要

关键词
Implicit Neural RepresentationsParameter GenerationNetwork PredictionDistillation

评审与讨论

审稿意见
5

This paper proposes an analysis and improved method for representing the weights of neural networks by implicit neural representations (INRs). I.e., It aims to compress model weights (or very slightly improve model performance), while keeping the same accuracy. The authors first analyze a reconstruction of model weights by MSE, showing that it enforces some form of smoothing on the weights. Additionally, they claim that with a big enough INR they are able to slightly improve the performance over the original model. They then propose a new method for model compression by INRs that decouples the distillation and reconstruction objectives into 2 separate stages, leading to better results. Lastly, the authors show that by distilling the model using a larger and more capable backbone can improve the compressed model performance.

优点

  • The paper is well written and easy to follow.
  • Despite the relatively minor technical change from the baseline, the new approach is significantly better at model compression without losing performance.
  • The analysis performed on model weights smoothness is interesting (Sec. 3.1 and App. A).

缺点

  • Claim on Accuracy Improvement: My main concern is that the paper's main claim is not well supported. The authors argue that their method enhances model performance compared to the original uncompressed model through weights smoothing. While this sounds promising and the presented analysis is interesting, in practice, the results reveal that the improvement from smoothness is minimal and almost negligible. The performance gain is limited to small-scale benchmarks like CIFAR-100 and STL-10, where the models’ accuracy increases by only 0.1-0.6%. Furthermore, when tested on ImageNet, the trend reverses, with all experiments performing worse than the original model. I believe that as is, this claim is misleading given these results.
  • Missing Baseline: The limitation on data usage presented in lines 82-84 about the NeRN baseline, are also true for the distillation loss used in the proposed method. I.e., it becomes impractical for model compression if data is not available. In that case, a knowledge distillation to a smaller architecture should also be considered as an additional baseline, as it has the same goal: compressing the model with minimal performance drop.
  • High-Performing Teacher: I believe the experiment done with a higher-performing teacher is a bit unfair for the scenario. It seems as if one could perform some knowledge distillation from the stronger teacher before the compression or just compress the stronger teacher in the first place.
  • Combining with Other Compression Approaches: While the authors claim their method is orthogonal to other compression types, Tab. 5 shows otherwise. Specifically, quantization of the compressed model heavily degrades performance (in CIFAR-100 it decreases from 70.84% to 51.72%).
  • Figure 1(a): Figure 1(a) is confusing as it only presents an expected trend, not real results. I think this part of the figure should be simply removed from the paper.
  • Intuition in Smoothness Analysis: While the analysis on weight smoothness presents cases where it slightly improves performance, it does not explain why this happens. I.e., why smoother weights might be better. I can conjecture it is related to the memorization of training examples (overfitting), which can be represented in higher frequencies. If this is the case, an explicit measurement of overfitting (generalization gap) with and without weights smoothing could greatly benefit this analysis.

Minor Remarks which did not affect the grading:

  • Most of the citations should probably be in parentheses and not in line (as all of the paper citations are).
  • In line citation at L64 is strange (reference seems to be in the wrong place).
  • L251: reference to equation has a wrong number.
  • L264-266: These lines should probably be revised to a more accurate version as some terms are unclear (e.g., in which layer does the decision making start and feature extraction end?)
  • Tab. 1: All the first lines are in bold. It's a bit confusing for which result to focus on. Should probably choose the best result as in the other lines of the table.

I am open to reconsidering my score given a revised manuscript which addresses the concerns I mentioned.

问题

  • The method compresses the model storage-wise only if one saves the implicit representation model alone. Does this mean that in order to use the compressed model one would have to thoroughly reconstruct every weight? If so, this means the method wont save RAM space at all, and even worse will require much more inference time. Could the authors show how many FLOPs it takes to rebuild a model compared to a standard inference of it?
  • Did the authors try to check if smoothing other layer types (e.g. Fully-connected layers) also works when training an implicit representation for them? While simple smoothing strategies might be ineffective due to the permutations of neurons in neural networks [1,2], training an INR on them could still perform some form of smoothing.
  • This is a bit out of scope for this work, but did the authors try to train a model from scratch with some smoothing objective on the weights? Did it improve the results?

[1] Navon, Aviv, et al. "Equivariant architectures for learning in deep weight spaces." International Conference on Machine Learning. PMLR, 2023.

[2] Kofinas, Miltiadis, et al. "Graph neural networks for learning equivariant representations of neural networks." arXiv preprint arXiv:2403.12143 (2024).

评论

PART 2/2

Intuition in Smoothness Analysis Thank you for the suggestion to analyze generalization explicitly. We evaluated the generalization gap, defined as the absolute difference between train loss and test loss (measured by cross-entropy loss). As shown in the table, our method demonstrates a generally decreasing trend in the generalization gap across progressive rounds. For example, starting from the original network with a gap of 1.284851.28485, the gap decreases to 1.241611.24161 by Round 5.

ModelGeneralization GapTrain AccuracyTest AccuracyTrain LossTest Loss
Original Net1.2848599.0671.370.052961.33782
Round 11.2671499.2771.650.045701.31286
Round 21.2542599.1671.780.051831.30609
Round 31.2541899.3071.840.048311.30250
Round 41.2623499.2771.950.048781.31113
Round 51.2416199.2071.970.049981.29159

We can further verify the generalization of our predicted weights using downstream tasks, i.e., transfer learning to a new dataset, CINIC-10. We compared our predicted weights with original weights under two scenarios: 1) Fine-tuning all layers: adapts the model completely to the new task, 2) Fine-tuning only the linear layer: highlights the generalizability of convolutional weights. The results show that our predicted weights (Hidden 360, CR < 1) consistently outperform the baseline pre-trained weights in both scenarios, achieving higher accuracy. Additionally, using predicted weights from progressive reconstruction (Hidden 680, CR > 1) further improves generalization slightly. Overall, these results confirm that the proposed method generalizes well to downstream tasks.

S → TTuning LayersFrom ScratchOriginal WeightsPredicted Weights Hidden 360, CR<1Predicted Weights Hidden 680, CR>1
CIFAR100 → CINIC10All Layers76.88%80.22% ± 0.0580.32% ± 0.0480.35% ± 0.03
CIFAR100 → CINIC10Linear Layer76.88%58.95% ± 0.0358.98% ± 0.0459.20% ± 0.04

Figure 1, L64, L251, L264-266, bolds in Table 1 Thank you for your suggestion and for pointing out the typos. We will revise our manuscript accordingly. Regarding lines 264-266, as our predictor focuses on predicting the weights in the convolutional layers, we did not use the term "decision layer" directly. Instead, we referred to a later layer closer to the decision layer, as it contributes more significantly to the decision-making process than early layers. We will update lines 260-266 as follows:

Interestingly, we find that this two-stage optimization leads to significant differences in the early layers of the network, while still matching the later layers. This is intuitive, as it is well known that the decision rules typically emerge in the later layers of a deep network. While the larger differences in the early layers may seemingly compromise reconstruction fidelity, this separate training strategy facilitates more effective integration of distillation into the network parameterization, as evidenced by significant improvements (red bars in Figure 4(c)).

Reconstruction cost vs. inference cost As expected, the reconstruction cost will be significantly higher than the single inference cost. However, the intention is not to do the reconstruction on the fly for a given instance but rather reconstruct first before doing a large number of inferences, therefore, amortizing the cost of the reconstruction. More importantly, our primary aim is to investigate the trade-off and improve upon the state-of-the-art weight prediction methods, e.g., NeRN, which itself is an early work of a rather new field (weight prediction), so the goal is more oriented toward academic exploration rather than practical deployment.

Other layer types Thank you for your insightful comments and references. While fully connected layers can indeed be interpreted as 1×1 convolutional layers, making them compatible with the predictor network, we have not explored this implementation in our study for the sake of simplicity. We acknowledge that applying smoothing strategies or training an implicit neural representation on fully connected layers could offer interesting insights.

Train from scratch with weight smoothing To simplify the task, NeRN initially explored a regularization-based approach to promote weight smoothness by explicitly adding a loss term during training of the original network. While this approach effectively encourages smoothness in the network, it often leads to slightly inferior performance on the original task. Furthermore, increasing the regularization factor significantly degrades the original network's performance.

评论

PART 1/2

Limited performance / KD baseline / High-Performing Teacher To address the concerns listed in the title, we designed two experiments to highlight the advantages of our proposed training strategies, which cannot be achieved with the baseline NeRN. As suggested, we directly compared our predictor network with a distilled network trained using conventional KD. In this setup, a student network is trained from scratch using ResNet50 teacher guidance. The results show that our decoupled training achieves an accuracy of 73.95%, outperforming the conventional KD approach, 73.60%. However, the advantages of the proposed method are not limited to this performance gain; it also allows for further iterative improvement in each case.

Original networkDistilled networkOurs from Table 4
71.37%73.60%73.95% ± 0.09

First, we applied the progressive-reconstruction process by targeting the distilled network (73.60%) and proceeded to the second round. The first round of progressive-reconstruction resulted in improved performance beyond the target accuracy, and the second round achieved further enhancement. These results demonstrate that our method is not constrained by the initial training techniques and can effectively refine an already optimized baseline.

Original networkDistilled networkRound 1 Hidden 680, CR>1Round 2 Hidden 680, CR>1
71.37%73.60%73.88% ± 0.0473.99% ± 0.01

We also investigated whether progressive-reconstruction can enhance our best-performing model (achieved through decoupled training with a high-performing teacher, as shown in Table 4). The results confirm that the first round improves upon the target accuracy (73.95%), and the second round achieves additional gains. These results also emphasize the effectiveness and flexibility of the proposed recipe in achieving further performance improvements.

Original NetworkOurs from Table 4Round 1 Hidden 680, CR>1Round 2 Hidden 680, CR>1
71.37%73.95% ± 0.0974.15% ± 0.0674.28% ± 0.03

Line 82-84, limitation on data usage It is correct that both baseline NeRN and our decoupled training need the original task's data (through the iterative refinement with large predictor use only reconstruction loss so access to the original data is not required). Despite that fact, we conducted an experiment that addresses the challenge of operating in a completely data-free environment. We employ uniformly sampled noise as input data for both methods, and our results demonstrate its effectiveness compared to the baseline NeRN even in the absence of meaningful data.

CIFAR10Original ResNetRecon-onlyBaselineOurs
Acc. (↑, %)91.69%85.64% ± 0.3986.31% ± 0.1187.25% ± 0.02
CIFAR100Original ResNetRecon-onlyBaselineOurs
Acc. (↑, %)71.37%61.31% ± 0.4563.92% ± 0.1164.39% ± 0.01

Significantly degraded performance of quantized predictor We use the term "orthogonal" to indicate that quantization can be directly applied to the predictor network if further benefits are desired. While quantizing the predictor leads to a significant degradation in performance (e.g., from 70.84% to 51.72% on CIFAR100), this outcome is explainable. The degradation is not necessarily due to the brittleness of predictor, but rather it is because the predictor is the weight generator. Even small changes in the predictor's outcome (weights for reconstructed network), such as those introduced by quantization, can significantly impact the performance of the reconstructed model. In contrast, quantizing the original network (e.g., ResNet) has less impact because it operates on a fixed structure rather than dynamically generating weights. Despite the degradation observed in the quantized predictor, our method (70.84%), using a predictor of the same size as the quantized ResNet56, achieves better performance than its counterpart (69.65%).

We also want to clarify that the primary objectives of our work differ from quantization methods. While improved model compression is one of the benefits offered by our approach, the core contribution lies in effectively performing NeRN-style reparameterization through our novel progressive-training and decoupled-training strategies.

评论

Thank you for your response to my comments. However, I believe my concerns remain:

(i) The revised introduction and the abstract still highlight the performance gain from weight reconstruction, although this gain is minimal and barely exists. This is also seen in the added results of the rebuttal, where the accuracy increases by less than 1% on CIFAR10 after multiple iterations. This is also seen in the generalization experiments where the gain is less than 0.3% on transfer learning. Moreover, the rebuttal did not address the ImageNet case which displays an opposite trend (on a larger and more challenging dataset). To summarize, I believe the accuracy gain claim is not well supported, and the writing should reflect that.

(ii) The higher performing teacher still seems unfair. Distilling the teacher still seems like a better option.

I choose to keep my score.

评论

(i) Performance gain is minimal and imagenet experiments: Thank you for your additional feedback. We would like to further clarify the key objective of our proposed work, which is to enable effective reparameterization of deep models. While a variety of approaches has emerged, including the recent work on using diffusion models to sample from a distribution of weights, we argue that INRs have the potential to enable a convenient and an effective framework for model re-parameterization. However, realizing this potential requires a deeper understanding and improvement of NeRN optimization, which is the primary focus of our work.

Although the ImageNet results in Table 2 do not indicate performance matching or outperformance, it is important to note that even recovering the original performance is non-trivial task, as evidenced by the baseline NeRN results. During reparameterizatin on small-scale datasets, particularly in the overparameterization regime (CR>1), we observed intriguing behavior suggesting that post-hoc iterative refinement could improve the original network's performance. However, our core objective remains effective reparameterization to accurately recover the original network. For ImageNet, our approach shows a 3% drop in performance compared to an 8% drop with the baseline NeRN under significant compression (CR = 15%), highlighting the substantial improvements achieved by our method over the baseline.

(ii) Amenable to distillation from teacher models: In Section 3.3, we make another interesting observation that our approach is not only effective at recovering a pre-trained model, but can also be leveraged to distill from better-performing teacher models. We include this experiment to emphasize that our re-parameterization is amenable to other post-hoc model refinement strategies. Note that distillation was only chosen as an example, and it is possible to utilize this with other approaches such as fine-tuning or performing arithmetic on task vectors from different fine-tuned models (e.g., Editing models with task arithmetic, ICLR 2023).

Regarding the concern of 'unfair', all results, including those for baseline NeRN and ours in Table 3 and 4, were obtained using the same guidance from a high-performing teacher model (ResNet50) to ensure fair comparison. These results further indicate that the optimization of the baseline NeRN can be significantly improved through our proposed approach. If you still believe this experimental setup is unfair, we kindly ask for further clarification on why you consider it so.

审稿意见
6

The paper addresses the task of learning an implicit neural representation (INR) of a trained neural network’s weights. When the INR is smaller than the original model, it offers a potential approach to model compression. The authors conduct an in-depth analysis of both a naive baseline and the current state-of-the-art method (NeRN), yielding key insights: (i) increasing the INR size enhances the performance of the reconstructed model, (ii) iterative training of INRs (on the previous INR) can sometimes exceed the original model’s performance, and (iii) NeRN, despite having three objectives, is primarily driven by the reconstruction objective. These findings inform the proposed method, which separates the reconstruction and distillation phases. The approach introduces one or more reconstruction-only stages, followed by a distillation phase (focusing solely on logits, not features), enabling knowledge transfer from potentially stronger teachers. Extensive experiments validate the approach’s effectiveness.

优点

The paper addresses a potentially interesting task within the emerging field of weight-space learning, where neural networks are treated as data points to train other networks. While it may not yet compete with state-of-the-art quantization methods, the paper explores a promising direction that could inspire future advancements. The analysis and motivating experiments are comprehensive and intuitive, and the experiments are thoughtfully designed, demonstrating tangible improvements while simplifying the method relative to the baseline.

缺点

The paper’s primary weaknesses are in presentation, particularly in the introduction, where motivation and context are insufficiently developed, and in its focus on ResNets alone. Both issues are potentially addressable in the rebuttal, as detailed below.

Presentation

The paper lacks a compelling motivation for learning implicit representations of neural networks. I believe the introduction should clearly explain why this is an interesting and valuable topic (I think this should be done even if the main reasons are purely academic and not commercial). Additionally, it presumes familiarity with NeRN, which may make it difficult for readers to follow without proper context. For instance, lines 44-46 discuss the contradictions in the multi-objective loss, yet the paper does not explain that multi-objective losses represent the current state-of-the-art or clarify what these objectives entail. Consequently, the value and relevance of the proposed method in decoupling objectives are unclear. Similarly, terms like “compression efficiency” (line 49) are introduced without context, leaving readers uncertain about their meaning within this work. Defining the motivation and the specific context of the compression task would make these points much more accessible.

Generalizing to Other Architectures

While the paper briefly mentions that the method is only applicable to CNNs, it is unclear why this limitation exists. Are there underlying assumptions preventing the method from generalizing to other architectures like ViTs? If there are no such constraints, presenting results on a ViT model would benefit the paper. Although the work remains relevant if limited to CNNs, its scope and applicability would be reduced if it cannot generalize to architectures beyond ResNets.


Smaller issues

In addition to the above primary weaknesses, below I describe a few additional issues and limitations, these are mostly smaller issues that do not carry a large weight in my decision but should nevertheless be addressed:

  1. While the paper focuses on implicit representations for neural networks, there is a growing body of research for learning semantic representations of neural networks and performing other tasks on weights of neural networks, the paper did not cite any of these works, which I think should be done. See [1-12] for a few such examples.
  2. The term “inception like” is written a few times (e.g. line 44), what does this mean?
  3. A small mismatch in notation, in Eq. 1, in the FMD definition you use ala^{l} while in line 92 you use aa^{\ell}.
  4. In Fig. 1, maybe show by how much the performance is improved (in the zoomed in part).
  5. Line 252 references Eq. 2, but the actual equation is unnumbered.
  6. Lines 264-266 are not very clear, and if they are important enough to be bold they should probably be rewritten. It took me a few passes to understand them.
  7. Just making sure, in 3.3, you first perform iterative refinement and then distill? The figure looks like they are simultaneously done and not sequentially.
  8. In Fig. 5, the 3.3 part, both the arrows are red, shouldn’t one be red and one blue?
  9. I would replace the term “baseline” in the tables with “NeRN” so that a reader can clearly understand what baseline you are using (and to give NeRN the credit it deserves).

[1] Predicting neural network accuracy from weights, 2020, Unterthiner et al.

[2] Towards Scalable and Versatile Weight Space Learning, 2024, Schurholt et al.

[3] Self-supervised representation learning on neural network weights for model characteristic prediction, 2021, Schurholt et al.

[4] Hyper-representations as generative models: Sampling unseen neural network weights, 2022, Schurholt et al.

[5] Learning Useful Representations of Recurrent Neural Network Weight Matrices, 2024, Herrmann et al.

[6] Learning to learn with generative models of neural network checkpoints, 2022, Peebles et al.

[7] Graph metanetworks for processing diverse neural architectures, 2023, Lim et al.

[8] Equivariant deep weight space alignment, 2023, Navon et al.

[9] Equivariant architectures for learning in deep weight spaces, 2023, Navon et al.

[10] Graph neural networks for learning equivariant representations of neural networks, 2024, Kofinas et al.

[11] Neural functional transformers, 2024, Zhou et al.

[12] Permutation equivariant neural functionals, 2024, Zhou et al.

问题

As mentioned in the weaknesses section, I think the points for the rebuttal are:

  1. Improve the motivation and context in the introduction.
  2. Show an example of the method generalizing to ViTs (if indeed there are no underlying assumptions preventing the method from generalizing).
  3. Adding the relevant citations and other small issues.

An additional question I had is whether the iterative refinement also works for smaller INRs (e.g., 280).


In summary, the paper addresses a less-explored aspect of weight-space learning, offering improvements over existing methods with a simpler approach. Since the motivation and presentation could need improving (and if possible, the generalization), I currently assign the paper a score of 5. I will consider increasing my score if the authors satisfactorily address my concerns.

评论

Presentation We appreciate the encouragement to provide a stronger motivation for our research and to highlight the appeal of the original NeRN method. In response, we have revised our introduction (lines 32-40), added a clearer explanation of the multi-objective loss function (lines 44-46), and removed ambiguous terms (lines 47-49).

Generalization to Transformers While our current work focuses on effectively performing NeRN's reparametrization on CNNs, we recognize the importance and value of extending NeRN to other architectures. Unfortunately, due to time constraints, we were unable to perform the extensive research to extend NeRN's application to transformers in this submission. However, we have carefully considered the challenges and requirements and will briefly discuss them here.

Generalizing NeRN to transformers would require significant modifications to its current setup, primarily in designing a suitable coordinate system and addressing the larger size and complexity of weight matrices in transformers. For example, the coordinate system would need to capture aspects such as layer index, head index, weight type (query, key, value), and block indices for submatrices (if decomposed weight matrices into smaller submatrices). These additions would maintain NeRN's core principle of functional weight representation but would introduce new challenges during training due to the complex structure.

Additionally, although NeRN does not impose architecture-specific assumptions, it relies on INR, which typically assumes that input coordinates represent a smooth and continuous space and that the learned function exhibits some level of smoothness. However, in transformers, the coordinates are discrete, and the corresponding target weight lacks continuity, making convergence and accuracy more challenging. NeRN addresses this for CNNs by introducing permutation-based smoothness, which promotes kernel smoothness by reordering pre-trained weights without altering their values. However, in transformers, the increased size and complexity of both the coordinate system and weight matrices would require further innovations to simplify the prediction task and ensure smooth learning.

Beside NeRN there are other approaches that aim to predict transformer weights, such as diffusion-based weight prediction methods. These can potentially be adopted as the predictor network to replace NeRN, whereas our proposed iterative weight reconstruction could potentially still be applied.

Smaller Issues

  • Adding more citations: We added a short discussion of semantic representations to the related works section, including references to the suggested literature.

    Semantic representations of neural networks encode meaningful, interpretable features aligned with human-understandable concepts (references). However, our implicit representations encode information in a distributed and flexible manner, capturing complex patterns and relationships.

  • Inception-like: We acknowledge the confusion and have changed the term to "recursive" for clarity.

  • Typos and naming (line 92, line 252, NeRN): We have revised these typos and replaced 'baseline' with 'NeRN'.

  • Lines 264-266 are not very clear: We have revised lines 264-266 in the updated manuscript to improve clarity.

  • Clarity section 3.3 and Figure 5: Both arrows in Figure 5 should indeed be red, as they represent the decoupled training process described in Section 3.3. Specifically, Table 3 shows results for a parameter-efficient predictor (CR<1) with logit distillation, while Table 4 shows results for a larger predictor (CR>1) using the same method. The blue arrow represents progressive reconstruction, which can further refine a decoupled model by targeting this high-performing model. To clarify, we have provided additional results in the following table. We hope this explanation and added results clarify the use of colors in the figure.

Original NetworkOurs from Table 4Round 1 Hidden 680, CR>1Round 2 Hidden 680, CR>1
71.37%73.95% ± 0.0974.15% ± 0.0674.28% ± 0.03
  • To evaluate whether progressive reconstruction enhances smaller NeRNs, we used a compact NeRN model with 70.84% accuracy ("Ours" in Table 2) as the target network. As shown in the results, applying a second round of progressive reconstruction (Hidden 680, CR>1) improved performance beyond the target accuracy (70.84%) and brought it closer to the original accuracy (71.37%). This suggests that smaller NeRNs can benefit from progressive reconstruction without directly targeting the original weights.
Original NetworkOurs, Hidden 280 from Table 2Round 1 Hidden 680, CR>1Round 2 Hidden 680, CR>1
71.37%70.8471.13% ± 0.0171.25% ± 0.004
评论

I appreciate the authors’ efforts in addressing my comments. The revisions to the introduction and the updates made in response to other reviewers’ feedback have improved the overall presentation of the paper.

However, I remain unconvinced as to why generalization to ViTs was not feasible during the rebuttal phase. As previously mentioned, the absence of ViT-related experiments limits the scope of the findings.

After reviewing the other comments, I agree with 7EjM and ZS5i that the method primarily smooths the weights, which is not novel. Nevertheless, the paper’s key observation on iterative refinement is intriguing and has the potential to benefit the weight-space learning community. I am therefore raising my score to a 6. However, I strongly encourage the authors to include experiments on ViTs or other architectural types in the camera-ready version of the paper.

评论

Thank you for recognizing the improvements in our paper's presentation. We sincerely appreciate your constructive feedback and your decision to raise the score to acceptance. As you suggested, we will prioritize extending our work to transformers and aim to include an example in the final version.

审稿意见
6

This study explores the trade-off between accuracy and parameter efficiency in neural networks using predictor networks. It reveals that the predicted model can exceed the original's performance solely through reconstruction objectives, with improvements accumulating over successive reconstructions. The research also proposes a new training scheme that separates reconstruction from auxiliary objectives, leading to significant enhancements in both model accuracy and predictor network efficiency compared to existing methods.

优点

  1. The paper is well-written with a clear and reasonable motivation. The problem setup and comparison with other methods are well articulated.
  2. Extensive validation is conducted on multiple datasets, showing consistent improvements. The figures and tables are presented clearly and logically.
  3. The proposed two-stage strategy not only compensates for the shortcomings of the baseline but also achieves better performance.

缺点

  1. The method presented in the paper targets a trade-off between accuracy and compression rate. Can the advantages gained over the baseline pre-trained weights, as demonstrated in Table 2, generalize to a broader range of downstream tasks?
  2. Will the progressive enhancement of the teacher network lead to corresponding progressive improvements in performance?
  3. Can these advantages extend beyond CNN-based architectures, such as to the pre-training of Vision Transformers (ViT) or hybrid architectures combining ViT and CNN?

问题

See weakness.

评论

Downstream tasks To evaluate the generalization of our predicted weights, we designed a transfer learning experiment comparing the original weights (baseline pretrained on CIFAR-100) and our predicted weights (trained on CIFAR-100) when applied to the CINIC-10 dataset. We tested two scenarios: 1) Fine-tuning all layers to fully adapt the model to the new task, and 2) Fine-tuning only the linear layer to highlight the generalizability of convolutional weights. In both cases, models were fine-tuned for 10 epochs, while the 'From Scratch' model was trained for 150 epochs. The results show that our predicted weights (Hidden 360, CR < 1) consistently outperform the original weights in both scenarios, achieving higher accuracy. Additionally, using predicted weights from progressive reconstruction (Round 5, Hidden 680, CR > 1) further improves generalization slightly. Overall, these results confirm that the proposed method generalizes well to downstream tasks.

S → TTuning LayersFrom ScratchOriginal WeightsPredicted Weights Hidden 360, CR<1Predicted Weights Hidden 680, CR>1
CIFAR100 → CINIC10All Layers76.88%80.22% ± 0.0580.32% ± 0.0480.35% ± 0.03
CIFAR100 → CINIC10Linear Layer76.88%58.95% ± 0.0358.98% ± 0.0459.20% ± 0.04

Progressive enhancement of high-performing networks We are not entirely sure about the interpretation of the question. Please provide clarification if you need more information. We think you are asking whether the progressive-reconstruction can generalize to a higher-performing network built upon the original network, such as those enhanced through techniques like knowledge distillation. To address this, we designed the following experiments: Instead of applying progressive-reconstruction to the original network (71.37%), we used a distilled network (73.60%) where knowledge distillation with a high-performing teacher was used while training the original network. This experiment is crucial because it demonstrates that our method is not constrained by the training techniques. The results confirm that progressive-reconstruction, when applied to the distilled network, also achieves better performance than the target accuracy (73.60%). This aligns with the findings reported in Table 1, which illustrates that our method effectively improves performance, even when targeting a more optimized baseline.

Original networkDistilled networkRound 1 Hidden 680, CR>1Round 2 Hidden 680, CR>1
71.37%73.60%73.88% ± 0.0473.99% ± 0.01

Extend beyond CNN-based architecture While our current work focuses on effectively performing NeRN's reparameterization on CNNs, we recognize the importance and value of extending NeRN to other architectures. Unfortunately, due to time constraints, we were unable to show an example of NeRN's application to transformer architecture in this submission. However, we have carefully considered the challenges and requirements.

Generalizing NeRN to transformers would require significant modifications to its current setup, primarily in designing a suitable coordinate system and addressing the larger size and complexity of weight matrices in transformers. For example, the coordinate system would need to capture aspects such as layer index, head index, weight type (query, key, value), and block indices for submatrices (if decomposed weight matrices into smaller submatrices). These additions would maintain NeRN's core principle of functional weight representation but would introduce new challenges during training due to the complex structure.

Additionally, although NeRN does not impose architecture-specific assumptions, it relies on INR, which typically assumes that the input coordinate system represents a smooth and continuous space and that the learned function exhibits some level of smoothness. However, in transformers, the coordinates are discrete, and the corresponding target weight lacks continuity, making convergence and accuracy more challenging. NeRN addresses this for CNNs by introducing permutation-based smoothness, which promotes kernel smoothness by reordering pre-trained weights without altering their values. However, in transformers, the increased size and complexity of both the coordinate system and weight matrices would require further innovations to simplify the prediction task and ensure smooth learning.

Beside NeRN there are other approaches that aim to predict transformer weights, such as diffusion based weight prediction methods. These can potentially be adopted as the predictor network to replace NeRN, whereas our proposed iterative weight reconstruction could potentially still be applied.

评论

Thanks for solving my confusion with your experiment, I think it's an interesting work.

评论

Thank you for finding our work interesting. Please let us know if you have any further questions.

审稿意见
5

In this work, the authors propose using predictor networks to achieve increased accuracy in two ways. The first one is to train the predictor network iteratively using only a reconstruction loss. The increased accuracy is a product of the weight smoothing caused by the reconstruction loss. In the second part, the authors propose to detach the reconstruction loss and distillation loss (as used in previous works) and do it sequentially. They argue that the distillation loss gains are limited to the reconstruction loss when used simultaneously.

优点

This paper is mostly well-written and has some interesting ideas. It also provides a concise and clear overview of the problem, the literature review is comprehensive, and the experiments are thorough. Finally, the problems that this work tries to address, like model compression and improved representation learning, are relevant to the community.

缺点

I have major concerns related to the technical contributions and the practicality of the proposed approach. The method achieves increased accuracy because of the smoothing of the weights due to the reconstruction (or iterative reconstruction), which is not surprising, and I think there are several other (easier) ways to increase accuracy, generalization, robustness, etc., by smoothing weights, which makes me believe this approach would not be very practical (In the paper there are no comparisons to these easier alternatives). Another reason I believe this approach is not practical is that reducing model storage is not as big of a concern as reducing the memory needed for the model during training or deployment.

For instance, let's assume we have an optimally trained model with weight smoothing regularization. For the proposed approach to achieve similar performance to the original model, it will require two networks, i.e., the original and the predictor (e.g., ~40% of the DoF of the original), then we need to have several rounds of reconstruction where the predictor learns to predict the original weights, and then, when need a training stage using only distillation loss. The only real gain will be the reduced memory for model storage (which has to be unwrapped to a larger model to produce predictions), which currently is not a real concern, at least on the applications described in the paper.

Despite this work having some interesting ideas, in its current version, I don't see any major benefits in using it. Please see the questions below for more specifics.

问题

  • The problem setup in this paper is not clear; Is the original network with parameters WW pretrained for the target task? How is the dataset split used to train WW and evaluate, for instance, the accuracy vs. reconstruction error results? Which task are these models evaluated on? and what models are used? For instance, what is the size of the predictor network in the experiments in Sec. 3.1? From some of the figures (e.g., figure titles or legends) later in the paper I could see some of these details but they should be summarized before describing the results in Sec. 3.

  • In Figure 1 (left) the authors show an expected behavior of the tradeoff between accuracy and reconstruction error; what is this figure based on? why is the expected reduction linear? why is the expected accuracy at a reconstruction error of 0.015 around 20%? I assume the authors used the observed results to build the expected plot, but then again, why the linear behavior instead of the approximate negative quadratic observed in the right plot? Additionally, could the slightly increased accuracy be a product of the variance in the results? error bars and axis labels of the zoomed-in crop of the right plot would be helpful.

  • Is the goal of the reconstruction loss to learn to predict the parameters of a network previously trained? If so, why is the accuracy higher with a non-zero reconstruction error? Can we then assume that the "original" model was not optimally trained? For instance, if smoothing the weights of the network increases its accuracy on a given task, should not the network be trained with a different regularization? e.g. a higher weight decay, which would also smooth weights and suppress high-frequency components.

  • How does the proposed iterative smoothing using predictor networks compare to other weights smoothing techniques like training regularizations or smoothing constraints (e.g., weight decay, dropout, etc.), and the mentioned methods in lines 157-162 (e.g., NeRN with regularization-based smoothness)?

  • It has been widely studied that weight smoothing increases performance, generalization, robustness to noise, etc., and many techniques have been proposed to achieve this, so I don't think it is that surprising of a find, and getting an increased accuracy by smoothing parameters via a predictor network seems overkill to me.

  • It would be helpful to have results comparing the proposed method with distillation and using distillation to train a smaller network directly (i.e., instead of training a predictor network for a larger model). It seems to me that the advantages of using a predictor network are not that great if the model has to be unwrapped to produce the predictions. Currently, GPU memory is a greater issue than model storage.

Minor comments:

  • In lines 214-215 the notation can be confusing, i.e., predictor PP with QQ learnable parameters and the original network OO with PP learnable parameters.

伦理问题详情

No ethical concerns

评论

Alternative methods exist for weight smoothing There are different methods to achieve weight smoothing, either during model training (e.g., weight decay, dropout) or post hoc approaches directly applied to pre-trained model weights (e.g., frequency filtering, modulate singular value). Our proposed approach focuses on modifying pretrained weights. In Appendix A, we explore alternatives for directly applying weight smoothing, either through modulating singular value or Fourier transformation and filtering. The downsides of these methods are that they are very sensitive to the right parameter setting or require checks against the validation dataset. Regularization-based methods usually come with a generalization-accuracy trade-off. Our method doesn't have these restrictions and we show below that we can iteratively increase accuracy of already-optimized models just using the reconstruction loss.

Reducing model storage is not as big of a concern as reducing the memory Training a predictor network doesn't necessarily mean requiring more memory. With a small predictor, we can learn through batch training of the original weights and, therefore, require significantly less memory compared to the original model (same for reconstruction of original weight). Compression is just one of the potential benefits of the proposed method. The reconstruction setup with a larger predictor aims at improving model performance rather than compression. During inference time, we will have a model of the same memory footprint but with improved performance. Moreover, we show the reconstructed weight has other benefits, such as better generalization for downstream tasks (see response to reviewer GbWd).

Problem setup in this paper is not clear We will update the paper to clarify the points raised by the reviewer. Specifically, we always use the test split for all model evaluations. Also for the reconstruction tasks, the predictor is trained to predict the model weight directly so no data split is involved. Only the training split is used for the knowledge distillation setup. We also updated the caption of Figure 1 and the description in the text of section 3.1.

Figure 1 (left) is confusing, and the slightly increased accuracy (Figure 1 right) may be due to variance We agree that the left figure is confusing, so we will remove it from the final version and have adjusted the text describing the figure. We also changed the caption of the figure to describe the meaning of the dots, i.e., representing results from predictors with different hidden layer sizes. Additionally, we provide the mean and standard deviation across three runs, demonstrating that the observed behavior is not due to variance in the results.

Original750680510360320280220
71.37%71.56 ± 0.0571.61 ± 0.0171.45 ± 0.0767.48 ± 0.1061.31 ± 0.4549.55 ± 1.7224.20 ± 1.56

Comparison with distillation We designed three experiments to highlight the advantages of our proposed training strategies. First, we compare our predictor network with a distilled network trained using conventional KD. Here, a student network is trained from scratch using teacher guidance from ResNet50. The results show that our decoupled training achieves an accuracy of 73.95%, outperforming the conventional KD approach, 73.60%. However, the advantages of the proposed method are not limited to this performance gain; it also allows for further iterative improvement in each case.

Original networkDistilled networkOurs from Table 4
71.37%73.60%73.95% ± 0.09

Second, we applied the progressive-reconstruction process on the distilled network for two rounds. Both result in improved performance beyond the target accuracy. This behavior demonstrates that our method is not constrained by the initial training techniques and can effectively refine an already optimized baseline.

Original networkDistilled networkRound 1 Hidden 680, CR>1Round 2 Hidden 680, CR>1
71.37%73.60%73.88% ± 0.0473.99% ± 0.01

Third, we also investigated whether progressive-reconstruction can enhance our best-performing model (achieved through decoupled training with a high-performing teacher, as shown in Table 4). The results confirm improvements over the target accuracy in each round, emphasizing the effectiveness and flexibility of the proposed recipe.

Original NetworkOurs from Table 4Round 1 Hidden 680, CR>1Round 2 Hidden 680, CR>1
71.37%73.95% ± 0.0974.15% ± 0.0674.28% ± 0.03
评论

I thank the authors for addressing the problem setup issues and providing error bars for the results. However, some of my concerns remain. For instance, the authors point out the difference between in-training weight smoothing (like weight decay) and post-training weight smoothing (like their approach), but no comparisons were made, and none of my concerns related to this were addressed. Post-training smoothing can have advantages over in-training regularization, but none of these analyses were given. With respect to memory savings, I agree with the authors that training a small predictor network requires less memory than training the target model directly, but no significant advantages are given by using predictor networks in this regard since the target model still has to be used to get predictions after the smoothing.

I also agree with several concerns from Reviewer ZS5i like the claim on accuracy improvement with almost negligible improvements in small datasets and the smoothness analysis.

I think my main concern about the practicability of this approach still remains; the method achieves increased accuracy because of the smoothing of the weights due to the reconstruction, which is not surprising, and I think there are several other (easier) ways to increase accuracy, generalization, robustness, etc., by smoothing weights, which makes this approach overkill for what it achieves, in my opinion. I will keep my previous recommendation.

评论

Thank you for your additional feedback. In response, we would like to highlight practical value of the proposed work. The key objective of this work is to enable an effective reparameterization of deep models. While a variety of approaches has emerged, including the recent work on using diffusion models to sample from a distribution of weights, we argue that INRs have the potential to enable a convenient and an effective framework for model re-parameterization. However, to realize this potential, there is a need to understand and improve the optimization of NeRNs, which is the focus of our work. Here, we summarize the benefits of this approach:

(i) Recover original model performance: Through the new empirical insights made in this work, we are able to obtain effective INR re-parameterizations of neural networks -- significantly superior to baseline NeRNs in terms of both ID and OOD performance recovery.

(ii) Improve memory efficiency: An important by-product of this re-parameterization is that the parameter count in the re-parameterized INR is significantly lower than the original network, thereby reducing the memory storage requirements.

(iii) Improve inference-time memory efficiency: Unlike other existing re-parameterizations (e.g., using a diffusion model to sample from the distribution of weights), our approach adopts a coordinate system for different (layer, filter, weight) indices. Consequently, our approach can improve the inference-time memory efficiency in two different ways: (a) it is not required to load the entire network into on-device memory in one shot -- instead only specific layers can be reconstructed at a given time (e.g., load one layer at a time) and forward passes can be done; (b) Pruning is another common strategy used to improve inferencing efficiency of models -- our re-parameterization will allow us to load only the relevant filters (identified by pruning the original network during the training phase itself) and improve the memory efficiency.

(iv) Enable post-hoc weight smoothing: We agree with the reviewer’s assessment that this is not the only approach to achieve weight smoothing. However, we want to emphasize that weight smoothing is typically performed during model training. In contrast, our approach enables post-hoc smoothing of model weights through the INR-based re-parameterization. To this end, we make an interesting observation that progressive training of the INR model with only the reconstruction objective achieves the desired smoothing.

Performance gain is minimal and imagenet experiment: Although the ImageNet results in Table 2 do not indicate performance matching or outperformance, it is important to note that even recovering the original performance is non-trivial task, as evidenced by the baseline NeRN results. During reparameterizatin on small-scale datasets, particularly in the overparameterization regime (CR>1), we observed intriguing behavior suggesting that post-hoc iterative refinement could improve the original network's performance. However, our core objective remains effective reparameterization to accurately recover the original network. For ImageNet, our approach shows a 3% drop in performance compared to an 8% drop with the baseline NeRN under significant compression (CR = 15%), highlighting the substantial improvements achieved by our method over the baseline.

We sincerely appreciate each reviewer’s valuable feedback on our work. In the final version, we will aim to include results for: (1) the base model trained with weight decay and (2) our re-parameterization applied to both models.

审稿意见
6

The author study the fundamental trade-off regarding accuracy and parameter efficiency in neural network weight parameterization using predictor networks. They present a finding where the predicted model not only matches but also surpasses the original model’s performance through the reconstruction objective (MSE loss) alone. Experiments are done on CIFAR, STL and ImageNet.

优点

  1. the topic is of interest and benefits the community
  2. the proposed "separation" is flexible

缺点

  • the "low reconstruction error" seems to be a bit arbitrary and I do not see a very good way to find it.
  • The reasoning/intuition on why the proposed method even "improve" the performance is lacking
  • see more in questions.

minor issues:

  1. typos near line 509-510
  2. Fig 5 can be earlier?

问题

  1. Fig. 1: "While one expects that the reconstruction error must approach zero to recover the true performance" How was the "low error" defined? empirically searched?
  2. Does the method only works with CNN? I thought no? Maybe only empirically no experiments were done. It is worth trying on transformers.
评论

Low reconstruction error We understand that "low reconstruction error" may seem broad in general contexts. In our study, reconstruction error serves as an important analytical metric, as it is directly related to the performance of the reconstructed network. We added more context in Section 3.1 and Figure 1. For example, achieving zero reconstruction error would theoretically result in performance identical to that of the original network. While one might expect performance to consistently decrease as reconstruction error increases, we observed cases where the reconstructed network's performance exceeds the original network's performance. To clarify, we measured reconstruction error and the corresponding performance computed by 3 runs across seven different predictor hidden sizes, with each point on the graph representing the original performance and a specific hidden size from left to right: original, 750, 680, 510, 360, 280, and 220.

Original NetworkHidden 750Hidden 680Hidden 510Hidden 360Hidden 320Hidden 280Hidden 220
71.37%71.56% ± 0.0571.61% ± 0.0171.45% ± 0.0767.48% ± 0.1061.31% ± 0.4549.55% ± 1.7224.20% ± 1.56
00.000360.000680.002520.007710.009670.011770.015116

Reasoning/intuition is lacking To explain the observed performance improvement, we introduce the SratioS_{ratio} metric for two key reasons: 1) it clearly distinguishes the behaviors of predicted and original weights, and 2) it supports the weight smoothing hypothesis, where progressive-reconstruction promotes weight smoothing, as evidenced by higher SratioS_{ratio} values (indicative of lower-frequency components), thereby leading to performance improvement. Figure 2 shows that Round 1 weights exhibit higher SratioS_{ratio}, particularly in the later layers (i.e., closer to the decision layer). This trend continues across additional rounds of progressive-reconstruction (Figure 3 (b), up to Round 5). This suggests that progressive-reconstruction smooths weights by suppressing lower singular values.

To further validate this hypothesis, we conducted additional analyses to explicitly test the relationship between weight modulation and performance improvement. 1) Frequency modulation (Figure 7): Applying a low-pass filter to the weight matrix demonstrated its impact on performance. 2) Singular value modulation (Figure 8): Scaling down less significant singular values showed their contribution to performance. 3) Singular values vs frequency (Figure 9). This analysis is particularly important as modulating singular values does not inherently guarantee the removal of high-frequency components. It shows that the weights after low-pass filtering exhibit higher SratioS_{ratio} values, suggesting a potential link between frequency and singular value-based modulations. We believe our analysis provides important reasoning and intuition behind the performance improvement of the proposed method. This aligns with findings from many prior works, further supporting the validity of our explanation.

Figure 5's position We moved Figure 5 to the beginning of the section.

Extend beyond CNN-based architecture Our current work focuses on effectively performing NeRN's reparameterization on CNNs. We have carefully considered the challenges and requirements for extending NeRN to transformers. Generalizing NeRN to transformers would require significant modifications to its current setup, primarily in designing a suitable coordinate system and addressing the larger size and complexity of weight matrices in transformers. For example, the coordinate system would need to capture aspects such as layer index, head index, weight type (query, key, value), and block indices for submatrices (if decomposed weight matrices into smaller submatrices). These additions would maintain NeRN's core principle of functional weight representation but would introduce new challenges during training due to the complex structure.

Additionally, although NeRN does not impose architecture-specific assumptions, it relies on INR, which typically assumes that the input coordinate system represents a smooth and continuous space and that the learned function exhibits some level of smoothness. However, in transformers, the coordinates are discrete, and the corresponding target weight lacks continuity, making convergence and accuracy more challenging. NeRN addresses this for CNNs by introducing permutation-based smoothness, which promotes kernel smoothness by reordering pre-trained weights without altering their values. However, in transformers, the increased size and complexity of both the coordinate system and weight matrices would require further innovations to simplify the prediction task and ensure smooth learning.

AC 元评审

This paper explores the trade-off between accuracy and parameter efficiency in neural networks using smaller predictor networks to predict neural network weights. With successive rounds of reconstruction, by decoupling the reconstruction and distillation processes, the model's accuracy can even exceed that of the original model. The proposed method is applied to CNNs on datasets such as CIFAR and ImageNet.

It has received 5 reviews and an extensive rebuttal, with final ratings 6,5,5,6,6. Reviewer 1 raised concerns about technical novelty, the arbitrary nature of "low reconstruction error", a lack of explanation regarding why the method would improve performance, and generalization to other architectures like transformers. Reviewer 2 doubted the practicality of the approach, with negligible gains on model performance, no memory reduction during deployment, and no comparisons with simpler weight smoothing methods. Reviewer 3 questioned whether the method could generalize better beyond CNNs, to other downstream tasks, with progressively improved teacher models. Reviewer 4 agreed with Reviewers 2 and 5 that the method primarily smooths the weights (not novel in itself), and that the absence of ViT experiments would be very limiting in the scope of work. Reviewer 5 found the accuracy claim not supported by experimental results: minimal to barely existent performance gains on small datasets and a worse trend on a larger and more challenging dataset such as ImageNet. The rebuttal clarified improved memory efficiency during inference and the ability to apply the method in a data-free environment, emphasized the goal of re-parameterization, with additional experiments and explanations regarding the observed performance gains over baseline NeRN etc.

Despite the authors' rebuttal addressing several concerns, the reviews indicated significant skepticism regarding the performance gain, the experimental rigor, generalization to non-CNN architectures, technical novelty, and practical impact of the proposed method. Although the method shows potential, especially in the domain of weight-space learning, the paper ultimately did not demonstrate sufficient transformative contributions to warrant acceptance at this stage. Therefore, the final decision is rejection.

审稿人讨论附加意见

Reviewers engaged in the rebuttal process with the authors and cross-referenced each others' comments.

最终决定

Reject