PaperHub
6.4
/10
Poster4 位审稿人
最低3最高5标准差0.7
3
5
4
4
3.8
置信度
创新性2.8
质量2.3
清晰度2.3
重要性2.3
NeurIPS 2025

Scaling Up Parameter Generation: A Recurrent Diffusion Approach

OpenReviewPDF
提交: 2025-05-04更新: 2025-10-29

摘要

关键词
Parameter Generation

评审与讨论

审稿意见
3

The paper introduces Recurrent Diffusion for Large-Scale Parameter Generation (RPG), a novel framework for generating full neural network parameters (up to hundreds of millions) on a single GPU. By partitioning parameters into tokens, modeling inter-token relationships via a recurrent mechanism, and synthesizing parameters through diffusion, RPG achieves performance comparable to trained models across diverse architectures (ResNets, ViTs, ConvNeXts, LoRA-based LLMs).

优缺点分析

Strengths

  1. The paper presents a well-structured framework with a clearly articulated methodology, including parameter tokenization and recurrent diffusion processes.
  2. The experimental validation is rigorous, encompassing diverse architectures (ResNets, ViTs, ConvNeXts, LLMs) and tasks (ImageNet, COCO, ADE20K, commonsense reasoning). Notably, the work bridges a critical scale gap by generating parameters for models with up to 200M weights.

Weaknesses

  1. Limited architectural generalization: The method lacks validation on generating parameters for unseen neural network architectures, raising questions about its broader applicability.
  2. Computational cost transparency: The proposed approach relies on multiple pre-trained checkpoints as training data (e.g., 50 checkpoints for ImageNet-1K, lines 162–166). While the inference-time efficiency of parameter generation is highlighted, the total computational overhead—including (a) checkpoint collection and (b) diffusion model training—is not compared to directly training a model on the target task. This omission makes the method’s practical efficiency unclear.
  3. Performance gap on unseen tasks: As shown in Table 8, the generated parameters for novel tasks significantly underperform compared to standard training baselines.
  4. Unclear practical utility: The paper does not sufficiently justify why recurrent diffusion-based parameter generation is preferable to direct training in real-world scenarios. Concrete use cases or advantages (e.g., cross-architecture transfer, resource savings) remain unspecified.

问题

  1. How does the method perform when generating parameters for neural network architectures not seen during training?
  2. What is the total computational overhead—including checkpoint collection and diffusion model training—compared to directly training a model on the target task, and under what conditions does this approach offer efficiency gains?
  3. Why does the generated parameter performance degrade significantly on unseen tasks (as shown in Table 8)?
  4. In what real-world scenarios would recurrent diffusion-based parameter generation be more advantageous than direct training, and are there measurable benefits (e.g., cross-architecture transfer, resource savings) that justify its use?

局限性

N/A

最终评判理由

Please see my latest comment.

格式问题

N/A

作者回复

Q1: How does RPG perform on architectures not observed during training?

A1: Thank you for the insightful comment. RPG targets functional generalization: producing weights for unseen tasks while the network topology remains fixed. Extending RPG to handle architecture changes belongs to structural generalization, a distinct research direction tied to design and hardware considerations. We therefore confine this study to a fixed architecture, where the challenges are already substantial and the practical value clear. We will clarify this limitation and future extension in the revision.

Q2: The total computational overhead is not compared to directly training a model on the target task.

A2: We acknowledge that collecting checkpoints and training RPG requires several hours or days, but once trained, it can generate models for target tasks indefinitely. The advantage of direct training on these tasks decreases as the number of tasks increases.

This is analogous to an image generator: once trained, it can create unlimited images cheaply, and we do not count manually drawing every potential image as part of its training cost.

Q3: Why does the generated parameter performance degrade significantly on unseen tasks?

A3: Thank you for the question. We clarify the details below.

  • Unseen tasks are evaluated in a zero-shot setting, so some decline relative to full-shot training is expected.

  • The drop is modest: RPG still outperforms random initialization both at the outset and after finetuning.

Consequently, RPG provides faster convergence and higher final accuracy, as shown in Table 18. For your convenience, we reproduce the key numbers below.

  • Setting: We compare three approaches: (i) training ViT-Tiny from scratch, (ii) fine-tuning an ImageNet-pretrained ViT-Tiny, and (iii) fine-tuning an RPG-initialized model.

    epochfrom scratchfine-tuneRPG init + fine-tune
    050.052.994.4
    161.490.396.8
    569.196.397.3
    1074.997.197.5
    5086.797.797.8
  • Conclusion: RPG significantly accelerates training for unseen tasks. These results establish RPG initialization as an effective approach that substantially reduces training costs while maintaining competitive performance.

Q4: In what real-world scenarios would recurrent diffusion-based parameter generation be more advantageous than direct training, and are there measurable benefits that justify its use?

A4: Thanks for the comment. RPG offers several practical advantages in real-world scenarios:

  • Efficient initialization: RPG provides superior parameter initialization compared to random weights, significantly reducing training time and computational costs as demonstrated in Table 18.
  • Few-shot learning: For tasks with limited training data, RPG-generated parameters serve as informed priors that enable better performance than training from scratch.
  • Rapid model deployment: RPG enables near-instantaneous model adaptation for new tasks without requiring extensive training, making it valuable for real-time applications where quick deployment is critical.
评论

Thank you for the detailed rebuttal. However, my primary concerns about the model's scalability to unseen architectures and its high computational overhead have not been fully addressed, so I will be maintaining my original score.

评论

Dear Reviewer yKm7,

We greatly appreciate your feedback and would like to address your concerns.

The primary focus of our work is to scale up parameter generation for neural networks. In addition, our work is the first to explore the generalization of parameter generation to unseen tasks, which we believe is a significant contribution to the field.

We acknowledge that synthesizing parameters for various unseen architectures is a challenging and unsolved research problem. However, it is beyond the scope of this work. Our study specifically emphasizes the scalability and generalization capabilities of parameter generation, which are critical first steps toward addressing these broader challenges.

We kindly ask for your understanding of the focus of our work and hope you can evaluate it based on its stated contributions. We believe this will lead to a fairer assessment of our efforts.

Thank you once again for your valuable feedback.

评论

Dear Authors,

Thank you for your detailed response. Your clarifications on the paper's primary focus and contributions were very helpful and I appreciate you taking the time to provide them.

After careful reconsideration, I agree that your work makes a meaningful contribution to the field. Specifically, the exploration of how parameter generation generalizes to unseen tasks and the demonstration of the approach's scalability are important steps forward.

However, I still have significant concerns regarding the practical applicability of the overall method. My primary reservation remains that while the method scales well, it does not seem to extend to unseen model architectures. This limits its real-world utility, particularly since a key application of parameter generation could be to reduce the computational cost of training models on novel architectures.

Given these points, I am willing to raise my score to reflect the contributions I now better understand. Nevertheless, I must maintain my recommendation to reject the paper in its current form.

Thank you again for your hard work and for engaging in this constructive discussion.

Sincerely,

Reviewer yKm7

评论

Dear Reviewer yKm7,

Thank you for your thoughtful reconsideration and for acknowledging the meaningful contributions of our work. We truly appreciate your engagement and constructive feedback throughout this process.

We completely agree that practical applicability is an essential goal for parameter generation methods, and we respect your concern about the current limitations regarding unseen model architectures. However, we would like to emphasize that scientific exploration is inherently a step-by-step process. Achieving a fully practical and universally applicable parameter generation method is a long-term aspiration, requiring incremental advancements along the way.

Our work specifically addresses two critical milestones on this journey:

  1. Scaling up parameter generation, which lays the foundation for handling increasingly complex models.
  2. Generalizing to unseen tasks, an essential step toward broader applicability. While these contributions do not yet solve the challenge of unseen model architectures, they represent significant progress toward that ultimate goal. We believe that such intermediate steps are not only valuable but also necessary for advancing the field in a meaningful and rigorous manner.

We hope this perspective helps clarify the importance of our work as part of the broader research trajectory. Thank you again for your time, insights, and constructive dialogue.

Sincerely, Authors

审稿意见
5

The authors present a new method that tackles the problem of parameter generation, this is the problem of generating well performing network parameters that solve a certain task with no previous information on how to solve the task, that can scale and is more efficient that previous approaches at generating a large ammount of parameters, while being well performant.

优缺点分析

strengths:

  • The paper is overall well written
  • The method appears to significantly improve over previous works
    • with a good comparison with previous works, and different models and problems

weaknesses:

  • lack of statistical significance over the results, showing only best, avg, min for the distribution of parameters generated by a single model
  • no code release, and provided details are not enough to fully reproduce the experiments

问题

What is the impact of Nope or a random fixed positional encoding in section 3.3? If I understood correctly, the 1D sinusoidal is independent of the network structure, and it does not have terrible performance; thus, I wonder what the impact of the positional encoding is in the learning process.

Could you also provide more details on how permutation states are generated and how they work? When you train with permutation states, do you also permute the parameters used for training? Working like a data augmentation?

局限性

yes

最终评判理由

The authors addressed all my questions, specially about the significance of the results, as such I maintain the positive rating, as I consider the work offers an improvement from previous works.

格式问题

typos:

  • line 159 : "theclassification,"
作者回复

Q1: Lack of statistical significance over the results, showing only the distribution of parameters generated by a single model.

A1: We have conducted two additional repeated experiments (model2 and model3) for Table 1. The original content in Table 1 corresponds to model1 in the updated table. We now present the distribution of parameters generated by 3 independent models as well as the aggregated results across all 3 models (all). When time permits, we will also conduct repeated experiments for other key results such as Tables 2, 7, and 8.

arch. \ acc.(%)ResNet-18ResNet-50ViT-TinyViT-SmallViT-BaseConvNeXt-AConvNeXt-L
parameters (M)11.725.65.722.186.63.7197.8
original70.079.874.981.484.475.285.8
1-maximum69.8579.5775.4380.6284.5974.5785.78
1-average69.4979.5175.2980.4784.3974.3985.49
1-minimum68.9679.3775.2080.1284.1974.1685.16
2-maximum69.7079.6875.3880.7984.6274.6085.82
2-average69.4079.5475.3180.5184.3774.4985.46
2-minimum69.0479.4475.2080.3984.1774.3885.17
3-maximum69.7879.5375.4180.6084.5574.6185.70
3-average69.4379.4875.3080.4684.3874.4885.43
3-minimum69.0279.4275.2180.2384.2474.2785.19
all-maximum69.8579.6875.4380.7984.6274.6185.82
all-average69.4479.5175.3080.4884.3874.4585.46
all-minimum68.9679.3775.2080.1284.1774.1685.16

Q2: No code release.

A2: We will release all experimental codes in the revision. We guarantee that all experiments in the paper are reproducible.

Q3: What is the impact of the positional encoding?

A3: We address this question from two perspectives: (1) the necessity of positional encoding, and (2) the choice between 1D and 2D positional encoding.

  • 1. Necessity of positional encoding: For Mamba-based recurrent models, positional encoding may be less critical since the model can inherently understand token order through its recurrent nature. However, for transformer-encoder-based models, positional encoding is essential to indicate which parameter position is being generated. For consistency across different architectures and framework flexibility, we uniformly apply positional encoding to all models.

  • 2. 1D vs. 2D positional encoding: Under the current paradigm, there is no significant performance difference between 1D and 2D positional encoding. However, we plan to extend our framework to support variable-structure parameter generation in future work. The 2D encoding provides richer structural information about the model architecture, which will be crucial for these planned extensions.

Q4: How are permutation states generated and how do they work?

A4: Thanks for your comment. We explain in detail how permutation states are generated and how they work:

  • How permutation states are generated: We assign different permutation states to checkpoints with different symmetric states. Typically, multiple models saved in one training run have similar symmetric states (assigned the same permutation state), and vice versa. However, in this work, we assume that all checkpoints are in different symmetric states, which is actually a more conservative assumption, so we assign different permutation states to each checkpoint.

  • How to choose permutation state during inference?

    • First, neural networks with different initialization can have various symmetric structures that perform similarly[1].

    • This symmetry phenomenon doesn't affect performance, but hinders hyper networks' training. It happens when training with different random seeds.

    • Based on this, we devise the permutation state to encode this symmetry and facilitate RPG's learning. That is, assigning different permutation states to checkpoints trained with different random seeds.

    • Previous works such as P-diff [2] collect checkpoints from a single training trajectory, i.e., one random seed. Permutation state may not be vital in this setting.

    • In inference, we only need to randomly assign permutation state.

    • To further explore permutation state's influence, we conduct experiment in Appendix C.1 and put it here for convenience:

      • Setting: we use the ViT-T tuned on CIFAR-10, and test RPG with and without permutation state.
      number of random seedsoriginalwithout permutation statewith permutation state
      188.288.088.1
      388.3fail88.1
      1088.5fail88.2
      • Findings:
        • RPG w/, w/o permutation state performs similar when trained only on one type of checkpoints.
        • When the number of random seeds increases, RPG without permution soon fails.
        • RPG with permutation works well with large number of different random seeds.
      • Conclusions:
        • Permutation state encodes the parameter symmetries as conditions and works well when trained on checkpoints from multiple training runs.
  • Overview of how permutation states work: In the training phase, the permutation state (i.e., a one-hot encoding) passes through an MLP and is added to the positional encoding of the model input to represent the current model's symmetric state. In the inference phase, we typically only care about the model's performance rather than the model's symmetric state. Therefore, we only need to randomly select a permutation state (default is state #1, i.e., 000...001). Alternatively, we can specify a particular state to generate models in a specific basin.

[1] B. Zhao et al. Symmetry in Neural Network Parameter Spaces
[2] K. Wang et al. Neural Network Diffusion

评论

Thank you for the responses.

So if I understood correctly, the permutation state is a random one-hot encoding that we generate and attribute to a certain checkpoint?

评论

Dear Reviewer DHqN,

We greatly appreciate your careful review and thorough understanding of our paper.

Your understanding is correct. The permutation state is a randomly generated one-hot encoding that we assign to a specific checkpoint, which helps improve model performance by enhancing the model's ability to learn from different states.

If you have any further questions, feel free to discuss them with us and we are more than happy to solve them. Thank you.

评论

Thank you for the reply. I was not able to understand this from the text. I would maybe recommend, for example, to include the concept that we are attributing a random one-hot vector for each checkpoint in line 120. "permutation state S—encoded as a randomly generated one-hot vector—for each checkpoint W".

The authors have addressed all my questions, and I will maintain my score.

评论

Dear Reviewer DHqN,

Thanks for your valuable recommendations and support. We will revise line 120 as suggested, changing the text from "permutation state S—encoded as a one-hot vector—for each checkpoint W" to "permutation state S—encoded as a randomly generated one-hot vector—for each checkpoint W". This modification effectively clarifies the details of the permutation state method.

If you have any further suggestions, please feel free to share them with us to help further improve the presentation of this work. Thanks again for your valuable comments and constructive feedback.

Sincerely,
Authors

审稿意见
4

The paper proposes a network design that generates parameters of another neural network in a scalable and efficient manner. Their proposed network takes a form of a recurrent network so generated layers are generated in a manner that is aware of the inter-part correlations. Their model is trained with a diffusion objective to enable sampling of network parameters. They show experimentally that the so generated networks are comparable with network trained via backpropagation.

优缺点分析

Strengths:

  1. Proposing recurrent hypernetworks is an interesting and innovative idea.
  2. Sufficient ablation experiments are done to justify the various design choices.
  3. Out-of-distribution experiments are quite interesting and show capabilities of hypernetworks to generalize to new tasks

Weaknesses:

  1. Mis-stated lines:
    (a) line 20: However, neural network parameter generation—from early HyperNetworks [20] to more recent diffusion-based methods [45, 63, 58]—has not kept pace with the expansion of main stream vision and language models"-- I disagree with this part of the paper's motivation. There are works such as [1,2] that have scalable hypernetworks. [1] replaces the MLP hypernetwork in the original HyperNetworks paper, with a convolutional layer which makes the original hypernetwork (with a few other modifications) quite scalable. [1] uses hypernetworks to generate models that are around 20M parameters and I dont see why it should not be capable of scaling to ~100M parameter size. For [2] they use it to generate NeRFs which are just a few MLP layers, but these models are also capable of scaling to models of size ~100M. So I think, line 20 is not true, unless the authors can show some experiments where hypernetworks such as the ones in [1,2] fail when required to generate weights for large networks. One distinction I would like to make here is that [1,2] have low dimensional embeddings (that are trainable) that get mapped to weights of the generated network, whereas in the paper, they take a generative approach, which maps noise to weights perhaps line 20 can be rephrased as "hypernetworks that generate parameters from noise dont scale well".

    (b) line 24: "This discrepancy not only underscores the challenge of scalability but also confines existing generators to academic demonstrations rather than real-world applicability."-- I disagree with this line the way it is stated. [2,3,4] demonstrate that  hypernetworks can be used in real-world applications, they require generating only NeRFs which are typically smaller models.  So imo hypernetworks are not just confined to academic demonstrations due to scalability challenges.  
    
    
    (c) line 29: "Traditional approaches such as HyperNetworks struggle to handle even moderately sized models, primarily due to memory overheads and optimization complexity"--Simple modifications to the orginal HyperNetworks paper as demonstrated in [1] make it very easy to scale. I personally have trained hypernetworks to generate parameters for ResNets of size 100M using the recipe given by [1] for tasks like CIFAR and ImageNet classification. It could be helpful if the authors can show [1,2] not scaling well for some applications.
    
  2. Lacks main motivation: The main motivation of the paper as I see it,"current parameter generation methods do not scale well when asked to generate weights of large models" is in my opinion not true. That said, the paper does IMO explore an interesting class of hypernetworks that are capable of generation of parameters from noise as opposed to [1,2] which map learnable embeddings to network weights. I think it would be useful if the authors can motivate why such a class of hypernetworks is more appealing than the class of hypernetworks that map embeddings to network weights as in [1,2]. If the authors can provide a suitable answer to this, I am happy to increase my rating.

  3. Measuring diversity of generated networks: While the authors show diversity of models (following prior work) using a reasonable metric of selecting maximum IoU with original models. IMO a better metric would be to see if linear mode connectivity[5,6] breaks, so typically if models end up in the same basin, it is shown that interpolation of those models leads to a model which is also valid model with the same performance level as the models that were interpolated, however if these model are in different basins then interpolation between these models would lead to a model that performs significantly worse than the models used for interpolation. So I am suggesting taking the generated models and seeing with what fraction of the model weights used in training does convex mode connectivity break. This metric would actually suggest diversity in the sense of models belonging to different basins which imo is more meaningful. I understand it might be hard to implement such a metric within the rebuttal period, but just a suggestion to try out.

  4. Missing related work: A discussion on [1,2,3,4], especially given [3] uses a recurrent hypernetwork would help in contextualizing the current work.

[1] Babu, Sudarshan, Pedro Savarese, and Michael Maire. "HyperNetwork Designs for Improved Classification and Robust Meta-Learning."

[2] Chen, Yinbo, and Xiaolong Wang. "Transformers as meta-learners for implicit neural representations." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.

[3] Babu, Sudarshan, et al. "Hyperfields: Towards zero-shot generation of nerfs from text." arXiv preprint arXiv:2310.17075 (2023).

[4] Lorraine, Jonathan, et al. "Att3d: Amortized text-to-3d object synthesis." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[5] Frankle, Jonathan, et al. "Linear mode connectivity and the lottery ticket hypothesis." International Conference on Machine Learning. PMLR, 2020.

[6] Yunis, David, et al. "On convexity and linear mode connectivity in neural networks." OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop). 2022.

问题

  1. Is the permutation state essentially a one hot encoding on the trained networks used to train the hypernetwork?

局限性

yes

最终评判理由

I recommend this paper for weak accept

格式问题

No formatting concerns

作者回复

Q1: There are works such as [1,2] that have scalable hypernetworks.

A1: Thanks for the constructive suggestions. We answer the question as follows:

  • The papers pose are interesting, [1] achieves large scale generation for meta-learning, [2] cusmotizes NeRFs via transformer hypernetworks.
  • We change the statement in line 20 to "hypernetworks that generate parameters from noise dont scale well".
  • We apologize for overlooking these literatures at first, and we will add them to Section 5 in the revision.

Q2: There are works such as [2,3,4] that demonstrate hypernetworks can be used in real-world applications.

A2: Thanks the reviewer for the comment and we answer as follows.

  • We appreciate reviewer Q57G for pointing out these relevant works. NeRFs are practical models and this serves as a effective indicator for hypernetworks' utility.
  • Our work, on the other hand, are trying to explore the generation of large models in recent years, such as vision transformers and LoRA adapter for LLMs.
  • We conduct experiment on models with 3 to 200M parameters across image classification, object detection, small language models, and LoRA adapters for text and image generation. Through these explorations, our scope is broadening the application scenarios of parameter generation.
  • We highly apprecitate your interest in hypernetworks and the helpful suggestions posed, and we rephrase "This discrepancy not only underscores the challenge of scalability but also confines existing generators to academic demonstrations rather than real-world applicability." \rightarrow "This discrepancy underscores the challenge of scalability and if solved, can strengthen parameter generation's real-world applicability."

We hope these answer can address the concerns, and we appreciate the reviewer Q57G's interest in hypernetworks. We hope to hear more in discussion phase.

Q3: Why is (a) generating parameters from noise more appealing than (b) mapping learnable embeddings to network weights?

A3: Thank you for the question. We answer as below.

  • Better scalability and memory use.

  • Embedding decoders must output the full parameter tensor in one pass, so memory scales with model size.

  • RPG generates weights sequentially: each token is conditioned on previous predictions. This design choice trades a small compute overhead for orders-of-magnitude lower memory.

  • Action: This discussion is added to the revision.

Q4: Using linear mode connectivity (LMC) to evaluate model similarity.

A4: Thanks for the comment. We conduct experiment as follows:

  • Setting: We use ResNet18 on CIFAR-10 with different training + sampling schemes, and test the performance of linear model weights interpolation:

    Method \ Accuracy (%)AA0.5A+0.5B0.5A+0.5BBB
    LMC of 2 models sampled by RPG trained on checkpoints from the same basin95.195.095.0
    LMC of 2 models sampled by RPG trained on checkpoints from 2 basins and sampled with 1 permutation state95.094.894.9
    LMC of 2 models sampled by RPG trained on checkpoints from 2 basins and sampled with 2 permutation states94.918.494.9

Q5: Missing related work: A discussion on [1,2,3,4].

A5: Thanks for the comment. We discuss papers [1,2,3,4] in our A1 and A2, and we put the results here for convenience.

methodscalabilityperformancegeneralization
SKDE30S_{KDE30}
p-diff
SANE
D2NWG
[1]
[2]
[3]
[4]
RPG

Observations:

  • Existing approaches tend to specialize: some stress high performance, others emphasize generalization, and few focus on scalability.

  • In contrast, our work can deliver strong performance, broad generalization, and efficient scalability simultaneously.

We add this discussion and the table above in the revision.

[1] Babu, Sudarshan, Pedro Savarese, and Michael Maire. "HyperNetwork Designs for Improved Classification and Robust Meta-Learning."
[2] Chen, Yinbo, and Xiaolong Wang. "Transformers as meta-learners for implicit neural representations." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.
[3] Babu, Sudarshan, et al. "Hyperfields: Towards zero-shot generation of nerfs from text." arXiv preprint arXiv:2310.17075 (2023).
[4] Lorraine, Jonathan, et al. "Att3d: Amortized text-to-3d object synthesis." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

Q6: Is the permutation state essentially a one hot encoding on the trained networks used to train the hypernetwork?

A6: Thanks for the comment.

  • We chose one-hot encoding because it offers a straightforward way to represent parameter symmetry.
  • However, the method does not depend on this particular choice: any encoding scheme that distinguishes different permuatation states is adaptable to our framework.
  • We add this clarification in Section 2.2, line 117 in the revision.
评论

It is quite interesting that LMC breaks within sampled networks. The authors have also addressed my concerns on why generating from noise is more appealing as well. I increase my score.

评论

Dear Reviewer,

Thank you for your valuable feedback and support. We truly appreciate your recognition of our contributions, particularly regarding the behavior of LMC and the appeal of generating from noise. We will address all the issues you raised in the revision to ensure a stronger presentation.

If you have any further questions or suggestions, please don’t hesitate to reach out—we’re happy to discuss and resolve them.

Sincerely,
Authors

审稿意见
4
  • Parameter-generating models have struggled to scale to large numbers of generated parameters.
  • This work presents a novel setup that proposes parameter tokenization combined a recurrent state-space network to model inter-layer dependencies, and a diffusion model for parameter prediction, these techniques enable scaling the number of predicted parameters.
  • The paper evaluates the method on a wide variety of tasks and networks across vision and language, showing that the approach works in practice.
  • Many ablations are provided to justify the modeling choices such as recurrent architecture, type of positional embeddings or tokenization stretegy.

优缺点分析

Strengths

  • Quality - Wide coverage of architectures (ResNet, ViT, ConvNeXt, LoRA-augmented LLMs) and both vision and language tasks.
  • Analysis - Ablation study is thorough, dissecting many modeling choices in order to pinpoint what matters for accuracy and efficiency.
  • Significance - Results show single GPU generation of ~200M parameter networks, pushing parameter generation methods further than prior work.
  • Originality - Combines parameter tokenization with a recurrent decoder and parameter diffusion. I believe this is novel work in this area of research.

Weaknesses

  • Clarity - After reading the paper, I find that training versus inference details are underspecified. E.g. the procedure for predicting weight unnormalization shifts and scales; as well as how to choose permutation states at inference time remains unclear.
  • Significance - Paper never analyzes how many RPG parameters are required relative to each target model, making cost-benefit trade-offs hard to judge.
  • Significance - There is no quantification of how many checkpoints are needed to train RPG well. An ablation here would be really informative.
  • Presentation/Quality - Section 4 feels like a distraction and my opinion overstates the results. "Unseen tasks" have the same training data as seen tasks, and the "task" amounts to adapting the weights of the last layer, so it is not a useful result in practice. Furthermore, the relative sizes of training and test sets (1002 and 20 tasks) are not realistic.
  • Presentation - In contrast, some key results (Tables 19 and 20) are buried in the appendix but I find them more relevant to the presented work and should be in the main body of the paper.

问题

  1. In Equation 1 weights are normalized during training. How are the corresponding shift and scale values produced at inference?
  2. If permutation states are one-hot during training, what representation is used at inference?
  3. Can you report standard deviations for Tables 1, 2 and 7? Additionally, why do the blue figures in Table 7 vary across entries in the same column?
  4. For the presented tasks/architecture, what is the minimum number of checkpoints required to train a useful RPG model?
  5. Conversely, for the presented networks what is the smallest RPG parameter count that can match at least 95 percent of baseline accuracy ?

I would consider raising my score if these questions are answered satisfactorily.

局限性

  • Generated models are still below one billion parameters, limiting immediate applicability to large language models.
  • RPG is not permutation invariant; it learns explicit neuron permutations, which increases model size and training time.

最终评判理由

See my reply to the authors, they addressed some of my key concerns. While the presentation has room for improvement, I believe the work is technically sound.

格式问题

n/a

作者回复

Q1: Training vs inference details are underspecified. How to choose permutation states at inference time remains unclear.

A1: Thanks for the comment.

  • How is unnormalization implemented?

    • The weights (scales) and bias (shifts) we used for normalization and unnormalization are NOT calculated independently for each model, but are based on the average over the entire training set.
    • The layer-wise mean and std are calculated by: μl=1Nn=1Nmean(Wnl),σl=1Nn=1Nstd(Wnl)\mu_{l}=\frac{1}{N}\sum_{n=1}^{N}mean(W_{n}^{l}), \sigma_{l}=\frac{1}{N}\sum_{n=1}^{N}std(W_{n}^{l}), where WnlW_{n}^{l} denotes layer ll of the nn-th model, and μl\mu_{l}, σl\sigma_{l} denotes its mean, variance, respectively.
    • These statics are pre-computed and stored, and utilized in unnormalization process for generated parameters.
  • How to choose permutation state during inference?

    • Due to space limitations, please refer to A4 of DHqN.
  • Action:

    • We put these results and analysis in Section 3.3.
    • We add more motivation of permutation state line 117, Section 2.2.

Q2: how many RPG parameters are required relative to each target model.

A2: Thanks for the comment.

  • We present the details of structures of RPG and number of parameters it can generate in Table 12, Appendix B.3. We put the data here:

    model scalesRPG-TRPG-SRPG-BRPG-L
    num. can generateup to 50Kup to 10Mup to 50Mup to 200M
    recurrent1.3M256M1018M3076M
    diffusion0.3M17M69M273M
    total RPG1.6M273M1087M3349M
  • Findings:

    • The denoising network's parameter scale is close to the number of parameters that RPG can generate.
    • Most of RPG's parameters come from recurrent backbone.
  • Conclusions:

    • RPG hass a light-weight denoising network.
    • With lighter backbone, RPG is possible to generate billion parameters.
    • RPG can generate almost infinite number of models once trained, so the efficiency trade-off is worwhile in the long run.
  • Action:

    • We add these analysis to Section 3.1.
    • We will explore lighter backbones.

Q3: How many checkpoints are needed to train RPG well.

A3: Thanks for the comment.

  • We evaluate RPG's performance on 2 scenarios: close-set (generate models seen in training) and open-set (generate unseen models) generation.

  • Close-Set: In Table 21, Appendix C.5, we conduct ablation studies of checkpoints number for training RPG. We use ViT-Tiny on ImageNet.

    • Setting:
    number of checkpoints5251003001002
    acc.75.175.275.275.275.3
    • Findings and Conclusions:
      • Only 1 checkpoint can train RPG well, showing its strong close-set geneation ability with very limited data.
  • Open-set: Thanks for the suggestion, we conduct the following ablation:

    • Setting:
      • We randomly select 5, 25, 100, 300, and 1002 (all) checkpoints claimed in Section 4 as training set. All configurations are consistent with Section 4.
    number of checkpoints5251003001002
    acc.54.184.786.987.890.0
    • Findings and conclusions:
      • RPG can generalize to unseen tasks even trained on a smaller number of checkpoints, i.e., 100.
      • More data can improve performance, showing that generalization needs basic amount of training data.

Q4: "Unseen tasks" have the same training data as seen tasks, and the "task" amounts to adapting the weights of the last layer, so it is not a useful result in practice. Furthermore, the relative sizes of training and test sets (1002 and 20 tasks) are not realistic.

A4: Thanks for the comments.

  • 1. "Unseen tasks" have the same training data as seen tasks.

  • 2. The "task" amounts to adapting the weights of the last layer, so it is not a useful result in practice.

  • 3. The relative sizes of training and test sets (1002 and 20 tasks) are not realistic.

  • We refer our answers to them as A 4.1, A 4.2, and A 4.3, respectively.

  • A 4.1: Thanks for the comment.

    • Parameters for unseen tasks and seen tasks are of same network structures, but they are trained on different binary embedding classification tasks.
    • ALL parameters in unseen tasks are not involved in RPG's training, so the evaluation process is zero-shot.
  • A 4.2: Thanks for the comment.

    • We first conduct full-parameter fine-tuning.
    • Therefore, the backbone parameters for the 1022 training samples are not identical.
  • A 4.3: Thanks for the comment.

    • As ablation study in A3 suggests, RPG can effectively learn from limited training checkpoints, i.e., achieving 84.7% accuracy with only 25 checkpoints, 86.2 with 100 checkpoints

    • Additionally, our experiments on object detection and larger dataset in Appendix C.2 provide more generalization results: (Note that we change embedding to hex to save words)

    • object detection:

      • Setting: YOLOv8n on PASCAL VOC, using 20-bit 0-1 embeddings.

      • Results:

        unseen tasksepoch-0epoch-1epoch-2epoch-5
        2010c0.00 / 0.290.01 / 0.300.06 / 0.330.18 / 0.39
        080260.01 / 0.250.02 / 0.300.05 / 0.320.21 / 0.39
        110480.01 / 0.280.09 / 0.330.12 / 0.350.32 / 0.38
        200610.00 / 0.240.04 / 0.250.04 / 0.290.13 / 0.35
        0a0600.00 / 0.190.03 / 0.220.11 / 0.260.21 / 0.33
      • Conclusion:

        • RPG offers good initializations, faster adaptation and reduces training costs.
        • RPG can generalize to complex tasks.
    • larger dataset:

      • Setting: ViT-Small on ImageNet-1K (coarse-grained 10 classes).

      • Results:

        unseen tasksoriginalRPG
        01591.589.7
        1a188.587.0
        3f289.188.1
        30b89.387.3
        2ad91.490.4
        37088.687.2
        2be92.590.1
        1ea89.587.5
        2d190.187.9
        36090.289.2
      • Conclusion:

        • RPG can also generalize at ImageNet, achieving comparable performance to trained models.

Q5: Some key results are buried in the appendix.

A5: We make the following modifications.

  • Move Tables 19 and 20 , and relevant analysis to the main text in the revision.
  • Examine other details of our work and refine it accordingly.

Q6: Can you report standard deviations for Tables 1, 2 and 7?

A6: We provide the std below and include them in the revision:

  • Note:
    • We present the data in accuracy (std) format.
    • Due to word limit, we only provide Table 1 and will update the rest results once entered discussion phase.
  • Table 1:
    arch. \ acc.(%)ResNet-18ResNet-50ViT-TinyViT-SmallViT-BaseConvNeXt-AConvNeXt-L
    original70.0(0.28)79.8(0.05)74.9(0.13)81.4(0.08)84.4(0.08)75.2(0.07)85.8(0.10)
    generated69.5(0.23)79.5(0.07)75.3 (0.07)80.5 (0.14)84.4 (0.12)74.4 (0.11)85.5 (0.19)

Q7: Why do blue subscript of Table 7 vary across entries in the same column?

A7: Some works did not have updated the code with the latest paper, making them difficult to reproduce. Therefore, all the data in Table 7 are taken from the original papers and are not our reproduced results. Since the checkpoints used in the original papers vary across studies, we have added blue subscript markers to indicate this.

Q8: What is the smallest RPG parameter count that can match at least 95 percent of baseline accuracy?

A8: Thanks for the comment.

  • We conduct ablation studies Appendix C.3, Table 19 to explore this, and we put the results here for convenience.

  • Settings: We generate ViT-Tiny with RPG of various parameter counts and report the accuracy.

  • Results:

    # RPG parameteracc.
    20M0.3
    81M70.8
    273M75.2
    1087M75.3
  • Findings and Conclusions:

    • At parameter count of 81M, RPG generates model with accuracy of 70.8%
    • As in Section 3.1, Table 1, baseline accuracy is 74.9%.
    • The generated model achieves 70.8% ÷ 74.9% = 94.5% (roughly 95%) of the original model's performance.

Q9: Generated models are still below one billion parameters, limiting immediate applicability to large language models.

A9: Thanks for the comment.

  • We admit that RPG currently don't support LLM full model generation.
  • As in A2, RPG's scaling up bottleneck is its heavy backbone. We are currently working on finding lighter alternatives.
  • Moreover, we can generate LoRA parameters, which offers efficient task adaptation of LLMs.
  • Solutions for scaling up:
    • Finding lighter recurrent backbones.
    • Improving the Efficiency of tokenization.

Q10: RPG is not permutation invariant; it learns explicit neuron permutations, which increases model size and training time.

A10: Thanks for the comment.

  • We admit that permutation invariant is a crucial question in parameter generation, which enables training on checkpoints collected from various sources.
  • We introduce permutation state to mitigate this problem, as stated in A1. Results show that without permutation state can cause failure in training on models with different permutations.
  • Our method can alleviate the problem, but achieving permutation invariant remains challenging, and we will continue to explore in the future.
评论

Dear Reviewer,

As mentioned in A6 above, we provide the standard deviation results here. Below are the standard deviations for Table 2 and 7.

Note: We present the data in accuracy (std) format.

  • Table 2:

    rank-method \ acc. (std)BoolQPIQASIQAHellaSwagARC-eARC-cOBQA
    rank4-org64.3 (2.3)71.3 (11.7)66.0 (15.1)53.7 (15.8)64.4 (19.0)49.5 (12.7)63.1 (16.6)
    rank4-RPG63.1 (3.6)72.0 (14.6)67.5 (10.1)56.7 (18.8)65.3 (16.9)49.7 (13.2)66.0 (13.0)
    rank64-org69.3 (1.0)78.9 (10.8)72.9 (1.5)81.1 (6.1)72.9 (5.3)58.1 (4.4)71.2 (2.0)
    rank64-RPG69.1 (1.3)79.4 (10.1)73.9 (0.9)81.1 (9.0)73.1 (4.1)58.3 (3.9)72.1 (1.9)
  • Table 7:

    method \CNN (s)CNN (m)ResNet-18ViT-B
    params. (M)0.0030.01111.786.6
    SKDE30S_{KDE30}26.9 (4.9) / 46.1-OOMOOM
    p-diff48.8 (0.2) / 49.061.9 (0.1) / 62.1OOMOOM
    SANE-57.9 (0.2) / 57.268.6 (6.7) / 85.5-
    D2NWG38.2 (-) / 44.758.8 (0.1) / 57.294.6 (-) / 94.6-
    RPG49.0 (0.1) / 49.062.0 (0.1) / 62.195.1 (0.6) / 95.398.9 (0.1) / 98.7

We hope our rebuttal has addressed your concerns. To save your time, we make a summary of our rebuttal as follows:

  1. We clarify normalization statistics computation and permutation state selection.

  2. We compare RPG's parameter counts with generated ones on a table.

  3. We conduct experiments to compare RPG's parameter counts with accuracy on both close-set and open-set scenarios.

  4. We provide supplementary explanations and experiments for training and evaluating data.

    • We clarify that parameters for unseen tasks and seen tasks are different.
    • We explain that all checkpoints have different backbones.
    • We construct experiments with other data splits and demonstrate results on more complex tasks and larger datasets.
  5. We move key results (Tables 19 and 20) from appendix to main text.

  6. We report the standard deviations for Table 1, 2, and 7.

  7. We explain the reason for varying blue subscripts in our notation.

  8. We conduct an experiment to measure the parameter count required for RPG.

  9. We discuss scalability limitations and propose alternative solutions.

  10. We acknowledge the permutation invariance limitation while highlighting our solution—permutation state method.

Looking forward to your reply!

评论

Dear Reviewer D95z,

Thanks again for your efforts in reviewing this work. As the discussion period will be closed soon, may we know if the concerns are fully addressed? Your insightful comments and feedback are really important for us to improve the quality of our work!

Looking forward to your reply!

Authors

评论

I thank the authors for their thorough reply.

I am still concerned about some of the presentation challenges, and I hope they are addressed in future revisions. As an additional example, I believe Tables 12 & 21 belong in the main body of the paper as they are key results.

Regarding Q4/A4, I probably wasn't clear enough, and while the reply is technically correct, I do not think it's addressing my observations. I still think the tasks are too similar (just a label flip) so calling them unseen (even if the embeddings are different) is disingenuous.

Nevertheless, I think the contribution is valuable and the method novel, so I have decided to raise my score.

评论

Dear Reviewer D95z,

We sincerely thank you for your insightful and constructive feedback. Your comments have highlighted key areas for improvement, and we have developed a comprehensive plan to address them in the revision. Below, we outline the improvements we will implement:

1. Moving Key Results to the Main Text

  • We are going to move Table 21 to the main text to better emphasize its importance.
  • For Table 12, due to space constraints, we are going to include a condensed version in the main text, summarizing key parameter count information. This will ensure clarity while adhering to space limitations.
  • Additionally, we are going to provide detailed explanations linking these results to our core research claims to improve clarity and accessibility.

2. Clarifying Task Definition

  • We emphasize that our approach is not a simple label flip. Different groupings of labels require the model to extract distinct features. For example:
    • For four fruits (red apple, green apple, kiwi, cherry), classifying (red apple, green apple) as positive focuses on extracting shape features.
    • Conversely, classifying (red apple, cherry) as positive requires the model to focus on extracting the color red. These represent entirely different feature extraction processes, with no overlap.
  • To address your concerns, we are going to:
    • Revise the definition of "unseen tasks" to emphasize parameter distribution differences rather than "label flipping".
    • Include additional experiments demonstrating task variability and its impact on model performance.

3. Presentation Optimization

  • We are going to restructure Sections 3 and 4 to enhance the logical flow and make the paper more reader-friendly.
  • Key results raised by reviewers will be prioritized and discussed in the main text to better highlight our contributions.

We deeply appreciate your valuable feedback, which has been instrumental in improving our paper. Should you have further suggestions or questions, please do not hesitate to reach out. We are committed to addressing all your concerns and look forward to submitting a stronger revision.

Sincerely,
The Authors

评论

Dear 6523 Reviewers: The authors have provided detailed rebuttals to your reviews. I'd urge you to read their rebuttals (as well as other reviews) early to allow further interactions that help clarify any lingering confusion or misunderstanding. Thank you! AC

最终决定

This paper presents Recurrent Diffusion for Large-Scale Parameter Generation (RPG), a framework that combines parameter tokenization, a recurrent state-space model to capture inter-layer dependencies, and a diffusion process for sampling network weights. RPG achieves performance competitive with backpropagation-trained networks across a wide range of vision and language architectures (ResNets, ViTs, ConvNeXts, LoRA-augmented LLMs). Extensive experiments and ablations validate its design choices and highlight its scalability.

Reviewers appreciated the originality and breadth of the proposed approach, especially the novel combination of recurrence and diffusion for parameter synthesis. They praised the strong empirical results and coverage across multiple domains, along with well-motivated ablations and comparisons. Several reviewers emphasized that the paper pushes the boundary of what parameter-generating networks can achieve, noting successful single-GPU generation of ~200M-parameter models and compelling experiments on out-of-distribution tasks. The method was also described as well-structured and clearly presented, with some reviewers highlighting the potential for further research on generative hypernetworks.

However, reviewers also noted important weaknesses. Some challenged the paper’s core motivation, arguing that existing hypernetworks have already scaled to large models, and the claims that previous methods are limited may be overstated without direct comparisons. Others pointed out the lack of clarity in training vs. inference details, as well as the absence of key evaluations such as architectural generalization to unseen model types, or stronger statistical analyses of model diversity. The method’s computational footprint, including the use of many pretrained checkpoints for training, is underexplored, leaving the practical cost-benefit unclear. Additionally, some reviewers found that performance on unseen tasks (e.g., with new output heads) lagged behind standard training, and suggested the paper would benefit from clearer real-world use cases or scenarios where RPG offers a distinct advantage over traditional training.

The authors' extensive rebuttals resolve all major concerns, with multiple reviewers raising their scores post-rebuttal. The paper received 1x accept, 2x borderline accepts, and 1x borderline reject. For the lingering concern of the negative review, the authors clarified that RPG specifically targets functional generalization (producing parameters for unseen tasks while holding the architecture fixed), while structural generalization to unseen architectures is a separate research direction beyond the scope of this paper. This distinction addresses one reviewer’s concern about unseen architectures and will be made explicit in the revision. The paper also details practical scenarios where RPG is beneficial: 1. efficient initialization, providing stronger starting points than random weights and reducing training costs, 2. few-shot learning, where RPG-generated parameters act as priors for data-scarce tasks, and 3. rapid model deployment, enabling near-instant adaptation to new tasks without retraining. These use cases strengthen the real-world relevance of RPG.

While there are reasonable critiques regarding cost-benefit transparency and reproducibility details (e.g., checkpoint usage, statistical reporting), reviewers widely recognized the originality and significance of the work, noting that it advances parameter-generating models well beyond prior methods and demonstrates strong functional generalization to new tasks within fixed architectures. The consensus leans toward acceptance.