Autoencoder-Based Hybrid Replay for Class-Incremental Learning
A new scalable hybrid autoencoder architecture for class-incremental learning.
摘要
评审与讨论
In this paper, the authors proposed a method named hybrid autoencoder (HAE) and a strategy named autoencoder-based hybrid replay (AHR). For HAE, it is an autoencoder learnt with charged particle system energy minimization (CPSEM) equations and repulsive force algorithm (RFA). This autoencoder is used for both decoding stored embeddings for replay and encoding inputs for classification. The AHR is storing data embedding rather than raw data.
给作者的问题
What is O(cte)? Is there any misunderstanding of O(0.1t)? Why did you just said the embedding size is 1/10 of the raw data?
论据与证据
The theoretical part was not described clearly. The experiment part is relatively reasonable, but more metrics should be added.
方法与评估标准
The author used several baselines on different benchmark datasets with accuracy. They also compared memory used among methods.
理论论述
This part makes people confusion, especially the mess notations. It is hard to find the definitions of them. For example, they stated that complexity was O(cte) but hard to find the meaning. They mentioned that their space complexity was O(0.1t). It is a very strange and unprofessional way to describe the complexity. O(0.1t) is O(t).
实验设计与分析
The experiment design is relatively reasonable. As I mentioned above, more metrics should be considered like how performance dropping along with more tasks leant.
补充材料
I reviewed the appendix.
与现有文献的关系
Storing embedding rather than raw data is not new in continual learning. The idea of learning autoencoder to reconstruct for replay makes it better.
遗漏的重要参考文献
There are some works that have the idea of storing embedding, like “Skantze G, Willemsen B. Collie: Continual learning of language grounding from language-image embeddings[J]. Journal of Artificial Intelligence Research, 2022, 74: 1201-1223.”
其他优缺点
The good things are the idea of learning an autoencoder to reconstruct stored embedding and experiments included many baselines. However, the description of the method and notations are confusion. The experiment part should include more metrics.
其他意见或建议
See other parts.
We thank the reviewer for their time and feedback. We have carefully considered the points raised and we offer the following responses and clarifications:
On the Theoretical Clarity
-
Regarding O(cte): The notation "O(cte)" was intended as shorthand for "constant time complexity" (O(1)). We agree this was unclear and will explicitly replace "O(cte)" with the correct, widely recognized notation "O(1)" in the revised manuscript.
-
Regarding O(0.1t): The notation "O(0.1t)" was deliberately chosen to emphasize that our hybrid replay method uses approximately 10 times less memory compared to standard non-hybrid replay methods. Such notation, indicating proportional improvements in complexity, is commonly used in professional literature and textbooks to highlight practical differences in resource usage.
Incorporating Additional Evaluation Metrics
We commit to incorporating metrics such as average forgetting, forward transfer, incremental confusion maps, and accuracy curves over time to provide a better understanding of any biases (recency/primacy).
Clarification on Hybrid Method Novelty
Indeed, hybrid replay strategies have been explored previously, and we cited relevant works like i-CTRL [1] (2022), REMIND [2] (2020), and its variant REMIND+ [3] (2021) in our paper. To provide clearer context and comparison, we will include the following comparative analysis:
| Classification Approach | Quantization | Learning Scenario | Latent Space Representation | Architectural Simplicity | |
|---|---|---|---|---|---|
| AHR | Classification within the latent space of the encoder using Euclidean distance. | Easy integration possible, potentially improving performance. | Offline | Structured latent space (Lennard-Jones Potential). | Minimalistic and clean design. |
| Remind | Classification after decoding with cross-entropy loss. | Applies quantization to latent exemplars for compression. | Online | Unstructured latent space (CNN-based). | Complex architecture, complicating implementation and scalability. |
| Remind+ | Classification after decoding with cross-entropy loss. | Uses quantization for feature compression. | Online | Unstructured latent space (autoencoder-based). | Complex architecture, hindering scalability. |
| i-CTRL | Classification within latent space using Euclidean distance. | Easy integration possible, potentially improving performance. | Offline | Structured latent space (Linear Discriminative Representation). | Minimalistic and clean design. |
Additional Points and Clarifications
-
Relevant References: We will discuss the relevant work by Skantze and Willemsen [4] to provide further context.
-
On Latent Space Dimension: We clarify that the compression ratio varies depending on the dataset and the underlying network architecture used for feature extraction. The specific compression ratios are
Dataset MNIST SVHN CIFAR-10 CIFAR-100 miniImageNet Compression Ratio (%) ≈ 40 ≈ 10 ≈ 10 ≈ 10 ≈ 10
Thank you very much for your thorough review and feedback. We respectfully invite you to review the comments from the other reviewers as well, who have positively acknowledged key strengths and contributions of our work. We hope this broader context might offer additional perspectives on the significance and novelty of our contributions, leading to a more balanced and constructive reassessment. We greatly value your insights and remain committed to addressing your concerns comprehensively.
References
[1] Incremental learning of structured memory, Tong et al., 2022.
[2] Remind your neural network to prevent forgetting, Hayes et al., 2020.
[3] Acae-remind for online continual learning, Wang et al., 2021.
[4] Collie: Continual learning of language grounding. Skantze et al., 2022.
The paper proposes an Autoencoder-Based Hybrid Replay (AHR) strategy for class-incremental learning (CIL), addressing catastrophic forgetting (CF) and task confusion (TC) while reducing memory complexity. The core innovation is a Hybrid Autoencoder (HAE) that compresses exemplars into a latent space using a repulsive force algorithm (RFA) inspired by charged particle systems. This allows efficient storage and recovery of exemplars via a decoder designed for memorization. Extensive experiments validate the effectiveness of this method.
给作者的问题
Please refer to the weaknesses.
论据与证据
Most claims are generally supported by rigorous experiments and ablation studies. However, the paper claims that AHR is applicable to task-free CIL, but it is only briefly mentioned in Figure 2 and lacks relevant experimental proof.
方法与评估标准
Yes, the proposed method makes sense to the issue at hand.
理论论述
Yes, its theoretical claims I have not found to be incorrect.
实验设计与分析
Experiments are comprehensive but have some shortcomings:
- The baselines compared are somewhat outdated, with most of the methods from around 2020 and only one from 2023. What is the relative performance of the AHR compared to the most recent methods?
- In the five datasets, the architecture of the network uses the underlying dense network and ResNet32, how does the AHR perform for the models of the ViT architecture?
- Benchmarks use balanced class splits; performance on imbalanced data (common in real-world CIL) is untested.
补充材料
The supplementary material contains hyperparameters and experimental details and provides source code, which makes it reproducible.
与现有文献的关系
The work effectively builds on prior CIL strategies and addresses the shortcomings of previous approaches.
遗漏的重要参考文献
No.
其他优缺点
Strengths: 1.Innovative combination of physically-inspired RFA and autoencoder provides new ideas for CIL. 2. Extensive experiments on different benchmarks have demonstrated the effectiveness of AHR. 3.The proposed AHR practically focuses on memory/computation efficiency and achieves good results. Weaknesses:
- Referring to the questions posed in Claims And Evidence and Experimental Designs Or Analyses sections.
- The effect of some hyperparameters on the performance is missing, such as λ in Eq. 1.
其他意见或建议
None
伦理审查问题
None
We sincerely thank the reviewer for their thorough review and insightful comments. We are particularly grateful for the recognition of AHR's strengths. Below, we clarify and address issues pointed out by the reviewer:
On Applicability to Task-Free CIL
We'd like to clarify that Figure 2 is intended solely as a visualization to illustrate the concept of a task within our task-based CIL framework. We do not claim that our current approach performs equally well in task-free settings. Nevertheless, due to the inherent compression capabilities of our proposed architecture, which facilitate the storage of larger and more diverse exemplars, it could potentially, with minor adjustments, be adapted for use in task-free settings. We will explicitly clarify this point in the revised paper and leave a detailed exploration of this possibility to future work.
On the Choice and Recency of Baselines
Our selection aimed to include representative and relevant methods from the CIL literature, particularly focusing on hybrid replay strategies where latent or compressed representations are stored. To this end, we included methods like i-CTRL [1] (2022), REMIND [2] (2020), and its variant REMIND+ [3] (2021), which align closely with the concept of AHR. However, we remain open to incorporating additional recent baselines if the reviewer can point to specific, highly relevant examples where a direct comparison would be particularly insightful. Furthermore, we believe AHR's core contribution, using an efficient hybrid replay, is a concept potentially orthogonal and complementary to other advancements in CIL backbone architectures. One could envision integrating AHR's lightweight decoder mechanism with the latent representations of other contemporary CIL models to potentially enhance their performance under strict memory constraints, leveraging our memory-efficient replay approach.
On Performance with ViT Architectures
Our current experiments utilize standard backbone networks like ResNet32 commonly employed in the CIL literature. This choice was primarily driven by the goal of ensuring fair and direct comparability with a wide range of existing CIL methods, many of which report results using these architectures. While we believe AHR's core mechanism, efficient exemplar storage, is conceptually compatible with features extracted by ViTs, rigorously evaluating this combination was beyond the scope of the current study due to the aforementioned reasons of comparability and computational cost. We will note the exploration of AHR with ViT backbones as a valuable avenue in the 'Limitations and Future Work' section.
On Performance with Imbalanced Data Distributions
While our current experiments utilize standard balanced benchmarks for clear comparability, we believe AHR's core contribution, its highly efficient hybrid replay mechanism, is conceptually orthogonal to the challenge of class imbalance itself. Critically, AHR's ability to store significantly larger and potentially more diverse sets of exemplars within a fixed memory budget remains a powerful tool, regardless of the underlying class distribution. Since maintaining knowledge of past classes especially minority classes, in an imbalanced setting is vital, the enhanced replay capability offered by AHR is anticipated to be advantageous even under such conditions. However, we acknowledge that we have not explicitly evaluated AHR under imbalanced scenarios in the present work. Thoroughly investigating AHR's performance in such settings, potentially in combination with techniques specifically designed for imbalance, is indeed crucial. We recognize this and will explicitly identify the evaluation of AHR under class imbalance as an important area for future investigation in the 'Limitations and Future Work'.
On Hyperparameter Sensitivity (λ)
Concerning the weighting factor λ in Equation 1, which balances the reconstruction and classification losses, while detailed ablation studies were omitted from the main paper due to space constraints, we performed sensitivity analyses during development. The results for the CIFAR-10 5/2 split benchmark, demonstrate the impact of varying λ:
| λ Value | 0 | 20 | 40 | 60 | 80 | 100 | 120 |
|---|---|---|---|---|---|---|---|
| Avg. Accuracy (%) | 42.03 | 63.93 | 70.51 | 73.70 | 77.12 | 76.53 | 72.26 |
We commit to including a more comprehensive sensitivity analysis for λ in the appendix of the revised paper.
Thank you for reading our response.
References
[1] Incremental learning of structured memory via closed-loop transcription, Tong et al., 2022.
[2] Remind your neural network to prevent catastrophic forgetting, Hayes et al., 2020.
[3] Acae-remind for online continual learning with compressed feature replay, Wang et al., 2021.
The paper tackles the problem of class incremental learning, where the model sees a sequence of tasks of different classes, and needs to adapt to them sequentially while minimizing catastrophic forgetting and task confusion. During testing, the model does not have access to the task ID.
To solve this problem, the authors propose to train a novel autoencoder architecture that is used for both replay and classification. The contributions of the paper are as follows:
- The authors propose a hybrid autoencoder, that is used both for computing the latent representation to store in memory and replay, and for classification. This autoencoder is coupled with a physics-inspired approach: charged particles system energy minimization, and repulsive force algorithm to incrementally add components to the latent space memory. The goal of this approach allows for the different classes to be away from each other in the latent space, allowing for a simple use of the Euclidean distance to the class centroids to classify samples during test.
- These autoencoders are used in an approach that combines exemplar and generative replay ideas. As mentioned above, the approach stores exemplars from the latent space, reducing the memory footprint of the approach. When replay data is needed, the autoencoder is used to decode samples from the memory. Contrarily to generative replay, the decoder is trained for memorization and not to generate new samples, which hedges the approach against some of the drawbacks of generative replay.
- The approach is tested on 5 benchmarks, and compared to 10 baselines, showing an improvement in most cases. The authors also conduct some ablation studies, mainly testing the importance of the use of the physics inspired approach to structure the latent space.
Update after rebuttal
The authors answered most of my questions and addressed most of my concerns. I also thank the authors for their reactivity during the rebuttal. With the added evaluations and additional experiments to the revised version, I am happy to increase my score to 3.
给作者的问题
All of my questions are related to my previous comments:
- Can the authors comment on the choice of not changing the class centroids?
- How would the method behave and how should it be modified to be applicable in more realistic scenarios and benchmarks?
- How does the method impact other continual learning metrics?
- How does the latent space dimension impact the results?
论据与证据
The paper main claim is to achieve better performance for class incremental setting while reduce the memory footprint and keeping a linear compute as in competitor methods.
The authors prove this claim empirically through extensive experiments, on several datasets, and comparing to a good number on baselines. While I have some comments on the choice of baselines and used benchmarks (see next sections), I think the evidence is clearly provided and relatively convincing.
方法与评估标准
Strengths:
- The whole approach is relatively novel to the best of my knowledge.
- In particle, the use of the physics inspired approach to structure the latent space is interesting and see=ms to add a significant effect.
- The experiments are extensive. I particularly appreciated the ablatio studies.
Weaknesses:
- The authors made the choice of not changing the class centroid positions. While this works well for the tested benchmarks, I think it is mainly due to the fact that classes are perfectly disjoint across tasks in this artificial benchmarks. In real applications, this might not be the case, and allowing for a more flexible adaptation of the latent space can not only be more generalizable, but also have other effects (e.g. help reduce the degradation of the decoder).
- Regarding the benchmarks, despite their wide use in the literature, they are highly artificial and have a very limited representation of real world applications. There are multiple attempts to propose alternative benchmarks for continual learning and related topics (e.g. meta-learning) that are more realistic. I will provide the references in the dedicated section below. They also consist of relatively short sequences. Recent works have indicated the impact of the sequence length on the model behavior (see reference below too).
- For evaluation, the authors base their analysis on accuracy at the end of the sequence only. It would be interesting to have a more granular approach, with metrics such as forgetting of forward transfer, to test if the approach has a recency or primacy bias, if it increases knowledge accumulation, etc ...
理论论述
The paper is empirical in nature. The main theoretical contribution is in deriving the different objectives and algorithms that constitute the approach. Except of the first comments under Weaknesses in the previous section, I didn't detect any other issues.
实验设计与分析
My main critique to the experimental design is the choice of benchmarks and metrics as explained above.
As mentioned above, I appreciated the ablation studies that highlight the importance of different components of the approach. These results can be improved with additional tests. For example, it seems from the paper introduction and claims that the latent space size is fixed to 2.5 or 10% of the input size. If this is not the case, this should be made clearer. If it is the case, it would be interesting to test the impact of the latent space dimension on the results.
On the form, the curves can be improved and made more readable. In particular, the choice of colors is not optimal. For example. the finetuning with replay exemplars baseline and the proposed approach (AHR-up) have the same color in figure 3.
补充材料
I checked the experimental details, and the extended literature review.
与现有文献的关系
While the paper already include a detailed literature review, there are some missing references as mentioned above. Nevertheless, the paper is relatively well situated in the literature. In particular, the comparison to several generative and exemplar replay approaches and strategies is interesting.
Benchmarks:
- Meta-Album: Multi-domain Meta-Dataset for Few-Shot Image Classification, Ullah et al. 2022
- A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark, Zhai et al. 2020
- NEVIS'22: A Stream of 100 Tasks Sampled from 30 Years of Computer Vision Research, Bornschein et al. 2023
On the impact of sequence length
- Challenging Common Assumptions about Catastrophic Forgetting, Lesort et al, 2023
遗漏的重要参考文献
Regarding the idea of latent replay, and related baselines, I think the authors are missing an important reference:
- Continual Learning with Foundation Models: An Empirical Study of Latent Replay, Ostapenko et al. 2022
其他优缺点
N/A
其他意见或建议
N/A
We are impressed by the reviewer's high-quality feedback, which reflects a deep understanding of the field and engagement with our work. We particularly appreciate the fairness of their critique. In response, we offer the following clarifications:
On the Choice of Not Changing Class Centroid Positions
We acknowledge the importance of considering more realistic evaluation scenarios where task data may not be perfectly class-disjoint, in which case the idea of fixing class centroids (successfully employed in recent works [1]) is suboptimal. However, our decision to employ the standard, well-established CIL benchmarks (which typically feature disjoint classes) was primarily driven by the necessity of ensuring fair, direct, and easily interpretable comparisons with the large body of existing literature evaluated under these protocols.
We commit to adding a dedicated discussion in the 'Limitations and Future Work' section of our revised paper, where we state that while the proposed fixed-centroid approach demonstrates strong performance under the standard disjoint CIL setting, future work should investigate adaptive centroid mechanisms to optimize performance for non-disjoint, real-world data distributions [2,3,4]. We will also mention that in such realistic non-disjoint benchmarks, the decoder's degradation could significantly be mitigated because the overlap of classes across tasks could lead to the decoder revisiting older classes.
On the Choice of Benchmarks and Sequence Length
We acknowledge the limitations of standard CIL benchmarks, particularly their often artificial nature and relatively short task sequences. We thank the reviewer for providing references to more realistic benchmarks [2, 3, 4] and studies on the impact of sequence length [5]. We commit to incorporating a discussion of these limitations and citing these important works in the 'Limitations and Future Work' section.
Incorporating Additional Evaluation Metrics
We will incorporate more granular evaluation metrics beyond final accuracy and agree that metrics such as average forgetting, forward transfer, incremental confusion maps, and potentially accuracy curves over time can provide better understanding of any potential biases (recency/primacy). We commit to including these evaluations in the revised version of our paper.
Additional Points and Clarifications
-
On Latent Space Dimension: We clarify that the compression ratio varies depending on the dataset and the underlying network architecture used for feature extraction. The specific compression ratios and latent space sizes used in our experiments are summarized below:
Dataset MNIST SVHN CIFAR-10 CIFAR-100 miniImageNet Compression Ratio (%) ≈ 40 ≈ 10 ≈ 10 ≈ 10 ≈ 10 Latent Space Size 20 307 307 307 2117 -
On Figure Readability and Color Choices: In the revised paper, we will ensure the distinctness of color palettes, potentially incorporating different line styles or markers where appropriate.
-
Inclusion of a Latent Replay Study: We will include reference [6] and discuss how it compares with our proposed work.
Kind Request for Reassessment
We believe our proposed Autoencoder Hybrid Replay method remains highly effective, even under realistic benchmark conditions, due to its ability to store a significantly larger and more diverse set of exemplars. Extensive research supports that exemplar diversity effectively mitigates catastrophic forgetting. Additionally, we have devised solutions for handling non-class-disjoint benchmarks with adaptive centroids, which we could not include here due to space constraints but will gladly discuss during the discussion period.
We have strived to address the key concerns, regarding benchmarks and evaluation metrics, through clarification and planned revisions (expanded limitations/future work section, additional metrics). We hope these responses and commitments are satisfactory. If you feel that our clarifications and planned updates satisfactorily address your concerns, we kindly request you to consider increasing your score. This would greatly support the acceptance of our work and help us contribute meaningfully to the field. Regardless of your decision, we are sincerely grateful for your thoughtful feedback. Thank you.
References
[1] Combating Inter-Task Confusion and Catastrophic Forgetting, Moslem et al. 2025.
[2] Meta-Album: Multi-domain Meta-Dataset, Ullah et al. 2022
[3] A Large-scale Study of Representation Learning, Zhai et al. 2020
[4] NEVIS'22: A Stream of 100 Tasks, Bornschein et al. 2023
[5] Challenging Common Assumptions, Lesort et al., 2023
[6] Continual Learning with Foundation Models, Ostapenko et al. 2022
I thank the authors for their careful consideration of my comments, and for revising their paper accordingly.
Regarding additional benchmarks: I agree that the used benchmarks are standard and widely used, and I also agree comparing the methods on these benchmarks is needed and useful. It is however important for the community to go beyond the limitations and the artificial settings of these benchmarks, in order to improve our understanding of more realistic scenarios, and develop methods that can have wide practical implications.
Regarding varying the size of the latent space, while there is a difference between MNIST and the other datasets, the ratio is the same for all the other datasets. It is important to state how this ratio has been chosen (my guess is based on the memory requirements?), and how changing it would influence the behavior of the method.
Again, thank you very much for your answers.
We appreciate the reviewer's suggestion to go beyond the commonly used benchmarks for evaluation as they have significant limitations. We commit to conducting two additional experiment scenarios and including them in the appendix of the revised manuscript:
-
Non-class-disjoint scenario: We plan to incorporate a simulation scenario in which task sequences feature realistic class overlaps, following the data distributions specified in references suggested by the reviewer (e.g., Ullah et al., 2022; Zhai et al., 2020). By juxtaposing our results across both standard and realistic simulation scenarios, we will discuss how the performance of our proposed AHR method is affected when operating in non-class-disjoint scenarios. Additionally, we will provide insights into the reasons for any observed performance variations, indicating AHR's strengths and potential limitations in real-world applications.
-
Long task sequence scenario: Recognizing the limitations of the short task sequences, we will conduct simulation scenarios that study longer sequences of incremental learning tasks following the approach outlined by Lesort et al. (2023).
Our baseline comparisons for these additional scenarios will include representative latent hybrid replay methods, specifically:
- i-CTRL (Tong et al., 2022)
- REMIND (Hayes et al., 2020)
- REMIND+ (Wang et al., 2021)
- Latent Replay (Ostapenko et al., 2022), as explicitly recommended by the reviewer.
These baselines align closely with our proposed AHR method in terms of their main contribution, allowing for rigorous and meaningful comparisons. Detailed experimental setups, including task definitions, evaluation metrics, and architectures, will be provided in the appendix.
We greatly appreciate the reviewer’s inquiry regarding our choice of latent space dimension. Specifically, we set the latent space dimension to 307 for SVHN, CIFAR-10, and CIFAR-100, and 2117 for miniImageNet. These latent space dimensions were chosen to achieve approximately a 10-fold memory compression, with the ultimate goal of clearly demonstrating that an order-of-magnitude memory reduction is achievable with our proposed architecture across diverse datasets. For the simpler MNIST dataset, we utilized an even more substantial compression rate of approximately 40-fold.
However, during development, we found that achieving significantly greater compression (e.g., 100-fold) was impractical given the constraints of the standard ResNet-32 architecture. Specifically, the ResNet-32 encoder naturally compresses the input images (32 × 32 × 3 for SVHN, CIFAR-10, and CIFAR-100; and 84 × 84 × 3 for miniImageNet) into a 64-dimensional latent representation. Thus, attempting a latent dimension as low as 30 (corresponding to roughly two orders of magnitude compression) proved ineffective. Consequently, we consider the natural compression of 64 dimensions provided by ResNet-32 as an approximate lower bound for latent space dimensions when working with SVHN, CIFAR-10, CIFAR-100, and miniImageNet (using ResNet-32 architecture).
Additionally, there are two further reasons why setting the latent dimension to 64, despite being the natural lower bound, is not ideal for our architecture:
-
Decoder Complexity: Our decoder architecture is intentionally designed to be lightweight, composed of only three simple convolutional layers. Achieving effective reconstruction from latent representations demands that the input images not be excessively compressed. Overly compressing images would necessitate employing deeper decoder.
-
Class Separation and Structured Latent Space: Within our architecture, a clear separation of classes within the latent space (structured latent space) is crucial for accurate classification and class-conditioned exemplar decoding. Empirically, we found that larger latent spaces facilitate significantly better class separation, particularly when dealing with extensive sequences of tasks.
We'll include results and discussions on the latent space dimensions in a dedicated section in the revised paper.
We're grateful to the reviewer and we'd be encouraged if they raised their score. Thank you very much for reading our response.
References:
- Ullah et al., 2022. Meta-Album: Multi-domain Meta-Dataset for Few-Shot Image Classification.
- Zhai et al., 2020. A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark.
- Lesort et al., 2023. Challenging Common Assumptions about Catastrophic Forgetting.
- Tong et al., 2022. Incremental learning of structured memory via closed-loop transcription.
- Hayes et al., 2020. Remind your neural network to prevent catastrophic forgetting.
- Wang et al., 2021. Acae-remind for online continual learning with compressed feature replay.
- Ostapenko et al., 2022. Continual Learning with Foundation Models: An Empirical Study of Latent Replay.
The paper proposes a hybrid AE that is coupled with charged particle system energy minimization and repulsive force algorithm. This algorithm has linear time complexity (O(0.1t) should be notated by O(t)). Complaints include (1) using outdated baselines to compare and (2) only using final accuracy metric instead of forgetting and forward transfer. The only reviewer who gave rejection score did not elaborate the reviews and was not very responsive to authors' rebuttal. Therefore, the AC recommends to weakly accept the submission to ICML 2025.