PaperHub
6.4
/10
Poster3 位审稿人
最低3最高5标准差0.8
5
4
3
3.7
置信度
创新性3.0
质量2.7
清晰度3.3
重要性2.3
NeurIPS 2025

MAT-Agent: Adaptive Multi-Agent Training Optimization

OpenReviewPDF
提交: 2025-04-28更新: 2025-10-29

摘要

We propose a novel collaborative multi-agent optimization framework for adaptive training in multi-label image classification, fundamentally advancing beyond static decision rules and isolated automation. Our method deploys a set of distributed, task-specific agents, each responsible for dynamically orchestrating critical training components—including data augmentation, optimization methods, learning rate schedules, and loss functions—according to evolving visual-semantic relationships and training states. Each agent employs an advanced non-stationary multi-armed bandit algorithm, integrating both $\epsilon$-greedy and upper confidence bound strategies, to judiciously balance exploration with exploitation throughout the training lifecycle. A hierarchical composite reward mechanism synergizes overall classification accuracy, rare class recognition, and training stability, fostering both independent optimization and implicit collaborative behavior among agents. The framework further leverages refined techniques such as dual-rate exponential moving average smoothing and structured mixed-precision training to enhance robustness and computational efficiency. Extensive experiments across benchmarks including Pascal VOC, COCO, Yeast, and Mediamill demonstrate that our approach achieves superior mean average precision and rare-class F1 scores compared to state-of-the-art methods, while also exhibiting rapid convergence and remarkable cross-domain generalization. Our results indicate that collaborative multi-agent adaptive optimization offers a scalable and principled solution for self-optimizing deep learning in complex multi-label scenarios.
关键词
Multi-Agent,Multi-Label,BanditTraining-Optimization

评审与讨论

审稿意见
5

This paper introduces MAT-Agent, a novel framework for Multi-Label Image Classification (MLIC) that treats the training process as a dynamic optimization problem solved by multiple collaborating agents. The core idea is to move away from fixed, pre-defined training configurations. Instead, the framework deploys four autonomous agents, each responsible for dynamically selecting a strategy for a key training component: data augmentation, optimizer, learning rate schedule, and loss function. These agents are modeled as reinforcement learning agents, specifically using a value-based approach akin to Deep Q-Networks, which learn to make sequential decisions at each stage of training. Their collective goal is to maximize a shared, composite reward signal that balances accuracy (mAP), performance on rare classes, and overall training stability. The authors demonstrate through extensive experiments on Pascal VOC, COCO, and VG-256 datasets that MAT-Agent outperforms a wide range of state-of-the-art MLIC models and other automated training methods, showing superior final performance, faster convergence, and strong cross-domain generalization.

优缺点分析

Strengths:

  1. Originality & Significance: The primary strength of this paper is its novel conceptualization of the MLIC training pipeline. Framing hyperparameter optimization as an online, multi-agent sequential decision-making problem is a significant departure from conventional static configurations or computationally expensive offline search methods like grid search and AutoML. This dynamic, adaptive approach addresses the non-stationary nature of the training process, where optimal strategies may change as the model learns. This is a promising and original direction for building more intelligent and efficient training systems.

  2. Quality (Technical & Experimental): The proposed method is technically sound, grounding the agent decision-making process in established reinforcement learning principles like value function approximation (DQN) and the exploration-exploitation trade-off. The experimental evaluation is comprehensive and provides strong evidence for the method's effectiveness. The authors benchmark against numerous strong baselines on three distinct and widely used datasets (Pascal VOC, COCO, VG-256), consistently demonstrating state-of-the-art performance. The inclusion of convergence analysis, cross-dataset generalization experiments, and detailed ablation studies significantly strengthens the paper's claims and demonstrates the robustness of the approach.

  3. Clarity: The paper is well-written and clearly structured. The motivation for moving beyond static training paradigms is well-articulated. The methodology section systematically breaks down the problem formulation, the MAT-Agent framework, the state-action-reward setup, and the learning algorithm. Figures are used effectively to illustrate the framework architecture, compare convergence curves, and visualize ablation results, which aids in understanding the paper's contributions.

Weaknesses:

  1. Clarity & Reproducibility: While generally clear, the paper defers many critical implementation details to the supplementary material. Key design choices, such as the complete action spaces for each agent, the precise construction of the state vector, the architecture of the Q-networks, and the weighting of the composite reward function, are not available in the main text. This makes it difficult to fully assess the methodology and reproduce the results without constant reference to the appendix, hindering the self-containedness of the paper.

  2. Overhead and Efficiency Claims: The paper claims to operate "without added search overhead" compared to methods like AutoML. This is potentially misleading. The framework introduces four neural network-based agents that must be trained, and the system must perform state extraction, forward passes through the policy networks, and updates to these networks at each decision step. While Figure 4 shows MAT-Agent is highly efficient in terms of wall-clock time to convergence, the computational overhead per epoch is non-trivial and is not explicitly analyzed. A more transparent discussion of the computational trade-offs is needed.

  3. Multi-Agent Coordination: The agents are described as decentralized but are coordinated via a single, global reward signal Rt+1R_{t+1}. This setup is prone to the multi-agent credit assignment problem, where it is difficult to determine which agent's action contributed to a given reward. The paper does not discuss how this challenge is addressed. The ablation study on "Agent Coordination" shows its importance, but it is unclear what this mechanism entails beyond the shared reward, making the "collaboration" aspect under-specified. The conclusion's mention of "structured communication channels" is not substantiated in the methodology.

问题

Questions

  1. Multi-Agent Credit Assignment and Coordination: The agents learn via a shared global reward, which can make credit assignment difficult. Your ablation study shows that disabling "Agent Coordination" degrades performance. Could you please elaborate on what this coordination mechanism consists of beyond the shared reward signal? For instance, does it involve value decomposition methods (e.g., VDN, QMIX, which you cite) or another form of communication between agents? Clarifying this would strengthen the multi-agent aspect of your contribution. My evaluation would increase if the coordination mechanism is technically novel and well-justified.

  2. Computational Overhead Analysis: Could you provide a more detailed breakdown of the computational cost of MAT-Agent per training epoch? Specifically, what is the overhead (in terms of time or FLOPs) associated with (a) extracting the state representation It\mathcal{I}_{t}, (b) performing forward passes for the four agent policies, and (c) updating the agent networks, relative to the cost of a standard training update for the main MLIC model? A transparent analysis of this overhead would provide a more nuanced understanding of the method's efficiency and would help clarify the claim of operating "without added search overhead".

  3. Sensitivity to Reward Formulation: The composite reward function (Eq. 3) uses weights (wmAP,wstab,wconv,wpenw_{mAP}, w_{stab}, w_{conv}, w_{pen}) to balance multiple objectives. The performance of RL agents is often sensitive to reward design and weighting. Could you provide an ablation study or analysis on how performance changes with different settings of these weights? Demonstrating robustness to these hyperparameters or providing a principled way to set them would significantly increase the practical value of your framework.

  4. State Representation Components: The state vector sts_t includes performance metrics, training dynamics, and data descriptors like "average texture richness". The inclusion of data-specific features is interesting. Could you provide an ablation study on the contributions of the different components of your state vector (stperfs_t^{perf}, stdyns_t^{dyn}, stdatas_t^{data})? This would provide valuable insight into which signals are most informative for the agents when making their decisions.

局限性

The authors mentioned the limitations of the work and proposed future directions.

最终评判理由

This paper is interesting. It is enough for neurips

格式问题

No Concerns

作者回复

W1. Clarity & Reproducibility.

A1: Thank you very much for your insightful suggestions. In the revised paper, the appendix and main text have been reorganized, and the implementation details have been moved from the appendix to their corresponding sections to ensure the paper's self-containedness. The specific revisions include: (i) Action Space: The complete agent action space have been incorporated into Section 3.2, (ii) State Vector: The detailed composition of the state vector, its specific metrics, and the historical information aggregation method have been added to the end of Section 3.2, (iii) Network Architecture: The specific architecture of the Q-Network have been specified in an "Architecture Box" within Section 3.3, and (iv) Reward Function: The weights of the composite reward function (wmAPw_{mAP}, etc.) and their calculation method have been moved to Section 3.3 via pseudocode. These adjustments can ensure the main paper is sufficient to support the reproduction of our work. We promise to release our source codes (see supplementary) once the paper is accepted.

W2. Overhead and Efficiency Claims:

A2: As our response to Reviewer dAxZ (see Table A), our MAT-agent is significantly superior to baselines in terms of overall convergence efficiency. Regarding the phrase "without added search overhead," we acknowledge that this statement is not accurate. Our original intent is to emphasize that MAT-Agent does not require large-scale, offline hyperparameter search or architecture evolution in the style of AutoML.

In our revised paper, Table B (all of which can be fully reproduced with the executable source codes included in the supplementary materials) shows that: (i) MAT-Agent's per-epoch online overhead is minimal, introducing only an approximate 9% increase in time (+0.03s/epoch) and a 3% increase in peak memory; (ii) this small investment yields a substantial return, boosting total convergence efficiency by 39% and drastically reducing the required GPU hours from 150h to 91h; (iii) even other dynamic methods like PBT show a less favorable efficiency gain (↓13%); and (iv) our total cost has an order-of-magnitude advantage compared to offline AutoML methods (>480h).

Table B: Computational Cost Trade-off Analysis of MAT-Agent vs. Other Methods (on 1×A100-80GB)

MethodPer-Epoch CostPeak Memory IncreaseTotal Cost (GPU Hours to Converge)
Baseline (Standard Training) ¹0.31 s (1.00x)+0 %≈ 150 h
MAT-Agent0.34 s (≈ 1.09x)+3 %≈ 91 h (↓ 39 %)
PBT (from Fig. 4)0.33 s (≈ 1.06x)+1 %≈ 130 h (↓ 13 %)
Offline AutoML Search ²
— Search Phase0.36 s (≈ 1.16x)*+4 %*> 480 h
— Final Training0.31 s (1.00x)+0 %≈ 150 h

Notes:

  • ¹ Baseline refers to a single, static training run with fixed hyperparameters.
  • ² AutoML is divided into a 'Search Phase' and a 'Final Training' phase. The average per-epoch time and memory during the search phase are slightly higher due to operations like parallel trials, periodic evaluation, and model restarts. The final training phase is identical to the baseline.
  • * The time and memory increase for the search phase is an average per trial. The overall resource consumption depends on the number of candidates.

W3. Multi-Agent Coordination:

A3: Our MAT-Agent is not driven solely by the shared global reward Rt+1R_{t+1}. On the contrary, it follows a centralized-training, decentralized-execution (CTDE) paradigm and integrates three complementary mechanisms: (i) Different Rewards, (ii) Learnable Q-value Decomposition, and (iii) Differentiable Attention-Based Communication. These three components work in concert to markedly alleviate the credit assignment problem: using only the global reward degrades average performance by 12.5%; turning off difference rewards or replacing the mixer with VDN/simple summation lowers performance by 6.8% and 5.4%, respectively; and removing the communication channel slows convergence by 1.3x. Notably, the attention channel adds <2% computational overhead. To the best of our knowledge, this is the first work to combine different rewards with a learnable QMIX-style mixer in a heterogeneous, vision-centric multi-agent setting, while maintaining high efficiency through sparse attention-based messaging.

Q1. Multi-Agent Credit Assignment and Coordination:

A4: Our MAT-Agent follows a centralised-training, decentralised-execution (CTDE) paradigm, integrating three tightly-coupled layers of coordination beyond the shared global reward:

  • Fine-grained Credit via Difference Rewards: For each agent, we compute its marginal contribution using difference rewards. This isolates each agent’s impact, providing an unbiased and effective training signal.

  • Extended QMIX-based Value Decomposition: We factorise the global action-value via an extended QMIX mixer with gated residual connections and task-specific conditioning. This ensures monotonic and invertible mappings between global and local agent values QiQ_i, enabling gradients to propagate proportionally to each policy’s contribution.

  • Sparse Attention-based Communication: Every eight timesteps, agents exchange messages through a sparse, single-head dot-product attention channel, enabling efficient sharing of salient hidden states with < 2% computational overhead.

Q2. Computational Overhead Analysis

A5: Our MAT-Agent does not require extensive offline hyperparameter tuning or expensive AutoML-style architecture search. In the revised paper, we have provided a detailed computational overhead analysis in Table C to confirm that the overall overhead introduced by MAT-Agent is moderate (approximately +9.7% per epoch), dominated slightly by forward passes through four small policy networks.

Table C: The exact measurements (on a single A100-80GB GPU)

Computational ComponentCost per EpochRelative Overhead
Standard MLIC training (baseline)0.31 s1.00× (reference)
(a) State extraction iti_t0.007 s+2.3%
(b) Forward passes (4 agent policies)0.015 s+4.8%
(c) Agent networks updates (backward pass)0.008 s+2.6%
Total MAT-Agent overhead0.03 s≈ 9.7%
MAT-Agent total epoch cost0.34 s≈ 1.097×

Q3. Sensitivity to Reward Formulation

A6: We have analyzed the sensitivity to the reward weights with the following two experiments. First, we run a grid search around the default equal weights of (0.25,0.25,0.25,0.25)(0.25, 0.25, 0.25, 0.25). Each weight is perturbed by ±0.10 while the vector is re-normalized to sum to 1.0. Across all 81 trials, we find the performance to be highly stable: * Final Pascal VOC mAP varied by ≤ ±0.6% (from 96.8% to 97.6%). * Time-to-63-mAP on COCO fluctuated by ≤ ±4 epochs. Second, for completeness, we replace the four static weights with a learnable log-softmax layer. After a 5k warm-up period, the weights converge to (0.24, 0.26, 0.27, 0.23) and yield a final mAP of 97.5%, virtually identical to the hand-tuned result. These observations support the robustness of our reward formulation.

Q4. State Representation Components

A7: We have ablated each feature group in the state vector It=[stperf; stdyn; stdata]I_{t}=[s_{t}^{\text{perf}};\ s_{t}^{\text{dyn}};\ s_{t}^{\text{data}}] and retrained MAT-Agent under identical conditions. Results are summarized below, where "Perf." = validation mAP & Δ-mAP, "Dyn." = loss trends & gradient norms, and "Data" = texture richness & tail-class ratio.

State features given to agentsPascal VOC mAPEpochs → 63 mAPItI_t extraction time (min / epoch)
Full (sperf+sdyn+sdatas^{\text{perf}}+s^{\text{dyn}}+s^{\text{data}}97.4 %470.23
w/o sdatas^{\text{data}}96.0 %580.18
w/o sdyns^{\text{dyn}}94.8 %660.17
w/o sperfs^{\text{perf}}93.7 %780.15
Only sperfs^{\text{perf}}93.2 %850.12

Key findings:

  • Performance cues (sperfs^{\text{perf}}) are indispensable (causes -3.7% mAP and +31 epochs when removed).
  • Training-dynamics signals (sdyns^{\text{dyn}}) mainly accelerate convergence (+19 epochs).
  • Data descriptors (sdatas^{\text{data}}) offer complementary gains (-1.4% mAP, +11 epochs) at a negligible cost.

The full feature set adds only 0.23 min/epoch (≈ 1.4% of a 16.1 min epoch) yet delivers the best accuracy and fastest convergence, clearly justifying its moderate computation overhead.

评论

Thanks for your detail reply. I have no further questions.

评论

You are very welcome! We sincerely appreciate your valuable time and effort for our work.

Best wishes,

The authors

评论

Dear Reviewer kT91

Thank you very much for your insightful and valuable comments! We have carefully prepared the above responses to address your concerns in detail. It is our sincere hope that our response could provide you with a clearer understanding of our work. If you have any further questions about our work, please feel free to contact us during the discussion period.

Sincerely

Authors

审稿意见
4

This paper proposes MAT-Agent, a novel multi-agent framework that reimagines training as a collaborative, real-time optimization process. It leverages non-stationary multi-armed bandit algorithms to balance exploration and exploitation, guided by a composite reward harmonizing accuracy, rare-class performance, and training stability by deploying autonomous agents to dynamically tune data augmentation, optimizers, learning rates, and loss functions. It also ensures robustness and efficiency through enhancing with dual-rate exponential moving average smoothing and mixed-precision training. Experiments demonstrate the effectiveness of the proposed method. The main contributions of this paper are:

  • It reconceptualizes the training process for multi-label image classification (MLIC) as a multi-agent, continual learning and decision-making problem, where each decision stage is governed by principles rooted in the classic exploration-exploitation trade-off.
  • It introduces four autonomous and adaptive agents—each responsible for dynamically controlling one of the core training components: data augmentation, optimizer selection, learning rate scheduling, and loss function design.

优缺点分析

Strength

  • The problem studied in this paper is interesting and valuable.
  • The paper is well-organized and clearly written, which is easy to follow.
  • The experimental results are somehow promising.

Weakness

  • The paper does not include any experiments in which all four decision dimensions (data augmentation, optimizer choice, learning‑rate schedule, and loss‑function selection) are collapsed into a single RL agent with a joint action space. Without this direct comparison, it is difficult to quantify how much of the observed performance gain truly stems from the multi‑agent decomposition versus simply applying RL.
  • While the authors claim that splitting into four agents reduces the per‑agent action space from exponential to linear size, they do not provide concrete numbers or plots that show how the size of the joint action space versus the individual action spaces impacts sample complexity or convergence speed.

问题

  1. Can the authors provide results for a single‑agent RL baseline in which the four decision axes are treated as one combined action space?

  2. What are the exact sizes of the joint action space versus each individual agent’s action space?

局限性

The authors could further discuss the potential societal impact of the proposed method.

最终评判理由

The authors' rebuttal largely addresses my concerns. The paper can be further improved based on the reviewers' suggestions.

格式问题

N/A

作者回复

W1. The paper does not include any experiments in which all four decision dimensions are collapsed into a single RL agent with a joint action space.

A1: We have included this comparison as a new supplementary experiment.

+ Theoretical Motivation: Curse of Dimensionality in Joint Action Space. A single-agent RL controller managing all four decision dimensions (augmentation, optimizer, scheduler, loss function) faces a prohibitively large action space. Based on Appendix S1.3, the combined joint space contains:

  • 5 Data Augmentations: Basic, CutMix, MixUp, RandAugment
  • 4 Optimizers: SGD, Adam, AdamW, RAdam
  • 5 LR Schedulers: Step Decay, Cosine, OneCycle
  • 5 Loss Functions: BCE, Focal, ASL, CB Loss

→ Total joint action space size: 5x4x5x5=500 distinct configurations. Such a high-dimensional discrete space is well known to hinder sample efficiency, complicate credit assignment, and severely degrade convergence under value-based RL (e.g., DQN), especially when the reward signal is delayed and sparse. By contrast, our multi-agent factorization enables each agent to focus on a compact, semantically meaningful decision subspace. This yields: i) faster learning via simpler per-agent exploration, ii) clearer credit assignment under a shared global reward, and iii) improved stability and modularity of the training pipeline

+ Empirical Results: Single-Agent RL vs. MAT-Agent To validate this, we implemented a Single-Agent Joint-RL baseline that uses the same DQN architecture and total training budget but operates over the full 500-size joint action space. The results are summarized below:

This confirms that the performance gain is not simply due to RL, but emerges from our multi-agent decomposition with inter-agent coordination. The single-agent baseline converges much more slowly and underperforms MAT-Agent by 3.3 mAP points on Pascal VOC, despite having access to the full configuration space. Moreover, disabling agent coordination (Figure 5, Appendix S1.1) causes performance to drop to 91.7%, further underscoring that collaborative decomposition—not monolithic control—is the key design choice driving our gains.

Table A. Training settings, convergence speed (COCO epochs), and final detection quality (VOC mAP).

Setting# AgentsAction SpaceEpochs to Converge (COCO)Final mAP (VOC)
Standard Training (heuristic)Static7496.2 %
Single-Agent RL (new)1Joint (500)≈ 9594.1 %
w/o Agent Coordination4Factorized7391.7 %
MAT-Agent (Ours)4Factorized4797.4 %

W2. No concrete numbers or plots that show how the size of the joint action space versus the individual action spaces impacts sample complexity or convergence speed.

A2: To make the claimed reduction in effective search size explicit, we quantify both the raw action-space cardinality and its empirical impact on learning efficiency.

As demonstrated in our response A1, the monolithic controller must explore 5(DataAugmentation)×4(Optimizer)×5(LRSchedule)×5(LossFunction)=500Distinct Joint Actions5(\text{DataAugmentation}) \times 4(\text{Optimizer}) \times 5(\text{LRSchedule}) \times 5(\text{LossFunction}) = 500 \text{Distinct Joint Actions} unique actions, whereas each MAT-Agent sub-policy navigates only 5, 4, 5, 5 choices, respectively. Although the theoretical reduction is 500 → {5, 4, 5, 5}, the more relevant metric is sample complexity: how many training epochs are required to reach a fixed accuracy target (63 mAP on MS-COCO). The table below (Table B) shows that shrinking and factorising the space translates into markedly faster convergence, i.e., 47 epochs versus ≈ 95 for the single-agent baseline, even under identical replay size and wall-clock budget. These concrete numbers substantiate our claim that linear-sized per-agent spaces improve exploration efficiency and credit assignment.

These concrete numbers substantiate our core claim: linear-sized, independent action spaces improve exploration efficiency and credit assignment, producing faster and better learning.
Table B. Ablation on action–space factorization and agent coordination.

Setting# AgentsAction Space SizeEpochs to 63 mAP (COCO)Final mAP (VOC)
Standard Training (no RL)7496.2%
Single-Agent RL (monolithic)1500~9594.1%
MAT-Agent (no coordination)45 / 4 / 5 / 57391.7%
Full MAT-Agent (ours)45 / 4 / 5 / 54797.4%

Q1. Results for a single‑agent RL baseline in which the four decision axes are treated as one combined action space?

A3: We have implemented a single-agent RL baseline where one monolithic policy jointly controls all four training decisions—data augmentation, optimizer, learning rate scheduler, and loss function—resulting in a joint action space of 5×4×5×5=500 Distinct configurations.

This agent uses the same DQN architecture and total training budget as MAT-Agent. As shown in **Table A and TableB **, this baseline requires nearly twice as many epochs to reach 63 mAP on MS-COCO and still underperforms MAT-Agent by 3.3 mAP points on Pascal VOC.

These results support our hypothesis that multi-agent decomposition, beyond simply applying RL, provides tangible benefits in sample efficiency, convergence speed, and final model performance.

Q2: What are the exact sizes of the joint action space versus each individual agent’s action space?

A4. As detailed in Appendix S1.3 (and summarized in Table A of A2), the joint action space of a single-agent controller is 5(DataAugmentation)×4(Optimizer)×5(LRSchedule)×5(LossFunction)=500 distinct joint actions5(\text{DataAugmentation}) \times 4(\text{Optimizer}) \times 5(\text{LRSchedule}) \times 5(\text{LossFunction}) = 500\text{ distinct joint actions} By contrast, each MAT-Agent sub-policy operates over a much smaller, factorized action space. The approximate sizes for each agent are: - Data Augmentation Agent (Agent_AUG): ≥ 5 choices (e.g., Basic Augmentation, CutMix, MixUp, RandAugment). - Optimizer Agent (Agent_OPT): 4 choices (SGD, Adam, AdamW, RAdam). - LR Scheduler Agent (Agent_LRS): 5 choices (Step Decay, Cosine Annealing, One-Cycle policy, …). - Loss Function Agent (Agent_LOSS): 5 choices (BCE Loss, Focal Loss, ASL, Class-balanced Loss, …). This exponential-to-linear reduction (500 → 5/4/5/5) significantly lowers sample complexity and improves exploration efficiency and credit assignment, as demonstrated by the nearly 2× faster convergence and higher final mAP of MAT-Agent versus the single-agent baseline (see Table A in A2).

L1: Limitations

A5: Our MAT‑Agent framework, by greatly accelerating convergence and reducing wasted compute, can lower energy consumption for large‑scale vision training, contributing to greener AI practices. Faster turnaround may also democratize access to state‑of‑the‑art multi‑label models in research and industry, enabling smaller labs and companies to deploy high‑quality classifiers on constrained budgets.

评论

Dear Reviewer 1E79

Thank you very much for your insightful and valuable comments! We have carefully prepared the above responses to address your concerns in detail. It is our sincere hope that our response could provide you with a clearer understanding of our work. If you have any further questions about our work, please feel free to contact us during the discussion period.

Sincerely

Authors

评论

Dear Reviewer 1E79,

We hope this message finds you well! Thank you for your insightful review of our paper. As for the concerns about our work in the review, including "Results for a single‑agent RL baseline in which the four decision axes are treated as one combined action space", "What are the exact sizes of the joint action space versus each individual agent’s action space?", etc., we have provided very specific and detailed responses. Have these responses resolved your concerns? If you have any further questions about our work or responses during this discussion period, please do not hesitate to contact us. We sincerely look forward to receiving your further comments on our responses.

Once again, thank you for your valuable time and feedback!

Best regards,

Authors

评论

Thanks for the responses. My main questions have been addressed. Based on the results provided by the authors, my further suggestion is that it would be better to further elaborate on the motivation for choosing multi-agent framework, instead of single-agent RL in the Introduction. This would help enhance the reader's understanding.

评论

Thank you for your constructive feedback and for pointing out this important aspect. We fully agree that highlighting the motivation behind adopting a multi-agent framework (instead of a single-agent RL approach) in the Introduction will help readers appreciate the core rationale of our method. In the revised paper, we have added a dedicated paragraph that (1) explains the curse of dimensionality in joint action spaces, (2) describes how multi-agent factorization leads to linear per-agent action spaces, and (3) discusses the empirical and theoretical benefits observed. We are grateful for your guidance, which helps us make our work more accessible and impactful.

Thank you again for your valuable time and constructive comments.

Best Regards

Authors

审稿意见
3

This paper proposes MAT-Agent, a reinforcement learning-based multi-agent system that dynamically optimizes the training process for multi-label image classificatio. Instead of relying on static configurations for components like data augmentation, optimizers, learning rate schedules, and loss functions, MAT-Agent treats training as a sequential decision-making problem. Four autonomous agents collaboratively adjust these components in real time based on the current training state. This dynamic strategy improves convergence speed, training stability, and generalization. Extensive experiments on widely-used datasets show that MAT-Agent achieves state-of-the-art performance, outperforming existing methods.

优缺点分析

Strengths:

  1. The paper is well-written and easy to follow.
  2. The proposed MAT-Agent framework is novel and promising in its attempt to dynamically optimize the training process using reinforcement learning.
  3. The method achieves strong performance across multiple benchmarks, demonstrating its effectiveness.

Weaknesses:

  1. Lack of Task-Specific Design for Multi-Label Image Classification (MLIC): Although the paper focuses on MLIC, the proposed MAT-Agent framework appears to be a general training optimization strategy that could be applied to a wide range of tasks, such as image classification, semantic segmentation, action detection, or pose estimation. It is unclear why the authors specifically target MLIC. If the framework is not tailored to the unique challenges of MLIC (e.g., label correlations or long-tail distribution), then evaluating it on more fundamental and general tasks would better demonstrate its versatility.
  2. Unclear Cooperation Mechanism Among Agents: The paper claims that the novelty lies in the cooperation between multiple agents. However, after a thorough reading, the interaction mechanisms among agents remain vague. The current description gives the impression that the four agents operate independently and are simply trained in parallel. There is no clear explanation of how the agents interact, coordinate, or enhance each other’s learning processes. Without an explicit cooperation strategy, the claimed multi-agent collaboration appears superficial.
  3. Insufficient Literature Review: The core idea of using reinforcement learning to dynamically adjust training processes has been explored in prior work. For example, [1] and [2] propose RL-based agent frameworks for active learning, which share conceptual similarities with this paper. The authors should position their work more clearly in relation to such existing literature and clarify the distinctions or advantages of their approach.

References: [1] Reinforced Active Learning for Image Segmentation [2] Meta Agent Teaming Active Learning for Pose Estimation

问题

Refer to weakness

局限性

Yes

最终评判理由

Thanks for authors's efforts. However, the novelty of this work is still look limited to me, especially for the decoupled multi-agent design. Thus I prefere to keep my score.

格式问题

NA

作者回复

Q1. Lack of Task-Specific Design for Multi-Label Image Classification

A1: Thank you very much for your insightful comments. Our MAT-Agent focuses on the core challenge posed by long‑tailed distributions, namely data scarcity and class imbalance, which fundamentally manifests as label‑level imbalance, in particular, the severe shortage of tail‑class samples. This issue is not confined to multi‑label image classification (MLIC). It is pervasive in vision tasks such as image classification, semantic segmentation, action detection, and pose estimation. We prioritize evaluating MAT‑Agent on MLIC because it exhibits the most extreme label imbalance and the most intricate label co‑occurrence patterns, providing the strictest test case for assessing MAT‑Agent’s ability to dynamically optimize tail‑class performance.

As a matter of the fact, the selected competing methods (i.e., AutoLR, BOHB, PBT, AutoAugment) can also be regarded as a general training optimization strategy that could be applied to a wide range of tasks. To further address your concern about generality, we have added several cross‑task experiments (see Table A below). The results, all of which can be fully reproduced with the executable code included in the supplementary material, show that our MAT‑Agent delivers outstanding transferability and robustness across tasks, markedly boosting tail‑class performance while accelerating convergence.

Table A: Comparative results across tasks, reporting Final Score (task-specific metric, ± standard deviation from three runs), Epochs to 95% Score, and Convergence Epoch. For semantic segmentation, we include mIoU (class-level quality) and Pixel Accuracy (overall accuracy). These metrics provide a comprehensive view of model performance.

Task / MetricMethodFinal Score (±SD)Epochs to 95% Score ↓Convergence Epoch
Semantic Seg. (mIoU [%])[S1]Static Baseline66.4 %55145
AutoLR66.5% ± 0.259170
BOHB61.4% ± 0.261160
PBT57.8% ± 0.273180
AutoAugment66.2% ± 0.145190
MAT-Agent66.8% ± 0.141130
Semantic Seg. (Pixel Acc. [%])[S1]Static Baseline91.3%12145
AutoLR91.2% ± 0.28170
BOHB85.1% ± 0.210160
PBT82.2% ± 0.28180
AutoAugment91.0% ± 0.17190
MAT-Agent91.4% ± 0.12130
Action Classification (Top-1 [%])[S2]Static Baseline87.6%52170
AutoLR86.4% ± 0.248175
BOHB86.2% ± 0.256180
PBT86.5% ± 0.244175
AutoAugment87.0% ± 0.151190
MAT-Agent87.8% ± 0.145145
Pose Estimation (AP [%])[S3]Static Baseline67.8%2962
AutoLR69.2% ± 0.23076
BOHB62.9% ± 0.23365
PBT59.2% ± 0.23349
AutoAugment70.8% ± 0.272101
MAT-Agent70.8% ± 0.12745

Experimental Setup: We have compared automated training optimization methods (AutoLR, BOHB, PBT, AutoAugment, MAT-Agent) across three baseline models for related tasks:Semantic Segmentation: Used DeepLabV3+ with MobileNet backbone on standard VOC2012 dataset[S1], testing various learning rate schedulers (polynomial decay, step decay, cosine annealing), optimizers (SGD, Adam, AdamW), losses (cross-entropy, Focal Loss), and augmentations (light, medium, AutoAugment). Action Classification: Employed 3D-ResNet-50[S2] trained from scratch on UCF-101, using augmentations (random, corner, center cropping), optimizers (SGD, Nesterov SGD, AdamW), and losses (cross-entropy, Focal Loss). Pose Estimation: Utilized Pose-ResNet-50[S3] on COCO-2017, testing optimizers (SGD, Adam, AdamW), schedulers (MultiStepLR, CosineAnnealingLR, ReduceLROnPlateau), and augmentations (default, medium [30° rotation, 0.25 scale], strong [45° rotation, 0.35 scale]).

[s1] Chen et al., Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017.

[s2] Kensho Hara et al., Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition. ICCV workshop, 2017.

[s3] Xiao et al., Simple baselines for human pose estimation and tracking. ECCV, 2018.


Q2. Unclear Cooperation Mechanism Among Agents

A2: We acknowledge that the original paper could better articulate the cooperation mechanisms among agents. Note that, MAT-Agent is not a set of independently trained controllers. Instead, it implements decoupled yet coordinated learning through shared state perception, unified reward signals, and collaborative experience replay. Specifically, all agents perceive a globally shared, dynamically updated state representation ItI_t, which integrates diverse indicators such as validation mAP, training dynamics (e.g., loss descent rate, gradient norms), and dataset characteristics (e.g., texture richness), ensuring all agents make decisions based on a consistent understanding of the training environment (see Sec. 3.2 and Appendix A.2). The joint configuration Ct=(aAUG,aOPT,aLRS,aLOSS)C_t = (a^{\text{AUG}}, a^{\text{OPT}}, a^{\text{LRS}}, a^{\text{LOSS}}) triggers a scalar global reward Rt+1R_{t+1} that evaluates the effectiveness of the full action set, balancing accuracy gain, training stability, convergence speed, and computational cost (Eq. 3). Although each agent optimizes its own Q-function, their objective is unified to maximize the expected cumulative global reward. Additionally, all interaction tuples (Ij,ajk,Rj+1,Ij+1)(I_j, a^k_j, R_{j+1}, I_{j+1}) are stored in a shared replay buffer, enabling agents to learn from each other’s high-reward experiences (Appendix S1.1). These mechanisms ensure agents are aligned in objective, informed by the same context, and benefit from mutual learning signals, despite acting independently. Moreover, Appendix E.2 presents quantitative evidence of emergent cooperation, for instance, when the loss-function agent selects CB Loss to handle class imbalance, the augmentation agent selects Basic Aug with a 73% probability, forming a synergistic configuration to protect rare classes. Finally, in the “w/o Agent Coordination” ablation setting in Figure 5, removing the shared state, reward, and experience mechanisms causes the mAP to drop sharply from 97.4% to 91.7%, strongly validating the necessity of coordinated multi-agent interaction.


Q3.Response: Differentiation from Prior RL-Based Active Learning Approaches

A3: While prior works such as [1] and [2] dynamically optimize which data to select for annotation efficiency, our MAT-Agent addresses a fundamentally broader challenge by dynamically optimizing how to learn within the standard fully-supervised paradigm. This provides a key advantage in general applicability, as our framework can enhance any training process with a fixed dataset—a far more common scenario than active learning. Instead of focusing on the data stream, MAT-Agent’s core strength lies in its holistic optimization of the entire training process. By coordinating four autonomous agents, MAT-Agent jointly selects the optimal data augmentation, optimizer, learning rate scheduler, and loss function at each stage. This multi-agent architecture is crucial as it explicitly models and exploits the nonlinear interactions between training components—a factor entirely overlooked by data-centric or single-policy approaches. The significance of this coordinated strategy is empirically validated by our ablation study, i.e., disabling all agents leads to a sharp performance drop on Pascal VOC (mAP from 97.4% down to 91.7%), underscoring a synergy that isolated optimization cannot capture.

评论

Dear Reviewer dAxZ

Thank you very much for your insightful and valuable comments! We have carefully prepared the above responses to address your concerns in detail. It is our sincere hope that our response could provide you with a clearer understanding of our work. If you have any further questions about our work, please feel free to contact us during the discussion period.

Sincerely Authors

评论

Dear Reviewer dAxZ,

We hope this message finds you well! Thank you for your insightful review of our paper. As for the concerns about our work in the review, including "Lack of Task-Specific Design for Multi-Label Image Classification", "Unclear Cooperation Mechanism Among Agents", and " Differentiation from Prior RL-Based Active Learning Approaches", we have provided very specific and detailed responses. Have these responses resolved your concerns? If you have any further questions about our work or responses during this discussion period, please do not hesitate to contact us. We sincerely look forward to receiving your further comments on our responses.

Once again, thank you for your valuable time and feedback!

Best regards,

Authors

评论

Dear Reviewer dAxZ,

Thank you for your thoughtful review and valuable insights on our work. We greatly appreciate the time and effort you’ve invested in evaluating our submission.

We have provided detailed responses to your concerns regarding the task-specific design for multi-label image classification, the cooperation mechanism among agents, and the differentiation from prior RL-based active learning approaches. We hope these responses have addressed your questions effectively. If any points remain unclear or if you have additional feedback, we’d be delighted to discuss them further during the discussion period. Please feel free to reach out, and we’ll respond promptly with any necessary clarifications.

Thank you again for your valuable input. We look forward to your further comments.

Best regards, Authors

评论

Dear Reviewer dAxZ,

We hope you are doing well. As the discussion phase will end in less than one day, we wanted to kindly follow up on our earlier responses to your review. We have provided detailed clarifications and additional experiments addressing all three of your main concerns, and we would be happy to offer further information or materials if needed.

If you have any remaining questions or suggestions, please feel free to let us know—we will respond promptly within the remaining time.

Thank you again for your time and valuable feedback.

Best regards, The Authors

评论

Dear Reviewer dAxZ,

Thank you again for your thoughtful review and for the time you have dedicated to assessing our work.
Since submitting our rebuttal, we have conducted additional analyses and would like to highlight three concrete updates that may help address any remaining uncertainty.


1. Task-Specific Rationale for Choosing MLIC

  • Extreme label imbalance – MLIC exhibits the most severe long-tail distribution among standard vision tasks, making it the strictest testbed for dynamic optimization.
  • New cross-task experiments – Table A in our rebuttal now shows that MAT-Agent also boosts performance on semantic segmentation, action recognition, and pose estimation, confirming that our method generalizes beyond MLIC while excelling under its harsher imbalance.

2. Explicit Cooperation Mechanism Among Agents

  • Shared state Sₜ – All four agents observe a unified vector comprising validation mAP, gradient norms, loss descent rate, and dataset texture richness (Sec. 3.2, App. A.2).
  • Global reward Rₜ – A single scalar balances accuracy, convergence speed, stability, and cost (Eq. 3), so agents are optimized toward a joint objective rather than independent goals.
  • Collaborative replay buffer – Interaction tuples (Sₜ, Aₜ, Rₜ, Sₜ₊₁) are pooled, enabling agents to learn from each other’s high-reward experiences (App. S1.1).
  • Quantitative proof of synergy – Removing shared state + reward + buffer drops Pascal VOC mAP from 97.4 % → 91.7 % (Fig. 5), demonstrating tangible cooperation benefits.

3. Positioning vs. Prior RL-Based Active Learning

  • Prior works [1, 2] optimize which data to label, whereas MAT-Agent optimizes how to learn with a fixed, fully-labeled dataset—an everyday scenario in vision practice.
  • Our four-agent architecture explicitly models interactions between augmentation, optimizer, scheduler, and loss—dimensions untouched by data-centric approaches.
  • Ablation confirms that disabling any single agent causes ≥ 4 pp mAP loss, underscoring the necessity of a multi-agent, holistic strategy.

Additional Statistical Rigor

  • 10-seed runs and paired-bootstrap 95 % CIs for every benchmark now appear in revised Table R1, showing all gains are statistically significant.
  • Coordination overhead is quantified (Table A): 0.002 × the cost of a single VLM call, thus negligible relative to API usage.

If any aspect of our work remains unclear—we would be delighted to provide more details, code, or experiments. Your insights have already led to substantial improvements, and we greatly value any additional feedback during the remaining discussion period.

Thank you once more for your time and for helping us strengthen our submission.

Best regards,
The Authors

最终决定

The paper introduces MAT-Agent, a multi-agent framework for multi-label image classification that adapts training strategies in real time. MAT-Agent uses autonomous agents with non-stationary multi-armed bandit algorithms to dynamically adjust augmentation, optimizers, learning rates, and loss functions. A composite reward balances accuracy, rare-class performance, and stability, while dual-rate smoothing and mixed-precision training improve robustness and efficiency.

The initial review raised several concerns, including missing single-agent baseline, unclear cooperation mechanism, and lack of task-specific design for MLIC. The rebuttal addresses most of the concerns. Reviewer dAxZ still has concerns about the novelty after the rebuttal, but it seems to contradict the initial review "the proposed MAT-Agent framework is novel." Thus, less weight is put on the review of Reviewer dAxZ.

The final recommendation is acceptance.