PaperHub
6.3
/10
Poster4 位审稿人
最低5最高7标准差0.8
7
6
7
5
3.5
置信度
正确性2.8
贡献度2.8
表达3.3
NeurIPS 2024

GO4Align: Group Optimization for Multi-Task Alignment

OpenReviewPDF
提交: 2024-04-29更新: 2024-11-06

摘要

关键词
multi-task learningmulti-task optimizationtask groupingdense prediction tasks

评审与讨论

审稿意见
7

The paper designs a multi-task optimization method, namely GO4Align, to address task imbalance by aligning optimization processes across tasks. It proposes an adaptive group risk minimization strategy, formulated as a bilevel optimization problem where the lower-level optimization is a task grouping optimization and upper-level optimization is a weighted optimization over task group losses. Then, this method optimizes the problem by alternating two steps: (1) Dynamical Group Assignment where tasks are clustered using a dynamic group assignment process, implemented via K-means clustering, to capture beneficial task interactions. (2) Risk-guided Group Indicators: indicators are designed to balance task risks and align learning progress by combining scale-balance and smooth-alignment operations.

GO4Align is evaluated on benchmarks, including NYUv2, CityScapes, QM9, and CelebA. The results show that it outperforms existing gradient-oriented (MGDA, PCGRAD, CA-GRAD, IMTL-G, GRADDROP, and NASHMTL) and loss-oriented methods (Linear scalarization, Scale-invariant, Dynamic Weight Average, Uncertainty Weighting, Random Loss Weighting, and FAMO.), achieving lower performance drops and better computational efficiency. Lastly, the study explores the contributions of each component of GO4Align, the influence of group assignments, and the role of group weights. It shows that the proposed AGRM principle can integrate with existing MTO methods, further improving their performance.

优点

  • This paper proposes an adaptive group risk minimization strategy to address task imbalance, formulated as a bilevel optimization problem where the lower-level optimization is a task grouping optimization and upper-level optimization is a weighted optimization over task group losses.
  • GO4Align is evaluated on benchmarks, including NYUv2, CityScapes, QM9, and CelebA. The results show that it outperforms 12 existing gradient-oriented and loss-oriented methods, achieving lower performance drops and better computational efficiency.

缺点

  • There are many existing clustering algorithms that can be potentially used for task grouping, such as spectral clustering and SDP-based clustering [1]. What is the rationale for choosing K-means clustering in the method? It would be better to discuss and ablate the clustering algorithms, as it is an important component of the proposed method.
  • Another important component is the choice of the group indicators for clustering tasks. This work uses indicators based on task loss trajectories. How about using gradients and model features as group indicators? Would it be worse than the proposed indicators in Section 3.3?
  • It would be better to explain why the proposed method uses the comparable time as linear scalarization. How does the proposed method scale the number of tasks? It would be better to provide a comparison of the additional computation across each method?

[1] Relax, no need to round: integrality of clustering formulations. https://arxiv.org/abs/1408.4045

问题

  • How is the convergence difference evaluated in Figure 2? How is the Δm\Delta m metric defined?

局限性

This work has discussed its limitations.

作者回复

We sincerely thank # Reviewer sLBS for their insightful comments. The following mainly addresses the concerns and answers questions.


1. Clustering methods.

In our main experiments, we employed standard K-means for instantiation. K-means is a widely used clustering approach that worked well in our experiments, so we did not explore this part further (as it was not the main focus).

We appreciate your suggestion and have conducted additional experiments to evaluate the impact of using alternative clustering algorithms. We thank you for bringing the reference [a]; however, due to the absence of open-source code for [a], we explored another SDP-based clustering method [b].

Specifically, we conduct the experiments on NYUv2 by substituting the K-means clustering in our method with SDP-based clustering [b] and spectral clustering [c]. As demonstrated in the table below, these alternative clustering methods also outperform state-of-the-art approaches (FAMO, -4.10%), particularly by enhancing the performance of each task over STL. For training efficiency, we implemented K-means using the >kmeans_pytorch package, which offers GPU support.

Interestingly, our experiments show that the K-means clustering algorithm we deployed outperforms both the spectral and SDP-based clustering methods. This is because the hyperparameters for the latter algorithms have yet to be thoroughly investigated. We have presented promising initial results and will include this ablation study and related discussions in the main manuscript.

MethodPackageGPU support of clusteringRelative runtime ()(\downarrow)Δseg.()\mathbf{\Delta seg.} (\downarrow)Δdepth()\mathbf{\Delta depth} (\downarrow)Δnormal()\mathbf{\Delta normal} (\downarrow)Δm()\mathbf{\Delta m} (\downarrow)
Ours w/ SDP-based clustering [b]sdp_kmeansNO1,20X-2,97-18,76-1,09-5,44
Ours w/ Spectral clustering [c]sklearnNO1,17X-1,78-18,58-0,06-4,56
Ours w/ K-means clustering (in the paper)kmeans_pytorchYES1,02X-4,03-20,37-1,18-6,08

2. Choice of the group indicators.

It is also plausible to use the gradient information as clustering indicators. As shown in Table 4, we tried to replace the proposed risk-guided group indicator with gradient-guided task weights(MGDA and NashMTL). Our work outperforms these alternatives. The main reason could be that our group indicator yields better representations of learning information by capturing the differences in the per-task risk scale and exploring the learning dynamics over time.

Moreover, from the perspective of training efficiency, the task-specific gradients need to back-propagate the shared architecture for MM times, where MM is the number of tasks, which increases computational cost linearly with the number of tasks.


3. Comparable time with linear scalarization and scale-up number of tasks.

We follow representative MTO works (RLW[39] and FAMO[14]) in choosing linear scalarization (LS) as the relative baseline for training time. This is a good choice because LS is a common MTL baseline, where each task has equal weights without extra loss-oriented or gradient-oriented techniques.

As shown in Fig. 4, when the number of tasks scales up from 2 to 40, the advantages of reducing computation cost with our method become increasingly significant compared to other gradient-oriented methods, e.g., NashMTL=2.07 versus Ours=1.01 with 2 tasks, NashMTL1=2.49 versus Ours=1.01 with 40 tasks. We will add this discussion in Line 299.


4. Questions.

Convergence difference in Figure 2: We evaluate the convergence difference by the standard deviation of the task-specific epoch numbers to reach convergence. We will polish this description in Line 90.

Definition of Δm\Delta m: Δm\Delta m is the average per-task performance drop relative to STL as representative MTO works (NashMTL and FAMO). We will polish this in Line 268.


Reference:

[a] Awasthi, P., Bandeira, A. S., Charikar, M., Krishnaswamy, R., Villar, S., & Ward, R. (2015, January). Relax, no need to round: Integrality of clustering formulations. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science (pp. 191-200).

[b] Tepper, M., Sengupta, A. M., & Chklovskii, D. (2017). The surprising secret identity of the semidefinite relaxation of k-means: manifold learning. arXiv preprint arXiv:1706.06028.

[c] Damle, A., Minden, V., & Ying, L. (2019). Simple, direct and efficient multi-way spectral clustering. Information and Inference: A Journal of the IMA, 8(1), 181-203.


Thank you for your time and efforts. We hope our experimental results and clarifications have addressed your concerns. Please don’t hesitate to reach out if you have further questions or need more information.

审稿意见
6

This paper proposes a multi-task optimization method designed to address task imbalance by aligning optimization processes across different tasks. To accomplish this, the authors developed an adaptive group risk minimization strategy, which includes two key techniques: (i) dynamic group assignment, clustering similar tasks based on their interactions, and (ii) risk-guided group indicators, leveraging consistent task correlations and risk information from previous iterations. Extensive experimental results across various benchmarks show that the proposed method outperforms others while requiring even lower computational costs.

优点

  1. The paper is well-organized and easy to follow.
  2. The proposed method outperforms several newly proposed MTO algorithms.
  3. Extensive experimental results across various benchmarks show that the proposed method outperforms others while requiring even lower computational costs.

缺点

  1. To help readers better understand the application scenarios of the proposed method, the author can provide examples illustrating the phenomenon of task imbalance in MTO tasks.
  2. In Eq. (4), what is the optimization method for ww and GG? Is it performing standard k-means on γ\gamma?
  3. In Line 172, why is it possible to invert the K×MK\times M matrix GG?
  4. In the introduction of the dataset, there is a lack of explanation regarding the imbalance phenomenon in the relevant learning tasks.

问题

See "Weaknesses".

局限性

No.

作者回复

We sincerely thank # Reviewer 9PzS for their insightful comments. The following mainly addresses the concerns and answers questions.


1. The phenomenon of task imbalance and application scenarios.

Thanks for your kind suggestion. This phenomenon of task imbalance describes that some tasks are severely under-optimized during multi-task training. This can be observed in Fig. 2 and Table 1. (i) In the left subplot of Fig. 2, task “normal” is under-optimized, leading to convergence earlier than other tasks. (ii) In Table 1, most baselines achieve comparable performance to STL on the segmentation and depth estimation tasks but significantly sacrifice the performance on the surface normal estimation task. We note that on NYUv2, our work is the only method that improves each task’s performance relative to the corresponding STL results, especially for the surface normal estimation task. This demonstrates that our method performs better in alleviating task imbalance.

Meanwhile, we would like to specify one typical application scenario of GO4Align. It lies in vision-based autonomous driving, such as Autopilot, which needs to simultaneously handle several tasks (instance segmentation and depth estimation). Significant differences in the improvements across tasks could exist due to diverse scales, different task difficulties, or asynchronous learning dynamics. In this case, the proposed method can be deployed to balance their learning process, particularly without extra training costs.


2. What is the optimization method for ω\omega and G\mathcal{G}?

We use standard K-means with “Euclidean” distance (import kmeans_pytorch) on γ\gamma to optimize the ω\omega and G\mathcal{G}.


3. Why is it possible to invert the matrix G\mathcal{G}?

Special thanks for pointing out this. The assignment matrix G\mathcal{G} is a full-row rank. For simplicity, we formulate its generalized inverse, especially one-sided right inverse, GR1=G(GG)1.\mathcal{G}_{R}^{-1}=\mathcal{G}^{\top}(\mathcal{G}\mathcal{G}^{\top})^{-1}. We will add a detailed description in Line 172.


4. Imbalance phenomenon in datasets.

The task-imbalance phenomenon in the MTO literature [1, 27] mostly refers to the imbalanced optimization process rather than data distributions in the task space. We will clarify this in the paper, and specifically, we will add the description of task imbalance in Line 257.


Thank you for your time and efforts. We hope this has addressed your concerns and answered your questions. Please don’t hesitate to reach out if you have further questions or need more information.

审稿意见
7

The paper proposes GO4Aligh, a multi-task optimization approach, which targets the task imbalance issue. Specifically, the authors devise two objectives in multi-task optimization: 1) the first is the dynamical group assignment, which can attribute similar tasks to the same cluster and distinguish different ones. 2) The second is the risk-guided group indicators, which include scale-balance and smooth-alignment. These two objectives are incorporated into the AGRM principle for multi-task optimization.

优点

From an overall view, this paper is quite well-structured, with clear writing and an organized layout. The primary motivation, which is to optimize task imbalance, is clearly articulated and valuable. The methodology is understandable and easy to reproduce. Also, the experiments show that GO4Align can reduce 6.08% metrics with no further training time.

缺点

While I do not have major concerns about this paper, it’s worth noting that the performance improvement of GO4Align over FAMO is marginal, which may limit its broader impact. The concept of grouping-based task interactions is innovative, but not necessarily groundbreaking. Besides, I have some questions. Please refer to the next part.

问题

  • Given the existence of Eq.4, I’m unsure whether Eq.5 or Eq.6 would have a significant impact. How would the performance change if only Eq.4 is included?
  • The paper asserts that GO4Align has even lower computational costs, yet it is 1.02x that of STL. Does this mean that GO4Align is more efficient compared to the baselines (not the STL)?

局限性

Yes.

作者回复

We sincerely thank # Reviewer AXsM for their insightful comments. The following addresses their concerns and provides answers to their questions.


1. GO4Align versus FAMO.

We thank the reviewer for the comment. GO4Align and FAMO are both strong candidates for handling task imbalance in the MTO literature. However, on NYUv2, our work is the only method that improves each task’s performance relative to the corresponding STL results, especially for the challenging surface normal estimation task. This demonstrates that our method performs better in alleviating task imbalance.


2. Questions.

Performance change with only Eq.4: Eq.4 is a bi-level optimization process, where lower-level optimization takes the group indicators (Eq.5 and Eq.6) as inputs to update the assignment matrix and group weights. Thus, the optimization process can not be performed individually without specifying the group indicator. More details can be found in Code Line 6-10 of Algorithm 1.

Computational cost: Each method’s training time is computed relative to a common MTL baseline (LS, linear scalarization), the same as used in representative MTO works (RLW[39] and FAMO[14]). LS weights each task equally without extra loss-oriented or gradient-oriented techniques. Moreover, our method has the advantage of computational efficiency over STL. STL takes o(M) the space or time cost, where M is the number of tasks. In contrast, our work uses o(1) space and time due to the shared architecture and loss-oriented mechanism.


Thanks again for your time and efforts. We hope this has answered your questions. If you have any other questions, we are happy to discuss them and provide further clarification.

评论

Thanks for your response. I have no further question and keep the current rating at this phase.

审稿意见
5

This paper presents an approach for multi-task learning aiming to reduce interference among tasks by adaptive loss weighting. Instead of task-specific weights as in existing works, the authors proposed to group similar tasks and all tasks in the same group share the same weight. The paper also includes a new derivation of task level weights (referring to group indicator), which contains two components: scale-balance and smooth-alignment. Evaluation was done using four benchmark datasets, three in computer vision and one in chemistry (predicting property of molecules)

优点

• The manuscript is very well written and easy to follow • The proposed idea is neat. The proposed task grouping is flexible, which can easily work with other task-weight deriving methods. • Ablation study was done to evaluate the contribution of ingredients in the proposed method.

缺点

• I see that the technical contributions of this work are twofold. One is the derivation of the group indicator (essentially task specific weights) and the other one is to use the same weight to weigh all tasks in a group instead of just using task specific weights. I do not see clear motivation of both. More discussion is needed why both are expected to be better than existing approaches. • Figure 5 needs more description to make it easier to read. It took me quite some effort and time to figure the x-axis in the plots are epochs and the intensity of the color indicates the weight value (if I am correct). • There are results from only one dataset which contains just three tasks. Results from more datasets especially those with larger number of tasks are desired, given the proposed method aims at task grouping. • The reported results are not strong. (1) Based on the results from CityScapes (table 2), which has only two tasks (meaning there is not much grouping involved and difference in performance should be due to that in weight calculation), GO4Align is not the best performing one, implying the proposed weight calculation could be inferior to existing one. (2) All numbers in Table 2 are positive, implying models perform worse than those in single-task learning, which defeats the major advantage of MTL, enhancing the model generalizability and performance. • I do not see an ablation study that compares with grouping to without grouping at all. • More details in experiment setup is needed to enhance reproducibility, especially, for each of the experiment how partition of the dataset was done for training, validation, and testing.

问题

Refer to weakness

局限性

NA

作者回复

We sincerely thank # Reviewer FqZ5 for their insightful comments. The following addresses their concerns and provides answers to their questions.


1. Motivations.

Thank you for acknowledging the two technical contributions to our paper. We will clarify the motivations for both in the main manuscript as follows:

Group indicator: The motivation of the group indicator derivations is to fully utilize the risk information to explore the relationships among tasks, where applying risk information can avoid the computation cost of gradients. Compared with other loss-oriented methods (such as RLW, DWA, UW, and FAMO), our group indicator can capture the differences in the per-task risk scale and fully utilize the learning dynamics over time, yielding better representations of risk information.

Empirically, Table 4 in Line 340 shows comparisons with other task-specific weights (MGDA, NashMTL and FAMO) + the proposed AGRM principle; our group indicators result in superior performance over other alternatives.

Group-specific weights versus task-specific weights: The motivation for group-specific weights is that tasks with close group indicators have similar behaviors on risk, and similar tasks can benefit from training together by sharing parameters as much as possible [21], even weights in the multi-task objective. With the group-specific weights, our method can align tasks with similar risk behaviors during joint training, alleviating the task imbalance issue.

As shown in Table 3 in Line 307, our work with group-specific weights outperforms its variants with task-specific weights (without Eq.4). This further demonstrates that our group-specific weights can effectively alleviate the under-optimization of some tasks.


2. Caption of Figure 5.

Thanks for your suggestion. We've taken your advice and added the description to the caption of Figure 5.

The x-axis in the subplots denotes the epoch, and the intensity of the color indicates the weight value.


3. Datasets with a larger number of tasks.

In MTL literature, it is common to evaluate datasets with a small number of tasks, such as NYUv2 (3 tasks) and CityScapes (2 tasks) [6, 13, 14, 16]. As task grouping may be beneficial for larger numbers of tasks, we also included QM9 (11 tasks) and CelebA (40 tasks). On these last two benchmarks, our method achieves better overall performance than other loss-oriented methods, with higher training efficiency than gradient-oriented baselines. The overall results are in Table 2 in Line 276 and details in Appendix Table 5 & 6.


4. Experimental analysis on Table 2.

We appreciate the reviewer's careful observations of Table 2. We want to provide clarifications on two points:

Weight calculations on CityScape: GO4Align achieves comparable performance, securing the third position (Δm\Delta m) on CityScape. It’s critical to note that this benchmark, comprising only two tasks, offers limited grouping options, which constrains the effectiveness of the proposed weight calculation (risk-guided group indicator). Furthermore, in contrast to other weight calculation methods on NYUv2, such as Table 4 in Line340, the proposed risk-guided group indicator surpasses alternative weight calculations.

MTL methods underperform STL in terms of the overall evaluation: This is a common phenomenon in MTL literature (CAGrad, NashMTL, and FAMO), caused by the task-imbalance issue, where some tasks are under-optimized during joint training. In particular, we note that: (i) In Appendix Table 5 & 6 with detailed task-specific performance, we can observe that most baselines achieve comparable performance over STL on some tasks but significantly sacrifice the performance on the other tasks. (ii) MTL still offers advantages such as improved computational efficiency and reduced training time, which significantly contribute to real-world systems.


5. Our model with grouping versus without grouping.

We refer the reviewer to the ablations in Table 3 in Line307. As the grouping is performed by Eq.(4), the first two rows in Table 3 are the variants of our method without grouping, and the last row is our method with grouping. Table 3 empirically examines the performance gains of task grouping over models without grouping. We will make this more clear in Line 307.


6. Detailed experimental setup and open-source plan.

This work follows the same experimental setting used in NashMTL [13] and FAMO [14], including the dataset partition for training, validation, and testing. The benchmark partition is attached. We will add this table in Appendix Sec.B.1. we also note that NYUv2 and Cityscapes do not have validation sets. Following the protocol in [13-14], we report the test performance averaged over the last ten epochs. Importantly, we will release our code to facilitate MTO research after the final decision.

DatasetsTotalTrainingValidationTest
NYUv21449795N/A654
CityScapes34752975N/A500
QM9~130k~110k10k10k
CelebA202,599162,77019,86719,962

Thank you for your feedback. We greatly appreciate the time and effort you put into reviewing our work. We have carefully considered your comments and made improvements based on your suggestions. We hope you will reconsider your evaluation of our work. Thank you once again.

评论

I appreciate the response from the authors. Despite I still do not see the clear explanation why grouping helps, my other comments have been largely addressed. Considering this together with the general positive ratings from other reviewers, I have increased my rating.

作者回复

Dear Reviewers and Area Chair:

We sincerely thank you for your time, insightful suggestions, and valuable comments. We are encouraged by your support and positive reviews of our work:

  • Neat/innovative idea of adaptive task grouping in MTO [# Reviewers FqZ5/AXsM] ;
  • Manuscript well-written/structured and easy to follow [# Reviewers FqZ5/AXsM/9PzS];
  • SOTA performance with no further training time [# Reviewers AXsM/9PzS];
  • Extensive experiments with sufficient ablation studies [# Reviewers FqZ5/9PzS/sLBS].

To address your concerns, we have been working diligently on improving the paper in several aspects. We summarize the major changes that we will update in the main manuscript:

  • Add additional conceptual analysis to clarify motivations for our method [# Reviewer FqZ5] and the computation efficiency over STL [# Reviewer AXsM];

  • Add additional experimental analysis for the main results and the ablations, such as "Experimental analysis on Table 2" [# Reviewer FqZ5], "Our model with grouping versus without grouping" [# Reviewer FqZ5], "GO4Align versus FAMO" [# Reviewer AXsM], "The phenomenon of task imbalance and application scenarios" [# Reviewer 9PzS], "Choice of the group indicators" [# Reviewer sLBS], and "Comparable time with linear scalarization and scale-up number of tasks" [# Reviewer sLBS];

  • Provide additional experimental results to investigate the effects of different clustering methods. For more details and discussions, please refer to the response to [# Reviewer sLBS]; | Method | Package | GPU support of clustering | Relative runtime ()(\downarrow) | Δseg.()\mathbf{\Delta seg.} (\downarrow) | Δdepth()\mathbf{\Delta depth} (\downarrow) | Δnormal()\mathbf{\Delta normal} (\downarrow) | Δm()\mathbf{\Delta m} (\downarrow) | |----------------------|-----------------|-------------|--------------|---------------|----------|-------------|------------------| | Ours w/ SDP-based clustering | sdp_kmeans | NO | 1,20X | -2,97 | -18,76 | -1,09 | -5,44 | | Ours w/ Spectral clustering | sklearn | NO | 1,17X |-1,78 | -18,58 | -0,06 | -4,56 | | Ours w/ K-means clustering (in the paper) | kmeans_pytorch | YES | 1,02X |-4,03 | -20,37 | -1,18 | -6,08 | |

  • Clarify descriptions of figures [# Reviewer FqZ5], experimental setups [# Reviewers FqZ5/9PzS], generalized inverse of the assignment matrix [# Reviewer 9PzS], and evaluation metrics [# Reviewer sLBS].

Once again, we thank all reviewers and area chairs. Your efforts and suggestions helped us improve this paper. Please see the reviewer-specific response for more detailed information.

最终决定

This paper proposes a multi-task optimization approach to alleviate the task imbalance problem. It adopts an adaptive group risk minimization strategy that consists of two parts: dynamical group assignment and risk-guided group indicators. Dynamical group assignment aims at grouping similar tasks into the same cluster, while risk-guided group indicators are to balance task risks by utilizing consistent task correlations. Ablation experiments well demonstrate the effectiveness of each part. After rebuttual, most of issues are solved and all reviewers give positive scores. Reviewers acknowledge its good-orgianization and sufficient experiments. Desipte for Reviewer FqZ5 the explanation why grouping helps being still not clear, his/her other comments have been largely addressed and the rating is also increased. The authors are encourated to further enhance this part in the final version. Based on these, I recommend accepting this manuscript.