4.0

/10

Rejected3 位审稿人

最低3最高6标准差1.4

3.7

置信度

ICLR 2024

Interleaving Multi-Task Neural Architecture Search

Pengtao Xie,Hao Ban

OpenReview PDF

提交: 2023-09-16更新: 2024-02-11

TL;DR

We develop an interleaving learning framework for neural architecture search.

摘要

关键词

Interleaving learningneural architecture search

评审与讨论

审稿意见

评分: 3置信度: 42023-10-22

Authors deal with the problem of finding a good architecture for multiple tasks. Since multitask is hard, they propose to separate the tasks and perform them in a round robin fashion. Algorithm: They train an encoder that would be common to all the tasks, the weights may be different for every task.

Perform M rounds of K tasks learning.
Transfer the knowledge from task t_n to t_(n+1). They do so by sampling 10 samples from task n and n+1, create augmentations and label them by having similar distribution (Maximum Mean Discrepancy smaller than thr) with the task n. Then they force task n+1 to output the same labels (different or similar according the the distributions) for the same samples.
train the head only for the current task.
update the architecture and go back to step 1

优点

Using the MMD for knowledge transfer sounds like an interesting approach.

缺点

The paper is very poorly written, there is a lot of redundant details and not enough critical details. See the rest of the review for explanation.
Your algorithm is unclear. There are multiple places where the algorithm is explained with bits of information in each place. Some of the stages look like redundant branching. The clearest place where the algorithm is explained is the pseudo code in A.3. Even here the algorithm is not very precise and there are typos even here. For example:
1. Why is there a special case of M>1 and k=1? This is just redundant.
2. The architecture is outside the for loop, this seems just another mistake of poorly written paper.
The algorithm seems suboptimal. You are training the weights W of the encoder only in the first task. On all the other tasks all the weights are trained only by the MMD. Why not train the encoder, at least partially, on the end tasks? 2. Algorithm seems very slow, since in every step one needs to learn all the tasks. No data on training time for these tasks.
This is a NAS paper, there is so little reference to architectures. What is the search space? What is the final architecture? How is it different from other algorithms used for the same task?
The motivation/comparisons are not clear:
1. Are you learning the same architecture but different weights for different tasks? What is the advantage of using different architectures for the same tasks? Why not perform a separate optimization for each task?
2. A few key components are missing for the correct comparison. As stated in the introduction: "Data encoders of all tasks share the same architecture, but have different network weights", so the correct comparison for this case should be in the same setting. Two core comparisons that are missing from the paper are:
  1. training the same architecture for each task individually without the round robin algorithm.
  2. training the other architectures with the same procedure as yours.
3. When looking at the some known architectures, when comparing the classification results provided by the authors even the smallest EfficientNet (B0)[1] outperforms the proposed algorithm on all 3 tasks. When looking at more modern architectures like EfficientNetV2 [2] the gap in performance increases significantly.

[1] Tan, M. and Le, Q., 2019, May. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105-6114). PMLR.

[2] Tan, M. and Le, Q., 2021, July. Efficientnetv2: Smaller models and faster training. In International conference on machine learning (pp. 10096-10106). PMLR.

问题

Look at the weaknesses.

审稿意见

评分: 3置信度: 42023-10-25

The paper identifies choosing the task-weighting as a major challenge in multi-task learning, specifically multi-task neural architecture search. The authors propose IMTNAS, a method which interleaves tasks and learns weights across tasks with soft-weight sharing and applies post-hoc NAS. A new method for soft-weight sharing based upon distribution matching (controlling the maximum mean discrepancy between augmention sets) is proposed. IMTNAS is evaluated in two multi-task settings: ImageNet classification/MS CoCo detection, and ImageNet/CIFAR10/CIFAR100 classification.

优点

Strengths of the paper include:

IMTNAS performs strongly in the experiments, beating a range of baselines.
A new knowledge transfer method for soft-weight sharing based upon distribution matching is proposed. This technique outperforms baseline transfer method in experiments (Table 3). It is also a promising idea to use augmentation data to encourage consistent representations between consecutive tasks.
Related works from multi-task learning and neural architectures search are discussed in detail.

缺点

The paper has a number of weaknesses:

For a paper focusing on NAS, the actual NAS contribution is minor and borderline trivial. NAS is not interleaved with task learning but happens "post-hoc" after the task weights have been learned (Section 3.1.3). The authors explicitly state in Section 3.1./3.1.2 that the architecture is not optimized in these stages. The NAS search space is simply taken from ProxylessNAS, and gradient-based NAS is applied (unclear which gradient-based NAS?).
As of Equation (6), NAS is based on a uniform task weighting (for "normalized " task losses). For a paper that aims addressing the challenge "How to balance these losses by finding the optimal loss weights requires a lot of tuning, which is time-consuming and labor intensive." this is surprising - in the end a trivial task weighting in NAS seems to suffice? The authors state "Note that our method focuses on alleviating the burden of tuning the weights of tasks’ training losses." - but then the work should be positioned as a multi-task learning paper not as a NAS paper, because the main contribution (soft-weight sharing) does not apply to NAS.
For a paper focusing on "multi-task" learning, experiments on two heterogenous/three homogenous tasks is not sufficient. For instance: how would the method perform on settings with ~5 tasks, where some of the tasks are highly similar but others are outliers(highly dissimilar)?
It is unclear how the architecture A is chosen in Section 3.1.1/3.1.2? It is not optimized but probably some diversity in selecting A during multi-task learning is required to avoid too strong co-adaptation of weights and architecture.
An ablation on using a hypernetwork is missing. In the current form, it remains unclear if IMTNAS would work equally well without a hypernetwork (or even better)?
The paper lacks technical details and it would be difficult to reproduce/confirm the paper's results. For instance, what does it mean in the hypernetwork that "The architecture A is represented as a continuous vector. The dimension of the vector is the number of operations in the search space. A_i denotes the importance weight of the i-th operation (i.e., whether it should be selected)". Is A_i coming from the gradient-based NAS?
While the related works section is extensive, it does not cover recent works. Specifically, for NAS, nearly all discussed works are from 2020 or earlier.

In summary: I think the paper is not primarily a NAS but a multi-task learning work. However, for a multi-task paper, it lacks evaluation on complex multi-task settings. It also lacks technical clarity, which hampers reproducibility.

问题

Clarifying some of the issues raised under "weaknesses" would be appreciated.

Also, I am lacking evidence for the claim "When multiple tasks are closely related, it is beneficial to search for a shared architecture for all tasks via multi-task learning". How would IMTNAS perform if architectures were optimized independently per tasks and not jointly?

审稿意见

评分: 6置信度: 32023-11-10

The paper proposes a multi-task neural architecture search method which searches for a shared backbone architecture for all the tasks. In this framework, the shared encoder learns an architecture without the constraint of sharing the weights between the set of tasks. A single optimization loop in this method comprises several rounds, each of which learns a set of tasks in a sequence, while also transferring knowledge between consecutive tasks using a novel distribution matching method. In each round, each task optimizes its own loss function. At the end of every loop, the architectural parameters of the supernet (the shared encoder) are optimized by minimizing the sum of (normalized) losses across all tasks on their respective validation datasets. By learning one task at a time, the method introduced in the paper eliminates the challenge of weighing the loss terms of the different tasks. It is also generally compatible with differentiable NAS methods.

优点

The paper tackles an important problem in the domain of Multi-Task NAS - the weighting of the losses of the different tasks - by eliminating it altogether.
While the results may not be state-of-the-art, they are promising.
The method is relatively simple, and more importantly, it is orthogonal to differentiable NAS methods. In principle, this makes it possible to apply it in conjunction with any DARTS-like method, allowing advances in differentiable NAS methods to be integrated with the method introduced in the paper.
In terms of originality, although one could argue that there is no novelty in interleaving the set of tasks which are being learned, the novel distribution matching mechanism introduced in the paper makes up for it.

缺点

Although the paper is mostly written well, a few ideas are not conveyed clearly in the main body of the paper. They are as follows:

When the method is introduced in section 3, the encoder is described merely having a shared architecture $A$ , with no hint of what this architecture is. This can be confusing to the reader since Eq (2) through (5) emphatically do not optimize $A$ , but consume it nevertheless. Simply explaining that this is a supernet in a gradient-based NAS setting could avoid this confusion. This becomes clear only in Section 4.3.1, when ProxylessNAS is introduced (which should ideally have been introduced in Related Works section 2.2).
The role of differential NAS in the method is not clearly explained. The optimization algorithm for solving Eq (7) is deferred to the appendix without a reference. Similarly, the brief overview of differential NAS in section 3.1.4 lacks context (even though differential NAS is what is used to optimize the architecture).
The discussion on the role of Hypernetworks in the method is too brief to fully understand its function (except, superficially, that it saves compute and memory).
It was not clear to me (before reading the appendix) that the $M$ x $K + 1$ learning stages of the method is within a single loop of the optimization method, such as PC-DARTS. The initial impression upon reading section 3 was that the method terminates at the end of the $M^{th}$ round.
Broadly speaking, the experimental setup of the two multi-task settings are the same. In the search phase, the optimal architectural parameters are learned, and in the evaluation phase, the discretized model is trained from scratch. This is explained quite well in section 4.4.1 but skipped entirely in 4.3.1.

The method is applicable only to differentiable NAS methods. I will admit, however, that this does not pose a significant issue since compatibility with evolutionary and other black-box methods simply do not fall in the domain of the method presented in the paper.
Code is not available.

I would consider updating the score if these points (other than 2) are addressed.

问题

The following are the question I have about the paper:

Are there any ablation studies which look at how the number of training examples used to generate the augmented datasets (not the number of augmentations per training example) influence the performance? The appendix indicates that the batch size used for the CIFAR-10 and CIFAR-100 experiments is 256. Does this mean that for every 256 training samples that is used in a one-step gradient descent approximation of $\widetilde{W}_k^{(m)}$ , $(10 + 10)$ x $C$ samples are used for distribution matching?
The paper states that hypernets were used to reduce computation and memory costs, allowing the method to store only the parameters of the hypernet in memory, as opposed to all $W_K^{(m)}$ weights. It seems to imply that one would have to store all $M$ x $K$ weights in the optimization loop. I do not understand why this is the case. Would it not suffice to store $K$ weights? I am also not clear on the role of the hypernet here - is it used to simply store the weights $W_k^{(m)}$ ? If so, does that mean that every approximation of $\widetilde{W}_k^{(m)}$ is followed by a hypernet training step?
Many of the baselines being compared against in Table 1 are from 2019. Are there newer methods which perform better? FBNetV5[1], for example, seems to perform better in ImageNet classification.
Would it be possible to get confidence intervals for the COCO dataset in Table 1, like in Table 2?
In Table 2, why is the params of pcdarts and IMTNAS-Pcdarts expressed as an average of three values?
In Table 2, the caption indicates that runs marked with † were re-run 10 times with random initializations, and in the bottom section, only Pcdarts is marked with it. However, section 4.4.1 indicates that the errors are reported from 10 random runs. This is a bit confusing.

Here are a few suggestions to improve the paper:

Improve the phrasing of “more decrease of one loss renders less decrease of other losses”. Perhaps "A greater reduction in one loss results in a lesser reduction of other losses."? Similarly, rephrase "More loss decrease of one task leads to less loss decrease of the other task”
FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search was published in ICCV 2021. Please update the citation.
Section H.1 mentions tuning "the interleaving round number $K$ ". $K$ refers to the number of tasks in the main part of the paper. Is this supposed to be the partial connection factor of PC-DARTS, or should it be $M$ ?
Ensuring consistency in the nomenclature between Table 1 and Table 2, particularly by using the name 'ProxylessNAS' for the IMTNAS on COCO and ImageNet datasets in Table 1, would enhance clarity for the reader.

[1] FBNetV5: Neural Architecture Search for Multiple Tasks in One Run by Wu et al.

AC 元评审

2023-12-05

This submission received scores of 3,3,6, and there was not rebuttal by the authors. Upon seeing the other reviews, the positive reviewer said that their main concern with the paper is its presentation and clarity of writing, that this sentiment is clearly echoed in the other reviews as well, and that, considering the other concerns raised by the other reviewers, they too are convinced that the paper should be rejected. Therefore, I propose rejection.

为何不给更高分

Nobody is arguing for it.

为何不给更低分

N/A

最终决定Reject

2024-01-16

Reject