PaperHub
4.8
/10
Rejected4 位审稿人
最低3最高6标准差1.1
3
5
5
6
3.8
置信度
ICLR 2024

Selective Mixup Helps with Distribution Shifts, But Not (Only) because of Mixup

OpenReviewPDF
提交: 2023-09-16更新: 2024-02-11
TL;DR

Selective mixup (a family of methods very successful at improving out-of-distribution generalization) is sometimes equivalent to weighted sampling, a classical baseline for handling covariate and label shift.

摘要

关键词
mixupdistribution shiftsOOD generalizationweighted training

评审与讨论

审稿意见
3

This paper attempts to show the equivalence between selective mixup, where the pairwise samples are selected according to a specific distribution and are mixed up for training and resampling based on the derived sample distribution in the batches for the selective mixup. The authors argue that such mixups shift the underlying training distribution to a more uniform one and can show performance improvements when the test distribution is uniform in nature. The authors also show results for cases where there are multiple domains in the data.

优点

The authors present a simple analysis of an intuitively known effect of the mixup on training data where there is an imbalance in the training data. They provide mathematical proof that such selective increases the entropy of the training distribution

缺点

  1. The authors claim that the selective mixup yields a training distribution over those classes that is closer to the uniform distribution. The authors do not provide results for the imbalanced classification on standard datasets such as CIFAR-10LT, CIFAR-100 LT, and imagenet1k-LT.
  2. Since the authors claim that an equivalent resampling is just as good, an important baseline in long-tail would be to compare against MiSLAS.
  3. Could you an analysis where the test distribution is also skewed, independent of the training distribution? How is the performance of existing methods and the derived resampling distribution from Selective Mixup one of the claims made is that these methods are useful because the balancing effect on the training distribution performs well when the testing data is uniform.
  4. I find a lack of novelty in the finding that mixup yields a training distribution with higher entropy, i.e., closer to the uniform distribution, Could you provide any strong theoretical justification that the actual mixup is not contributing to an improvement in the performance since existing works such as MiSLAS do show superior performance to vanilla classifier retraining on training with class balanced samplers.

问题

See weakness

评论

Thanks for your time. The review seems to use a few incorrect assumptions, e.g. the paper makes zero claims about long-tail scenarios (c.f. comparison with MiSLAS). We propose to clarify the paper with the points below.


Evaluation datasets The authors do not provide results for the imbalanced classification on standard datasets such as CIFAR-10LT, CIFAR-100 LT, and imagenet1k-LT.

  • The 5 datasets we use come directly from the most directly-relevant prior work (selective mixup for distribution shifts).

  • Our main claim is that this prior work missed important ablations, hence our priority is to perform the missing experiments on the same datasets.

  • The point of this paper is to improve our scientific understanding of selective mixup with distribution shifts. Experiments on *-LT could be interesting but would add little value to the main claims.


Could you do an analysis where the test distribution is also skewed, independent of the training distribution? (...)

The test distribution is skewed "independent of the training distribution" (?) as our default setting since the whole paper is about distribution shifts. If the question is specifically about shift of the label distribution, this is exactly what is investigated with the yearbook/MIMIC/arxiv datasets (see Fig. 10; the test distribution is not uniform).


I find a lack of novelty in the finding that mixup yields a training distribution with higher entropy

Not sure what "lack of novelty" means (??). Prior work missed this as an explanation for the effects of selective mixup. This is exactly why this paper is important: it highlighting an effect that has been in plain sight but missed in highly-cited prior work.


Could you provide any strong theoretical justification that the actual mixup is not contributing to an improvement

The paper never makes this claim, which is factually incorrect from our experiments. The paper states the exact opposite multiple times:

  • yearbook dataset: "(...) confirms the complementarity of the effects of resampling and within-class selective mixup"
  • arxiv dataset: "the performance of selective mixup is explained by cumulative effects of vanilla mixup and resampling effects"
  • civilComments dataset: "the performance of selective mixup is the result of the independent effects of vanilla mixup and resampling"

Different datasets benefit differently from mixup and/or resampling. This is why there is no single simple story, and why the results are analyzed on each dataset separately.

审稿意见
5

This paper studies the mechanism of selective Mixup, with a particular emphasis on its resampling aspect. The authors suggest that the resampling effect plays a crucial role in achieving the notable performance. Furthermore, the paper provides theoretical results, demonstrating that specific selection criteria exhibit a "regression toward the mean" bias or help mitigate class imbalance bias. The study includes many empirical results, and the authors propose a novel combination of selective Mixup and resampling to boost the performance beyond that of the original selective Mixup.

优点

The resampling effect of Selective Mixup has not been investigated in previous literature, which is definitely a crucial aspect to uncover the mysteries of Selective Mixup. Additionally, the authors introduce a novel technique to improve the performance of selective Mixup by enhancing the resampling effect.

缺点

A notable weakness is the misalignment between the explanations and claims provided and the empirical observations. Additionally, the paper appears to overlook certain important and interesting discussions based on observations.

Further details can be found in the questions outlined below.

问题

  1. In Figure 2, the authors mention that ''The ranking of various criteria for selective sampling is similar whether with or without Mixup''. However, claiming that the performance between selective sampling is similar to that of selective Mixup seems somewhat strained. For instance, in the case of "Diff. domain+ Same class," selective Mixup demonstrates a 6%6\% higher accuracy than selective sampling. What accounts for the superiority of selective Mixup over selective sampling in this scenario?

  2. Similarly, in Figure 2, for ''Same domain'', ''Diff. class'' and ''Diff. domain + Diff. class'', selective sampling is much better than selective Mixup, does this indicate that vanilla Mixup is harmful in this case? Such observations are more evident in Figure 8. The authors have not discussed the reasons behind the occasional superiority of selective Mixup over selective sampling.

  3. In Figure 6, given the effective performance of vanilla Mixup, it appears that vanilla Mixup is the main driver for the improvement in selective Mixup, even with the optimal criteria. This observation contradicts the third point outlined in the summarized contributions in the Introduction.

  4. In Figure 6, it is observed that for the case of ''Diff. domain + Same class,'' selective Mixup performs worse than vanilla Mixup. Does this observation imply that the resampling effect may have a degrading impact on the performance of Mixup?

I would be willing to increase my score if the authors addressed my concerns.

评论

Thanks for your time. These comments are very helpful for improving the paper. We propose to clarify the points below in the text, and make this general message clearer upfront: different datasets benefit differently from mixup and/or resampling. This is why there is no single simple story and why each dataset is best analyzed individually.


In Figure 2, the authors mention that ''The ranking of various criteria for selective sampling is similar whether with or without Mixup''. However, claiming that the performance between selective sampling is similar to that of selective Mixup seems somewhat strained. For instance, in the case of "Diff. domain+ Same class," selective Mixup demonstrates a higher accuracy than selective sampling. What accounts for the superiority of selective Mixup over selective sampling in this scenario?

There are indeed small differences (<5%), which we propose to mention in the caption. We will clarify that the key observation is correct for the best-performing version (same domain/diff. class, and resampling for uniform combinations), which is >25% above the baseline and the version that most will care about.


Similarly, in Figure 2, (...) does this indicate that vanilla Mixup is harmful in this case?

Yes indeed. One can see it just by observing that vanilla mixup is worse than the baseline on this dataset. Prior work has indeed shown that mixup is not always beneficial.


The authors have not discussed the reasons behind the occasional superiority of selective Mixup

We already acknowledge that some cases benefit from selective mixup in the way proposed by Yao et al. For example see Fig.4 (yearbook): "it indicates a genuine benefit from mixup restricted to pairs of the same class". This is also exactly what the title of the paper says ("not always"). We propose to highlight this takeaway in the caption of Fig.4.


In Figure 6, given the effective performance of vanilla Mixup, it appears that vanilla Mixup is the main driver for the improvement in selective Mixup, even with the optimal criteria. This observation contradicts the third point outlined in the summarized contributions in the Introduction.

The reviewer is absolutely correct. On this dataset, 4/5 of the improvement over the baseline can be attributed to vanilla mixup, and 1/5 to resampling. We propose to tone down the third summary point ("resampling is SOMETIMES the main driver").


Does this address your concerns? Happy to take other suggestions for improvement.

评论

I would like to thank the authors for their response, and apologize for my delayed reply. I also have read all the reviews and the author responses.

  • Yes indeed. One can see it just by observing that vanilla mixup is worse than the baseline on this dataset. Prior work has indeed shown that mixup is not always beneficial.

To clarify, note that in Figure 8, vanilla mixup outperforms the baseline, but selective mixup is much worse than the corresponding selective sampling. The authors briefly mention this in the caption, but additional insights or discussions on this discrepancy would be appreciated.

Regarding the potential drawbacks of vanilla mixup, I am aware of two relevant works, namely [1,2], which are cited in the "Explaining the benefits of mixup" paragraph in your paper. Both of these papers suggest that if early stopping is applied, mixup does not harm performance. Could the less satisfactory performance of vanilla mixup be attributed to the long training time? While [1,2] primarily focus on in-distribution generalization, it's possible that we're dealing with a different scenario, but such considerations are currently missing from the discussion.

[1] Liu et al. Over-training with mixup may hurt generalization. ICLR 2023.

[2] Zou et al. The benefits of mixup for feature learning. ICML 2023.

  • different datasets benefit differently from mixup and/or resampling. This is why there is no single simple story and why each dataset is best analyzed individually.

I agree with the authors that there is no single simple story based on your experiments. However, this has left me somewhat puzzled about the key message of the paper. Claims in the introduction, such as the highlighted text on the first page, "The non-random selection of pairs ... completely unrelated to the mixing," seem to suggest a clear message that resampling is the key factor behind mixup. Yet, after reviewing all the empirical results, the conclusion seems less definitive, especially considering that vanilla mixup can be a dominant factor (e.g., Figure 6).

This is my personal feeling and my viewpoint can be subjective. I will leave this point to the discussion with AC and other reviewers.

On a separate note, I want to mention that ICLR allows authors to modify their paper submission during the rebuttal period. Therefore, if the authors propose changes in their responses, updating the PDF accordingly might be more convincing. If the authors are already aware of this, please disregard this comment.

评论

This is a very constructive discussion, thanks for your time!


Regarding vanilla mixup: everything you mention is relevant, and it just shows that there's variability in its effects that aren't well undertood.


Despite the necessarily complicated results, there is no contradiction. E.g. with this claim:

"The non-random selection of pairs implicitly biases the training distribution and improve generalization by means completely unrelated to the mixing."

This is perfectly unambiguous and fully supported by the experiments. The resampling is unrelated to the mixing, and it does improve generalization. The message that we propose to highlight much more clearly (and which should really clarfiy things for the reader), is that the effects of mixup and resampling are additive and largely independent. This remains a significant advance over previous work that completely overlooked the latter.

审稿意见
5

This paper studies the problem of selective mixup, where the sample and labels are paired based on a certain criteria and the risk is minimised on the mixed up samples. The paper then demonstrates that when samples from different classes are mixed up the overall data distribution regresses towards a uniform distribution. Hence, the success of mixup under distribution shift is attributed to both the resampling effect and regularization effect, where it is argued that resampling plays an important role. Some theoretical results are shown, which are supplemented with empirical results.

优点

Relevant problem to be studied in detail.

Extensive experiments have been conducted for analysis.

缺点

Weak Theoretical Results: I find the theoretical result to be weak, as it only considers the mixup of the labels. For appropriate analysis of the mixup, the mixing of data points (x) should also be considered (i.e. covariates) to get the complete picture of the problem.

Empirical Results are Scattered: The authors explain the results of each of the datasets independently, which is a source of confusion due to the complexity involved. The datasets often demonstrate conflicting conclusions, for example, results on Arxiv are much different from the Waterbird. Hence, it’s hard to obtain final conclusions. The results on various datasets can be combined which contain similar problem settings and demonstrate consistent results.

Inconsistency in Experimental Setups: There is a difference between the number of methods considered for each dataset. Hence, it’s hard to parse which combination of methods is most effective on average across all the datasets.

Novelty: However, the authors have done a considerable amount of experiments. I find that the content is scattered and insufficient, to meet the bar for novelty and doesn’t provide insights different than existing works (Yao et al. 2023)

问题

As the method mainly considered label shift, have the authors considered the setting of long-tailed label shift mixup?

评论

Thanks for your time. These comments are very helpful by showing that the significance of our findings was not sufficiently highlighted. We propose to clarify the paper with the following points.


novelty / doesn’t provide insights different than existing works (Yao et al. 2023)

On the contrary, this work overturns part of Yao et al. We show that the widely-accepted explanation of their method is incomplete or even incorrect in some cases.

This paper is important because it improves the scientific understanding of this highly-cited method. Incorrect explanations will derail future work that builds upon current knowledge. And getting to such deep understanding is the whole point of science.


only considers the mixup of the labels (...) the mixing of data points (x) should also be considered (i.e. covariates)

This is indeed an interesting question that we openly discuss p.4. The proposed formalization with labels does not readily extend (e.g. a uniform distribution over covariate isn't well defined) but the extensive experiments suggest that the same mechanism is at play with labels and covariates. We propose to make it clear that this is a limitation of our formalization.


The authors explain the results of each of the datasets independently (...) it’s hard to obtain final conclusions

There is a simple conclusion: there is no simple conclusion :)

Different datasets benefit differently from mixup and/or resampling. This is why there is no single simple story and why each dataset is best analyzed individually. We propose to make this message clearer upfront in the paper.


Do these clarfifications address your concerns? We are grateful for your contribution to the improvement of this paper!

评论

Thanks to the authors for the rebuttal. I have read other reviews and rebuttal responses. After this, my major concern regarding the claim of "resampling being the main reason for mixup" (i.e., empirical results scattered) being not supported by sufficient experiments still persists. Further, no revision has been submitted by the authors, which improves the clarity of the results. Hence, at this point, I will maintain my rating and leave further discussion to AC.

评论

We are more than eager to produce a clarified revision, but let's first agree on what needs to be clarified.

Nothing like the sentence mentioned by the reviewer ("resampling being the main reason for mixup") was ever claimed.

  • The title of the paper itself is very clear: selective mixup works, but not only through mixup. The results are the combined effects of mixup and of the implicit resampling. Prior work completely overlooked the latter.

  • The theoretical part shows that resampling does happen (this is absolutely non-controversial). It also describes the nature of the resampling, which predicts that the effects should be beneficial if there is a "regression towards the mean" in the data.

  • The empirical evidence overwhelmingly supports that resampling has an effect independent from mixup. E.g. on waterbirds, the best "selective mixup" method gets the same improvements as the equivalent resampling without mixup. Other datasets (e.g. yearbook) show a combined effect from mixup + resampling.

  • Even stronger empirical evidence comes from the correlation (Fig.5) between the improvements in performance, and the reductions in divergence between training and test distributions induced by the resampling. This is evidence that the resampling does contribute to the improvements in performance even when mixup also does play a role (and we made it very clear when this is the case, as mentioned in bold three times in Sections 4.2, 4.3, and 4.5).

  • We find support for the theoretical prediction (that improvements would happen if there is a regression towards the mean) by checking that every dataset investigated (on which the resampling is beneficial) does indeed show such a regression towards the mean.

  • Even stronger evidence for this theoretical prediction: we conduct a counterfactual experiment where we swap the training and test sets (Fig.12/13) such that the "regression to the mean" does not hold anymore. As predicted, the same selective mixup methods are now detrimental. This was a completely unsuspected failure mode of these methods that isn't explained without taking the resampling into account.

Any concern that the paper is not supported by sufficient experiments clearly needs to better substantiated.

审稿意见
6

This paper examines the success of selective mixup on the out-of-distribution generalization problem and finds out that the effect of mixup and resampling due to selective mixup can be decoupled. They conduct several experiments and find that the main contribution to the effectiveness of selective mixup may result from resampling.

优点

  1. The paper is easy to follow
  2. Conduct results on many ood datasets and compare the results of selective sampling without mixup with the selective mixup method. And figures like Figure 7 and Figure 15 is insightful for noting the decoupling between mixup and resampling.

缺点

  1. Some of the notations are not so clear. Like in Table 1, what's the definition of Resampling (uniform cl.) + concatenated pairs, and why it has different proportion of majority class compared with "Resampling (uniform classes)"?

问题

  1. What makes the difference in the sampling ratio between the selective sampling without mixup and resampling? Is this determined by the hyperparameter of mixup? When changing the lambda for the beta distribution of mixup, are similar results as Fig 15 hold?
  2. For this finding, what about using another mixup method like manifold mixup, which may help the old generalization ability by connecting samples from different domains/classes in the representation space?
评论

Thanks for your time. These comments are very useful. We propose to clarfiy the following points in the paper.


Some of the notations are not so clear. Like in Table 1, what's the definition of Resampling (uniform cl.) + concatenated pairs, and why it has different proportion of majority class compared with "Resampling (uniform classes)"?

What makes the difference in the sampling ratio between the selective sampling without mixup and resampling?

This can indeed use a formal explanation:

  • Resampling (uniform cl.) + mixup uses paired instances such as: {mix(x1,x~1),mix(x2,x~2),mix(x3,x~3),...\textrm{mix}(x_1, \tilde{x}_1), \textrm{mix}(x_2, \tilde{x}_2), \textrm{mix}(x_3, \tilde{x}_3), ...} where xix_i are sampled with uniform classes (50% proportion) and x~i\tilde{x}_i are sampled indiscriminatively (hence with the original 78% proportion). Overall we get something inbetween (64%).

  • Resampling (uniform cl.) + concatenated pairs uses the same instances as above without the mixing: {x1,x~1,x2,x~2,x3,x~3,...x_1, \tilde{x}_1, x_2, \tilde{x}_2, x_3, \tilde{x}_3, ...} The proportion of labels is thus also ~64%.


When changing the lambda for the beta distribution of mixup, are similar results as Fig 15 hold?

Yes, we performed experiments on datasets with various λ\lambda and found very little impact.


For this finding, what about using another mixup method like manifold mixup, which may help the old generalization ability by connecting samples from different domains/classes in the representation space?

Indeed, we follow Yao et al. and use manifold mixup (which is the only sensible option for the NLP and MIMIC tasks). We made this clearer in the text.


Does this properly address your questions? We are grateful for your contribution to the improvement of this paper!

评论

Thanks for your reply. I think further formal explanations of the notations like the author mentioned in the response are needed to make the paper clearer. And it’s better for the author to provide some of the sensitive analysis of parameter λ\lambda formally to show that it has little impact. Besides, I also agree with the opinion to tone down the summary point and point out that mixup still plays an important role in some datasets. I will keep my score.

评论

Dear reviewers, the discussion period with the authors is ending soon. Don't forget to acknowledge the authors' response, update your rating, and/or point out remaining issues accordingly. Thanks!

AC 元评审

After a thorough examination of the reviews, author responses, and discussions, the consensus is to reject this paper. The reviewers commend the authors for tackling the important question of understanding the mechanisms behind selective mixup's success in out-of-distribution generalization. The paper brings a valuable perspective by emphasizing the resampling effect as a significant contributor to performance improvements seen with selective mixup. This perspective is indeed an interesting addition to the literature.

However, there are several critical issues with the paper that warrant its rejection. Firstly, there is a lack of clarity and consistency in the empirical results, which, as pointed out by multiple reviewers, do not consistently support the claims made throughout the paper. The observations from the experiments are scattered, and the conclusions drawn from them appear to be overstated or not sufficiently justified.

Secondly, the theoretical contribution is deemed weak by the reviewers, as it does not fully consider the mixup of data points and is limited to label mixing. A comprehensive analysis that includes covariates is necessary to provide a complete understanding of the mixup mechanism.

Furthermore, the novelty of the paper is called into question, with some reviewers finding the insights provided to be insufficiently novel or different from existing works. The paper is perceived as not meeting the bar for novelty and does not sufficiently distance itself from prior work.

Lastly, the paper suffers from a lack of rigorous experimental setups and baselines, particularly in long-tail classification scenarios, which are relevant to the paper's topic but not explored. This is a significant limitation, as a comparison with established baselines in these scenarios could have strengthened the paper's contributions.

Given these shortcomings, and despite the authors' willingness to clarify points in the revision, the paper does not reach the level of contribution required for acceptance.

为何不给更高分

The decision to reject is based on a balance of the paper's strengths and weaknesses as assessed by multiple reviewers. While the paper addresses an interesting aspect of selective mixup, the empirical results do not offer a clear narrative or consistent support for the claims made. The theoretical analysis is considered incomplete, and the novelty of the findings is not compelling compared to existing literature. Additionally, the experimental design lacks the depth and comparison with relevant baselines that could have validated the claims more robustly.

The reviewers' concerns about the clarity of results, inconsistency in experimental setups, and the paper's contributions relative to existing work are significant. Although the authors have provided responses to the reviewers' comments, the concerns raised are not fully addressed, and the paper still lacks the clarity and consistency required for a higher score.

Moreover, the paper's focus on the resampling effect, while valuable, is not sufficiently substantiated with strong empirical evidence across the board. The lack of a comprehensive theoretical foundation and the omission of relevant experimental scenarios further weaken the paper's standing.

The decision to reject reflects the committee's requirement for clear, consistent, and novel contributions, which, according to the reviewers, the paper does not meet. The reviewers' confidence in their assessments and the alignment of their critiques across multiple reviews underscore the decision to reject rather than inviting a revision for a higher score.

为何不给更低分

N/A

最终决定

Reject