Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing
We introduce Annealed Multiple Choice Learning (aMCL) which combines deterministic annealing with MCL.
摘要
评审与讨论
The paper introduces Annealed Multiple Choice Learning (aMCL), a method that integrates simulated annealing with Multiple Choice Learning (MCL), in applications where the output label may be ambiguous, and many values may be plausible given the same input.
The authors show that problems arise with the use of winner-takes-all in the optimization step of MCL, where the output space is partitioned in Voronoi cells based on the individual predictors. Instead, the authors suggest to soft-weight the predictors to replace hard-WTA assignments, in a way that depends on a temperature parameter. Low values of the temperature make the weighting scheme behave close to hard-WTA (sharpening the softmin distribution of weights).
优点
-
The integration of simulated annealing and MCL is novel and effective; I also like that the objective in eq. (6) becomes fully differentiable, without any non-differentiable assignment or operators.
-
Extensive theoretical analysis.
-
Extensive validation over a large number of datasets, although from the results in seems that epsilon-MCL seems to perform better, overall.
缺点
-
The introduction of the temperature schedule as hyperparameter complicates the use of the method, while it seems that epsilon-MCL would be both easier (one hyperparameter, but fixed, without schedule) and overall better performing.
-
I am not sure how representative the UCI datasets are of real world problems, but that is not a big problem.
-
Minor: typo? 'broaden impact' -> 'broadeR impact'?
问题
N/A
局限性
- Since the authors argue that WTA is sensitive in particular to initialization, it would have been interesting to see experiments comparing the robustness of vanilla MCL vs aMCL over different initializations.
We thank the reviewer for their suggestions to improve the quality of this manuscript.
Comparison with baselines
From the results in seems that epsilon-MCL perform better, overall.
We compare epsilon-MCL (refered to as Relaxed-WTA in the following), MCL and aMCL on the UCI datasets for two metrics: RMSE (line 239) and Distortion (1).
RMSE compares the barycenter of the predicted distribution with the target positions. For this metric, Relaxed-WTA outperforms other methods for high values of (e.g., with , see Table B). This is expected, since Relaxed-WTA is biased towards the distribution barycenter, especially for high values of (see Figure 1 of the paper).
However, barycenter comparison discards information concerning the spatial distribution of the hypotheses. Distortion (1) corrects this issue by measuring quantization performance, and we used it in our theoretical analysis.
For this metric, aMCL outperforms Relaxed-WTA in most cases, and especially for large datasets (Year, Protein, see Table A). Hence, aMCL strikes a balance between RMSE and Distortion (See Figure B).
In order to verify this behavior on audio data, we have additionally trained Relaxed-WTA on the speech separation data. We show that aMCL still outperforms Relaxed-WTA in this more realistic setting (see Table C of the supplementary pdf).
It would have been interesting to see experiments comparing the robustness of vanilla MCL vs aMCL over different initializations.
We thank the reviewer for this suggestion. We have extended the experiments presented in Figure 8 of the Appendix by comparing PIT, MCL, aMCL and Relaxed-WTA for 3 different seeds on audio data (see Table C of the supplementary pdf).
First, we confirm that PIT, which performs perfect assignation, acts as a topline in this experiment. Moreover, the difference between PIT and aMCL is not statistically significant. Therefore, aMCL reaches the same performances as PIT, while improving from to in terms of complexity (noting the number of speakers and the number of hypotheses) [A]. This complexity gap is best exploited when the number of speakers is high, similarly to [B], and this will be the object of further work.
Second, we see that aMCL is better performing than MCL when we average over seeds. Moreover, aMCL has lower inter-seed variance than MCL. This validates our theoretical analysis, which suggests that aMCL guides the hypotheses towards a better local minimum than MCL, independently of the initialization.
Limitations
The introduction of the temperature schedule as hyperparameter complicates the use of the method.
We thank the reviewer for raising this issue. The temperature schedule is indeed a new degree of freedom that may require some tuning.
However, we conjecture that the theoretical analysis of aMCL convergence will lead to the characterization of optimal temperature schedules, similarly to Hajek theorem for deterministic annealing [22]. We leave this analysis for future work.
How representative are the UCI datasets of real world problems ?
The UCI datasets correspond to real world tabular data for 1D predictions, and constitute a widely used benchmark to evaluate uncertainty quantification algorithms. We have included it our manuscript in order to compare ourselves to the customary multi-hypothesis baselines (see Table 1 of the main paper).
The datasets are sorted by size in Tables A and B. Interestingly, aMCL obtains the best Distortion results for large datasets (Year, Protein, see Table A and Figure B), which are the most realistic. This will be emphasized in the main paper.
[A] Jack Edmonds and Richard M Karp. Theoretical improvements in algorithmic efficiency for network flow problems. Journal of the ACM (JACM), 19(2):248–264, 1972.
[B] Hideyuki Tachibana. Towards listening to 10 people simultaneously: An efficient permutation invariant training of audio source separation using sinkhorn’s algorithm. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 491–495. IEEE, 2021.
I read the response and found it convincing, and I appreciate the new results. I decided to raise my score.
This paper aims to tackle the local minima problems in MCL optimization. Inspired by simulated annealing, a temperature-controlled soft assignment is used and then soft-cells are directly optimized in the MCL process. Some theory and 2d analysis sheds insight into why this approach works. This work seems well motivated and reasonable, but the experimental gains on real datasets are quite small. Additionally, the performance seems similar to the older epsilon-MCL approach, in which some gradient is used to increase the weighting on non-selected cells.
notes from reading the paper:
-Multiple choice learning handles ambiguous tasks by producing a set of hypotheses.
-Hypotheses trained using winner-take-all technique, to encourage diversity.
-This paper aims to apply simulated annealing ideas to MCL.
-Try to increase variance in the annealing process, to address MCL's issue of falling into local minima based on the initialization.
-aMCL uses softmin assignment, followed by gradient steps on the soft cells.
优点
-The basic theoretical analysis is nice, providing relevant bounds on the performance of the algorithm.
-The analytical results also nicely complement the theory.
缺点
-The empirical results seem lackluster. Epsilon-MCL and aMCL seem to work roughly equally well on the UCI datasets. For the speech separation datasets, the improvement over MCL seems quite small, and Epsilon-MCL doesn't have results presented.
问题
-Because it could be difficult to tune the temperature schedule over the course of training, I wonder if it could be possible to move the temperature schedule into the inference process (a bit like what's done in diffusion generative models)? I.e., we would train a few models with different temperature levels, and force the cell assignment to be consistent with the higher temperature level.
局限性
Limitations aren't discussed in much detail in the paper.
We thank the reviewer for their insightful comments. We provide here a detailed answer to the raised concerns.
Performance of aMCL
The experimental gains on real datasets are quite small. For the speech separation datasets, the improvement over MCL seems quite small.
To compare MCL and aMCL we provide two additional experimental validations. For the UCI datasets, we experimented with an additional temperature schedule. For the speech separation data, we trained PIT, MCL, and aMCL with 3 seeds to measure sensitivity to initialization. See Tables A and C of the supplementary pdf.
On the UCI datasets, we observe that aMCL is competitive with MCL for the Distortion metric, and better for the RMSE metric. This effect is exacerbated on large datasets. This suggests that with careful temperature schedule tuning, aMCL can outperform MCL.
On the audio data, we observe that aMCL has better inter-seed average PIT-SISDR score, and lower inter-seed variance than MCL in all settings. This is consistent with our theoretical analysis, which suggests that aMCL is less sensitive to initialization than MCL.
The performance seems similar to the older epsilon-MCL approach. Epsilon-MCL and aMCL seem to work roughly equally well on the UCI datasets. For the speech separation datasets, Epsilon-MCL doesn't have results presented.
We thank the reviewer for this suggestion. We provide two additional experiments. On the UCI datasets, we experiment with different values of for the epsilon-MCL baseline (refered to as Relaxed-WTA in the following), in Table B and Figure B. On the speech separation data, we compare Relaxed-WTA with the other approaches (Table C).
On the UCI datasets and for the RMSE metric, we observe that Relaxed-WTA outperforms all other approaches for high values of , but not for lower values. This is expected since Relaxed-WTA is biased towards the barycenter for high values of , and RMSE compares the predicted distribution barycenter with the target positions.
However, barycenter comparison discards information concerning the spatial distribution of the hypotheses. Distortion (1) corrects this issue by measuring quantization performance, and we used it in our theoretical analysis.
For this metric, we observe that aMCL outperforms Relaxed-WTA in most cases, and especially for large datasets. A higher in Relaxed-WTA comes at the cost of a higher Distortion error. Comparatively, aMCL strikes a balance between RMSE and Distortion (Figure B).
On the audio data, we observe an advantage of aMCL over Relaxed-WTA for the PIT SI-SDR metric on the 2-speaker dataset. Preliminary experiments suggests that this is also true on the 3-speaker dataset, where Relaxed-WTA reaches a PIT SI-SDR score of compared to for aMCL (average and standard deviation over 3 seeds).
Temperature schedule
I wonder if it could be possible to move the temperature schedule into the inference process (a bit like what's done in diffusion generative models)? I.e., we would train a few models with different temperature levels, and force the cell assignment to be consistent with the higher temperature level.
The idea of moving the temperature schedule into the inference process is interesting: with this, aMCL can be used to perform hierarchical clustering at test time.
More precisely, we can store the model's parameters at different times of the training schedule (i.e., at several temperature levels). During inference, this allow to replay the temperature cooling for new test samples.
Replaying this trajectory has several advantages. Indeed, the hypotheses trajectory follows the rate-distortion curve and consequently explores recursively the modes of the distribution as the temperature decays. Crucially, at each critical temperature, when the hypotheses are about to split, they are exactly located at the barycenter of these modes. If we can track these splitting moments, for instance by counting the number of distinct virtual hypotheses at each step of the cooling schedule, we can perform a hierarchical clustering that iteratively uncovers the modes of the distribution.
We appreciate the reviewer's suggestion, which presents an opportunity for further research. We hope our response addresses their question and we remain available for additional discussion on this topic.
It could be difficult to tune the temperature schedule over the course of training.
The temperature schedule is indeed a new degree of freedom that may require some tuning.
However, we conjecture that the theoretical analysis of aMCL convergence will lead to the characterization of optimal temperature schedules, similarly to Hajek theorem for deterministic annealing [22]. We leave this analysis for future work.
Limitations
Limitations aren't discussed in much detail in the paper.
We thank the reviewer for raising this issue. In addition to the "Limitations" paragraph that we had already provided at the end of the main paper, we will update Sections 3 and 5 in order to provide more insights into the specific challenges raised by aMCL. In particular, we will insist on the temperature schedule choice, and the potentially longer training time of optimal schedules.
The paper proposes to apply deterministic simulated annealing to multiple choice learning (MCL) as a means to mitigate some of the drawbacks associated with the winner-takes-all (WTA) scheme used to train MCLs, such as sensitivity to initialization and hypothesis collapse. They demonstrate that the proposed annealed CML (aCML) method works well in practice, on par with previous approaches, and allows for an interesting and sound theoretical analysis of the training trajectory.
优点
- The proposed method, aMCL, is very well motivated both theoretically and based on existing open research questions in multiple choice learning. Moreover, the mathematical development of the paper is very clear and seems sound.
- The paper is very well written and easy to follow.
缺点
- The main weakness of the paper is that the experimental results, with the exception of the quite enlightening toy experiments, are somewhat underwhelming and aCML does not seem to outperform the baselines in most cases. I think the paper could benefit from a more in-depth discussion as to why that is the case.
- Admittedly, I am not an expert on the CML literature and maybe guilty of hindsight bias, but the idea of applying simulated annealing to CML seems quite intuitive and, one might argue, somewhat incremental. For instance, simply annealing in -CML (Rupprecht et al., 2017) seems quite natural, and perhaps a good baseline for the authors to compare against. That does not take away from the solid mathematical motivation and theoretical results in the paper though.
Minor issues
- Line 201: “in no longer” should probably be “is no longer”.
- The work of Rupprecht et al. is mentioned under different names (Relaxed WTA, -WTA and -CML) which can be a bit confusing.
- It is not entirely clear to me what the authors want to show in Figure 2.
- The robustness result in Figure 8 is quite interesting and could be expanded upon (perhaps with more random seeds) or at least mentioned in the main paper.
References
Christian Rupprecht, Iro Laina, Robert DiPietro, Maximilian Baust, Federico Tombari, Nassir Navab, and Gregory D Hager. Learning in an uncertain world: Representing ambiguity through multiple hypotheses. In Proceedings of the IEEE international conference on computer vision, pages 3591–3600, 2017.
问题
- Do the authors compare against the score-based method of Letzelter et al.? In line 232, it is suggested that this was one of the baselines, but it was not mentioned again in the paper, unless I missed it.
- On a similar note, could the authors elaborate on the relation between aMCL and rMCL (Letzelter et al., 2023)? It seems to be both have similar objectives, but the assignments in aCML are a function of the temperature of the system and the loss function, while in rMCL the assignments are learnable. If that is correct, would be fair to say that, given enough data and sufficient learnable parameters, we can always expect rMCL to outperform aCML?
- It is somewhat surprising that aCML does not outperform -CML in many cases. Do the authors have any intuition as to why that could be? Is it because the temperature schedule is hard to tune or aCML would require a larger number of epochs to fully converge? Or maybe the datasets are not particularly sensitive to bias of -CML?
- Have the authors considered stochastic simulated annealing as well?
References
Letzelter, Victor, et al. "Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysis." Advances in neural information processing systems 36 (2023).
局限性
The limitations section is very well written and covers all possible shortcomings of the model and analysis proposed in the paper that I could think of.
We thank the reviewer for their detailed suggestions.
Experimental validation
It is surprising that aMCL does not outperform epsilon-MCL in many cases.
RMSE (line 239) compares the barycenter of the predicted distribution with the target positions. For this metric, epsilon-MCL (refered to as Relaxed-WTA in the following) outperforms other approaches on the UCI datasets if is high ( in the main paper). This is expected, as Relaxed-WTA is biased towards the distribution barycenter in this regime (see Figure 1 of the paper).
However, barycenter comparison using RMSE discards spatial distribution information of the hypotheses. Distortion (1) corrects this issue by measuring quantization performance, on which our theoretical analysis relies.
Focusing on the Distortion metric, aMCL (trained with an exponential scheduler) outperforms Relaxed-WTA both for the UCI datasets and for the audio data (see Tables A and C of the supplementary pdf). This is especially true on large datasets (Year, Protein, audio data).
aMCL does not seem to outperform the baselines in most cases.
Looking at the performance on the UCI datasets in Table A, we see that aMCL provides a good tradeoff between Distortion and RMSE compared to the baselines, especially on the largest datasets. Figure B confirms this trend with further analysis on Year and Protein, comparing aMCL, MCL, and Relaxed-MCL for several in the (RMSE, Distortion) space.
On the audio data, we trained all compared methods on 3 different seeds (Table C). These experiments extend the findings presented in Figure 8 of the Appendix: aMCL has a better inter-seed average score and lower inter-seed variance than MCL for the PIT SI-SDR. This suggests that aMCL is more robust to initialization than MCL, which is consistent with our theoretical analysis.
Annealing in epsilon-MCL seems a good baseline.
Using annealing in Relaxed MCL is an interesting idea: it may reduce the bias toward the barycenter of this method.
On the synthetic dataset of Figure 1, Relaxed-WTA with annealed has the following trajectory (Figure A). All hypotheses initially converge to the barycenter, then the winners gradually move towards the modes as decreases. As approaches 0, only few additional hypotheses escape from the barycenter to reach the modes, indicating that annealing does not solve the collapse issue of Relaxed-WTA.
Results on the UCI datasets confirm this qualitative analysis (Table B and Figure B). In Figure B, aMCL outperforms the best Relaxed-WTA variants on Distortion.
Connection with the literature
Do the authors compare against the score-based method of Letzelter et al. ?
In synthetic and UCI experiments, 'MCL' refers to rMCL, the score-based method of Letzelter et al. (Figure 1, Table 1, and Table 3 of the paper). For audio experiments, scoring is not relevant since all sources are active.
Could the authors elaborate on the relation between aMCL and rMCL.
rMCL uses a hard assignment based on the Winner-takes-all scheme (3) for the prediction heads . It is not learned, but determined by the hypotheses positions.
aMCL uses a soft assignment based on a temperature schedule (5) to train the prediction heads (6). With high temperatures, the assignment is uniform. As the temperature decreases, the assignment converges toward the WTA scheme.
Using a soft assignment with annealing guides the optimization process of aMCL towards a good local minimum. This is critical since optimizing the Distortion (1) is difficult by gradient descent (this task is NP-hard [1,9,42]).
Note that rMCL and aMCL are however identical with respect to the scoring heads , which are trained using the same loss (4).
Have the authors considered stochastic simulated annealing as well?
Stochastic simulated annealing [23,46] is a promising research direction due to its strong convergence properties (see Hajek theorem [22]).
It requires to define the state of the system and the optimization objective . At each step, the state is updated to a neighbor state based on a stochastic exploration criterion. The probability of accepting a neighbor state depends on the objective variation and the temperature .
Our objective corresponds to the Distortion (1). However, the state of the system can be defined in various ways. In the non-conditional setting, [A] defines the state as the hypothesis positions (similarly to the present work), while [22, 23, 46] defines it as the cluster assignation of each dataset sample.
In both cases, storing and updating this state using neural networks is costly. Moreover, evaluating requires going through a validation set, which is time-consuming.
Further investigation in this direction is left for future research.
Additional comments
It is not entirely clear to me what the authors want to show in Figure 2.
Figure 2 illustrates the training trajectory of MCL and aMCL in the rate-distortion space. It shows that MCL has a constant rate during training, i.e., a constant number of hypotheses with distinct positions. In contrast, aMCL's rate varies during training: at high temperatures, hypotheses merge into a single cluster, then recursively subdivide as the temperature decreases.
Crucially, this Figure shows that MCL always has a higher rate than aMCL. Yet, it is known that the optimization procedure is more difficult in this regime [52]. This motivates the use of annealing and will be made clearer in the next revision.
[A] Zeger, Kenneth, Jacques Vaisey, and Allen Gersho. "Globally optimal vector quantizer design by stochastic relaxation." IEEE Transactions on Signal Processing 40.2 (1992): 310-322.
I truly appreciate the detailed answers about the method and related literature. All my questions in that regard were completely satisfied, and I encourage the authors to include these clarifications in the final version of the paper.
Unfortunately, I still think the empirical results are somewhat underwhelming, but I see now how aCML could strike a valuable balance between distortion and RMSE. I think the authors should make this trade-off more clear, instead of simply showing Table 1 with RMSE results in the main paper. Figure B in the extra results is particularly enlightening, and I'd argue the results for the UCI dataset should also highlight this trade-off between distortion and RMSE. Looking at Tables 1 and 3, I can see aCML results are usually in-between -MCL and MCL, but I think this should be made evident and not require comparing two tables in very different parts of the paper.
All in all, I am happy to raise my score to 7 provided these extra clarifications and experimental results are added to the paper.
We thank the reviewer for their positive feedback. In the next version of the paper, we will include clarifications about the method and the related literature, as well as the additional experimental results exhibiting a tradeoff between distortion and RMSE, which will be emphasized in the main paper.
We thank reviewers WAas [R1], WQVX [R2] and qUke [R3] for their precise and detailed comments. We summarize hereafter the main changes in the submission, in accordance with the reviewers' feedbacks.
-
We provide additional experimental validation on the UCI benchmark [R1, R2, R3]. We demonstrate that our proposed method aMCL has superior performance in terms of Distortion error, which is the customary quantization metric, without underperforming for the RMSE metric. This is especially true for large datasets.
-
We provide additional experimental validation on the speech separation benchmark [R1, R2, R3], where we further demonstrate the competitiveness of our method, in particular its robustness to initialization and its performance compared to Relaxed-WTA [55].
-
We provide additional discussion of aMCL, including its connection with similar approaches introduced in the literature (rMCL [36], Relaxed-WTA [55], Stochastic simulated annealing [29]), its challenges concerning the choice of an optimal temperature schedule, and its potential extention to hierarchical clustering [R1, R2, R3].
-
Following advice from the reviewers, we introduce an additional baseline, Relaxed-MCL with annealed [R1]. We analyze its behavior using a synthetic dataset, and confirme this qualitative analysis with more extensive experiments on the UCI benchmark.
We provide a supplementary pdf, and refer to its Figures and Tables in the rebuttal. In the rebuttal, we will refer to Epsilon-MCL as Relaxed-WTA.
This paper presents a modification to multiple choice learning (MCL), which is for dealing with situations where the correct output for a given input cannot be determined wholly from the information available in the input. Specifically, the authors seek to improve on existing MCL approaches by incorporating annealing into the gradient descent process for MCL, thereby exploring a wider range of the hypothesis space. The authors provide analysis of the training trajectories and show empirically that their analyses hold.
Overall, the reviewers were fairly positive about this paper. They raised some concerns related to the strength of the performance improvements and comparisons to other MCL learning variants, and also highlighted the need for additional benchmarks. The authors were largely able to address the reviewer concerns in rebuttal, and the final average score was above acceptance threshold (6.33). Given these considerations, the AC decided to accept the paper as a poster.