/10

Poster4 位审稿人

最低3最高4标准差0.4

ICML 2025

Winner-takes-all for Multivariate Probabilistic Time Series Forecasting

Adrien Cortes,Remi Rehm,Victor Letzelter

提交: 2025-01-24更新: 2025-08-11

TL;DR

We introduce TimeMCL, leveraging the Multiple Choice Learning (MCL) paradigm to forecast multiple plausible time series futures.

摘要

We introduce $TimeMCL$, a method leveraging the Multiple Choice Learning (MCL) paradigm to forecast multiple plausible time series futures. Our approach employs a neural network with multiple heads and utilizes the Winner-Takes-All (WTA) loss to promote diversity among predictions. MCL has recently gained attention due to its simplicity and ability to address ill-posed and ambiguous tasks. We propose an adaptation of this framework for time-series forecasting, presenting it as an efficient method to predict diverse futures, which we relate to its implicit *quantization* objective. We provide insights into our approach using synthetic data and evaluate it on real-world time series, demonstrating its promising performance at a light computational cost.

关键词

Time-series forecastingQuantizationProbabilistic methodsConditional Distribution EstimationWinner-takes-allDiversityMultiple Choice Learning

评审与讨论

审稿意见

评分: 32025-03-08

This paper addresses a time-series forecasting problem, where the model generates multiple forecasting for each timestamp. The authors propose TimeMCL, a method based on the Multiple Choice Learning (MCL) paradigm that can output multiple plausible forecasting with multiple prediction heads and score heads. TimeMCL is trained with Winner-Takes-All (WTA) loss and score head loss to produce diverse forecasting from multiple heads due to the nature of the WTA loss, computing the gradient for the head with minimum loss value. The authors claim that TimeMCL can be viewed as a functional quantizer with the theoretical analysis. Experimental results on multiple datasets demonstrated that the proposed method performs better than the baselines in terms of the Distortion metric while it performs comparably with the baselines in terms of the standard metrics.

给作者的问题

See above

论据与证据

In Eq.5, why is not x_{t-1} inputted to the function gamma?
In l.134 of the right-hand side in p.2, why is not x_{t-1} inputted to the function f^k_theta?
The authors mentioned that the score head can avoid overconfident heads, but how they avoid them is not described.
Can we not replace min in the WTA loss with the gamma and jointly train with the WTA loss, which can be a more straightforward approach? It can be similar to Eq.9.
Proposition 5.2. is analyzed only with the binary cross entropy for the score heads. However, s in TimeMCL are shared in the score heads and heads, and TimeMCL is trained based on the compound loss with the WTA loss, which is not the direct case of Proposition 5.2.

方法与评估标准

Why is the Distortion fair metric? How is the Distortion computed for the baselines? Can we use score head to choose the hypotheses and use the standard metrics?
The paragraph "Comparing TimeMCL with the baselines on standard metrics" describes only Tables 6 and 7 in the Appendix. It cannot be appropriate in terms of the page limit regulation.

理论论述

The descriptions around l.215 of the right-hand side in p.4 are not self-complete. For example, z is not defined.

实验设计与分析

Please see Methods And Evaluation Criteria.

补充材料

Yes. All parts.

与现有文献的关系

TimeMCL, a method based on the Multiple Choice Learning (MCL) paradigm, can be novel and practically important.

遗漏的重要参考文献

其他优缺点

Clarity issues:

In l.82 of the right-hand side in p.2, the subscripts may be wrong in the equation.
In l.87 of the right-hand side in p.2, the right most "..." may be unnecesally.
The character T is used in multiple meanings in forecasting horizon and annealed temperature.
Figures 1 and 2 appear in reversed order.

其他意见或建议

作者回复

2025-04-01

We would like to thank the reviewer for their positive feedback on the paper.

In Eq.5, why is not $x_{t-1}$ inputted to the function gamma?

This is because, with our notations, $\gamma^k_{\theta}$ corresponds only to the head. The full model writes as $\gamma_\theta^k \circ s_\theta$ , where $s_{\theta}$ is the backbone (See section 3,4 and Figure A of the rebuttal pdf).

In l.134 of the right-hand side in p.2, why is not $x_{t-1}$ inputted to the function $f^k_\theta$ ?

Same as above. This is because, with our notations, $f^k_{\theta}$ Only the head. The full model writes as $f_\theta^k \circ s_\theta$ , where $s_{\theta}$ is the backbone. This will be made clearer with an illustration as suggested by Reviewer uVhD.

The authors mentioned that the score head can avoid overconfident heads, but how they avoid them is not described.

Overconfidence is a known issue in Multiple Choice Learning (Rupprecht et al., Lee et al), where, some heads associated with very low probability zones may not be distinguishable from plausible hypotheses at inference time. Score-heads allow to solve this issue by learning the probability of each head to avoid this situation.

Can we not replace min in the WTA loss with the gamma and jointly train with the WTA loss, which can be a more straightforward approach? It can be similar to Eq.9.

Does the reviewer suggest training a loss that looks like a weighted sum of the $L_{\theta}^{k}(x_{1:t_{0}-1},x_{t_{0}:T})$ by the $K$ predicted scores? The suggestion of the reviewer is very interesting, as it would when the hypotheses are fixed, encourage the score associated with the lowest distance to increase (and the other to decrease). However, this would make the prediction head loss dependent on the values of the scores, which may produce different training dynamics as the current model. In the current version, the score head objective depends on the position of the hypotheses, but not the opposite. This would definitely be a promising try for further work.

Proposition 5.2. is analyzed only with the binary cross entropy for the score heads. However, s in TimeMCL are shared in the score heads and heads, and TimeMCL is trained based on the compound loss with the WTA loss, which is not the direct case of Proposition 5.2.

Indeed, Proposition 5.2 is analyzed only with the binary cross entropy for the score heads. In this proposition, we implicitly assume that the prediction heads have already converged, so that the task of learning the probability mass of each cell becomes duable for the prediction heads.

Indeed, in according with Letzelter et al. (2024b) (Section C.1.2), we observed the WTA training scheme leads to a fast convergence of the predictions, while the scoring heads are slightly slower to train because they need the prediction heads to have already converged to do so.

This will be made clearer in the assumption of this Proposition 5.1.

Why is the distortion a fair metric ?

The Distortion is known from the quantization literature (Pagès.G.,2015), as a way to assess the quality of a set of $K$ samples $z_k$ , $k = 1,...,K$ , for quantizing a target distribution $p$ , with:

$D_2 := \int_{\mathcal{X}} \min_{k=1, \ldots, K} \left\| z_k - x \right\|_{2}^2 \mathrm{d}p(x).$

In our setup, the distortion we are considering is the generalization of the above where the samples $z_k$ are functions of the context. It writes in the context of time series

$\int_{\mathcal{X}^{T}} \min_{k=1, \ldots, K} \left\| z_k\left(x_{1: t_0-1}\right) - x_{t_0: T} \right\|^2 \; \mathrm{d}p(x_{1: t_0-1}, x_{t_0: T}) \simeq \frac{1}{N} \sum_{i} \min_{k=1, \ldots, K} \left\| z_k\left(x^i_{1: t_0-1}\right) - x^i_{t_0: T} \right\|^2.$

where $(x^i_{1: t_0-1},x^i_{t_0: T})$ are samples from $p(x_{1: t_0-1},x_{t_0: T})$ .

It implicitly assesses how well the predicted samples cover a target distribution with a given set of samples. The distortion is a fair metric provided that the same number of hypotheses is used for each baseline, as the distortion is expected to improve with $K$ , as per the Rate distortion curve (Gray, 1989) in the optimal case.

Gray, Robert M. Source coding theory. Vol. 83. Springer Science & Business Media, 1989.

审稿意见

评分: 42025-03-16

The paper presents a new method, TimeMCL, for time series forecasting. The proposed method uses Multiple Choice Learning using Winner-Takes-All (WTA) loss to forecast multiple plausible time series future. The paper uses synthetic data to show that TimeMCL is a functional quantizer. The proposed TimeMCL is compared with two Diffusion methods on standard datasets.

给作者的问题

How will the proposed approach perform on other domains e.g. Stock Market?
How are the graph tranformer methods, e.g. STGNN, compare with the diffusion methods in general and with TimeMCL in particular?

论据与证据

The main claim that TimeMCL is a functional quantizer is supported by mathematical proofs and the use of synthetic data
The forecasting ablility of TimeMCL is supported by comparing with 2 diffusion methods

方法与评估标准

The proposed method is evaluated on standard datasets. The comparisons are done with 2 SOTA diffusion methods.

There should be more comparisons with other SOTA methods

理论论述

The paper presented proofs and used synthetic data to support TimeMCL as a Functional Quantizer. I did not check the correctness of the mathematics and proofs.

实验设计与分析

The datasets used are usually the ones that are used in SOTA time series forecasting methods. Two SOTA diffusion methods are used for comparisons. The results will be more validated if compared with more SOTA methods.

补充材料

I have reviewed all the appendices, although not able to check the material equations and provided proofs

与现有文献的关系

The paper proposes a WTA approach for time series forecasting and I believe does not relate to broader scientific literature.

遗漏的重要参考文献

The works referenced in the paper are adequate. Although I believe other SOTA works must be discussed including the Graph Deep learning methods e.g. STGNN. .

其他优缺点

Strengths:

Use of synthetic data
Mathematical equations and proofs
The use of multiple performance criteria Weaknesses
Need to consider datasets from more domains e.g. stock market
Need to compare with other SOTA time series forecasting methods
In comparing with TimeGrade and DeepAr, there is an assumption that these are the 2 best methods for time series forecasting.

其他意见或建议

Please add a figure that shows the network architecture of the proposed method and a figure that visualy describes the proposed technique.
Figures and tables should be closer to the text descriptions e.g. figure 1 is on a different page than the description, same is true of table 2 & 3.

伦理审查问题

作者回复

2025-04-01

We thank the reviewer for their relevant remarks. The rebuttal pdf is attached to the response. Figure A of the pdf will be included in the paper.

Comparison with more SOTA methods

The results will be more validated if compared with more SOTA methods.

We conducted this benchmark using DeepAR and TimeGrad to demonstrate that, with consistent settings across baselines (e.g., backbone, data scaler, training details), our approach offers competitive distortion at a low computational cost. Since we used an RNN backbone, we felt comparing it with other architectures, such as transformers, would complicate conclusions. However, we acknowledge the reviewer’s point that additional baselines could strengthen our work.

To address the reviewer's comment and enhance our evaluation, we added additional models. Specifically, we included Tactis-2 (Ashok et al, 2024), a transformer-based model based on non-parametric copulas, and TempFlow (Rasul et al, 2020), which uses conditioned normalizing flows (with both RNN and transformer backbones). For completeness and as suggested by Reviewer 2HFL, we also included exponential smoothing (ETS) as a simple baseline without neural networks.

Results analysis

Distortion Comparison (Table A).

TimeMCL outperforms TempFlow, both when using the same RNN backbone and when TempFlow is based on a Transformer.
Tactis proves to be a strong competitor in terms of Distortion, though at a significantly higher computational cost (see Table H).
We conducted an ablation study on the number of hypotheses (Table E). We observed that TimeMCL consistently achieves the best performance (except when using only one hypothesis).

Inference run-time (Table H)

Among neural methods, TimeMCL and DeepAR demonstrate the best trade-off between speed and performance.
ETS achieves the fastest inference but exhibits weaker performance on other metrics.

Smoothness Analysis (Table B)

TimeMCL achieves the best smoothness scores, as measured by Total Variation.
This supports our theoretical claim from Section 5.2, which predicts that TimeMCL generates smoother trajectories.

Additional metrics (Tables C & D)

- TimeMCL does not significantly improve RMSE, CRPS, or the Energy Score (Table G), as expected, since it does not directly optimize these metrics.

We plan to extend this comparison by implementing timeMCL with a transformers-based architecture as the backbone, and adhering to Tactis's training details for more accurate performance comparison.

How are the graph tranformer methods, e.g. STGNN, compare with the diffusion methods in general and with TimeMCL in particular?

Regarding spatio-temporal graph transformer methods, does the reviewer have a specific method in mind? Most STGNN methods we found are tailored to specific tasks (e.g., traffic prediction in Luo et al., 2023). However, we did identify stemGNN (Cao et al., 2020), which integrates graph and attention mechanisms and is evaluated on similar data.

However, comparing our approach with stemGNN is difficult, as it is non-probabilistic and generates only one prediction per input. As future work, exploring a graph transformer-based approach in place of our RNN backbone could be promising.

Need for datasets from more domains, e.g., Stock markets

How will the proposed approach perform in other domains e.g. Stock Market?

We thank the reviewer for their advice. Accordingly, we performed experiments on a dataset of correlated financial time series, consisting of 15 correlated Cryptocurrencies (See Tables J and K of the rebuttal pdf).

On this dataset, TimeMCL was trained with annealed winner takes all loss, and compared with the previous baselines, and the results, presented in Table I further demonstrate the competitiveness of the method in terms of Distortion and Smoothness, with also good CRPS here.

We also provide a visualization of the predictions, along with the baselines in Figure B, which is akin to Figure 2 of the main paper.

Missing references

I believe other SOTA works must be discussed including the Graph Deep learning methods e.g. STGNN.

To make our benchmarks more comprehensive, we’ve included Tactis2, TempFlow (with its transformer-based variant), and non-neural exponential smoothing (ETS) methods (Hyndman et al., 2008). Additionally, STGNN will be referenced in the paper.

Luo, X., Zhu, C., Zhang, D., & Li, Q. (2023). Stg4traffic: A survey and benchmark of spatial-temporal graph neural networks for traffic prediction

Cao, Defu, Yujing Wang, Juanyong Duan, Ce Zhang, Xia Zhu, Congrui Huang, Yunhai Tong et al. "Spectral temporal graph neural network for multivariate time-series forecasting." In NeurIPS 2020

Rasul, Kashif, Abdul-Saboor Sheikh, Ingmar Schuster, Urs M. Bergmann, and Roland Vollgraf. "Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows." In ICLR 2021

审稿人评论

2025-04-04

After reading author's rebuttal and other reviews, I am updating score to accept

审稿意见

评分: 42025-03-17

This work introduces TimeMCL, a time series forecasting model that looks to project plausible future scenarios and their associated probabilities to better forecast multimodal distributions. The model learns multiple heads, as well as scores associated to each head to estimate the probability of a given head being correct. Training these heads with vanilla Winner-takes-all (WTA) loss can result in under-trained heads for unimodal distributions, so the authors explore relaxations that enable more distributed gradients. The authors provide a theoretical analysis of TimeMCL, demonstrating that their combined architecture and training scheme leads to a voronoi tesselation of future trajectories, which they show on synthetic datasets. They also conduct experiments on six typical TS datasets from GluonTS wrt Distortion, FLOPs, RMSE and CRPS.

The main claimed contributions are:

the TimeMCL approach that takes a backbone and trains multiple heads for it using (two variations of relaxed) Winner-Takes-All loss (with scoring heads)
the theoretical analysis of TimeMCL as a functional quantizer
the evaluation of TimeMCL with an RNN backbone on synthetic and real-world benchmarks

给作者的问题

covariates c1:T , the latter being omitted in the following for conciseness

can the model accept covariates?

To compute TimeMCL metrics while respecting hypothesis probabilities, we resample with replacement from the K hypotheses obtained in a single forward pass, weighting them by their assigned probabilities before computing metrics.

Can you explain this is more detail please? I'm not sure I follow, I figured that the score heads provided the probabilities.

Did you look at methods that might enable a variable number of heads? Seems like the optimal tesselation is conditional on the number of heads, so you can't really "extend" the number of heads.

论据与证据

TimeMCL forecasts diverse possible futures & provides smooth forecasts

The is visualized in figures 2-4. It would be nice to also provide some sort of quantitative measure of diversity here, e.g. Frechet or even just dispersion averaged across series and time points. The same goes for smoothness.

TimeMCL is a stationary conditional functional quantizer

This seems to be true under the proposed assumptions, but it's unclear what value this provides as the underlying clustering problem is NP-hard, and you're dealing with very high-dimensional data.

TimeMCL is compared against SOTA probabilistic forecasters

The TACTiS models that you cited are not compared against, which is odd.
- https://arxiv.org/abs/2202.03528
- https://arxiv.org/abs/2310.01327
Missing simple baselines against which to compare on the real-world datasets, e.g. naive, drift, ARIMA, exp-smoothing, etc.

方法与评估标准

The datasets are typical for ML forecasting, although it is difficult to know how much multivariate correlation is present in these datasets. Therefore, performance on synthetic multivariate tasks would be insightful, e.g. correlated brownian motion and/or VAR processes.

理论论述

I reviewed the proof of proposition 5.1 briefly. I am mostly unfamiliar with the clustering literature, but from what I gather, k-means is NP-hard, so I'm not sure what this formulation actually brings given the extremely high dimensionality of multivariate trajectory data. Maybe in the limit of expressivity and time, this algorithm converges to the optimal voronoi tesselation due to the two-step formulation, but it would help to analyze the rate of convergence theoretically and experimentally.
It's unclear why the architectures for sections 5 and 6 differ.
I did not review the proof of proposition 5.2

实验设计与分析

A training/inference runtime analysis would be nice in addition to flops
It would be good to add in other relevant baselines, e.g. TACTiS and simple baselines
You could also assess the models using actual multivariate metrics, e.g. energy score, variogram, etc. (see https://arxiv.org/abs/2407.00650)

补充材料

Did not review other than for proof of proposition 5.1.

与现有文献的关系

This paper relates to a recent line of work about time series forecasting using machine learning methods. Specifically, the paper belongs to two niches: multivariate forecasting and probabilistic forecasting. Cited methods, included DeepAR, TimeGRAD and TACTiS, figure among this line of work as well.

遗漏的重要参考文献

Missing reference to distortion being standard in quantization
Other references to multivariate probabilistic forecasting:
- https://arxiv.org/abs/2410.02168
- https://ojs.aaai.org/index.php/AAAI/article/view/29085

其他优缺点

Extremely clear and well-written paper
The main points that can move my score are:
- experiments with multivariate synthetic tasks
- multivariate metrics (energy score, variogram)
- comparison to TACTiS-2

其他意见或建议

Typos:

paragraph at line 84 has some weird verb conjugation, e.g. "one have"
missing period line 100-101 col 2 before "WTA".
the notation, and especially the overload of x, feels a bit clumsy, e.g. on line 112-113 you switch to the superscript for the hypotheses, which took a minute to notice since the subscript is omitted (I imagine to avoid it being too heavy). You can probably just use \mathbf{x} for a vector to prevent the cognitive break during reading.
line 120 col 1: you might want to reiterate that these are the "final projection" heads here instead of just "hidden state representation"
Tables 6 and 7 belong in the main text, as they are key evaluation results with more established metrics than distortion.
Line 167 col 2: "finite over if"
Line 171 col 2: the first clause of this paragraph is incomprehensible.

作者回复

2025-04-01

We thank the reviewer for their feedback. The rebuttal pdf is attached to the response.

Clarification of the theory

TimeMCL is a stationary conditional functional quantizer [...], but it's unclear what value this provides as the underlying clustering problem is NP-hard, and you're dealing with very high-dimensional data.

The reviewer raises a valid point. However, as shown in Figure 1, our toy example qualitatively demonstrates that our training scheme closely aligns with the target conditional quantizer in practice, even when forecasting across 250 time steps (see also Appendix B for details).

Maybe [...] this algorithm converges to the optimal voronoi tesselation [...], but it would help to analyze the rate of convergence theoretically and experimentally.

Theorem 2 of Loubes & Pelletier (2017) provides an asymptotic upper bound on the distortion error with respect to the number of training pairs in quantizer learning. Extending this result to neural networks trained with WTA Loss is a promising direction for future work.

Additional baselines and metrics

A runtime analysis would be nice [...]. It would be good to add in other relevant baselines, e.g. TACTiS and simple baselines You could also assess the models using actual multivariate metrics, e.g. energy score [...] It would be nice to also provide [...] a measure of diversity, [...] and smoothness.

In response to the reviewer, we added experiments with a simple baseline, ETS exponential smoothing (Hyndman et al., 2008), which does not rely on neural networks. We also included TempFlow (Rasul et al., 2021), a normalizing flow-based using both RNN and Transformer backbones, as well as Tactis-2 (Ashok et al., 2024), a Copula method with a Transformer backbone.

These methods were evaluated across the same six datasets. In addition to the metrics from the original paper (Distortion, RMSE, and CRPS-Sum), we followed the reviewers' suggestions and included Inference runtime, Smoothness, and Energy Score.

Results analysis

Distortion Comparison (Table A,E,F).

TimeMCL outperforms TempFlow, and Tactis proves to be a strong competitor in terms of Distortion, though being slower at inference (see Table H).
An ablation study on the number of hypotheses (Table E) shows that TimeMCL consistently achieves the best performance (except with $K=1$ ).

Inference run-time (Table H)

Among neural network-based methods, TimeMCL and DeepAR demonstrate the best trade-off between speed and performance.
ETS achieves the fastest inference, with weaker performance otherwise.

Smoothness Analysis (Table B)

TimeMCL achieves the best smoothness scores, as measured by Total Variation (averaged over predictions).
This supports our theoretical claim from Section 5.2, which predicts that TimeMCL generates smoother trajectories.

Additional metrics (Tables C & D)

TimeMCL does not significantly improve RMSE, CRPS, or the Energy Score (Table G), since it does not directly optimize these metrics.

We did not conduct further experiments with Diversity, as we believe Distortion implicitly captures it.

When comparing TimeMCL (using an RNN backbone) with methods using transformer backbones, it's hard to tell if performance improvements are due to the training method or the backbone. To clarify, we plan to implement TimeMCL with a transformer backbone, following Tactis's training details.

Additional datasets

It is difficult to know how much multivariate correlation is present in these datasets. Therefore, performance on synthetic multivariate tasks would be insightful

In response, we conducted experiments on a new dataset of correlated financial cryptocurrencies time series (see Table J), with the correlation matrix in Table K.

TimeMCL was trained with an aMCL loss and compared to previous baselines. Results in Table I show that our method remains competitive, excelling in both Distortion and Smoothness, with strong CRPS performance. Visualizations are in Figure B.

Additional questions

Can the model accept covariates?

The model can accept covariates. In previous implementations these typically serve as additional concatenated input features.

To compute TimeMCL metrics [...], we resample with replacement from the K hypotheses obtained [...]. Can you explain this is more detail please?

Our implementation extends GluonTS (Alexandrov et al., 2020) with minimal code changes. Instead of modifying evaluation functions, we resample from the K hypotheses, weighted by their probabilities, allowing us to use existing evaluation functions (e.g., CRPS) without rewriting metrics for TimeMCL.

Did you look at methods that might enable a variable number of heads?

Indeed, the number of predictions must be predefined beforehand. Exploring dynamic "rearrangements" of hypotheses when adding new ones, without full retraining, is left for future work.

审稿意见

评分: 42025-03-18

This paper proposes the idea to generate a diverse set of forecast trajectories instead of a single trajectory as typically done. Prior ideas of doing this involved sampling from the output distribution such as in TimeGrad using a diffusion process or sampling from other models generating a distribution. However such methods do not necessarily produce diverse outputs covering the whole space of outputs.

This paper proposes using the idea of winner takes it all (WTA) to learn a tessalation of the output space (similar to K means) using multiple output heads and use it to generate several representative outputs. Experimental results show superior performance metrics compared to other comparable baselines. While the proposed methods do not perform well in terms of CRPS or RMSE, this is expected as the method does not aim at producing mean or median forecasts which minimize traditional losses.

给作者的问题

It seems that the proposed method is akin to the k means algorithm. K-means is quite sensitive to intialization. Can you comment on the sensitivity of WTA to the initialization of the hypotheses?

It is possibly the case that the loss is such (convex or some similar property) that any initialization would lead to the hypothesis shifting appropriately and aligning with the actual distribution during training.

What are some important applications for this method in real life scenarios? In what settings would someone want to generate forecasts from multiple hypothesis? An examples that come to mind is stock market predictions - it would be important to look at a diverse set of forecasts to understand any extreme predictions. Or product demand forecasts where a retail company may want to prepare for different scenarios.

论据与证据

All methodological claims are supported by experimental results and analysis showing that they work as intended. Theoretical claims are also evidenced with proofs in the appendix.

方法与评估标准

The proposed approach is suitable for the problem studied. WTA has been shown to work for other domains such as vision and this paper shows that it works for time series forecasting as well. The paper uses the standard benchmark datasets for evaluation, and evaluations are extensive and sufficient.

理论论述

Theoretical proofs in the supplementary material were not checked.

实验设计与分析

The experimental setup is sound

Using synthetic time series to test the correctness of the approach is reasonable and experimental results show that the method is working as intended.
The results are evaluated using distortion - a reasonable metric to test the accuracy of a diverse set of outputs.

补充材料

Reviewed the additional results in the appendix.

additional experiment metrics (RMSE, CRPS) and visualizations.
details of the experimental setup.
details of the synthetic models.

与现有文献的关系

The idea is significant and highly relevant to the time series forecasting literature. Generating a diverse set of predictions is a relevant problem in machine learning in general (especially wrt to Generative models), such ideas being extended to forecasting is highly relevant to the time series community.

遗漏的重要参考文献

N/A

其他优缺点

This paper presents a novel approach to model multi-modal outputs for time series forecasting. As such, i advocate for the acceptance of this paper.

However, the motivation behind the applicability of the paper to real life forecasting problems is unclear.

其他意见或建议

N/A

作者回复

2025-04-01

We are grateful to the reviewer for their positive feedback and insightful comments.

Sensibility to initialization

K-means is quite sensitive to initialization. Can you comment on the sensitivity of WTA to the initialization of the hypotheses?

As MCL can be seen as a conditional and gradient-based variant of K-Means, it inherits some of its limitations, in particular the sensibility to initialization. This is also related to the known collapse issue (Rupprecht et al.) in the MCL literature, where some of the hypotheses may never be chosen, leading to suboptimal Distortion performance. This is the reason why we decided to leverage WTA variants (aMCL, Relaxed-WTA), for TimeMCL, which makes the algorithm more robust to these issues.

Note also that previous works (Letzelter et al., 2023; Shekarforoush et al.,2024) have noticed that this collapse issue can, in some settings, be naturally solved by the randomness of the data distribution. As mentioned in the Limitations paragraph of the submission, further work will include enhanced normalization techniques to further improve the quality of the optimum of the vanilla TimeMCL.

Motivation clarification

The motivation behind the applicability of the paper to real life forecasting problems is unclear. What are some important applications for this method in real life scenarios? In what settings would someone want to generate forecasts from multiple hypothesis?

Indeed, compared to generative models, we believe time-MCL has the ability to capture rare events or "modes" in the conditional distribution. This is for instance illustrated in Figure 2 (middle) on the Solar dataset where one of the hypotheses is capturing a rare event. This can indeed also be useful for Stock Market prediction, for capturing trend reversal. For clarification, we included a use case with financial data in the rebuttal pdf (See e.g., Figure B) for which details are provided in the answers of Reviewer 2HFL and Reviewer uVhD.

Letzelter, Victor, Mathieu Fontaine, Mickaël Chen, Patrick Pérez, Slim Essid, and Gaël Richard. "Resilient Multiple Choice Learning: A learned scoring scheme with application to audio scene analysis." In NeurIPS, 2023

Shekarforoush, Shayan, David Lindell, Marcus A. Brubaker, and David J. Fleet. "CryoSPIN: Improving Ab-Initio Cryo-EM Reconstruction with Semi-Amortized Pose Inference." In NeurIPS, 2024

最终决定Accept (poster)

2025-05-01

This paper introduces TimeMCL, a novel method for multivariate probabilistic time series forecasting that uses a multiple-choice learning approach with a winner-takes-all loss to generate diverse future predictions. Reviews highlight its ability to produce a range of plausible forecasts and its theoretical grounding as a functional quantizer. While it shows strong performance on a custom distortion metric designed for diverse outputs, it doesn't necessarily excel on traditional metrics like RMSE and CRPS, as it prioritizes diversity over mean accuracy.

Pros include its innovative approach to generating diverse forecasts, supported by theoretical analysis and promising results on a dedicated metric. Reviewers also noted the clarity of the writing. Cons involve weaker performance on standard forecasting metrics and the need for more comparisons with a broader range of state-of-the-art methods, and a clearer motivation for real-world applicability. Some theoretical aspects and experimental details also required clarification.

The authors addressed concerns by providing additional experiments with more baselines and on a new financial dataset, including multivariate evaluation metrics, and clarified theoretical points and the model's architecture. They also discussed potential real-world applications and acknowledged limitations like the fixed number of prediction heads.

Considering the novelty of the approach, the theoretical backing, and the authors' efforts to address the reviewers' concerns with additional experiments and clarifications, the unanimous consensus is for acceptance. The method offers a different perspective on time series forecasting by focusing on capturing a spectrum of possibilities, which could be valuable in specific applications despite not optimizing traditional error metrics.