PaperHub
7.0
/10
Poster3 位审稿人
最低5最高8标准差1.4
5
8
8
2.0
置信度
正确性2.7
贡献度2.7
表达2.7
ICLR 2025

Optimal Transport for Time Series Imputation

OpenReviewPDF
提交: 2024-09-28更新: 2025-04-02

摘要

关键词
Time seriesImputation

评审与讨论

审稿意见
5

The paper proposes an optimal transport (OT) based time-series imputation method. The authors claim that naive application of OT does not work for time-series data. The proposed method consider applying OT in the frequency domain of the original data, called pairwise spectrum distance (PSD). Further, to deal with multiple modes, proximal spectral Wasserstein (PSW) distance is also proposed, in which mass constraint is removed to make transportation more flexible.

优点

  • Using OT to impute time-series data is an interesting approach.

  • Empirical evaluation shows high performance.

缺点

  • Some technical justification is vague. Clearer descriptions would be desired.

  • Introduction is a bit too abstract about the proposed method. It describes what problem is solved in the paper, but does not describe the basic technical idea how it is achieved.

问题

  • In 'Contributions', the authors mention that PSW-I eliminates the need for masking, but PSW-I seems to use masking (eg as shown in Fig3). What does this description mean?

  • What does Lemma 3.2 indicate? How do you know 'deviates more from the typical elements of \beta'? Further, how do you know the PSW discrepancy avoid the problem indicated by this lemma? In Theorem C.2, the perturbation in PSW discrepancy is shown. Is it possible to compare this with Lemma 3.2? Even if it is possible, what does it mean? Is it clear how large (or small) perturbation caused by outliers ultimately affect the final imputation?

  • Although D_KL is used in (2), T 1_m and T^T 1_n is not a probability distribution (because the normalization (sum equals 1) constraint is removed). How are they normalized?

  • What is the definition of the 'DFT matrix' in the gradient? How does the gradient of (2) become \Delta_alpha_i P and \Delta_beta_j P? The second term in (2) disappear? The optimality condition wrt T is considered in this gradient calculation?

  • After (1), D is squared distance, while in definition 3.1 D is distance without square. Which is correct?

Minor issues:

  • In the end of p3: Fig. 4 should be Fig 1? In Fig 1(a), the left is W^(F)?

  • The first word of Sec3.2: 'time-series' -> 'Time-series'

评论

[Q5] After (1), D is squared distance, while in definition 3.1 D is distance without square. Which is correct?

Response. We sincerely apologize for this typo and appreciate the chance to correct it. Given αi\alpha_i and βj\beta_j, the correct definition of D\mathbf{D}, as stated in Definition 3.1, is the standard (non-squared) Euclidean distance: Di,j=αiβj\mathbf{D}_{i,j} = \||\alpha_i - \beta_j\||. Additionally, D_i,j(F)=αi(F)βj(F)1\mathbf{D}\_{i,j}^{\mathrm{(F)}} = \||\alpha_i^{\mathrm{(F)}} - \beta_j^{\mathrm{(F)}}\||_1, where the absolute error enhances robustness against large deviations, accommodating the characteristics of spectral data with energy concentrated in specific frequency bins.

  • Action. We have revised the manuscript to rectify the definition of D\mathbf{D} and D(F)\mathbf{D}^\mathrm{(F)}.

[Minors] In the end of p3: Fig. 4 should be Fig 1? In Fig 1(a), the left is W^(F)? The first word of Sec3.2: 'time-series' -> 'Time-series'

Response. We appreciate the reviewer’s attention to these details. We have corrected the reference from Fig. 4 to Fig. 1 at the end of page 3. In Fig. 1(a), we have clarified that the left component represents W(F)\mathcal{W}^{(F)} by updating the figure legend accordingly. We have capitalized “Time-series” at the beginning of Section 3.2. Big thanks for your meticulous reading!


Reference

[1] Rethinking the Diffusion Models for Missing Data Imputation: A Gradient Flow Perspective, NeurIPS 2024.

评论

[Q4] What is the definition of the 'DFT matrix' in the gradient? How does the gradient of (2) become αiP\nabla_{\alpha_i} P and βjP\nabla_{\beta_j} P? The second term in (2) disappear? The optimality condition wrt T is considered in this gradient calculation?

Response. Thank you very much for your meticulous and valuable comment. Below we answer this question by (1) clarifying the definition of the DFT matrix and (2) delineating the derivation of αiP\nabla_{\alpha_i} P and βjP\nabla_{\beta_j} P.

  • To clarify the definition of the DFT matrix, we recall that for a sequence x=[x_0,...,x_T1]\mathbf{x}=[\mathbf{x}\_0, ..., \mathbf{x}\_\mathrm{T-1}], the DFT is calculated as x(F)ˆ_k=_t=0T1ˆx_tej2πkt/T=[x_0,x_1,...,x_T1][ej2πk0/Tˆ,ej2πk1/Tˆ,...,ej2πk(T1)/Tˆ],\mathbf{x}\^\mathrm{(F)}\_k=\sum\_{t=0}\^\mathrm{\mathrm{T}-1} \mathbf{x}\_t e^{-j2\pi k t/\mathrm{T}}=[\mathbf{x}\_0,\mathbf{x}\_1,...,\mathbf{x}\_\mathrm{\mathrm{T}-1}]\cdot [e\^{-j2\pi k 0/\mathrm{T}},e\^{-j2\pi k 1/\mathrm{T}},...,e\^{-j2\pi k (\mathrm{T-1})/\mathrm{T}}]^\top, where x(F)ˆ_k\mathbf{x}\^\mathrm{(F)}\_k is the kk-th element of the spectrum x(F)ˆ\mathbf{x}\^\mathrm{(F)}. On the basis, denote W(F)ˆ=[W(F)ˆ_t,k]_t,k=0T1ˆ\mathbf{W}\^\mathrm{(F)}=[\mathbf{W}\^\mathrm{(F)}\_{t,k}]\_{t,k=0}\^\mathrm{T-1} where Wt,k(F)=ej2πkt/T\mathbf{W}^\mathrm{(F)}_{t,k}=e^{-j2\pi k t/\mathrm{T}} is the t-th row and k-th column of W(F)\mathbf{W}^\mathrm{(F)}, the spectrum generated by DFT can be represented as x(F)=xW(F),\mathbf{x}^\mathrm{(F)}=\mathbf{x}\mathbf{W}^\mathrm{(F)}, which clarifies the definition and role of the DFT matrix W(F)\mathbf{W}^\mathrm{(F)}.

    • Action. We have added Definition C.5 to delineate the DFT process where the role of the DFT matrix W(F)\mathbf{W}^\mathrm{(F)} is clarified.
  • After acquiring the optimum transport matrix by solving the optimization problem in Eq.(2), denoted as T\mathbf{T}, the PSW is calculated as P(α,β):=D(F)ˆ,T+κ(D_KL(T1_mΔ_n)+D_KL(TTˆ1_nΔm))\mathcal{P}(\alpha, \beta) := \left\langle \mathbf{D}\^\mathrm{(F)}, \mathbf{T} \right\rangle + \kappa \left( \mathrm{D\_{KL}}(\mathbf{T} \mathbf{1}\_m \Vert \Delta\_n) + \mathrm{D\_{KL}}(\mathbf{T}\^\mathrm{T} \mathbf{1}\_n \Vert \Delta_m) \right) which depends on α\alpha and β\beta through the distance matrix D(F)ˆ_i,j=α_i(F)ˆβ_j(F)ˆ_1=(αiβj)W(F)ˆ_1\mathbf{D}\^\mathrm{(F)}\_{i,j} = \||\alpha\_i\^\mathrm{(F)} - \beta\_j\^\mathrm{(F)}\||\_1=\||(\alpha_i - \beta_j)\mathbf{W}\^\mathrm{(F)}\||\_1. The KL terms does not depends on α\alpha and β\beta, since the optimum T\mathbf{T} and the 1_m\mathbf{1}\_m, 1_n\mathbf{1}\_n, Δ_m{\Delta}\_m, Δ_n{\Delta}\_n are constants. Using the chain rule, we have: Pα_i=_j=1mˆPD(F)ˆ_i,jD(F)ˆ_i,jαi=_j=1mˆT_i,jD(F)ˆ_i,jα_i=_j=1mˆT_i,jW(F)ˆsign(e_i,j)ˆ,\frac{\partial \mathcal{P}}{\partial \alpha\_i} = \sum\_{j=1}\^\mathrm{m} \frac{\partial \mathcal{P}}{\partial \mathbf{D}\^\mathrm{(F)}\_{i,j}} \frac{\partial \mathbf{D}\^\mathrm{(F)}\_{i,j}}{\partial \alpha_i} = \sum\_{j=1}\^\mathrm{m} \mathbf{T}\_{i,j} \frac{\partial \mathbf{D}\^\mathrm{(F)}\_{i,j}}{\partial \alpha\_i} = \sum\_{j=1}\^\mathrm{m} \mathbf{T}\_{i,j} \mathbf{W}\^\mathrm{(F)} \cdot \mathrm{sign}(\mathbf{e}\_{i,j})\^\top, Pβ_j=_i=1nˆPD(F)ˆ_i,jD(F)ˆ_i,jβ_j=_i=1nˆT_i,jD(F)ˆ_i,jβ_j=_i=1nˆT_i,jW(F)ˆsign(e_i,j)ˆ.\frac{\partial \mathcal{P}}{\partial \beta\_j} = \sum\_{i=1}\^\mathrm{n} \frac{\partial \mathcal{P}}{\partial \mathbf{D}\^\mathrm{(F)}\_{i,j}} \frac{\partial \mathbf{D}\^{(F)}\_{i,j}}{\partial \beta\_j} = \sum\_{i=1}\^\mathrm{n} \mathbf{T}\_{i,j} \frac{\partial \mathbf{D}\^\mathrm{(F)}\_{i,j}}{\partial \beta\_j} = -\sum\_{i=1}\^\mathrm{n} \mathbf{T}\_{i,j} \mathbf{W}\^\mathrm{(F)} \cdot \mathrm{sign}(\mathbf{e}\_{i,j})\^\top. where e_i,j=(α_iβ_j)W(F)ˆRTˆ\mathbf{e}\_{i,j}=(\alpha\_i - \beta\_j)\mathbf{W}\^\mathrm{(F)}\in\mathbb{R}\^\mathrm{T}, T\mathrm{T} is the sequence length, e_i,j,k\mathbf{e}\_{i,j,k} denotes the kk-th element of e_i,j\mathbf{e}\_{i,j}, the derivation of D(F)ˆ_i,jα_i\frac{\partial \mathbf{D}\^\mathrm{(F)}\_{i,j}}{\partial \alpha\_i} and D(F)ˆ_i,jβ_j\frac{\partial \mathbf{D}\^\mathrm{(F)}\_{i,j}}{\partial \beta\_j} are detailed in our revised appendix.

    • Action. We have added Theorem C.6 to formulate the calculation of gradients. The gradient formulations are also refined to accommodate the 1-norm formulation.
评论

Thank you for you detailed reply to my comment.

The KL terms does not depends on \alpha and \beta, since the optimum T ...

Why does this hold? Since the optimum T depends on alpha and beta. I guess dT/dalpha and dT/dbeta should be considered.

4Wtk

评论

[Q2] What does Lemma 3.2 indicate? How do you know 'deviates more from the typical elements of \beta'? Further, how do you know the PSW discrepancy avoid the problem indicated by this lemma? In Theorem C.2, the perturbation in PSW discrepancy is shown. Is it possible to compare this with Lemma 3.2? Even if it is possible, what does it mean? Is it clear how large (or small) perturbation caused by outliers ultimately affect the final imputation?

Response. Thank you for your invaluable suggestion for a clearer exposition of Lemma 3.2 and its relation to Theorem C.2. To this end, we have revised Theorem C.2 to make it comparable to Lemma 3.2.

  • Firstly, we delineate Lemma 3.2 and Theorem C.2. Suppose α~=ζδz+(1ζ)α\tilde{\alpha}=\zeta\delta_{z}+(1-\zeta)\alpha is a distribution perturbed by a Dirac mode at zz with relative mass ζ(0,1)\zeta \in (0,1), W\mathcal{W} is the standard wasserstein discrepancy, Pκ\mathcal{P}^\kappa is the PSW with matching strength κ\kappa. Lemma 3.2 states that: for an arbitrary sample yy_* in β\beta: W(α~,β)(1ζ)W(α,β)+ζ(D(z,y)const)\mathcal{W}(\tilde{\alpha}, \beta) \geq(1-\zeta) \mathcal{W}(\alpha, \beta)+\zeta\left(D\left(z, y^*\right)-\mathrm{const}\right) where D(z,y)D\left(z, y^*\right) is the deviation of δz\delta_z, const\mathrm{const} encapsulates the terms irrelative to ζ\zeta and DD. In contrast, Theorem C.2 states that: Pκ(α~,β)(1ζ)Pκ(α,β)+2κζ(1ed(z)/2κ).\mathcal{P}^\kappa(\tilde{\alpha}, \beta)\leq (1-\zeta) \mathcal{P}^\kappa(\alpha,\beta) + 2\kappa\zeta(1-e^{-d(z)/2\kappa}). where d(z)d(z) is the average distance between zz and samples in β\beta.

  • Secondly, we clarify and COMPARE the implications of Lemma 3.2 and Theorem C.2.

    • Lemma 3.2 demonstrates that the standard Wasserstein distance W(α~,β)\mathcal{W}(\tilde{\alpha}, \beta) increases proportionally to the deviation D(z,y)D(z, y^*) of the outlier from the typical elements yy^* of β\beta. W(α~,β)\mathcal{W}(\tilde{\alpha}, \beta) grows without bound as the deviation increases, indicating sensitivity to outliers.
    • Theorem C.2 demonstrates that Pκ(α~,β)\mathcal{P}^\kappa(\tilde{\alpha}, \beta) remains bounded even if the outlier’s deviation d(z)d(z)\rightarrow \infty. What remains is the PSW excluding the outlier mode and a cost bounded by 2κ(1ζ)2\kappa(1-\zeta).
    • By comparing Lemma 3.2 and Theorem C.2, it becomes evident that the standard Wasserstein distance is susceptible to large perturbations from outliers, but the PSW discrepancy mitigates this issue by capping the impact of such deviations. Consequently, PSW provides a more stable and reliable discrepancy amidst outlier modes.
  • Finally, we note that in our framework, accurate imputation relies on precise discrepancy estimation. False discrepancy estimation produces false gradient and thereby misleading the update of imputation. In this context, the robustness of PSW, demonstrated in Theorem C.2, matters since it offers reliable discrepancy estimation and subsequently gradient estimation amidst outlier modes, which enhances the accuracy of imputation.

[Q3] Although D_KL is used in (2), T1mT 1_m and TT1nT^T 1_n is not a probability distribution (because the normalization (sum equals 1) constraint is removed). How are they normalized?

Response. In this work, the KL divergence, for instance D_KL(T1m,Δn)\mathrm{D}\_\mathrm{KL}(T 1_m,\Delta_n), is employed merely as a tool to align the vectors T1mT 1_m and Δn\Delta_n. Specifically, denote a=T1ma=T 1_m and b=Δnb=\Delta_n, for the ii-th element, if ai>bia_i>b_i, then D_KL(ai,bi)>0\mathrm{D}\_\mathrm{KL}(a_i,b_i)>0, where the optimization objective encourages to reduce aia_i towards bib_i. Otherwise, D_KL(a,b)<0\mathrm{D}\_\mathrm{KL}(a,b)<0, where the optimization objective encourages to increase aia_i towards bib_i. Therefore, we observe that KL divergence effectively guides the alignment of aa and bb irrespective of whether their sums equal to 1.

评论

Thank you very much for your meticulous comments and appreciation of our novelty and empirical results. We have noticed that this reviewer has carefully checked the formulas and theorems in both main text and appendix and raises a bag of constructive suggestions. Below are our responses to the specific concerns and queries.


[W1] Some technical justification is vague. Clearer descriptions would be desired.

Response. It seems that this concern manifests as the specific comments in the question section. We sincerely appreciate these meticulous comments, and have made careful revisions to improve the clarity of technical justification. Please kindly see our response to these questions below.

[W2] Introduction is a bit too abstract about the proposed method. It describes what problem is solved in the paper, but does not describe the basic technical idea how it is achieved.

Response. Thank you very much for your sincere comment. We have made extensive editing works on the 5-th paragraph of introduction, with particular focus on highlighting the basic technical idea of PSW-I to tackle the research problem. The revised paragraph is attached below.

To this end, we propose the Proximal Spectrum Wasserstein (PSW) discrepancy, a novel discrepancy tailored for comparing sets of time-series based on optimal transport. Specifically, PSW integrates a pairwise spectral distance, which transforms time-series into the frequency domain and then calculate the pair-wise absolute difference. By comparing time series in the frequency domain, the underlying temporal dependencies and patterns are captured. Moreover, PSW incorporates selective matching regularization, which relaxes the hard matching constraints of traditional optimal transport and allows for flexible mass matching between distributions. This relaxation enhances robustness to non-stationary. Building upon PSW, we develop the PSW for Imputation (PSW-I) framework, which iteratively refines the imputed missing values by minimizing the PSW discrepancy. Extensive experiments demonstrate that PSW-I effectively captures temporal patterns and accommodates non-stationary behaviors, significantly outperforming existing time-series imputation methods.

[Q1] In 'Contributions', the authors mention that PSW-I eliminates the need for masking, but PSW-I seems to use masking (eg as shown in Fig3). What does this description mean?

Response. Thank you very much for your query. In our manuscript, the mask matrix M\mathbf{M} is utilized exclusively to indicate the missingness of data at each indice. The non-missing indices are employed as the training&validation dataset and the missing indices are used for test. In contrast, many prevailing works such as CSDI and time-series foundational models (iTransformer, TimesNet, etc) perform an additional masking operation on the training dataset to generate labels for model training. Our method bypasses the complexity of this additional masking operation, which proves to be beneficial for imputation quality [1].

评论

Thank you very much for your swift response. We have incorporated the clarification into the main text. Specifically, on page 5, footnote 1, we note:

The definition of the DFT matrix is presented in Definition C.5, and the gradient derivation is detailed in Theorem C.6. Notably, our gradient formulation omits the gradients from the optimal TT with respect to α\alpha and β\beta to enhance the efficiency and stability of the calculation process.

评论

Thank you very much for reading our rebuttal and raising follow-up questions.

Since the optimum T depends on alpha and beta. I guess dT/dalpha and dT/dbeta should be considered.

Response. We understand that the derivatives dT/dαdT/d\alpha and dT/dβdT/d\beta ideally should be considered. However, T is obtained through an iterative numerical optimization procedure A\mathcal{A} that is not always differentiable. Even if A\mathcal{A} was differentiable, calculating dT/dαdT/d\alpha and dT/dβdT/d\beta would be computationally intensive due to the need to compute gradients across up to 1,000 iteration steps. Even worse, large values in iteration could cause excessively large gradients and lead to gradient explosion. Therefore, we choose to update α\alpha and β\beta through the pathway PD(F)ˆα,β\mathcal{P}\rightarrow \mathbf{D}\^\mathrm{(F)}\rightarrow \alpha,\beta. This strategy stabilizes the optimization procedure and effectively guides the imputation updates. We have clarified this point in Theorem C.6.

评论

Thank you for your answer. I think it must be clarified in the main text that the equations below 'Backward Pass' are not a true gradient of (2). Because it causes a risk that readers (especially students) may misunderstand that, in general, gradient of the optimal value (P) can be derived without considering the dependence between optimal solution (T) and optimization problem parameters (alpha,beta).

4Wtk

评论

Dear reviewer 4Wtk,

Thank you very much for your valuable feedback and insightful suggestions. Following your recommendation, we have incorporated clarifications on gradient calculation into the main text. The revised manuscript highlights these changes in blue for your convenience.

As the discussion draws to a close, we hope to know whether you have any remaining questions or additional suggestions.

Please note that today is the final day to submit the revised PDF. If you have any further comments, please kindly consider communicating them to us, such that we could promptly discuss them with you and reflect them in the revision.

Thank you in advance,

14099 Authors

评论

Dear Reviewer 4Wtk,

We sincerely appreciate the time and effort you have dedicated to reviewing our manuscript. Your discussions and suggestions have been invaluable in enhancing our work.

In response to your comments, we have made the following efforts:

  • For Q1: we have clarified that the masking matrix is solely used for splitting test data.
  • For Q2: we have justified noise robustness by comparing Lemma 3.2 and Theorem C.2 (highlighted in blue fonts in revision).
  • For Q3: we have explained the efficacy of KL divergence without normalization.
  • For Q4: we have elaborated on the DFT matrix and the calculation of gradients (highlighted in blue fonts in revision).
  • For Q5: we have corrected typographical errors (highlighted in blue fonts in revision).
  • For W2: we have refined the technical motivation narrative in the introduction (highlighted in blue fonts in revision).

As the discussion deadline approaches, we are wondering whether our responses have properly addressed your concerns? Your feedback would be extremely helpful to us. If you have further comments or questions, we hope for the opportunity to respond to them.

Many thanks,

14099 Authors

评论

Dear reviewer 4Wtk,

Since the discussion period will end in around a day, we will be online awaiting your feedback on our discussion, which we believe has fully addressed your concerns.

We would highly appreciate it if you could take into account our discussion and clarification when updating the rating and having discussions with AC and other reviewers.

Thank you so much for your time and efforts. We understand that the recent Thanksgiving holiday may have caused our previous message to be unintentionally overlooked. Sincerely sorry for our repetitive messages, but we're eager to ensure everything is addressed.

Authors of #14099

审稿意见
8

The paper proposes a time series imputation method based on optimal transport. The key idea is the combination of a frequency-based Wassertein discrepancy and selective matching regularization. Theoretical justification is also provided. The experimental results show the imputation accuracy outperforms many sota methods.

优点

  1. An interesting and well-motivated design of the spectral-enhanced Wasserstein distance (WD)
  2. A theoretical justified design of proximal spectral WD to account for non-stationarity.
  3. seemingly excellent performance in real-world benchmark datasets.

缺点

  1. The biggest issue from my end is the lack of standard deviation. From Table 1, the error of the proposed method seems really good, but i am not informed if these results are averaged over multiple train/test runs or just one run. To avoid cherry picking, the authors are encouraged to highlight how these numbers were obtained, what the training/test splits were, and what hyperparameter selection/cross-validation process was involved, etc. Similar expectations apply to table 3 and 4.

  2. Lack of convergence discussion/analysis. From Fig. 3 and Section 3.4, the imputation procedure seems to repeatedly sample patches from the time series, compute PSW, and use the gradient of the PSW to update the imputation. There seems a lack of convergence guarantees or discussions about this procedure. Can authors provide at least some discussion on this?

  3. Data noise issue. Real time series data often includes noises. That means, just computing the distance after DFT in lin154 might be affected by data noises. Have the authors consider any methods or trade-offs, such as low-pass filters, in your SWD definition, to improve the robustness and/or counteract the noise effect?

问题

see above.

评论

The standard deviation of Table 3.

DatasetDistancesMAEMSEMRE
ElectricityPSW-T0.0070.0100.006
PSW-A0.0080.0030.009
PSW-P0.0080.0020.003
PSW0.0040.0040.007
ETTh1PSW-T0.0050.0040.002
PSW-A0.0010.0060.007
PSW-P0.0080.0080.006
PSW0.0050.0070.006

The standard deviation of Table 4.

DatasetDistancesMAEMSEMRE
ElectricityOT0.0080.0020.003
EMD0.0080.0060.007
UOT0.0040.0000.008
Ours0.0040.0070.004
ETTh1OT0.0050.0060.004
EMD0.0030.0100.003
UOT0.0010.0040.008
Ours0.0070.0060.005

The standard deviation of Table 1. We only report the results of 4 models here. Full results are presented in Table 9 in revision.

DatasetsETTh1ETTh2ETTm1ETTm2ElectricityTrafficWeatherIllnessExchangePEMS03
Missing ratiosMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAEMSEMAE
Transformer
pmiss=0.1p_{miss}=0.10.00.0040.0080.0080.0090.0050.0070.0050.0040.0050.0090.0090.010.0080.00.0040.0040.0040.0020.001
pmiss=0.3p_{miss}=0.30.0030.0050.0020.0030.0030.0080.0070.0050.0030.0010.0080.0040.0030.0080.0080.0080.00.0020.0010.0
pmiss=0.5p_{miss}=0.50.0070.0050.0070.0080.0090.0050.0080.0080.0060.0030.0000.0090.0090.0030.0050.0070.0090.0030.00.003
pmiss=0.7p_{miss}=0.70.0050.0020.0040.0000.0070.0080.0020.0070.0020.00.0010.0020.0070.0070.0060.0010.0060.00.0030.002
DLinear
pmiss=0.1p_{miss}=0.10.0050.0070.0020.0080.0010.0030.0000.0070.0090.0070.0010.0080.0060.0010.0020.0030.0010.0080.0040.002
pmiss=0.3p_{miss}=0.30.0060.0050.0080.0000.0080.0040.0010.0070.0030.0090.0100.0020.0070.0060.0050.0030.0070.0060.0050.003
pmiss=0.5p_{miss}=0.50.0080.0030.0040.0070.0100.0010.0050.0020.0040.0030.0040.0040.0090.0080.0010.0050.0100.0030.0080.002
pmiss=0.7p_{miss}=0.70.0090.0010.0020.0080.0070.0070.0070.0030.0100.0090.0070.0040.0040.0010.0080.0050.0070.0060.0070.008
TimesNet
pmiss=0.1p_{miss}=0.10.0080.0060.0090.0060.0040.0090.0050.0060.0010.0090.0040.0100.0050.0020.0030.0070.0090.0060.0030.009
pmiss=0.3p_{miss}=0.30.0040.0010.0070.0100.0050.0070.0040.0010.0070.0060.0060.0020.0100.0050.0090.0070.0020.0010.0040.002
pmiss=0.5p_{miss}=0.50.0080.0050.0050.00.0020.000.0010.0100.0030.0040.0070.0020.0030.0010.0050.0060.0080.0070.0050.007
pmiss=0.7p_{miss}=0.70.0070.0100.0020.0050.0050.0020.0020.0050.0020.0060.0020.0010.0040.0020.0100.0030.0090.0040.0010.004
PSW-I
pmiss=0.1p_{miss}=0.10.0080.0040.0070.0020.0010.0020.0030.0030.0080.0050.0070.0050.0050.0100.0050.0050.00.0020.0080.009
pmiss=0.3p_{miss}=0.30.0040.0080.0050.0020.0090.0090.0040.0090.0020.0010.0020.0040.0040.0030.0070.0040.0010.0050.0090.000
pmiss=0.5p_{miss}=0.50.0030.0090.0060.0030.00.0080.0080.0070.0030.0020.0020.0030.00.0090.0060.0070.0050.0050.0030.006
pmiss=0.7p_{miss}=0.70.0040.0060.0010.0030.0030.0090.0060.0020.0020.0090.0020.0050.0020.0010.0070.0060.0090.0040.0090.007

20241122 Update: We are pleased to announce that our implementation is now available here: https://anonymous.4open.science/r/psw-i-F35F. We hope this release reinforces your confidence and strengthens your support in this work.

评论

Thanks for the new results and fruitful discussions. I've increased my score. I hope all these will be integrated into your paper.

评论

Thank you so much for reading our response and providing very positive feedback! These contents are incorporated in the revised manuscript. We commit preserving them which are fruitful to enhance the quality of this paper.

评论

[W3] Data noise issue. Real time series data often includes noises. That means, just computing the distance after DFT in lin154 might be affected by data noises. Have the authors consider any methods or trade-offs, such as low-pass filters, in your SWD definition, to improve the robustness and/or counteract the noise effect?

Response. This is indeed a very insightful comment for extending PSW-I, which actually coincides the original objective of this research. We are pleased to share our considerations and trade-offs in this regard.

  • This extension is theoretically promising. Noise predominantly occupies the high-frequency components of the signal, whereas the underlying data semantics are primarily contained within the low-frequency regions. By applying a low-pass filter before calculating the spectrum distance between patches, it is feasible to attenuate the noise while preserving the essential semantic information. Consequently, this approach facilitates accurate time series imputation even amidst noisy observations.
  • This extension is empirically valid. To verify the extension you suggested, we have added additional experiments. Specifically, we introduce Gaussian noise to the observed data in frequency components above 100 while keeping the test data clean. As shown in the table below, most models experience a performance decline. To counteract noise, PSW-I+ applies a low-pass filter with a cutoff frequency of 20 prior to computing PSW. Experimental results demonstrate that PSW-I+ consistently outperforms PSW across all cases. However, the improvement at this stage is modest, potentially because the benefits of noise reduction are offset by the loss of semantic information in the high frequencies. Further performance enhancements may entail careful tuning of the low-pass filter, particularly the cutoff frequency.
  • Some trade-offs hinder us from incorporating this extension. Specifically, it is difficult to tune the filter's hyperparameters, typically the stopping threshold. In practice, the specific frequency bands affected by noise is not always available. Additionally, the absence of clean data makes it challenging to use a validation set for tuning the filter's hyper-parameters. Due to these challenges to determine the filter's hyperparameters, we have not extended our method to accommodate noisy datasets in the current study. Addressing these challenges requires further research, potentially involving adaptive filtering techniques or leveraging unsupervised learning methods to estimate the noise characteristics.
DatasetETTh1ETTh2ETTm1ETTm2Illness
ModelMSEMAEMSEMAEMSEMAEMSEMAEMSEMAE
DLinear0.1720.270.1090.2330.1050.2220.0790.1970.250.298
TimesNet0.230.3380.1210.2510.0640.1750.070.1880.2990.322
FreTS0.1970.3190.1080.2270.0510.1520.0390.1350.3060.36
PatchTST0.1820.2780.140.2710.0530.1540.0450.1470.910.584
SCINet0.1780.2850.1230.2470.0690.1780.0610.1701.1470.711
iTransformer0.1820.2750.1000.2110.0530.1510.0360.1280.4290.355
SAITS0.2170.3040.1940.2590.0570.1540.0470.1350.2640.273
CSDI0.2220.3220.1540.2430.0520.1350.1170.1230.3350.419
Sinkhorn1.0020.7541.0090.7521.0040.7550.9960.7491.0340.721
TDM1.0180.7531.0070.751.0030.7550.9950.7481.0180.711
PSW-I0.1690.2570.0510.1450.0490.1310.0230.0920.1110.157
PSW-I+0.1680.2470.0470.1440.0480.1290.0220.0890.110.153
评论

Thank you very much for your positive comments and appreciation of our novelty, motivation and theoretical & empirical results. Below are our responses to the specific concerns and queries.


[W1] The biggest issue from my end is the lack of standard deviation. From Table 1, the error of the proposed method seems really good, but i am not informed if these results are averaged over multiple train/test runs or just one run. To avoid cherry picking, the authors are encouraged to highlight how these numbers were obtained, what the training/test splits were, and what hyperparameter selection/cross-validation process was involved, etc. Similar expectations apply to table 3 and 4.

Response. We understand that the standard deviation and the detailed description of dataset split and hyperparameter selection is critical for reproduction. The detailed responses to each query are provided as follows.

  • Firstly, we have conducted additional experiments to provide the standard deviation of Table 1, 3, and 4. Please kindly check them in the subsequent comment windows. We agree that standard deviations are important to convince the readers of the reliability of experiments and conclusions. Therefore, we have added them in the revised appendix.
  • Secondly, we split the dataset using the binary mask matrix, denoted as M\mathbf{M} in Section 2.1. The binary mask matrix has the same shape as the dataset and is generated by sampling a Bernoulli random variable with a predefined mean (i.e., missing ratio). Each element in the binary mask matrix takes a value of 1 or 0, representing whether the corresponding index in the dataset is considered missing or observable, respectively.
    • For indices where the mask matrix element is 1, these indices are treated as missing in the training data. The associated data are excluded from the model training process and preserved exclusively for testing.
    • For indices where the mask matrix element is 0, these indices are treated as observable in the training data. The associated data are used for model training.
  • Thirdly, we introduce the hyperparameter selection process. We leave out 5% indices from the training data for validation set. Subsequently, the key hyperparameters involved in PSW-I are tuned in the validation set by minimizing the MSE, where the patch size is tuned within {24,36,48}\{24, 36, 48\}; the matching strength is tuned within {1, 10, 100, 1000}.

[W2] Lack of convergence discussion/analysis. From Fig. 3 and Section 3.4, the imputation procedure seems to repeatedly sample patches from the time series, compute PSW, and use the gradient of the PSW to update the imputation. There seems a lack of convergence guarantees or discussions about this procedure. Can authors provide at least some discussion on this?

  • Response. Thank you for your query, and we agree that the convergence property deserves detailed discussion.
  • Firstly, the PSW, denoted as Pκ\mathcal{P}^\kappa, is a convex function w.r.t. the input pair (α,β)(\alpha,\beta). It immediately follows from the fact that the composition of convex functions is convex. The PSW is convex (linear function is convex) to the pair-wise distance D(F)\mathbf{D}^\mathrm{(F)} and D(F)\mathbf{D}^\mathrm{(F)} is convex (1-norm and affine functions are convex) to α\alpha and β\beta. Therefore, Pκ\mathcal{P}^\kappa is convex to the input pair.
  • Thanks to the convexity of Pκ\mathcal{P}^\kappa, it is feasible to prove that PSW-I, which employs gradient descent to update (α,β)(\alpha,\beta), converges (in Theorem C.3) with diminishing errors (in Theorem C.4). Specifically, suppose (αk,βk)(\alpha_k, \beta_k) is the imputation result at the k-th iteration, (α,β)(\alpha^*, \beta^*) are the optimum; in Theorem C.4, we demonstrate that P(αk,βk)P(α,β)12ηkϵ2\mathcal{P}(\alpha_{k}, \beta_{k})-\mathcal{P}(\alpha^*,\beta^*) \leq \frac{1}{2\eta\mathrm{k}}\epsilon^2 where as kk increases, the error diminishes with rate O(1/k)\mathcal{O}(1/k)
审稿意见
8

Providing methods for time series imputations while respecting the temporal dependence.

优点

  • Very good empirical investigation.

  • Clear presentation.

缺点

NA

问题

  • What if the time series components are observed at different temporal frequencies (say days vs. hours, or hours vs. mins)?

伦理问题详情

NA

评论

Thank you very much for your encouraging support and appreciation of our presentation and empirical results. Below are our responses to the specific query raised.


[Q1] What if the time series components are observed (a.k.a. sampled) at different temporal frequencies (say days vs. hours, or hours vs. mins)?

Response. Thank you for your insightful question. If you are interested in the general applicability of PSW-I to different datasets with different sampling frequencies, please refer to the results over the ETTh1 and ETTm1 datasets reported in our paper, which are observed at hourly and minute-level frequencies, respectively. If you hope to generalize PSW-I to non-uniform time series, which contain observations at mixed frequencies, we propose the following two solutions:

  • Pre-Processing and Imputation. A straightforward approach involves converting the non-uniform time series into a uniform time series before imputation. The process depends on the desired target frequency. For example, given a non-uniform time series sampled at both daily and hourly intervals: To recover an hour-wise sequence, the daily observations can be redistributed into hourly slots, with the unsampled hours treated as missing; to recover a day-wise sequence, the hourly observations can be aggregated or downsampled, and redundant indices can be removed. Once the time series has been resampled to the desired uniform frequency, standard imputation methods can be applied to handle any missing values.

  • Using Non-uniform Discrete Fourier Transform (NDFT) (NDFT). An alternative solution involves leveraging the NDFT to acquire the spectrum of non-uniform time series. By enhancing the pair-wise spectrum distance with NDFT, we can calculate the distance between non-uniform time series by comparing their spectrum, thereby extending the PSW-I framework to handle such data. The NDFT is defined as: Xk=nN1xne2πipnfkX_k = \sum_{n}^{N-1} x_n e^{-2\pi i p_n f_k}, where kk (ranging from 00 to N1N-1 ) denotes the frequency index, p0,,pN1p_0, \dots, p_{N-1} are the scaled time stamps of the samples, and f0,,fN1f_0, \dots, f_{N-1} are the corresponding frequencies.


20241122 Update: We are pleased to announce that our implementation is now available here: https://anonymous.4open.science/r/psw-i-F35F. We hope this release reinforces your confidence in our work and inspires you to continue supporting our efforts.

评论

Thank you for providing details on the noon-uniform time series case. I think this is a good piece of work and I will keep my score.

评论

Thank you so much for the continuing support -- we truly appreciate it!

评论

We are grateful to all reviewers for their helpful comments regarding our paper. We are pleased that Reviewers 8uNw and 4Wtk appreciate the significance of methodology to enhance time-series imputation, finding our method novel and well-motivated. All reviewers find our empirical results promising. Below is a summary of our responses:

  • To Reviewer fN5T: We have discussed a case where the time-series are irregularly sampled, and proposed some feasible extensions of PSW-I to accommodate it.
  • To Reviewer 8uNw: We have included experiments across different random seeds, clarified the hyper-parameter selection process, underscored the convergence analysis, and conducted additional experiments with noisy dataset.
  • To Reviewer 4Wtk: After thorough discussion, the only remaining concern was the need to add a clarification on gradient calculation in the main text. This is an actionable item, and we have addressed it by including the necessary clarification on page 5, footnote 1. Additionally, we have thoroughly refined the technical justifications and the narrative of the introduction to better highlight the fundamental technical concepts. Therefore, we believe that all concerns raised by Reviewer 4Wtk have now been satisfactorily resolved.

Please refer to our detailed responses to each point raised. We have also uploaded a revised manuscript with changes highlighted in blue. Hopefully our revisions and clarifications reinforce your confidence and support in our work.

Thank you again for your valuable time and expertise.

AC 元评审

This submission tackles missing data imputation for time series via a new discrepancy measure. They target some difficult issues such as non-stationarity and periodicities, showing that the proposed proximal spectrum Wasserstein allows flexible learning and is amenable to the imputation framework. Reviewers concerns seemed addressed after the rebuttal phase. In particular, they already found the work novel with theoretical support and some questions and technical details were largely resolved.

审稿人讨论附加意见

Adequate discussion in response period

最终决定

Accept (Poster)

公开评论

Dear authors,

Thanks for a great paper; it was a really good read. Is it possible to get the code? I noticed that the code on github is deleted, Is there a new repository link? Thank you

公开评论

Thank you very much for your interest and appreciation. Some refinement is needed to enhance the quality and exposure of the repo before official releasing. We will open a thread to perform it ASAP after completing an approaching deadline.