PaperHub
5.5
/10
Poster4 位审稿人
最低3最高3标准差0.0
3
3
3
3
ICML 2025

Information Bottleneck-guided MLPs for Robust Spatial-temporal Forecasting

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

Derive a novel and general information bottleneck based principle, along with its instantiation for robust spatial-temporal forecasting under dual noise effect in MLP networks.

摘要

关键词
Robust spatial-temporal forecastingMulti-Layer PerceptronInformation bottleneck

评审与讨论

审稿意见
3

This paper proposes the Robust Spatio-Temporal Information Bottleneck (RSTIB) principle to enhance the robustness of spatio-temporal prediction models to noise interference. The authors introduce RSTIB-MLP, a multi-layer perceptron (MLP) based implementation that achieves state-of-the-art performance in the face of noise interference. Extensive experiments on multiple datasets demonstrate the robustness and efficiency of the proposed model.

update after rebuttal

I have read the overall feedback and the author has addressed my concerns, but some sections need to be rewritten and the final version needs to be carefully revised.

给作者的问题

Q1. How does the RSTIB principle handle noise types other than AWGN? Can the authors provide insights into robustness under different noise models?

Q2. Can the authors elaborate on the role of knowledge distillation in enhancing feature diversity and its impact on robustness?

论据与证据

The claims are generally supported by clear evidence. The authors provide a theoretical framework for RSTIB and demonstrate its effectiveness through experiments on noisy datasets. However, the assumption of additive white Gaussian noise (AWGN) is restrictive and may not generalize to other noise types (I understand that this is for the sake of good relaxation of the theoretical derivation). In particular, real-world data may be noisy in nature with spatio-temporal entanglement, or simply missing.

方法与评估标准

The proposed RSTIB-MLP method and evaluation criteria are applicable to the problem. The authors use standard metrics (MAE, RMSE, MAPE) and benchmark datasets for spatio-temporal prediction. The inclusion of the knowledge distillation module is innovative, but would be better if its role in enhancing feature diversity was explained more clearly.

理论论述

Although I have not examined them carefully, I believe the theoretical claims are sound. The authors correctly derive the RSTIB principle by lifting the Markov assumption and provide proofs for the key propositions.

实验设计与分析

The experimental design is comprehensive but has some limitations. The authors demonstrate that RSTIB-MLP outperforms state-of-the-art methods under noisy conditions. However, the experiments mainly use artificially added noise, which may not fully represent real-world scenarios.

补充材料

The supplementary materials are extensive and provide more details on the theoretical proofs, hyperparameter tuning, and experimental results. However, some sections could benefit from clearer explanations, especially on the role of knowledge distillation and the impact of different regularization terms.

与现有文献的关系

This paper builds on the existing information bottleneck (IB) method and extends it to handle the dual noise effect in spatiotemporal prediction. This work is related to recent advances in robust representation learning and applies these ideas to real-world problems. However, the authors should do a better job of distinguishing their contribution from existing works, especially those cited in the paper.

遗漏的重要参考文献

This paper cites related works, but could benefit from discussing recent advances in robust machine learning and spatiotemporal forecasting. For example, recent works on adversarial training of time series data or methods for handling missing data in spatiotemporal models could provide additional context [1]. In addition, feature variance is used to quantitatively analyze the diversity among learned features, which can also be interpreted as spatiotemporal heterogeneity in the spatiotemporal forecasting scenario, as shown by a recent paper [2].

[1] Cheng, Hao, et al. "RobustTSF: Towards theory and design of robust time series forecasting with anomalies." ICLR 2024.

[2] Chen, Wei, and Yuxuan Liang. "Expand and Compress: Exploring Tuning Principles for Continual Spatio-Temporal Graph Forecasting." ICLR 2025.

其他优缺点

Strengths:

S1. The paper introduces a novel theoretical framework (RSTIB) that extends the information bottleneck principle to handle dual noise effects for spatiotemporal forecasting applications.

S2. The experiments are comprehensive, covering multiple benchmark datasets and comparison baselines.

Weaknesses:

W1. Overall, my biggest concern is that its narrow setting affects the broad interest of the community. Specifically, the assumptions of AWGN are restrictive and may not generalize to other noise types.

W2. The experiments mainly use artificially added noise, which may not fully represent real-world noisy scenes.

W3. The role of knowledge distillation in enhancing feature diversity has not been fully explored.

其他意见或建议

C1. Basically, I think this is a good paper in the field of spatio-temporal forecasting, but its limitation lies in its diverse settings and motivations, which reduces its audience. I actually suggest that the authors change it to robust spatio-temporal forecasting under extreme noise conditions (use this as motivation to revise the writing story and introduce the theoretical framework), which I believe will win wider attention.

C2. Therefore, the authors can include more real-world noise experiments (e.g., the spatiotemporal data missing setting of the reference article [3]) to verify the robustness claims.

[3] Cini, Andrea, Ivan Marisca, and Cesare Alippi. "Filling the g_ap_s: Multivariate time series imputation by graph neural networks." ICLR 2022.

伦理审查问题

not applicable

作者回复

Thank you very much for your positive and constructive comments. We provide a point-by-point response as follows.

Re: Claims And Evidence & Experimental Designs Or Analyses & W1 & W2 & C2 & Q1

Due to space limits, please refer to Re: W1 in the Rebuttal for Reviewer ZfWm.

Re: Methods And Evaluation Criteria & Supplementary Material & W3 & Q2

  • KD's role in enhancing feature diversity:

Due to space limits, please refer to Re: W2 in the Rebuttal for Reviewer ZfWm.

  • KD's impact on robustness:

We observed performance gains by incorporating the KD module. The reason behind could be that KD dynamically tunes the balance across different time series, enabling MLPs with limited capacity to favor information containing less noise. However, the improvement in predictive performance is less prominent compared to robustness enhancements brought by the RSTIB, because in all situations, we still need to balance the preservation of the target and the compression of all the reparameterization. Intrinsically, although KD can tune the relative ratio, the improvement regarding predictive performance is still constrained by the objective itself.

  • The impact of different regularization terms:

We provided further ablation studies and discussions on the respective roles of the regularization terms in Appendix K.2. Please refer to that section for more details.

Re: Relation To Broader Scientific Literature

Appendix I.2 discusses the distinctions between RSTIB/RSTIB-MLP and existing IB methods. Please refer to that section for more details.

Re: Essential References Not Discussed

We highly appreciate the suggested papers! Below we provide a detailed discussion on the relation between our submission and the cited works:

  • We are inspired that we could include an additional part for introducing the robust spatial-temporal forecasting: RobustTSF ([1]) considers time series forecasting with anomalies, while our work similarly considers spatial-temporal forecasting with noise perturbation. We share a similar way of experimental conduction: artificially introducing noise and data missing (specifically mentioned in Essential References Not Discussed). Including this paper can enhance our experimental settings and results' credibility.

  • Linking feature diversity with spatiotemporal heterogeneity from [2] is really intriguing. Both works consider robustness in spatial-temporal forecasting. Our work attempts to enhance the feature diversity, while [2] aims to capture the heterogeneity. Our work quantifies feature diversity using feature variance(Var), [2] quantifies heterogeneity using the Average Node Deviation (AND) metric:

Given the feature matrix XRn×dX \in \mathbb{R}^{n \times d}, the AND metric is defined as:

D(X)=1n2i=1nj=1nk=1d(xikxjk)2D(X) = \frac{1}{n^2}\sum_{i=1}^n \sum_{j=1}^n \sum_{k=1}^d (x_{ik}-x_{jk})^2

To link it with Var, we further derive D(X)D(X) as below:

\begin{aligned} D(X) &= \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n \sum_{k=1}^d (x_{ik} - x_{jk})^2 \\ &= \frac{1}{n^2} \sum_{i,j,k} \left( x_{ik}^2 + x_{jk}^2 - 2x_{ik}x_{jk} \right) \\ &= \frac{1}{n^2} \left( 2n \sum_{k,i} x_{ik}^2 - 2 \sum_{k}\left( \sum_{i} x_{ik} \right)\left( \sum_{j} x_{jk} \right) \right) \\ &= \frac{2}{n^2} \left( n \sum_{k,i} x_{ik}^2 - n^2 \sum_{k} \bar{x}_k^2 \right) \\ &= \frac{2(n-1)}{n} \sum _ {k=1}^d \left( \frac{1}{n-1} \left( \sum _ {i=1}^n x _ {ik}^2 - n\bar{x} _ k^2 \right) \right) \\ &= \frac{2(n-1)}{n} \cdot \text{tr}(\mathbf{Cov}), \end{aligned}

where xˉk=1ni=1nxik\bar{x} _ k = \frac{1}{n} \sum _ {i=1}^n x _ {ik} and tr(Cov)=k=1d1n1(i=1nxik2nxˉk2)\text{tr}(\mathbf{Cov}) = \sum_{k=1}^d \frac{1}{n-1} \left( \sum_{i=1}^n x_{ik}^2 - n\bar{x}_k^2 \right).

Thus, the AND metric is proportional to the trace of the covariance matrix Cov\mathbf{Cov}:

D(X)tr(Cov)D(X) \propto \text{tr}(\mathbf{Cov})

The AND metric emphasizes overall variance, whereas our Var metric emphasizes balanced standard deviations.

We'll extend the above in our final version.

Re: C1

Many thanks for this constructive advice! We suspect that the mentioned "extreme noise conditions" refer to real-world noise experiments (as mentioned in C2), such as data-missing scenarios. We have actually provided such evaluations and discussions in Appendix K.3. Your suggestion is helpful as it guides us towards more focused and clear settings. Nevertheless, by including robustness studies under diverse noise conditions, we have also demonstrated the consistent performance of our method.

We hope this addresses all your concerns. Thank you very much!

审稿人评论

I've read through the overall feedback, and it looks good. Good luck.

Just to add, seriously—I’d recommend the authors check out my suggestion C1-C2 in the final version. It’ll help the paper reach a broader audience. Also, I believe that adding all above discussions will make the paper better.

作者评论

Dear Reviewer Mrf2,

Thank you very much for your positive feedback on our rebuttal. We greatly appreciate your valuable suggestions and will carefully consider how to win a broader audience in future revisions. Besides, we will add the discussions to refine our submission. Wishing you the best of luck as well!

Best regards,

Authors

审稿意见
3

This paper introduces a novel MLP training method based on the Information Bottleneck principle, termed RSTIB-MLP, designed to address the balance between model efficiency and robustness. By analyzing the dual noise effect in STgraph data, authors propose the Robust Spatiotemporal Information Bottleneck (RSTIB) principle. This principle relaxes the Markov assumption of IB and explicitly minimizes the impact of noisy information. In its implementation. Experimental results demonstrate that RSTIB-MLP outperforms other counterparts in robustness against noise interference across multiple spatiotemporal benchmarks, while maintaining higher computational efficiency.

给作者的问题

  1. The knowledge distillation module dynamically adjusts the regularization strength through the noise influence index (Definition 4.9). This index is calculated based on the prediction error of the teacher model. However, will the choice of the teacher model (such as STGCN) lead to significant deviations in the results?

  2. Have unsupervised or self-supervised methods been attempted to replace the teacher model? The process by which the authors derived their algorithm seems to have nothing to do with Knowledge Distillation (KD). Can the final loss function be directly applied to unsupervised learning?

  3. The assumption for relaxing the Z−X−Y restriction is vital for authors' RSTIB-MLP, but the discussion part of it is only in appendix. Could you please provide more details of it in the main text?

论据与证据

Yes. Authors use clear algorithms and experiments to support their claims.

方法与评估标准

Yes, they use classic ST benchmarks to formula their experiments.

理论论述

Yes, I have checked the correctness.

实验设计与分析

The experiment design is soundness. However, there are uncertainties regarding the correspondence between simulated noise scenarios and real-world conditions, and substantial challenges exist in validating the model's noise resistance capabilities.

补充材料

Yes. Many hyperlinks in the appendix are broken (for example, the Fig. ?? in page 16), and the author should correct these parts.

与现有文献的关系

This study addresses an intriguing research question regarding model robustness in noisy scenarios, a topic that has received limited attention in prior research.

遗漏的重要参考文献

How do these recent spatio-temporal prediction methods perform under this paper's experimental settings?

[1] PDFormer: Propagation Delay-Aware Dynamic Long-Range Transformer for Traffic Flow Prediction. AAAI 2023.

[2] UrbanGPT: Spatio-Temporal Large Language Models. KDD 2024.

[3] UniST: A Prompt-Empowered Universal Model for Urban Spatio-Temporal Prediction. KDD 2024.

其他优缺点

Strengths

  1. The study employs a diverse and extensive set of experimental datasets.

  2. Experimental results demonstrate the proposed method's effectiveness in handling data noise under specific scenarios.

Weakness

  1. The study lacks assessment of computational efficiency. How does the proposed method's computational performance compare with the baselines?

  2. While the proposed methodology validates noise resistance by artificially introducing noise into the original data, several critical considerations warrant discussion:

(i) Real-world scenarios present significant challenges in identifying noise occurrence patterns and frequencies, making noise characterization inherently difficult.

(ii) Both training and testing datasets contain noise, which seemingly complicates the validation of the model's 'noise resistance capability'.

How do the authors interpret and address these methodological challenges?

  1. The assumption for relaxing the Z−X−Y restriction is vital for authors' RSTIB-MLP, but the discussion part of it is only in appendix. Could you please provide more details of it in the main text?

其他意见或建议

Please refer to Supplementary Material

作者回复

Re: Supplementary Material

We are sincerely sorry for the typos, and will correct them in our final version.

P16: "illustrated in Fig.7(a)"; "represented in Fig.7(b)".

Re: Essential References Not Discussed

Appendix K.7 has examined such scenarios where transformer-based models with a very large number of parameters (PDFormerand STAEformer) are compared.

Please note that achieving new SOTA performance is not our claim. Instead, we dedicate to achieving a good trade-off between robustness and efficiency, which is demonstrated with SOTA STGNNs ((iii) above Related Work). We will include and discuss these three papers and further clarify this point in our final version.

Re: W1

It is evaluated theoretically (Appendix H) and empirically (Section 5.3).

Below we additionally evaluate the overall training to convergence time (TTCT) in PEMS04 and the training time per epoch(TTPE) on the large Weather2K-R dataset:

MethodTTPE(s)TTCT(s)
RSTIB-MLP67.22842.3
DSTAGNN1050.89283.7
Graph-WaveNet556.27308.6
STG-NCDE1436.19238.7
STExplainer1747.624514.0

It is clear that our full training is much faster; e.g., RSTIB-MLP reduces up to 88.42% compared to STExplainer. Besides, the superior efficiency of RSTIB-MLP is consistent on the large dataset.

These results will be included to ensure comprehensiveness.

Re: W2

(i) Please refer to Re: W1 in the Rebuttal for Reviewer ZfWm.

(ii) We follow the setting of RGIB as follows:

  • The train set is added with input noise and target noise.
  • The validate and test sets are added with the same set of input noise, while their targets remain clean (unchanged) to ensure accurate evaluation.

All methods follow these settings.

Re: W3 & Q3

Yes! we'll leave rooms in the main text for below:

  • Assumption 4.1:

The sliding window mechanism is a technique for processing ST data, extracting fixed-length subsequences from raw data by progressively sliding a window over temporal or spatial-temporal dimensions. A critical feature is that the same data window can flexibly serve as either input or target, creating a "dual noise effect"—when a noisy sequence serves as both the input XX in one window and the target YY in another, noise propagates bidirectionally. If ZXYZ - X - Y holds, then I(Z;YX)=0I(Z; Y|X) = 0: The noisy information behind I(Z;YX)I(Z; Y|X) is directly ignored. Ignoring noise in YY is therefore problematic. The dual noise effect allows noise influences both input and target across overlapping windows, necessitating relaxation of ZXYZ - X - Y.

  • Assumption 4.2:

ST graphs exhibit invariant patterns (generalizable across time) and variant patterns (node-specific, time-varying dynamics). Invariant patterns might represent structural dependencies (e.g., road connectivity in traffic prediction), whereas variant patterns could reflect transient events (e.g., traffic congestion due to accidents). Data dynamics thus also depend on the current window's characteristics, meaning the prediction for YY is not entirely determined by XX, but also by YY's unique dynamics. Therefore, the assumption ZXYZ - X - Y requires relaxation.

Re: Q1

Please refer to Re: Experimental Designs or Analyses in the rebuttal for Reviewer WNiV.

Re: Q2

To achieve this, a method calculating the noise impact indicators without supervised signals is required. Below provides one possible alternative unsupervised solution to our KD approach:

Firstly, perturb the noisy $X$ with noise to generate $X_{\text{perturb}}$. Then:

αi^=exp(D(Xperturb,i,X))j=1Nexp(D(Xperturb,j,Xj)),i{1,,N}\hat{\alpha_i} = \frac{\exp\left(D\left(X_{\text{perturb}, i}, X\right)\right)}{\sum_{j=1}^{N} \exp\left(D\left(X_{\text{perturb}, j}, X_j\right)\right)}, \quad \forall i \in \{1, \ldots, N\}

The following training process is similar to RSTIB-MLP with (w/) KD.

We investigate perturbation ratios of 0.1, 0.3, 0.5, and noise ratios of 0.1, 0.3, 0.5 to compare RSTIB-MLP with KD. Best results for the unsupervised method (RSTIB-MLP w/ Aug) across all perturbation ratios are reported below:

Noise RatioRSTIB-MLP w/MAERMSEMAPE(%)
0.1KD23.6436.4415.22
Aug24.0236.7615.55
0.3KD27.1542.8517.19
Aug27.6243.7317.69
0.5KD27.1643.4317.76
Aug27.8644.5418.32

Results demonstrate that RSTIB-MLP w/ Aug cannot outperform RSTIB-MLP w/ KD. Potential reasons(risks):

  • Noise ratio is difficult to characterize precisely. Fixed perturbation ratios and augmented $X_{\text{perturb}}$ samples may not align with real scenarios. Even adjustable ratios still pose this risk.

  • Using only information from XX neglects dynamics unique to YY (Assumption 4.2). Calculating the noise impact indicators solely from XX thus yields relatively inaccurate quantification.

This underscores the importance of our KD approach.

We hope this addresses all your concerns. Thank you very much!

审稿人评论

Thank you for your response. The explanation provided by the authors is reasonable and convincing. Although the proposed method may not achieve state-of-the-art performance, the model demonstrates strong resistance to adversarial noise, which is important for practical traffic prediction scenarios. Therefore, I am willing to increase my score.

作者评论

Dear Reviewer qjmA,

Thank you very much for your positive feedback and acknowledgment. We will add the discussions in our final version. We greatly appreciate your recognition of our submission and your efforts in reviewing our work. Please feel free to reach us if you have any further questions or suggestions.

Best regards,

Authors

审稿意见
3

The authors disclose the dual noise effect behind the spatial-temporal data noise, and propose theoretically-grounded principle termed Robust Spatial-Temporal Information Bottleneck (RSTIB) principle, which preserves wide potentials for enhancing the robustness of different types of models. Comprehensive experimental results show that an excellent trade-off between the robustness and the efficiency can be achieved by RSTIB-MLP compared to state-of-the-art STGNNS and MLP models.

update after rebuttal

The author's detailed response has largely addressed my initial concerns, so I will maintain my positive assessment.

给作者的问题

Weaknesses.

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

Yes.

实验设计与分析

Yes.

补充材料

Yes. I have reviewed all the supplementary material provided by the authors.

与现有文献的关系

The authors summarize the previous work in spatial-temporal prediction and point out the existing problems, which are also the challenges that this paper aims to solve.

遗漏的重要参考文献

The references are comprehensive.

其他优缺点

Strengths:

  1. The RSTIB principle is based on information theory and provides a solid theoretical foundation for model design.

  2. In the part of experimental evaluation, the baselines and datasets used are relatively comprehensive, especially some large-scale datasets.

  3. The proposed RSTIB-MLP achieves SOTA predictive performance while requiring fewer computing resources.

Weaknesses:

  1. The noise in the data is assumed to be additive white Gaussian noise (AWGN). Does the derivation hold if the noise type is not AWGN?

  2. How does knowledge distillation enhance the aspect of feature diversity? The author should further clarify its principle.

其他意见或建议

See Weaknesses.

作者回复

Thank you very much for your positive and constructive comments. We provide a point-by-point response as follows.

Re: W1

Please note that the derivation of the RSTIB principle and the implementation of RSTIB-MLP do not make assumptions regarding the noise type. AWGN is employed in our experiments primarily because it is commonly adopted as a noise model in information theory and experimental measurements within time series forecasting ([Ref1]).

To better address this concern, we have provided further empirical study in Appendix K.3. This section details how RSTIB-MLP handles data missing scenarios compared with selected baselines, and also the contribution of each module under such noisy conditions.

Re: W2

In our settings, the Lagrange multipliers are set to be constants. This means that the balance between preserving the target and applying regularization is fixed across different time series, which may not be optimal. Besides, regularization further limits the capacity of the MLP, preventing the model from fully learning complex features in the data. Therefore, we aim to better balance the preservation of the target and the regularization terms across different time series.

Inspired by the observation that the feature variance will largely decrease as the noise ratio increases, we leverage the KD module based on the noise impact indicator information to achieve what has been described in the Training Regime: When noise impact is low, we relax the KD-based regularization; when there is a significant noise impact, we intensify the KD-based regularization. As a result, the feature variance can be increased by KD in the cases when the noise impact indicator shows high noise impact. Please note that noise impact indicators quantify the noise impact on each time series and this noise impact information is used to dynamically tune/adjust the optimization of our model.

Hope that we have addressed all your concerns. Thank you very much!

[Ref1] Teck Por Lim and Sadasivan Puthusserypady. Chaotic time series prediction and additive white Gaussian noise. Physics Letters A, 365(4):309–314, 2007.

审稿意见
3

"This paper theoretically motivates and implements a novel regularization technique for training MLP-based models in spatiotemporal forecasting. The MLP models are distilled from state-of-the-art architectures that leverage graph neural networks. KL divergence terms between the data (input, output, encoded) and assumed Gaussian noise levels were used in the proposed loss function. Extensive experiments demonstrate the effectiveness of the proposed approach, highlighting its potential to enhance forecasting performance

给作者的问题

N/A

论据与证据

Yes

方法与评估标准

Comparison with a wide range of spatio temporal algoirhms demonstrate the advantages of the proposed method across different settings.

理论论述

No

实验设计与分析

Experimental results show the robustness of the propsoed method across several noise levels and also in the case of clean signal (the clean signal normally have small amount of noise in the measurmenets and the proposed method was able to improve the performance in these cases too).

However, I'm not clear on the details of the distillation process used and this is a critical point in this paper. What are the teacher model(s) used in the experiments in Table 2? How will the results differ if other teacher models were used? What is the perfomrance of the teacher model before distillation? Was this a distillation using only the output of the network or were intermediate features used?

补充材料

No

与现有文献的关系

The proposed loss function can have implications that goes beyond weather forecasting and extend to applications using sensor based measruements in general.

遗漏的重要参考文献

No

其他优缺点

The paper is missing important implmentation details:

  1. What loss function is used in the MLPs?
  2. Were the teacher model pretrained or you trained them from scratch? In general more details on the distillation procedure should be given.

其他意见或建议

I suggest changing the paper title as I think it should have an emphasis on efficiency. How about using the answer to the question you proposed in the abstract as title: "can simple neural networks such as.." or something similar as a title

作者回复

Thank you very much for your positive and constructive comments. We provide a point-by-point response as follows.

Re: Experimental Designs Or Analyses

  • "What are the teacher model(s) used in the experiments in Table 2?"

Teacher model selection settings are described below Figure 4: "Our method is teacher model agnostic (Appendix. K.10), where we set the default teacher model to STGCN."

  • "How will the results differ if other teacher models were used?"

This is discussed in Appendix K.10, where we directly compare empirical results obtained using different teacher models. Table 18 provides empirical details, and the process of preparing the teacher model is also described in Appendix K.10. We demonstrate that the superior performance of RSTIB-MLP is independent of the teacher model choice by showing that RSTIB-MLP also achieves strong robustness even when selecting MLP as the teacher model—outperforming or performing comparably to some state-of-the-art (SOTA) STGNNs. The potential reason behind this is elaborated in Appendix K.10. We will clarify this further in our final version.

  • "What is the performance of the teacher model before distillation?"

Consistent with the previous response, we present an empirical study in Appendix K.10 to demonstrate that RSTIB-MLP’s performance is independent of the original teacher model's performance. Even when using a basic MLP as the teacher model, RSTIB-MLP can achieve better or comparably good robustness relative to some SOTA STGNNs.

  • "Was this a distillation using only the output of the network, or were intermediate features used?"

Only the output of the network is utilized. Specifically, Definition 4.9 clarifies that the teacher model's output is used exclusively to compute the noise impact indicator.

Re: Other Strengths And Weaknesses

  • "What loss function is used in the MLPs?"

Below is our loss function:

LRSTIB-MLP=i=1N[Lreg(YiS,Y~i)]+i=1N(1+α^i)(λxLx,i+λyLy,i+λzLz,i)\mathcal{L}_{RSTIB\text{-}MLP} = \sum _ {i=1}^{N}\left[-\mathcal{L} _ {\text{reg}}(Y_i^S, \tilde{Y}_i)\right] + \sum _ {i=1}^{N}(1 + \hat{\alpha} _ i)(\lambda _ x \mathcal{L} _ {x,i} + \lambda _ y \mathcal{L} _ {y,i} + \lambda _ z \mathcal{L} _ {z,i})

This is also described in Eq.(6). Specifically, Lreg(YiS,Y~i)\mathcal{L} _ {\text{reg}}(Y _ i^S, \tilde{Y} _ i) is the lower bound of I(Z;Y~)I(Z;\tilde{Y}). Please refer to Proposition 4.8 for implementation details and proofs. Additionally, Lx,i\mathcal{L} _ {x,i}, Ly,i\mathcal{L} _ {y,i}, and Lz,i\mathcal{L} _ {z,i} represent the corresponding regularization terms applied to the input, target, and representation regions respectively. For their analytical calculations, please refer to Proposition 4.6 and Proposition 4.7. Descriptions of how we implement these regularizations are provided above Proposition 4.6 (for input and target) and Proposition 4.7 (for representation regularization). Moreover, α^\hat{\alpha} is the noise impact indicator defined in Definition 4.9, serving to dynamically adjust the training of RSTIB-MLP when incorporating these regularization techniques.

  • "Were the teacher models pretrained or trained from scratch? In general, more details on the distillation procedure should be given."

Relevant details are included in Appendix K.10 due to space limitations. We apologize for the unclear statement in the main text. The teacher model is trained from scratch by ourselves. The knowledge distillation procedure is implemented as follows:

1). Train the teacher model from scratch, following the same procedure as the student model.

2). Freeze the teacher model's parameters and start training RSTIB-MLP.

3). Only the output of the teacher model is leveraged to calculate the noise impact indicator as defined in Definition 4.9.

According to Definition 4.9, the noise impact indicator is normalized using the Softmax function, reflecting the relative relationships among time series.

Re: Other Comments Or Suggestions

Many thanks and we really appreciate for the suggested idea! Yes, the efficiency is one key advantage that we aim to demonstrate. Therefore, we include "MLPs" in our title to stress the point of our method. We will consider this constructive suggestion in our final version.

We hope we have addressed all your concerns. Thank you very much!

最终决定

This is an interesting and innovative idea. The reviewers reached a consensus in their scores (cf., the specific reviews below).