/10

Poster3 位审稿人

最低3最高3标准差0.0

ICML 2025

Residual TPP: A Unified Lightweight Approach for Event Stream Data Analysis

Ruoxin Yuan,Guanhua Fang

OpenReview PDF

提交: 2025-01-20更新: 2025-07-24

摘要

关键词

Temporal point processEasyTPPresidualevent streamweighted method

评审与讨论

审稿意见

评分: 32025-03-07

Residual Temporal Point Process (TPP) is introduced as a novel method for event stream data analysis, unifying statistical and neural TPP approaches through Residual Events Decomposition (RED). RED uses a weight function to quantify how well the intensity function captures event characteristics and identify residual events. The method integrates RED with a Hawkes process to capture the self-exciting nature of event stream data, and then uses a neural TPP to model the residual events. Experiments show that Residual TPP achieves state-of-the-art goodness-of-fit and prediction performance across multiple domains and is computationally efficient. The RED technique is the first TPP decomposition method and can be integrated with any TPP model to enhance its performance.

给作者的问题

Is the threshold w arbitrarily chosen? If not how do you choose a good value?
Why the proposed stepwise approach is lightweight? residual tpp is. but the approach may not be.

论据与证据

Regarding efficiency: the authors proposed Residual TPP as an efficient approach compared to other neural TPPs. Residual TPP however is a 3 step procedure. The author evaluation efficiency by run time / epoch for step 2 of their model with baseline models where learning is end to end. In my opinion this is unfair comparison and somewhat problematic.

Regarding season trend decomposition – the authors mentioned the residual TPPs are inspired by works in time series where such decomposition is common (they do cite many TS works) however, they do not include any reference to support their claim of Periodic patterns in modeling event streams (see line 208 – 219). I find it less convincing to decompose here and/or it can be motivated more. Here are two examples I have read before tha have similar taste: Loison et al. UNHaP: Unmixing Noise from Hawkes Process to Model Physiological Events, AISTATS 25 Zhang et al. Learning to Select Exogenous Events for Marked Temporal Point Process NeurIPS 21.

方法与评估标准

The key equations in the proposed methods 2-4 are less well explained. (later on i found similar eqns are from Zhang et al arxiv paper “Learning under Commission and Omission Event Outliers”. (posted jan 23, 2025); but the author did not cite or mention in their paper. )

I understand the authors are trying to decide on a point whether it should be considered as in the normal pattern in the hawkes process or the residual by the weight W_i(S;theta) . It is hard to understand the each term mean when they put together and why they should work, other than the justification of asymptotics in 3.3.

The benchamarks and evaluations are typical. My concern on Evaluation on computational efficiency are explained above.

理论论述

I did not check the correctness of proposition 3.1 and theorem 3.2 due to the fact I dont quite grasp eqn 2-4.

I later note proposition 3.1 is lemma 1 in Zhang et al arxiv paper “Learning under Commission and Omission Event Outliers”. (posted jan 23, 2025).

实验设计与分析

I am not sure about the experimental validity for a fair comparison of proposed model and baselines. The authors did make some effort - “To ensure a fair comparison, the training parameters and procedures for the corresponding neural TPP models trained on the original data are kept consistent with those for models trained on the residual events filtered by RED.”

Regarding the results of Table 2 &3 I am less convinced the Res version outperforms its neural TPP counterparts, since neural TPP should be able to capture both the λ(1)(t) + λ(2)(t) compared to the case the proposed approach which is using hawkes to model λ(1)(t) and neural tpp to model λ(2)(t) and then combined for inference. May be the proposed Res has more hyperparameters to tune to get better results?

补充材料

I scanned through the appendix for baseline and experiments.

与现有文献的关系

The key contributions of the paper are okay but the authors should argue why theirs are different from : Loison et al. UNHaP: Unmixing Noise from Hawkes Process to Model Physiological Events, AISTATS 25 Zhang et al. Learning to Select Exogenous Events for Marked Temporal Point Process NeurIPS 21. Zhang et al arxiv paper “Learning under Commission and Omission Event Outliers”. (posted jan 23, 2025).

遗漏的重要参考文献

Maybe the authors should mention Loison et al. UNHaP: Unmixing Noise from Hawkes Process to Model Physiological Events, AISTATS 25 Zhang et al. Learning to Select Exogenous Events for Marked Temporal Point Process NeurIPS 21.

其他优缺点

Other Strength: Flexibility: RED is a plug-and-play module that can be integrated with any TPP model to enhance its performance

Other Weakness:
Limitation:The RED technique has limited scope where the true signal is from hawkes process.

其他意见或建议

I think the authors should motivate more the concept of periodicity in event streams. In the paper it is limited to hawkes data.
I also think the authors can explain eqn 2-4 better - maybe with an example (or just refer to the Zhang et al paper)
Last suggestion is can the author think of creating a synthetic example where the residuals are clearly known and conduct experiments?

伦理审查问题

N/A

作者回复

2025-03-31

Thank you for your detailed and thoughtful feedback!

Q1：Is the threshold $w$ arbitrarily chosen? If not how do you choose a good value?

Please refer to our response to Reviewer H11P's Q2.

Q2：Why the proposed stepwise approach is lightweight?

Residual TPP follows a 3-step procedure：(1) fitting Hawkes process (HP)；(2) applying RED；(3) training a neural TPP on residuals. We originally evaluate efficiency by comparing the runtime/epoch for Step 3, as the neural TPP training is computationally intensive. Hence, the majority of ResTPP's computational complexity comes from this step. In contrast, HP fitting in Step 1, performed using Tick library, is efficient and fast, while Step 2, which computes weights based on HP, is even faster due to the fixed parametric intensity form.

We acknowledge your point and have included end-to-end runtime comparisons (HP fitting + RED + neural TPP training) against baselines, as shown in Tab1 (https://anonymous.4open.science/r/ResidualTPP-3695/Re_QaBc.pdf). These results further highlight ResTPP’s overall efficiency, demonstrating the computational advantage despite its stepwise nature.

Experimental Designs Or Analyses:

While neural TPPs can theoretically capture both $\lambda^{(1)}+\lambda^{(2)}$ as their width or depth approaches infinity, they face practical limitations that deep neural networks require large datasets and long training times to learn complex patterns.

The superior performance of ResTPP comes from two reasons：
(1) Theoretically, the model space of a neural TPP is contained within the model space of RED+neural TPP. In other words, optimizing over a larger model space can reduce estimation error.
(2) Practically, RED+neural TPP helps aviod overfitting. The first- and second- order features of TPPs are well captured by MHP, leading to better generalization.
Overall, with the same model architecture, RED technique can easily enhance performance compared to using a neural TPP alone.

Strengths And Weaknesses：The RED technique has limited scope where the true signal is from HP.
Suggestion 1：I think the authors should motivate more the concept of periodicity. In the paper it is limited to hawkes data.
Suggestion 3：can the author think of creating a synthetic example where residuals are known and conduct experiments?

Thank you for the comment, but we respectfully disagree. While the paper uses HP as a representative example of a statistical TPP for clarity, RED is intentionally designed as a plug-and-play module compatible with any base TPP model. The use of HP is explained in Appendix B and does not limit RED's scope.

To validate RED's robustness, we further simulate datasets using diffferent true signals $\lambda^{(1)}$ (not Hawkes) and residual intensities $\lambda^{(2)}$ .
(1) Poisson-based：We generate a non-homo Possion process with 5 event types, each with a different periodic triangular function for $\lambda^{(1)}$ , and set residuals to follow $\lambda^{(2)}=0.1$ , a homo Poisson process. The combination of these two generates a Poisson-based dataset.
(2) AttNHP-based：We use the AttNHP model for $\lambda^{(1)}$ and homo Poisson process $\lambda^{(2)}=0.1$ for residuals.
(3) Possion+AttNHP：We use the same periodic non-homo Possion process for $\lambda^{(1)}$ and AttNHP for $\lambda^{(2)}$ .
Descriptive statistics for the simulated dataset are provided in Tab2 (https://anonymous.4open.science/r/ResidualTPP-3695/Re_QaBc.pdf).

We compare the performance of ResTPP and baseline neural TPPs on these simulated datasets. As shown in Tab 3, ResTPP consistently enhances the performance of neural TPPs through RED, even when the true signal does not follow HP or exhibits periodicity.

Suggestion 2：I think the authors can explain eqn 2-4 better.

Appendix C.1 explains $\phi'(x)$ , with Fig 3 visualizing its behavior under different parameter settings. Fig 4 in Appendix C.2. shows the distribution of weights. We have cited Zhang et al. (2025, arXiv) Learning...Outliers in Section 3.3, as our weight function is inspired by their work. Still, we appreciate your suggestion and will make it clearer in Camera-ready version.

Relation To Broader Scientific Literature：

Thanks for pointing out valuable related work. Loison et al.(2024) introduce UNHaP, a framework that differentiates structured physiological events, modeled through MHP, from spurious detections, modeled as Poisson noise. UNHaP assumes that true signal follows HP with specific Poisson noise, whereas our method offers greater flexibility in handling arbitray noise. Zhang et al.(2021) propose a more computationally demanding method to select exogenous events through the best subset selection framework, whereas our method is more lightweight and efficient. We will include more discussions on the literature review in the Camera-ready version.

We have carefully addressed the main concerns and hope our revisions meet your expectations. Thanks again for your time and expert feedback.

审稿人评论

2025-04-08

Thank you for your response. My concerns have mostly been addressed. I will go ahead and increase your score.

作者评论

2025-04-08

Thank you very much for taking the time to revisit our submission. We are grateful that our efforts to address your concerns have been recognized, and we sincerely appreciate the increased score. Your thoughtful review and feedback mean a great deal to us and have helped improve the quality of our paper.

审稿意见

评分: 32025-03-10

The paper proposes decomposing a TPP into two models - one is a traditional model like Hawkes and the other is a neural model. First, Hawkes model is fit to the sequence. Then the residual events are found using an influence function. Neural model is fit to the residual points and the overall model is the sum of these two intensities. The experimental section shows that this works better than using a simple neural TPP model on different dataset-model combinations.

给作者的问题

Is there any alternative to using an influence function?
This also seems like a novel way of finding anomalies in a TPP sequence. Would you agree? If yes, why not add an experiment on that?
I don't see an experiment of starting with a neural TPP and fitting a second neural TPP as a residual model. Or alternatively, taking a simple TPP and fitting a simple TPP residual. Additionally, one could take it a step further and find another residual. Do you expect this would work or not?
What does section 3.3 say about the connection between the way points are discarded from the Hawkes model and the fact additive intensity implies uniformly rejecting some points?

论据与证据

All claims seem to be supported by evidence.

方法与评估标准

The benchmarking was done on traditional datasets and using well established baseline models.

理论论述

I checked all the theoretical claims but Theorem 3.2.

实验设计与分析

The experimental design is valid.

补充材料

I read the supplementary from section B onward.

与现有文献的关系

To the best of my knowledge this is a novel approach for finding "anomalies" in a TPP, applied to fitting a residual model.

遗漏的重要参考文献

It could use a bit more discussion on the influence functions and overview of the existing works, and their connections to this paper. No specific papers in mind. I think Lüdke et al. "Add and Thin" (2023) is also relevant as it has some similar ideas, but this paper is distinctly different.

其他优缺点

Strengths: original approach, clear motivation and implementation, theoretical justification, good empirical results.

Weaknesses: not enough discussion on the influence function choice, either showing why this is the chosen function compared to alternatives or better positioning in the influence function literature. Some other issues in the question section.

其他意见或建议

N/A

作者回复

2025-03-31

Thank you for your insightful and positive feedback!

Q1：Is there any alternative to using an influence function?

Please refer to our response to Reviewer H11P's Q1 and Table1 in https://anonymous.4open.science/r/ResidualTPP-3695/Re_H11P.pdf.

Q2：This also seems like a novel way of finding anomalies in a TPP sequence.

We agree with you and have supplemented our work with a simulation experiment demonstrating that RED can successfully identify anomalies in TPP sequences. We simulate a 1D Hawkes process with intensity function $\lambda(t)=0.5+0.8\int_0^t e^{-s}dN(s),0<t\leq 20$ , generating a dataset with 300 sequences. We then divide $(0,20]$ into 20 subintervals of length 1 and randomly select one subinterval in each sequence to insert anomaly events. The times of these anomalies are chosen randomly and uniformly within the select interval, and anomalies account for 21% of the total events. We apply the RED technique to calculate the weight values of all events, perform a moving average, and predict the anomaly interval with the smallest weights. RED achieves an accuracy of 89.0%. This demonstrates RED’s ability to successfully detect anomalous intervals in TPPs. Due to space limitations, we will include additional experiments on applying RED to anomaly detection in the Camera-ready version.

However, we chose to omit this part in our original paper, as the primary contribution lies in introducing RED as the first general decomposition framework for TPPs, rather than its application to specific tasks such as anomaly detection. While this topic falls outside the scope of the current paper, future research may explore enhanced RED variants for anomaly detection in TPPs.

Q3：I don't see an experiment of starting with a neural TPP and fitting a second neural TPP as a residual model. Or alternatively, taking a simple TPP and fitting a simple TPP residual. Additionally, one could take it a step further and find another residual. Do you expect this would work or not?

Thanks for your valuable suggestions. We have added experiments using simple TPP + RED + simple TPP and neural TPP + RED + neural TPP to further refine our work. As shown in Table 1 (https://anonymous.4open.science/r/ResidualTPP-3695/Re_Jjzj.pdf), the combination of MHP + RED + MHP may yield worse results, as the model complexity of “MHP + MHP” is twice that of a single MHP. Apparently, the residuals do not follow MHP, leading to overfitting. Then we select NHP as the example base neural model for residual filtering. For each baseline neural TPP, we compare its performance with the original RED using Hawkes and the RED using NHP. The results demonstrate that Residual TPPs with the RED technique consistently outperform the baseline, whether using Hawkes or NHP as the base model. This highlights that the RED technique, as a plug-and-play module, can effectively enhance the performance of TPPs.

We also would like to clarify that one of the key advantages of our method is its lightweight nature. Our goal is to capture statistical properties with a simple TPP and refine the residual part using a neural TPP, thereby accelerating neural TPP computation with fewer events. While combining neural TPP + RED + neural TPP may yield better performance, it would also introduce significantly higher computational complexity.

Additionally, this work is the first decomposition method for TPPs. We use a self-defined weight value to filter and obtain residuals. We believe future work can explore more advanced decomposition methods to derive alternative residuals, offering significant potential for further development.

Q4：What does section 3.3 say about the connection between the way points are discarded from the Hawkes model and the fact additive intensity implies uniformly rejecting some points?

We leverage the superposition property of the TPPs. The weight function is used to decide whether an event comes from the Hawkes model or residual model. It shares the similar spirit of using rejection sampling to determine the event type.

Essential References Not Discussed：It could use a bit more overview of the existing works, and their connections to this paper. No specific papers in mind. I think Lüdke et al. "Add and Thin" (2023) is also relevant as it has some similar ideas, but this paper is distinctly different.

Thanks for the suggestion! Lüdke et al.(2023) introduce ADD-THIN, a diffusion-inspired TPP model that allows sampling entire event sequences at once and excels in forecasting. This inspired us to explore how future research could develop new decomposition techniques to enhance the forecasting ability of standard autoregressive TPP models. More literature review on Meta TPP, UNHaP and related works can be found in our response to Reviewer H11P and QaBc. We will include more discussions in the Camera-ready version.

Thank you again for your time and valuable suggestions. We would be happy to clarify any further concerns.

审稿意见

评分: 32025-03-11

The paper proposes Residual TPP, a hybrid framework combining classical statistical TPPs (e.g., Hawkes processes) and neural TPPs through Residual Events Decomposition (RED). This computationally efficient approach leverages Hawkes processes for self-excitation/periodicity and neural TPPs for residuals, reducing training costs while improving performance. Empirical validation shows state-of-the-art results on six real-world datasets (e.g., MIMIC-II, Retweet, Volcano) for goodness-of-fit, event time/type prediction.

给作者的问题

Why is $\phi^{\prime}(x)$ defined with piecewise quadratic decay (Eq. 4) instead of smoother alternatives (e.g., sigmoidal transitions)?
How is $w$ selected in practice? Is cross-validation used?
Does RED’s preprocessing (Hawkes fitting + RED decomposition) introduce overhead that negates training time savings for small datasets? Will it lead to overfitting?

论据与证据

RED is the first decomposition technique for TPPs, inspired by time series decomposition. This addresses a gap in TPP analysis.
RED is model-agnostic and integrates with various neural TPP architectures (RNN-, attention-, ODE-based).

方法与评估标准

The residual threshold $w$ is treated as a hyperparameter without systematic guidelines for tuning. The impact of $w$ on performance/complexity trade-offs is underexplored.

理论论述

The theoretical justification for RED’s weight function (Section 3.3) relies on Proposition 3.1 and Theorem 3.2 but seems to lack a formal proof of how RED ensures unbiased estimation or optimal residual separation.

实验设计与分析

Missing comparisons with recent hybrid TPP frameworks (e.g., Meta TPP [Bae et al., 2023]) and decomposition-inspired methods (e.g., Autoformer [Wu et al., 2021] adaptations for TPPs). Add some possible benchmarks against hybrid models and decomposition-based TPP variants.

补充材料

Yes. Appendix A to E is reviewed.

与现有文献的关系

RED is the first decomposition technique adopted into TPPs, which bridges the gap between simple statistical TPPs and expressive neural TPPs and provides insights for developing further methods.

遗漏的重要参考文献

While RED is novel for TPPs, its conceptual similarity to residual learning in deep networks (e.g., ResNet) and ensembling modeling (e.g., boosting) is underemphasized. Clarify how RED differs from generic residual learning frameworks.

其他优缺点

The choice of $\phi^{\prime}(x)$ in Equation (4) is heuristic. No ablation studies validate its superiority over alternative influence functions.

其他意见或建议

The baseline Hawkes process assumes non-inhibitory effects and fixed parametric forms (e.g., exponential decay). This may limit its ability to capture complex periodicities. The authors may also compare with other statistical TPPs as the base model for RED.

作者回复

2025-03-31

Thank you for the detailed suggstions.

Q1：Why is $\phi'(x)$ defined with piecewise quadratic decay instead of smoother alternatives?

We acknowledge this concern and have incorporated new experiments using function $\phi'(x)=\frac{(1+\alpha)(x+1)}{(x+1)+\alpha\exp(x)}$ ，which is smooth across its domain and preserves the “unbiasedness” in Prop 3.1. As shown in Tab1 (https://anonymous.4open.science/r/ResidualTPP-3695/Re_H11P.pdf), ResTPP with two different influence functions both achieve better performance than the baselines. However，the core novelty of our work lies in introducing RED as the first general decomposition framework for TPPs. The modularity of RED allows $\phi'(x)$ to be replaced with any valid influence function with “unbiasedness” property. We also encourage future research to explore enhanced variants.

Q2：How is the residual threshold $w$ selected?

Fig 4 in Appendix C.2 shows the weight values distribution on different datasets. As observed，each distribution exhibits a truncation near a weight value of 0.8，with a substantial portion of the weights concentrated at 0. Given this observation，it suggests that $w$ can naturally be chosen as any value within $(0,0.8)$ .

Q3：Does RED’s preprocessing introduce overhead that negates training time savings for small datasets？Will it lead to overfitting？

In response to Reviewer QaBc's Q2，we give a detailed explanation and compare the end-to-end runtimes between ResTPP and baseline models in Tab 2 (https://anonymous.4open.science/r/ResidualTPP-3695/Re_H11P.pdf). MIMIC-II and Volcano are small datasets with a few hundred short sequences. As shown，even with these small datasets，RED’s preprocessing time is negligible compared to the training time of neural TPPs，highlighting the efficiency of our method.

Regarding overfitting，Hawkes process (HP) is a statistical model with few parameters，making it less prone to overfitting. To further address your concern，we conduct an additional experiment on small HP datasets. We simulat a 1D HP with intensity function $\lambda(t)=0.2+0.6\int_0^te^{-1.2s}dN(s)$ . See Fig 2 from the same link for details. The proportion of residual events filtered by RED is only 13%，indicating that most self-exciting patterns have already been captured. Fitting neural TPP to this small fraction of events will not lead to overfitting.

Theoretical Claims：

Thank you for raising this theoretical point. We find that the cumulative probability functions of the integral $\int_{t_{i-1}}^{t_i}\sum_{k=1}^K\lambda_k^{(1)}(u)du$ for residual and non-residual events are overlapped with each other. Therefore it seems that we cannot separate them perfectly. Regarding unbiased estimation，if there are no residual events，then RED can guarantee the unbiasedness property by choosing $w=0$ . However，if there exist residual events following an arbitrary TPP，then it would be hard to establish the unbiasedness result. We leave it as future work.

Experimental Designs Or Analyses & Essential References Not Discussed：

While Meta TPP is novel and intersting，it's not a hybrid framework like ours. Moreover，it's not an intensity-based model，meaning RED technique cannot be directly applied to Meta TPP，as our RED method relies on the intensity function.

As mentioned in Section 3.1，many popular models like Autoformer，FEDformer and DLinear adopt STD approach to decompose time series. However，they cannot be easily adapted for TPP for comparison due to TPPs' complexity (i.e. discrete event types and irregular spaced event times). Our proposed RED technique makes the first attempt to develop decomposition-based TPP variants. While RED is conceptually similar to residual learning and ensembling modeling，its design and operation mechanics are specifically tailored to the unique challenges of TPPs.

We will cite these papers and include a detailed discussion in the Camera-ready version.

Other Comments Or Suggestions：

HP's baseline intensity $\mu_k(t)$ can be modified to periodic functions to capture periodicities. However，in the original RED，we did not do so because：First，unlike time series，periodicity in event stream data typically appears in specific fields like neuroscience. Most commonly used TPP benchmarks do not exhibit periodicities but instead show self-excitation，as discussed in Appendix B. Hence，the standard HP already performs well on these benchmarks. Second，fitting more complex statistical TPPs increases computational complexity，whereas we aim to keep our method simple and efficient. For complex dependencies that cannot be captured，neural TPP can be used for refinement. Our additional experiments in response to Reviewer QaBc's suggestion and Tab2&3 in https://anonymous.4open.science/r/ResidualTPP-3695/Re_QaBc.pdf may also help clarify. Due to space limitations，we will include other statistical TPPs in the Camera-ready version.

We sincerely appreciate your time. We hope our revisions have addressed your concerns and improved the paper.

最终决定Accept (poster)

2025-05-01

This works proposes Residual TPP, a 2-stage framework where a Hawkes process first fits the data and then a neural model takes care of the "residual events", i.e., those for which the Hawkes process provides a poor fit. Experiments demonstrate that Residual TPP achieves strong results in multiple domains and offers computational advantages.

All the reviewers give positive ratings to this paper. Initially, there were concerns about run-time analysis and reference issues, which were mostly addressed over the rebuttal period.

I recommend acceptance for this paper.