Decomposable Transformer Point Processes
摘要
评审与讨论
The work designs a novel transformer-based approach for modelling time series (e.g. predicting next event). The main novelty is the decomposition of the log-likelihood into a conditional probability mass and density functions. The former, implemented with a transformer, models the distribution over the event types; the latter models the event occurance with a log-normal mixture. The experimental results are compelling: the approach achieves convincing performance on next-even prediction (in terms of log-likelihood) and long-horizon prediction.
优点
- The writing and presentation are clear and precise. The technical introduction and the contribution have solid foundations.
- The experiments are convincing. I appreciate reporting the variance for transparency.
- I also appreciate the details provided here and there without distracting from the main story (e.g. the hyperparameter values in the supp. material).
缺点
I do not see any major problems, but would encourage minor revisions towards reaching broader audience. Specifically:
- The thinning algorithm plays an important role in motivating the approach and interpreting the results. The readers will appreciate a high-level recap of the algorithm, intead of having to look it up in the reference.
- The justification for the decomposition (ll. 90-93) comes across a bit weak. Perhaps it could be improved by providing an analytical argument why it should offer the same benefits as the intensity function, and why depending on the thinning algorithm at inference time is a problem.
问题
- How does the size of the mixture model (M) affect the prediction and what was the methodology for choosing the optimal M?
- The scale in Fig. 1 is wildely different. How does one explain the differences and what does it say about the data quality and/or task complexity?
Minor remarks: l. 80 The integral’s upper bound coincides with the integrand variable. l. 150-155: Perhaps one could provide a comparison with previous work, what does this simplification of the problem translates to in practice (how much faster it is to train/optimize).
局限性
The limitations are discussed sufficiently in Sec. 6.
Thank you for your feedback. We respond to your concerns below:
-
Even though we do not present the full thinning algorithm in the main text due to space restrictions, we present the exact algorithm in the appendix on page 17. As we explain in lines 76-83, the expression of the two log-likelihoods in Eq. (1) and (2) are equivalent. Actually, there is a closed-form formula that relates the intensity function with the CPMF/CPDF; a derivation can be found in Section 2.4 in JG Rasmussen, 2018, "Lecture notes: Temporal point processes and the conditional intensity function"
-
We have included a table (attached pdf) with an ablation study on the influence of the number of mixture components M. The optimal M is chosen based on the log-likelihood of the held-out dev set as we explain in the Appendix (Section A.2).
-
One could potentially interpret the large (or low) log-lkl values as a proxy of the ability of the models to capture the complex dynamics of the event sequences. Larger values may indicate that the model provides a good approximation of the latent generative mechanism of the process however this might not be always true since it is heavily based on the quality of the training/test data and how representative is the sample at our disposal.
-
Thank you for spotting the typo in line 80. We will correct this in the revised version of the paper.
I thank the authors for their response.
What is the value of in the main experiments (e.g. in Fig. 1)? Since seems optimal in the provided ablation study, what is the value of log-likelihood for on the same datasets?
We have used across all datasets in Fig. 1. For , the model's flexibility is quite reduced and thus there is a performance drop. The corresponding results are
\\begin{array}{lc} \**Datasets** & \**M=1** \\\\ \\text{Amazon} & -2.342 \\\\ \\text{Taxi} & 0.391 \\\\ \\text{Taobao} & 1.020 \\\\ \\text{SO-V1} & -2.19 \\\\ \\end{array}This paper presents a novel framework for modeling marked temporal point processes (MTPPs) using Transformer-based architectures. The authors address the limitations of traditional methods that rely on computationally intensive thinning algorithms by proposing a decomposable approach that partly uses a Transformer. This approach models the conditional intensity function (CIF) using a Transformer architecture while separating the modeling of inter-event times and event types into two distinct components: the conditional probability density function (CPDF) for inter-event times and the conditional probability mass function (CPMF) for event types.
优点
- Novel framework that uses a Transformer architecture for the first time to tackle this problem
- Decomposing the likelihood into conditional probability density function (CPDF) and conditional probability mass function (CPMF) components
- The proposed DTPP model is technically solid, with a well-structured approach to decomposing the MTPP likelihood and modeling the components using Transformers. The mathematical formulations are clearly presented.
- The authors provide detailed implementation details and make their code available. This ensures that the results can be reproduced and verified by other researchers, which is important for the credibility and quality of the work.
- The demonstrated improvements in predictive accuracy and computational efficiency over state-of-the-art methods highlight the significance of the proposed framework. The speedup achieved in long-horizon prediction tasks is particularly noteworthy.
缺点
- The primary contribution of the paper is the application of Transformer architecture to the problem of modeling marked temporal point processes. While this is a useful application, it does not introduce significant theoretical advancements or novel methodologies beyond leveraging existing models in a new context. To enhance the novelty, the authors could consider integrating more innovative elements or demonstrating new theoretical insights specifically tailored for MTPPs.
- The paper could benefit from more detailed ablation studies to isolate the contributions of different components of the proposed framework. For example, assessing the impact of various hyperparameters, the influence of the Transformer architecture's depth and width, or the role of specific design choices in the decomposable framework would provide deeper insights into the model's functioning and robustness.
- While the paper includes several figures and tables, additional visualizations could enhance clarity. For example, visualizing the learned intensity functions, the attention mechanisms within the Transformer, or case studies showing specific sequences predicted by the model could provide more tangible insights into the model's behavior.
问题
- Have you considered including more visualizations of the learned intensity functions, attention mechanisms, or specific case studies showing predicted sequences? Adding more visual representations of your model’s predictions and learned parameters would improve clarity and provide tangible insights into the model’s behavior.
- Discussion about potential theoretical extensions that could lead to the modifications to the Transformer architecture specifically suited for MTPPs. Probably some interesting inductive biases could be added here.
- Still, I do not see too much scientific novelty because literally Transformer was partly applied to a problem where it was not applied before.
局限性
Authors already discussed the limitations in the main paper.
Thank you for your feedback and your suggestions. We respond to your concerns below:
-
Notice that we do not learn any intensity functions since our framework is based on the decomposition in Eq. (2). Given the black-box nature of the transformer-based architecture we refrained from including visualizations of the learned representations. This is a general problem for black-box models such as transformers or LSTM/RNN even though there has been some progress in the last few years [*] but in the context of computer vision tasks. To the best of our knowledge, previous works on MTPP using the transformer architecture do not provide such visualizations due to this reason. Only [44](Zuo, 2020) provides a visualization of attention patterns of different attention heads in different layers; however, we believe the results are more confusing rather than providing clarity since they have been arbitrarily chosen from one of the datasets without any insight regarding the configuration of the tranfsormer architecture.
-
This is an interesting point and something that we aim to investigate in the future regarding the modification of the architecture. Nevertheless, tha currently used architecture is already tailored for MTPP since it shares components with [44] amd [41]. We are happy to include this as future direction in the paper. Thank you.
-
We respectfully disagree with your assessment of lack of novelty. We indeed utilize ideas from previous works to build our final novel model but we fail to see why this is absence of novelty. For instance, [30] (Panos, 2023) used the same decomposition which is already known from [6] (Cox, 1975) and the functional form in [27] (Narayanan, 2023) to model the mark distribution. [35](Shchur, 2019) combined a mixture of log-normals with LSTM; all well-known ideas at that time. The continuous-time Transformer architecture for modeling point processes was first adopted by [44](Zuo, 2020) and [43] (Zhang, 2020) independently while [41] (Yang, 2022) later used the same transformer architecture with a modified way to model the intensity function. We also feel that the reviewer has overlooked our experimental evaluation which provides strong evidence of the state-of-the-art performance of our model over well-established baselines. Both the RMSE and ERROR metrics highlight the efficiency of combining a simple mixture of log-normals with a transformer architecture. We believe this is an important contribution and something that was not known to the community until now. The other contribution is the ability of our model to significantly outperform (in a fraction of time) the state-of-the-art HYPRO baseline on the long-horizon prediction task. This result showcases the importance of using a simple yet robust model for the inter-even times as the mixture of log-normals. We are also the first who investigate the limitations of thinning-based methods for the long-horizon prediction task. This result becomes more important if you consider that our method was never developed to deal with the more challenging long-horizon prediction task. We believe these results by themselves are novel enough and would be interesting for the community.
[*] Chefer, et al. "Transformer Interpretability Beyond Attention Visualization", CVPR, 2021
Thank you for your detailed rebuttal. Your clarifications are appreciated, particularly concerning the challenges with visualizations in black-box models and your justification of the novelty of your work. While I understand your points and acknowledge the technical solidity and performance improvements demonstrated, I maintain that the primary novelty lies in applying an existing architecture to a new problem domain rather than introducing fundamentally new theoretical insights.
I will keep my rating of 5 (Borderline accept), recognizing the technical soundness and potential practical impact of your contributions, while also noting the limited theoretical innovation.
The paper introduces a Decomposable Transformer Point Process (DTPP), a novel framework for modeling marked point processes. It maintains the advantages of attention-based architectures while avoiding the computational intensity of the thinning algorithm. The model uses a mixture of log-normals for inter-event times and a Transformer architecture for the conditional probability mass function of event marks, achieving state-of-the-art performance in next-event prediction tasks and outperforming thinning-based methods in long-horizon prediction.
优点
- Innovative Approach: The paper proposes a new way to model marked point processes by decomposing the problem into manageable sub-problems, which is a creative advancement in the field.
- Empirical Performance: The DTPP model demonstrates improved performance over existing methods, particularly in next-event prediction and long-horizon forecasting, which is a significant contribution.
缺点
-
Unclear motivation. I am not an expert in this field, so I do not have a deep understanding of the field of neural point processes. I was pretty confused when I tried to understand the necessity of decomposing the log-likelihood of a marked point.
-
The writing is a bit difficult to read. The writing lacks clarity for readers outside the field. The overall quality needs improvement, as certain sections are challenging to comprehend due to complex sentence structures and unclear presentation of ideas. Specific examples include lines 60-63 and 128-136, where the convoluted language may hinder understanding for those unfamiliar with the subject matter.
问题
See above.
局限性
The authors have adequately addressed the limitations
Thank you for your feedback. We respond to your concerns below:
-
Using the decomposition in Eq. (2) is equivalent to using the standard log-likelihood based on in Eq. (1). For more details, see JG Rasmussen, 2018, "Lecture notes: Temporal point processes and the conditional intensity function". We chose the decomposition in (2) because it allows us to freely define different models for the times and marks. Therefore, we can have models with nice properties (eg mixture of log-normals) without depending on the computationally demanding thinning algorithm for generating samples. We discuss this in lines 89-94.
-
We are unsure what is confusing in these lines. In lines 60-63, we just showcase the contributions of this paper. More details regarding the long-horizon prediction task can be found in Section 5.2. In lines 128-136, we introduce shortly the properties of the mixture of log-normal distributions and how this model can be used for modeling the distribution of inter-event times. The notation is quite standard but we encourage the reviewer to point specifically where the confusion comes from; we are happy to modify the text of the paper to increase readability.
We provide an ablation study on the influence of the number of mixture components M.
This paper addresses forecasting with marked point process models by integrating attention and sidestepping the intensive thinning algorithm currently required at inference. Experiments are conducted on next-event prediction and long-horizon prediction.
The reviews are all inclined to accept the paper, varying between borderline and weak accept.
Upon consideration of the paper, the reviews, the rebuttal, the authors-reviewers and AC-reviewers discussions, the AC summarizes these aspects:
(1) Novelty. The model is simple, but the contribution is non-trivial.
(2) The work is technically sound.
(3) The experimental evaluation is effective. As a summary of public and private discussion, the AC recommends exemplifying and discussing the predicted sequences and contextualizing the results with the qualitative properties of the datasets, given the significant variation of the likelihood scales.
The initial lamented weaknesses of clarity and novelty appear resolved and agreed upon by the end of the discussion period. The AC has verified this by considering the paper.
Overall, the AC is happy to second the recommendation to accept this work!