PaperHub
6.0
/10
Poster3 位审稿人
最低5最高7标准差0.8
7
5
6
4.0
置信度
正确性3.0
贡献度2.7
表达3.0
NeurIPS 2024

Is Score Matching Suitable for Estimating Point Processes?

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06
TL;DR

This study highlights the incompleteness of previously proposed score matching estimators for point processes. In addressing this issue, we introduce a novel score matching estimator for point processes.

摘要

关键词
point processesscore matchingparameter estimation

评审与讨论

审稿意见
7

EDIT: I have changed my score from a "weak accept" to an "accept."

The paper studies the potential use of score matching (SM) to do inference with the model of interest is a Poisson point process or Hawkes processes. Because using SM in practice requires modifying the objective with integration by parts, the paper shows for many common Poisson and Hawkes processes a core assumption to make this modification possbile does not hold. In order to remedy the situation, the paper proposes a random weighted version of SM called WSM (along with an analog for Autoregressive Score Matching (ASM) called WASM), which reestablishes the validity of integration by parts for the aforementioned processes. The paper argues theoretically that their new estimators are consistent, and it provides empirical analyses to demonstrate that the new estimator outperforms the non-weighted variant and matches the performance of alternatives. It also makes an argument to select a weighting function based on some set of assumptions.

优点

The paper points out a fundamental issue with trying to use SM for general Poisson point processes and Hawkes processes, and in addition, it recommends a weighting scheme that patches up this problem based on some reasonable assumptions. This is a valuable observation, and the proposed remedy looks reasonable. There is ample empirical work to suggest that their tweak on SM and ASM does agree with the maximum likelihood estimator (MLE). The theory provided in the paper also argues that their weighted estimator is consistent if the model is well-specified.

Originality:

[+]

  • The authors have applied a weighting scheme to SM and ASM that permits them to use integration by parts for the standard objective and prove the consistency of their new estimator under suitable regularity conditions.

Quality:

[+]

  • The empirical work is thorough and does a good job checking the accuracy of their new estimator.

Clarity:

[+]

  • The presentation of the theory and empirical work is very neat and polished.
  • Assumptions are clearly laid out to ensure consistency of the new estimator.

Significance:

[+]

  • The new estimator provides the ability to do consistent semiparametric estimation for Poisson point and Hawkes processes with SM (instead of MLE), which provides a new tool for very large dimensional problems.

缺点

There are a couple critiques of the paper:

  • While the idea of using a weighting function to enable integration by parts is a good one, it has been seen in other fields before (e.g. weighting Stein kernels in compact settings), and hence is not completely novel.
  • The argument in Section 5 outlining a weighting function is a great addition to the paper, but given that it is only optimizing ChC_h and not Γ(h,A,B)\Gamma(h, A, B), it is hard to know how close it is to the optimum.

Originality:

[-]

  • The only novel insight of the paper is that the trick of integration by parts, so often using in score matching, is not possible for general point processes, but it can be recovered by weighting the Fisher divergence a bit differently. It is not clear if this idea alone is sufficient for publication.

Quality:

[-]

  • The argument to choose their weighting function hh is definitely better than simply picking one. That being said, it still not clear how close to optimal their choice of hh is given they are only optimizing one term in the objective (ChC_h). The paper would be a bit stronger if they could include a different (natural) weighting function to show their new choice is an improvement.

Clarity:

[-]

  • Equation 1: Can we at least have a sentence explaining the paper will be slightly abusing notation by referring to p(T)p(T) as the conditional distribution of pp given Nt=TN_t = |T|?

问题

[Q1] Given the choice of hh optimizes ChC_h but ignores Γ(h,A,B)\Gamma(h, A, B), how sensitive is Γ\Gamma to the choice of hh? This could be demonstrated empirically or theoretically, but it would be good to have some analysis of this to argue why this weighting function is decent.

局限性

NA

作者回复

Thank you so much for reviewing our paper. We answer your questions below.

The argument to choose their weighting function hh is definitely better than simply picking one. That being said, it still not clear how close to optimal their choice of h is given they are only optimizing one term in the objective ChC_{\bf h}. The paper would be a bit stronger if they could include a different (natural) weighting function to show their new choice is an improvement.

We thank the reviewer for pointing this out. To address your concern, we carry out experiments and compare three different weight functions that satisfy the criterion (Eq. 15) for AWSM on Hawkes process. Other than the near-optimal weight function h0\bf{h}^0 introduced in our paper, we also consider a natural weight function h1\bf{h}^1 defined as,

hn1(tn)=(tntn1)(Ttn).h_n^1(t_n)=(t_n-t_{n-1})(T-t_{n}).

We also consider another valid weight function: the square root of h1\bf h^1 denoted as h2\bf h^2,

hn2(tn)=(tntn1)(Ttn).h_n^2(t_n)=\sqrt{(t_n-t_{n-1})(T-t_n)}.

All three weight functions can be applied in AWSM to recover ground-truth parameters, however with different convergence rates. We carry out experiments on synthetic data for exponential-decay model with the same setting as section 6.2 in our paper. We measure their MAE for different sample sizes in Figure 1 of the attached PDF and find that h0\bf h^0 does achieve the best results among the three weight functions.

We hope these additional experimental results can enhance the credibility of our paper and address your concerns. We would definitely add these results in the camera ready.

Given the choice of h\bf h optimizes ChC_{\bf h} but ignores γ(h,A,B)\gamma(\bf h,A,B), how sensitive is Γ\Gamma to the choice of h\bf h? This could be demonstrated empirically or theoretically, but it would be good to have some analysis of this to argue why this weighting function is decent.

We thank the reviewer for pointing this out. This is indeed a tough question that we do not yet have a satisfying answer. For specific parametric models, A˙n(T),B˙n(T)\dot A_n(\mathcal T), \dot B_n(\mathcal T) can be computed analytically (see line 478-480) and then Γ(h,A,B)\Gamma(\mathbf{h},A,B) can be computed via Monte Carlo. Then we can study how sensitive Γ(h,A,B)\Gamma(\mathbf{h},A,B) is to h\bf h. For general models, especially when ψθ\psi_{\theta} is a deep neural network like THP or SAHP, Γ(h,A,B)\Gamma(\mathbf{h},A,B) is intractable to compute.

However, heuristically speaking, our choice of near-optimal weight function h0\bf h^0 should be a good choice even concerning Γ(h,A,B)\Gamma(\mathbf{h},A,B). To make Γ(h,A,B)\Gamma(\mathbf{h},A,B) small, a natural idea is to make hn(T)|h_n(\mathcal T)| and tnhn(T)|\frac{\partial}{\partial t_n}h_n(\mathcal T)| small. The weight function we chose and its derivative have relatively positive low powers with respect to tnt_n, therefore making hn0(T)|h_n^0(\mathcal T)| and tnhn0(T)|\frac{\partial}{\partial t_n}h_n^0(\mathcal T)| small. For weight functions hn1(T)=(Ttn)(tntn1)h_n^1(\mathcal T)=(T-t_n)(t_n-t_{n-1}), the power w.r.t. tnt_n is two. And for hn2(T)=(Ttn)(tntn1)h^2_n(\mathcal T)=\sqrt{(T-t_n)(t_n-t_{n-1})}, its derivative is tnhn2(T)=12Ttn(tntn1)(Ttn)(tntn1)\frac{\partial}{\partial t_n}h_n^2(\mathcal T)=\frac{1}{2}\frac{T-t_n-(t_n-t_{n-1})}{\sqrt{(T-t_n)(t_n-t_{n-1})}} , the numerator is usually a bounded quantity and the denominator may be close to zero, making it large. In conclusion, h0\bf h^0 is a better choice compared with h1\bf h^1 or h2\bf h^2 concerning Γ(h,A,B)\Gamma(\mathbf{h},A,B).

Equation 1: Can we at least have a sentence explaining the paper will be slightly abusing notation by referring to p(T)p(\mathcal T) as the conditional distribution of p given NT=TN_T=|T|?

We thank the reviewer for pointing this out. Here since NTN_T is random, by the notation p(T)p(\mathcal T), we are referring to the probability density or the likelihood of observing NTN_T events t1,,tNTt_1,\ldots, t_{N_T} in [0,T][0,T]. For Poisson process, the conditional distribution of event time t1,,tNt_1,\ldots, t_{N} given NT=NN_T = N is,

p(t1,,tNNT=N)=n=1Nλ(tn)0t1tNTn=1Nλ(tn)dt1dtN,0t1tNT.p(t_1,\ldots, t_N|N_T = N) = \frac{\prod_{n=1}^N \lambda(t_n)}{\int_{0\leq t_1\dots\leq t_N\leq T}\prod_{n=1}^{N}\lambda(t_n)dt_1\ldots dt_N}, 0\leq t_1\ldots\leq t_{N}\leq T.

We acknowledge that our paper may not employ the rigorous notation typically used in the fields of probability and stochastic processes. Instead, we have opted for notations that we believe are accessible to readers from both the computer science and statistics communities. We deeply apologize for any confusion this may have caused.

评论

Thanks for your reply. I appreciate that you have done some extra work to demonstrate the choice of h0h^0 is better than other obvious alternatives. I am inclined to accept this paper.

评论

Thank you very much for your recognition and support. We will include the additional comparison of different weight functions in the camera-ready version. Thank you once again for your constructive feedback and for increasing your rating to accept.

审稿意见
5

This paper studies the use of score matching for estimation in point process models, and proves the incompleteness in the original score matching estimators due to the bounded support. Weighted score matching is use to address the issue and theoretical results are establish ed for the consistency and optimality of the proposed estimator. Experiments on different data demonstrate the effectiveness of the proposed estimator which yields results comparable to maximum likelihood estimation.

优点

  1. This paper theoretically shows the limitation in the use of original score matching estimators for point processes, pointing out their incompleteness and provide a solution.
  2. Theoretical analysis is given to support the claim and the convergence of the proposed method.
  3. The paper is well-written and easy to follow.

缺点

  1. As acknowledged in the limitation section in the paper, existing approaches that adopt denoising score matching on the point process is not considered. Even with theoretical gaurantee, experimentally, such baselines would be necessary to include in order to fully demonstrate the superiority of the method.
  2. For Hawkes processes, the problem cause by the bounded support can be solved by applying log-normalization to transform the bounded temporal domain into an unbounded one [1]. One could achieve this by removing the right bound and transform the (0,+)(0,+\infty) to (,+)(-\infty,+\infty). Although the real data is observed within finite window, the effect of removing the right bound should be negligible as it only affect the last event.
  3. Including more metrics on event prediction, such as MSE of event time, would be helpful, as the practical use of point process model is for prediction.

[1] Lin, H., Wu, L., Zhao, G., Liu, P., & Li, S. (2022). Exploring generative neural temporal point process. Transactions on Machine Learning Research.

问题

  • Did the author observe any instability during the score matching training?

局限性

Included in weakness

作者回复

Thank you so much for reviewing our paper. We answer your questions below.

Denoising score matching for point processes is not considered.

We thank the reviewer for pointing this out. Indeed, denoising score matching (DSM) is not considered in our paper because we mainly focus on correcting the original score matching. However, to address your concerns, we have carried out experiments on DSM. We can conclude that DSM is inferior to AWSM in terms of both efficiency and accuracy.

We deploy DSM on THP and SAHP. We use the DSM loss as in [1]. For observed timestamps tn(m)t_n^{(m)} in mm-th sequence, we sample LL noise samples t~n,l(m)=tn(m)+ϵn,l(m),l=1,,L,\tilde t_{n,l}^{(m)}=t_n^{(m)}+ \epsilon_{n,l}^{(m)}, l = 1,\ldots, L, where Var(εn,L(m))=σ2\text{Var}(\varepsilon_{n,L}^{(m)})=\sigma^2 and get the DSM objective:

J^(θ)=1Mm=1Mn=1Nml=1L12L[ψθ(t~n,l(m))+εn,l(m)σ2]+αJ^CE(θ),\hat {\mathcal J}(\theta)=\frac{1}{M}\sum_{m=1}^M\sum_{n=1}^{N_m}\sum_{l=1}^{L}\frac{1}{2L}[\psi_\theta(\tilde t_{n,l}^{(m)})+\frac{\varepsilon_{n,l}^{(m)}}{\sigma^2}]+\alpha\hat {\mathcal J}_\text{CE}(\theta),

where J^CE(θ)\hat {\mathcal J}_{\text{CE}}(\theta) is the cross-entropy loss as line 191-192 in our paper. We carried out experiments on synthetic data where TT is known. We compared the TLL, ACC, and running time of DSM with MLE and AWSM. The comparison results are shown in Table 1 in the attached PDF. DSM performs the worst among the three estimators, and AWSM is the fastest.

The reasons why DSM performs poorly are as follows:

  1. In terms of accuracy, as discussed in Section 5 of [2], DSM is biased when σ>0\sigma > 0, while our AWSM is unbiased and produces consistent estimates. Therefore, it is not surprising to see that AWSM achieves better TLL and ACC than DSM.

  2. In terms of efficiency, previous work states that DSM is faster than SM because the Hessian matrix is expensive to compute, while DSM avoids this. However, as presented in [3], ASM (or AWSM) also avoids computing the Hessian matrix by replacing it with a one-dimensional partial derivative tnψθ(tnFtn1)\frac{\partial}{\partial t_n}\psi_{\theta}(t_n|\mathcal F_{t_{n-1}}). Therefore, DSM is not more efficient than AWSM for avoiding the computation of the Hessian matrix. On the contrary, DSM can be slower than AWSM because DSM requires LL noise samples for each timestamp and must compute their scores.

For Hawkes processes, the problem can be solved by applying log-normalization to transform the bounded temporal domain into an unbounded one [1].

This is a misunderstanding. Apply log-normalization does not provide any solution to the problem, we can prove that the integration by parts still produces an intractable term after taking this transformation. Therefore, using ASM on a log-transformed sequence still results in wrong estimation. If one hopes to get consistent estimation after log transformation, a suitable weight function is still needed. This is summarised as follows.

ASMAWSM
No transformation×
Log Transform xn=logtnx_n = \log t_n×
Log Transform on interval yn=log(tntn1)y_n = \log (t_n-t_{n-1})×

We illustrate this using the simplest example both theoretically and empirically. Consider a homogeneous Poisson process with intensity λ\lambda ^* with observations t1,,tNTt_1, \ldots, t_{N_T} over 0t1tNTT0\leq t_1\leq \ldots \leq t_{N_T}\leq T. Now we consider

  1. First apply log transformation to timestamps then using ASM. The estimate is denoted as λ^ASM,Log\hat \lambda_{\text{ASM,Log}}.

  2. First apply log transformation to intervals then using ASM on intervals. The estimate is denoted as λ^ASM,Log Interval\hat \lambda_{\text{ASM,Log Interval}}.

For log-transformation on timestamp, we have xn=logtnx_n=\log t_n. Therefore the conditional pdf of xnx_n given Fxn1\mathcal F_{x_{n-1}} is:

p(xnFxn1)=λexp[λexp(xn)]exp(xn),p(x_n|\mathcal F_{x_{n-1}})=\lambda \exp\left[-\lambda \exp(x_n)\right]\exp(x_n),

and we can compute conditional score accordingly. We plug these terms into the implicit ASM objective and get the estimate. Suppose we observed MM sequences of t1(m),,tNm(m)t_1^{(m)}, \ldots, t_{N_m}^{(m)}, after derivation we get an analytical form of λ^ASM,Log\hat \lambda_{\text{ASM,Log}}:

λ^ASM,Log=2m=1Mn=1Nmexp(xn)m=1Mn=1Nmexp(2xn).\hat \lambda_{\text{ASM,Log}} =2\frac{\sum_{m=1}^M\sum_{n=1}^{N_m}\exp(x_n)}{\sum_{m=1}^M\sum_{n=1}^{N_m}\exp(2x_n)}.

Similarly, we consider transformation on intervals τn:=tntn1,n2,τ1:=t1,yn=log(τn)\tau_n:= t_n-t_{n-1}, n\geq 2, \tau_1:=t_1, y_n = \log (\tau_n) and consider the score ψ(ynFyn1)\psi(y_n|\mathcal F_{y_{n-1}}) then apply ASM. The estimate in this case is:

λ^ASM,Log Interval=2m=1Mn=1Nmexp(yn)m=1Mn=1Nmexp(2yn).\hat \lambda_{\text{ASM,Log Interval}} =2\frac{\sum_{m=1}^M\sum_{n=1}^{N_m}\exp (y_n)}{\sum_{m=1}^M\sum_{n=1}^{N_m}\exp(2y_n)}.

Both estimates are wrong and can never recover the true parameter regardless of sample size. To compare, after transformation, suitable weight functions can be added. For log transformation on timestamps, we use weight function hn(xn)=(xnxn1)(logTxn)h_n(x_n)=(x_n-x_{n-1})(\log T-x_n). For log transformation on intervals, we use hn(yn)=log[Ttn1]ynh_n(y_n) = \log [T-t_{n-1}]-y_n. The corresponding estimates are λ^AWSM,Log\hat \lambda_{\text{AWSM,Log}} and λ^AWSM,Log Interval\hat \lambda_{\text{AWSM,Log Interval}}.

We measure the MSE of these four estimates. Results are shown in Table 2 in the attached PDF. It is easy to see that as sample size increases, MSE of λ^ASM,Log\hat \lambda_{\text{ASM,Log}} and λ^ASM,Log Interval\hat \lambda_{\text{ASM,Log Interval}} remains large and unchanged, showing that these two estimators are wrong regardless of sample size.

Including more metrics would be helpful.

We will consider adding MSE to our revised paper.

Did the author observe any instability during training?

For our weighted score matching, we did not notice instability.

[1] SMURF-THP: score matching-based uncertainty quantification for transformer Hawkes process

[2] A connection between score matching and denoising autoencoders

[3] Autoregressive score matching

评论

Thank you sincerely for the additional experiments on denoising score matching and clarifications! For the log-normalization, could you kindly provide the detailed derivation for obtaining λ^_ASM,Log\hat{\lambda}\_{\text{ASM,Log}} and λ^_ASM,Log Interval\hat{\lambda}\_{\text{ASM,Log Interval}}? Additionally, is it possible to express λ^_AWSM,Log\hat{\lambda}\_{\text{AWSM,Log}} and λ^_AWSM,Log Interval\hat{\lambda}\_{\text{AWSM,Log Interval}} in closed form?

评论

We greatly thank the reviewer for your constructive feedback.

For the log-normalization, could you kindly provide the detailed derivation for obtaining λ^ASM,Log\hat \lambda_{\text{ASM,Log}} and λ^ASM,Log Interval\hat \lambda _{\text{ASM,Log Interval}}?

Sure, we are more than delighted to discuss this with you.

For log-normalization on timestamps, since the transformed variable xnx_n has conditional pdf,

p(xnFxn1)=λexp[λexp(xn)]exp(xn).p(x_n|\mathcal F_{x_{n-1}})=\lambda \exp\left[-\lambda \exp(x_n)\right]\exp(x_n).

The conditional score and its partial derivative w.r.t. xnx_n will be,

ψ(xnFxn1)=λexp(xn)+1,\psi(x_n|\mathcal F_{x_{n-1}}) = -\lambda \exp(x_n) + 1,

xnψθ(xnFxn1)=λexp(xn).\frac{\partial}{\partial x_n}\psi_{\theta}(x_n|\mathcal F_{x_{n-1}})=-\lambda \exp(x_n).

Now we perform ASM on the transformed variable xnx_n. Denote X=(x1,,xNT)T\mathcal X = (x_1,\ldots, x_{N_T})^T as the vector of transformed timestamps, the implicit ASM objective (Eq. 13) for X\mathcal X will be,

JASM(θ)=Ep(X)[n=1NT12ψθ2(xnFxn1)+xnψθ(xnFxn1)].\mathcal J_\text{ASM}(\theta)=\mathbb E_{p(\mathcal X)}[\sum_{n=1}^{N_T}\frac{1}{2}\psi^2_\theta(x_n|\mathcal F_{x_{n-1}})+\frac{\partial}{\partial x_n}\psi_\theta(x_n|\mathcal F_{x_{n-1}})].

We plug in the conditional score we just computed into the above equation and get,

JASM(θ)=Ep(X)[n=1NT12exp(2xn)λ2n=1NT2exp(xn)λ]+C,\mathcal J_{\text{ASM}}(\theta) =\mathbb E_{p(\mathcal X)}[\sum_{n=1}^{N_T}\frac{1}{2}\exp(2x_n)\lambda^2-\sum_{n=1}^{N_T}2\exp(x_n)\lambda ]+C,

where CC is a constant not containing θ\theta.

The empirical objective based on samples t1(m),tNm(m),m=1,Mt_1^{(m)},\ldots t_{N_m}^{(m)}, m=1,\ldots M will be,

J^ASM(θ)=1Mm=1Mn=1Nm[12exp(2xn(m))λ22exp(xn(m))λ].\hat J_\text{ASM}(\theta)=\frac{1}{M}\sum_{m=1}^M\sum_{n=1}^{N_m}[\frac{1}{2}\exp(2x_n^{(m)})\lambda^2-2\exp(x_n^{(m)})\lambda].

This is quadratic with respect to λ\lambda, and its minimal λ^ASM,Log\hat \lambda_{\text{ASM,Log}} can be computed analytically as λ^ASM,Log=2m=1Mn=1Nmexp(xn(m))m=1Mn=1Nmexp(2xn(m)).\hat \lambda_{\text{ASM,Log}} = 2\frac{\sum_{m=1}^M\sum_{n=1}^{N_m}\exp(x_n^{(m)})}{\sum_{m=1}^M\sum_{n=1}^{N_m}\exp(2x_n^{(m)})}.

For log transformation on time interval yny_n, since the conditional pdf of τn=tntn1\tau_n = t_n-t_{n-1} is,

p(τnFτn1)=λexp(λτn),p(\tau_n|\mathcal F_{\tau_{n-1}})=\lambda \exp(-\lambda \tau_n),

thus the conditional pdf of yn=logτny_n = \log \tau_n will be,

p(ynFyn1)=λexp[λexp(yn)]exp(yn),p(y_n|\mathcal F_{y_{n-1}})=\lambda \exp\left[-\lambda \exp(y_n)\right]\exp(y_n),

it has the same analytical expression as that of xnx_n, except that yny_n and xnx_n have different support. Therefore, following exactly the same steps as those of xnx_n, we can get the analytical expression of λ^ASM,Log, Interval\hat \lambda_{\text{ASM,Log, Interval}} in our rebuttal (with minor modification for adding superscript (m)(m)).

Additionally, is it possible to express λ^AWSM,Log\hat \lambda_{\text{AWSM,Log}} and λ^AWSM, Log Interval\hat \lambda_{\text{AWSM, Log Interval}} in closed form?

Yes, it is possible. We take λ^AWSM, Log\hat \lambda_{\text{AWSM, Log}} as an example. The derivation of λ^AWSM, Log Interval\hat \lambda_{\text{AWSM, Log Interval}} is basically the same.

First, the implicit AWSM objective as Eq. 17 in our paper for X\mathcal X will be,

JAWSM(θ)=Ep(X)[n=1NT12ψθ2(xnFxn1)hn(X)+xnψθ(xnFxn1)hn(X)+ψθ(xnFxn1)xnhn(X)].\mathcal J_\text{AWSM}(\theta)=\mathbb E_{p(\mathcal X)}[\sum_{n=1}^{N_T}\frac{1}{2}\psi^2_\theta(x_n|\mathcal F_{x_{n-1}})h_n(\mathcal X)+\frac{\partial}{\partial x_n}\psi_\theta(x_n|\mathcal F_{x_{n-1}})h_n(\mathcal X)+\psi_\theta(x_n|\mathcal F_{x_{n-1}})\frac{\partial}{\partial x_n}h_n(\mathcal X)].

Plug in ψθ(xnFxn1)\psi_\theta(x_n|\mathcal F_{x_{n-1}}) and xnψθ(xnFxn1)\frac{\partial}{\partial x_n}\psi_\theta(x_n|\mathcal F_{x_{n-1}}) into the empirical version of the above equation and get,

J^AWSM(θ)=1Mm=1Mn=1Nm[12exp(2xn(m))hn(X(m))λ22exp(xn(m))hn(X(m))λexp(xn(m))xnhn(X(m))λ].\hat J_\text{AWSM}(\theta)=\frac{1}{M}\sum_{m=1}^M\sum_{n=1}^{N_m}[\frac{1}{2}\exp(2x_n^{(m)})h_n(\mathcal X^{(m)})\lambda^2-2\exp(x_n^{(m)})h_n(\mathcal X^{(m)})\lambda - \exp(x_n^{(m)})\frac{\partial}{\partial x_n}h_n(\mathcal X^{(m)})\lambda].

This is still quadratic w.r.t λ\lambda, and its minimal will be,

λ^AWSM, Log=m=1Mn=1Nm[2exp(xn(m))hn(X(m))+exp(xn(m))xnhn(X(m))]m=1Mn=1Nmexp(2xn(m))hn(X(m)).\hat \lambda_{\text{AWSM, Log}}=\frac{\sum_{m=1}^M\sum_{n=1}^{N_m}[2\exp(x_n^{(m)})h_n(\mathcal X^{(m)})+\exp(x_n^{(m)})\frac{\partial}{\partial x_n}h_n(\mathcal X^{(m)})]}{\sum_{m=1}^M\sum_{n=1}^{N_m}\exp(2x_n^{(m)})h_n(\mathcal X^{(m)})}.

Since we choose weight hn(X)=(xnxn1)(logTxn)h_n(\mathcal X)=(x_n-x_{n-1})(\log T-x_n) with derivative xnhn(X)=logT2xn+xn1\frac{\partial}{\partial x_n}h_n(\mathcal X)=\log T - 2x_n+x_{n-1}. Plug them in the above equation, then we get a closed form of λ^AWSM, Log\hat \lambda_{\text{AWSM, Log}} .

评论

Thank you for the further derivation and clarification! For the log interval case, each τi\tau_i should follow the same exponential distribution, so λ^ASM,Log Interval\hat{\lambda}_{\text{ASM,Log Interval}} should approximate the correct λ\lambda with a large sample size. Could you please clarify why this is considered incorrect? I might have misunderstood something.

评论

Thanks so much for providing concrete answers to my queries and issues! Most of my concerns have been resolved and I have increased my score. I believe it would be valuable to include a discussion on denoising score matching, as well as some recent related work on score matching with point processes [1,2].

[1] Beyond Point Prediction: Score Matching-based Pseudolikelihood Estimation of Neural Marked Spatio-Temporal Point Process.

[2] Exploring generative neural temporal point process.

评论

Thank you for your constructive feedback and for increasing your rating to accept.

评论

We thank greatly for the reviewer for further engaging in the discussion.

For the log interval case, each τi\tau_i should follow the same exponential distribution, so λ^ASM,Log Interval\hat{\lambda}_{\text{ASM,Log Interval}} should approximate the correct λ\lambda with a large sample size.

This is not true. It is widely known that for an exponential distribution on [0,)[0,\infty), it indeed has expectation 1λ\frac{1}{\lambda} and second moments 2λ2\frac{2}{\lambda^2}. Then the average of exponential distribution should be equal to 1λ\frac{1}{\lambda} when sample is large. However, this is not the case when considering the time interval in a Poisson process on [0,T][0,T].

Considering the numerator of λ^AWSM,Log Interval\hat \lambda_{\text{AWSM,Log Interval}}, we cannot draw the conclusion that,

limMEp[m=1Mn=1Nmτn(m)m=1MNm]=1λ\lim_{M\rightarrow \infty}\mathbb E_{p}[\frac{\sum_{m=1}^M\sum_{n=1}^{N_m}\tau_n^{(m)}}{\sum_{m=1}^MN_m}]=\frac{1}{\lambda}

We cannot use CLT or other things here since m=1MNm\sum_{m=1}^M N_m is random and each τn(m)\tau_n^{(m)} does not necessarily have the same marginal distribution since their support is different.

To illustrate this, we compute m=1Mn=1Nmτn(m)m=1MNm\frac{\sum_{m=1}^M\sum_{n=1}^{N_m}\tau_n^{(m)}}{\sum_{m=1}^MN_m} for M=106,T=1M=10^{6}, T = 1 under different λ\lambda^{*} and show that the expectation of it is not equal to 1λ\frac{1}{\lambda}.

λ=1\lambda=1λ=2\lambda = 2λ=4\lambda = 4λ=8\lambda = 8
m=1Mn=1Nmτn(m)m=1MNm\frac{\sum_{m=1}^M\sum_{n=1}^{N_m}\tau_n^{(m)}}{\sum_{m=1}^MN_m}0.36930.28360.18860.1094

Also for the denominator, its expectation is not equal or asymptotically equal to 2λ2\frac{2}{\lambda^2} when taking average. Therefore, λ^ASM,Log Interval\hat{\lambda}_{\text{ASM,Log Interval}} does not approximate the correct λ\lambda with a large sample size.

We give a brief explanation as follows. For a stochastic process in a finite window [0,T][0,T]. Each τi\tau_i has support [0,Tinτi][0, T-\sum_{i\leq n}\tau_i] instead of [0,)[0,\infty). Therefore, firstly, each τi\tau_i has different support and does not follow the same distribution. Secondly, for each τi\tau_i , it has support different from that of the exponential distribution, which is [0,)[0, \infty).

In conclusion, when discussing a point process on a finite time window, the right bound's effect must be considered. And the case is different from sampling a fixed length time sequence over an infinite time window.

审稿意见
6

The paper considers the problem of utilizing score-matching approaches for point processes. The main motivations for the paper are the Poisson and Hawkes processes. Both of these model the repeated occurrences of events over a finite time interval [0,T][0, T]. The Poisson process models the occurrence of an event within a fixed infinitesimal interval with a fixed homogenous rate λ\lambda while a Hawkes process is a self-excitation process where past events increase the likelihood of future ones. Score-matching is a natural approach in settings where the computation of the normalizing constant of a parameterized model are challenging to compute. However, to ensure traceability, one utilizes an implicit approach where the score matching objective is approximated by a tractable alternative which may be easily computed through the derivatives of the approximating model.

The main contribution of this work is the identification of simple scenarios, even for the simple setting of the generalized Poisson process, where the score function optimization function of prior approaches fails as the regularity condition underlying them does not apply. The paper derives an expression showcasing the bias (Proposition 3.1) and is related to the evaluation of the estimated score function (for fixed number of occurrences) at the end points of the interval of interest (t=0t = 0 and t=Tt = T). The paper then proposes an alternative where a weight function hh is used to augment the score function loss which ensures that these biasing terms evaluate to 00. Furthermore, these functions may be chosen independently of the parameters of the underlying model. In their experiments, it is shown that the use of such functions still satisfies desirable properties such as the optima corresponding to the true values of the parameters while also allowing for a tractable score function formulation which avoids the bias incurred by prior approaches. This is a natural approach which suggests an avenue towards improving the performance of such models.

Overall, the problem considered in the paper and the solution they identify are interesting and likely relevant to the NeurIPS community. However, I do have concerns regarding the writing of the paper. For example, there are several assumptions on the weight function in Equation 9 but it is not clear which properties are used in the proof of Proposition 3.3. For instance, Line 400 features an explicit expression for h1h_1 and hNh_N while the expression in Equation 9 features none of these. It is my understanding that the assumptions in Equation 9 ensure that the biasing terms equate to 00 for these elements. Furthermore, I believe that under the assumptions in Equation 9, both terms appearing in the display between Lines 399 and 400 are 00. Please clarify whether this is in fact the case. The same issues persist for the proofs of the results in Section 4. I would like clarification from the authors on these points before updating my review.


I thank the authors for their response and have amended my review accordingly.

优点

See main review

缺点

See main review

问题

See main review

局限性

See main review

作者回复

Thank you so much for reviewing our paper. We answer your questions below.

For example, there are several assumptions on the weight function in Equation 9 but it is not clear which properties are used in the proof of Proposition 3.3. For instance, Line 400 features an explicit expression for h1h_1 and hNh_N while the expression in Equation 9 features none of these. Please clarify whether this is in fact the case. The same issues persist for the proofs of the results in Section 4.

We greatly thank you for pointing this out. Your understanding is correct, and we apologize for the typos. The proofs for Propositions 3.3 and 4.3 are from our previous edition, where we specified one weight function instead of all weight functions that satisfy the condition. We have proofread our work and ensured that the rest of our proofs remain correct.

For Propositions 3.3 and 4.3, the main parts remain correct. Only minor modifications are needed to utilize the assumptions stated in the main text, ensuring that the proofs hold for general weight functions. We have corrected the necessary parts as follows.

From lines 398 to 404, the corrected proof is as follows:

Using the first two equations in Eq. 9, we have

[p(t1,,tN)logpθ(t1,,tN)tnhn(T)]tn=tn+1dTn=0,n[N]\int [p(t_1,\ldots, t_N)\frac{\partial \log p_{\theta}(t_1,\ldots, t_N)}{\partial t_{n}}h_n(\mathcal T)]\big\vert_{t_{n}=t_{n+1}}d\mathcal T_{-n}=0, \forall n\in [N]

[p(t1,,tN)logpθ(t1,,tN)tnhn(T)]tn=tn1dTn=0,n[N].\int [p(t_1,\ldots, t_N)\frac{\partial \log p_{\theta}(t_1,\ldots, t_N)}{\partial t_n}h_n(\mathcal T)]\big\vert_{t_n=t_{n-1}}d\mathcal T_{-n}=0,\forall n\in [N].

Therefore, the first term on the right side of the second equation of Eq. 21 will disappear, and the second term equals Ep(T)[n=1NTtnψθ(tn)hn(T)+ψθ(tn)tnhn(T)]-\mathbb E_{p(\mathcal T)}[\sum_{n=1}^{N_T}\frac{\partial}{\partial t_n}\psi_{\theta}(t_n)h_n(\mathcal{T})+\psi_{\theta}(t_n)\frac{\partial}{\partial t_n}h_n(\mathcal T)].The existence of such an expectation is due to the last two equations in Eq. 9. Therefore, we complete the proof.

We can see from the proof that, in Eq. 9, the first two equations ensure that the integration by parts trick does not produce an intractable term, and the last two equations are simply regularity conditions that ensure all terms are well-defined.

Similarly, from line 431 to 434, the corrected proof is as follows:

Ep(T)[n=1NTψ(tnFtn1)ψθ(tnFtn1)hn(T)]\mathbb E_{p(\mathcal T)}[\sum_{n=1}^{N_T} \psi(t_n|\mathcal F_{t_{n-1}})\psi_\theta(t_n|\mathcal F_{t_{n-1}})h_n(\mathcal T)]

=n=1p(t1,tn)ψ(tnFtn1)ψθ(tnFtn1)hn(T)dT1:n=\sum_{n=1}^\infty \int p(t_1,\ldots t_n)\psi(t_n|\mathcal F_{t_{n-1}})\psi_\theta(t_n|\mathcal F_{t_{n-1}})h_n(\mathcal T)d\mathcal T_{1:n}

=n=1p(T:n1)p(tnFtn1)ψθ(tnFtn1)hn(T)tn=tn1tn=TdT:n1=\sum_{n=1}^\infty \int p(\mathcal T_{:n-1})p(t_n|\mathcal F_{t_{n-1}})\psi_\theta(t_n|\mathcal F_{t_{n-1}})h_n(\mathcal T)\vert_{t_n=t_{n-1}}^{t_n=T}d\mathcal T_{:{n-1}}

n=1p(t1,,tn)[ψθ(tnFtn1)tnhn(T)+ψθ(tnFtn1)hn(T)tn]dT1:n.-\sum_{n=1}^\infty \int p(t_1,\ldots, t_n)[\frac{\partial \psi_\theta(t_n|\mathcal F_{t_{n-1}})}{\partial t_n}h_n(\mathcal T) + \psi_\theta(t_n|\mathcal F_{t_{n-1}})\frac{\partial h_n(\mathcal T)}{\partial t_n}]d\mathcal T_{1:n}.

Between the second and the third line above, we omit the steps used in the derivation of Proposition 4.1 to make it concise. For the term in the third line above, it will be eliminated using Eq. 15. For the term in the fourth line above, using Lemma B.1, we have:

n=1p(t1,,tn)[ψθ(tnFtn1)tnhn(T)+ψθ(tnFtn1)hn(T)tn]dT1:n=Ep(T)[n=1NTψθ(tnFtn1)tnhn(T)+ψθ(tnFtn1)hn(T)tn].-\sum_{n=1}^\infty \int p(t_1,\ldots, t_n)\left[\frac{\partial \psi_{\theta}(t_n|\mathcal F_{t_{n-1}})}{\partial t_n}h_n(\mathcal T) + \psi_{\theta}(t_n|\mathcal F_{t_{n-1}})\frac{\partial h_n(\mathcal T)}{\partial t_n}\right]d\mathcal T_{1:n}=\\ -\mathbb E_{p(\mathcal T)}[\sum_{n=1}^{N_T}\frac{\partial \psi_{\theta}(t_n|\mathcal F_{t_{n-1}})}{\partial t_n}h_n(\mathcal T) + \psi_{\theta}(t_n|\mathcal F_{t_{n-1}})\frac{\partial h_n(\mathcal T)}{\partial t_n}].

The existence of the expectation is ensured by the last two terms of Eq. 15.

We hope the corrected proof can address your concerns.

评论

Thank you very much for your thoughtful review and for taking the time to carefully consider our responses. We greatly appreciate your kind words and are pleased that you find our work to be well-presented, thorough, and novel.

We will certaintly correct the proofs in the updated manuscript, as you suggested. Your feedback is valuable in improving the quality of our work, and we are grateful for your support.

Thank you once again for your constructive feedback and for increasing your rating.

作者回复

We express our sincere appreciation to all reviewers for their time, effort, and insightful feedback. We are encouraged by their recognition of the significance of our work in identifying the incompleteness of the original score matching for point processes, proposing the weighted score matching method, studying its convergence rate, proposing a near-optimal weight function, and deploying the method on deep point process models.

In the following, we have provided specific and direct responses to each of the primary concerns raised by the reviewers, supplemented with additional experiments that substantiate our arguments. We aim to ensure that we address all the concerns and offer clarity and reassurance where needed. Should any additional questions arise, we invite reviewers to engage in further discussion. Once again, we express our gratitude for your time and dedication in reviewing our work.

评论

Dear Reviewers,

Now that the rebuttal period has ended, please take a moment to review the authors' responses to your initial reviews. Your engagement is crucial for:

Ensuring a fair evaluation process
Improving the quality of final reviews
Fostering meaningful scientific discourse

If you have any questions or need clarification, please post follow-up comments.

Your continued dedication to this process is greatly appreciated.

Best regards,

AC

最终决定

This paper addresses an important issue in applying score matching techniques to point process models, specifically Poisson and Hawkes processes. The authors identify a fundamental problem with existing score matching approaches for these models due to bounded support, and propose a novel weighted score matching solution to address this limitation.

All reviewers agree that the paper makes a valuable contribution by pointing out this critical issue and providing a principled solution. The theoretical analysis demonstrating the consistency of the proposed estimator and the empirical evaluations showing its effectiveness are viewed positively. The paper is generally well-written and the presentation is clear.

Some key strengths highlighted by reviewers include:

  • Identifying an important limitation of existing score matching approaches for point processes
  • Proposing a theoretically-grounded weighted score matching solution
  • Thorough empirical evaluation demonstrating the effectiveness of the proposed method
  • Clear presentation of the theory and assumptions

The main concerns raised relate to:

  • Lack of comparison to some relevant baselines like denoising score matching
  • Questions about the optimality of the proposed weighting function
  • Some notational issues and need for clarification in parts

The authors have provided detailed responses addressing most of these concerns, including additional experimental results comparing different weighting functions. This has increased reviewer confidence in the work.

Overall, the reviewers recommend accepting this paper, contingent on the authors making a thorough revision incorporating the feedback and additional results from the discussion phase. Specifically, the authors should:

  1. Include comparisons to denoising score matching approaches as discussed in the rebuttal
  2. Add the additional experimental results on different weighting functions to the main paper
  3. Expand the discussion on the choice and optimality of the weighting function
  4. Clarify the notation, especially around equation 1 as mentioned by Reviewer X8YN
  5. Consider adding MSE of event time prediction as an evaluation metric
  6. Discuss recent related work on score matching for point processes

With these revisions, this paper promises to make a solid contribution to the field. The authors should carefully incorporate all the feedback received during the review process to strengthen the final version.