/10

Poster4 位审稿人

最低3最高4标准差0.5

ICML 2025

K$^2$IE: Kernel Method-based Kernel Intensity Estimators for Inhomogeneous Poisson Processes

Hideaki Kim,Tomoharu Iwata,Akinori Fujino

提交: 2025-01-22更新: 2025-07-24

摘要

Kernel method-based intensity estimators, formulated within reproducing kernel Hilbert spaces (RKHSs), and classical kernel intensity estimators (KIEs) have been among the most easy-to-implement and feasible methods for estimating the intensity functions of inhomogeneous Poisson processes. While both approaches share the term "kernel", they are founded on distinct theoretical principles, each with its own strengths and limitations. In this paper, we propose a novel regularized kernel method for Poisson processes based on the least squares loss and show that the resulting intensity estimator involves a specialized variant of the representer theorem: it has the dual coefficient of unity and coincides with classical KIEs. This result provides new theoretical insights into the connection between classical KIEs and kernel method-based intensity estimators, while enabling us to develop an efficient KIE by leveraging advanced techniques from RKHS theory. We refer to the proposed model as the *kernel method-based kernel intensity estimator* (K$^2$IE). Through experiments on synthetic datasets, we show that K$^2$IE achieves comparable predictive performance while significantly surpassing the state-of-the-art kernel method-based estimator in computational efficiency.

关键词

point processeskernel methodskernel intensity estimatorsrepresenter theoremleast squares loss

评审与讨论

审稿意见

评分: 42025-03-13

Summary: The authors consider modelling the intensity function of a Poisson point process as belonging to an RKHS, and then fitting this based on a regularised squared error objective. This yields a method which is similar to other kernel intensity estimators (which were not motivated via RKHS), in that no "model fitting" or "parameters" need to be found once the data and hyperparameters are fixed. The technique achieves comparable predictive performance while being more computationally efficient than existing methods.

update after rebuttal

I'm happy with the authors response --- the comments about the clipping and squared loss were particularly helpful. It looks like all reviewers are leaning accept or accept. I will maintain my current score of accept.

给作者的问题

Question: Could you add some more discussion for equation (10), beyond providing the citations? It looks like the squared L2 distance between the intensity function and a constant 1 function, which is then approximated by replacing the integrated intensity by a sum of intensities evaluated at the data. Is that the "correct" interpretation? Why is this a helpful loss function?

论据与证据

The claims are clear and supported by evidence.

方法与评估标准

Yes, the evaluation criteria make sense and are standard for this area. The benchmark with the rectangular domains is particularly nice.

理论论述

This paper bridges an important theoretical and conceptual gap between two "kernel" approaches to intensity estimation --- RKHS (e.g. Flaxman et al.) and kernel intensity estimators (like kernel density estimators). In particular, with a regularised least squared loss, RKHS methods actually give a variant of kernel intensity estimation methods! This is actually quite an inspiring result - I wonder what other loss functions with RKHS models yield?

实验设计与分析

Yes, they are sound.

补充材料

I did not check supplementary material.

与现有文献的关系

This is well-placed within broader scientific literature -> stats/ml -> point processes and machine learning. I list some possible related works for the authors' consideration, however these are definitely not essential to cite. - In Bayesian context, the intensity is modelled as a squared Gaussian process and Random Fourier features are used in "Sparse Spectral Bayesian Permanental Process with Generalized Kernel", as well as a Laplace approximation to the posterior. - Yet another type of "kernel method" for estimating intensity functions of inhomogeneous Poisson point processes is using ""Exact, Fast and Expressive Poisson Point Processes via Squared Neural Families. These are squared neural networks with closed-form integrated intensity functions, trained using maximum likelihood. - Is your squared loss related to equation (12) of "PSD Representations for Effective Probability Models"? They are looking at density estimation rather than intensity estimation, but apart from that it looks a little bit similar.

遗漏的重要参考文献

All essential works are cited, to the best of my knowledge.

其他优缺点

Strengths:

This paper bridges an important theoretical and conceptual gap between two "kernel" approaches to intensity estimation --- RKHS (e.g. Flaxman et al.) and kernel intensity estimators (like kernel density estimators). In particular, with a regularised least squared loss, RKHS methods actually give a variant of kernel intensity estimation methods! This is actually quite an inspiring result - I wonder what other loss functions with RKHS models yield?
Text, equations and figures are easy to follow, and seem to be without any major errors.
I particularly liked the composite domains built from multiple rectangles in Figure 3.

Weaknesses:

As pointed out by authors in section 3.1, the method leads to intensity functions which can be negative, due to the fact that they do not utilise a nonnegative "link function". This can lead to undesirable effects from a modelling perspective (e.g. predict a negative number of events). The authors use a post-hoc clipping of the intensity function below zero, after the fitting procedure, however this then clearly breaks the optimality of the solution.
The equivalent kernel has to be approximated (e.g. Fourier features, MC), leading to approximate estimation procedure.

其他意见或建议

See above.

作者回复

2025-03-28

We would like to thank the reviewer for the highly positive comments. Below, we provide a detailed response to each comment.

I list some possible related works for the authors' consideration, however these are definitely not essential to cite.....

We appreciate the suggestion of including these important references. "Sparse Spectral Bayesian Permanental Process with Generalized Kernel" and "Exact, Fast and Expressive Poisson Point Processes via Squared Neural Families" both propose intensity estimation methods that employ the quadratic link function, $\sigma(\cdot) = (\cdot)^2$ , to ensure non-negativity of the estimators. We will appropriately cite these works in the Other Related Works section.

Is your squared loss related to equation (12) of "PSD Representations for Effective Probability Models"?

This reference proposes a quadratic kernel model of the form $f_{\text{PSD}}(x) = \sum_{n,n'=1}^N A_{nn'}k(x_n,x)k(x_{n'},x)$ , which leverages the non-negativity property of the positive semi-definite matrix $A$ . The model is applicable to both density estimation and intensity estimation tasks. While the reference also considers learning the model within the framework of penalized least squares loss minimization, it focuses on fitting the parameter $A$ using iterative optimization methods and does not address whether the functional form of $f_{\text{PSD}}(x)$ is optimal for that loss--indeed, it likely is not. In fact, (Marteau-Ferey et al., NeurIPS2020) showed that $f_{\text{PSD}}(x)$ minimizes a certain functional loss, $L(f(x_1),\dots, f(x_N))$ , with appropriate regularization. However, this functional loss does not include the least squares loss, which involves the integral of the latent function. Although identifying the functionally optimal estimator under the least squares loss with similar regularization is an intriguing and important question, the work "PSD Representations for Effective Probability Models" does not appear essential in this specific context.

Weaknesses: As pointed out by authors in section 3.1, the method leads to intensity functions which can be negative, .... The authors use a post-hoc clipping of the intensity function below zero, after the fitting procedure, however this then clearly breaks the optimality of the solution.

In fact, applying post-hoc clipping via $\text{max}(\hat{\lambda}(x),0)$ consistently improves the accuracy of the estimator, since for any input $x$ , the inequality $|\text{max}(\hat{\lambda}(x),0) - \lambda(x)| \leq |\hat{\lambda}(x) - \lambda(x)|$ holds due to the non-negativity of the true intensity function $\lambda(x)$ . The reason why the clipping, despite the fact that it breaks the optimality of the solution, improves the accuracy of K $^2$ IE is clear: the non-negativity condition of intensity function is not taken into consideration in the problem of minimization of the penalized least squares loss (11). For more details, see our 3rd response to Reviewer XSJm.

Question: Could you add some more discussion for equation (10), beyond providing the citations?

In response to the reviewer's suggestion, we will include the following explanation of the least squares loss (10), which we believe is satisfactory. Let $\mathbb{E}$ be the expectation regarding data points generated from the true intensity function $\lambda(x)$ . Then we consider the expectation of the integrated squared loss between the estimator, $\hat{\lambda}(x)$ , and the true intensity $\lambda(x)$ , defined by $\mathbb{E} \Bigl[ \int_{\mathcal{X}} \bigl|\hat{\lambda}(x) - \lambda(x) \bigr|^2 dx \Bigr] = \mathbb{E} \Bigl[ \int_{\mathcal{X}} \hat{\lambda}^2(x) dx \Bigr] - 2\mathbb{E} \Bigl[ \int_{\mathcal{X}} \hat{\lambda}(x) \lambda(x) dx \Bigr] + \mathbb{E} \Bigl[ \int_{\mathcal{X}} \lambda^2(x) dx \Bigr].$ The third term can be omitted, as it is independent of the estimator. The second term can be decomposed into two parts as $2\mathbb{E} \Bigl[ \int_{\mathcal{X}} \hat{\lambda}(x) \lambda(x) dx \Bigr] = 2\mathbb{E} \Bigl[ \int_{\mathcal{X}} \hat{\lambda}(x) \sum_{n=1}^N \delta(x-x_n) dx \Bigr] + 2\mathbb{E} \Bigl[ \int_{\mathcal{X}} \hat{\lambda}(x) \bigl( \lambda(x) - \sum_{n=1}^N \delta(x-x_n) \bigr) dx \Bigr],$ where the second term vanishes due to Campbell’s theorem:

2\int_{\mathcal{X}} \mathbb{E} [\hat{\lambda}(x)] \lambda(x) dx - 2\sum_{n=1}^N \mathbb{E} [\hat{\lambda}(x_n)] = 2\int_{\mathcal{X}} \mathbb{E} [\hat{\lambda}(x)] \lambda(x) dx - 2\int_{\mathcal{X}} \mathbb{E} [\hat{\lambda}(x)] \lambda(x) dx = 0.

Putting everything together, we obtain the following (correct) interpretation of Eq. (10): $\mathbb{E} \Bigl[ \int_{\mathcal{X}} \bigl|\hat{\lambda}(x) - \lambda(x) \bigr|^2 dx \Bigr] = \mathbb{E} \Bigl[ \int_{\mathcal{X}} \hat{\lambda}^2(x) dx - 2\sum_{n=1}^N \hat{\lambda}(x_n) \Bigr] + C,$ where $C$ is the constant term.

审稿意见

评分: 32025-03-13

The paper proposes K²IE, a kernel method-based intensity estimator for inhomogeneous Poisson processes, combining the computational efficiency of classical kernel intensity estimators (KIEs) with edge-correction capabilities from reproducing kernel Hilbert spaces (RKHS). By reformulating the problem using a penalized least squares loss in an RKHS, the authors derive an estimator that matches the form of classical KIEs but uses equivalent RKHS kernels h(⋅,⋅) to implicitly handle edge effects. Theoretical analysis shows the solution satisfies a specialized representer theorem with unit dual coefficients. Experiments on synthetic 1D/2D datasets demonstrate comparable accuracy to Flaxman’s kernel method-based estimator (FIE) but with significantly improved computational efficiency.

给作者的问题

The equivalence between $q(\cdot,\cdot)$ and $h(\cdot,\cdot)$ in Equation 8 is asserted but lacks a proof of uniqueness or convergence under finite-dimensional approximations. Can this be rigorously shown without invoking path integral heuristics (Kim, 2021)?
Section 3.1 acknowledges potential negative estimates but dismisses them without quantitative analysis (e.g., frequency/severity in experiments). How prevalent are negative values in practice?
The edge-correction mechanism via solving Equation 8 directly extends prior work. What fundamentally new theoretical insight does K²IE offer?
The claim "no model fitting" is misleading since $\gamma$ and $\beta$ require cross-validation. How does hyperparameter tuning affect computational efficiency claims?
Equation 16 uses random Fourier features (RFF) to approximate $k(\cdot,\cdot)$ . How does RFF rank $M$ trade off edge-correction accuracy?
Does $\gamma \to 0$ reduce $h(\cdot,\cdot)$ to an uncorrected kernel? How does $\gamma$ balance edge effects and overfitting?
Experiments are limited to synthetic 1D/2D data. How does K²IE perform on real-world data with irregular domains?
Figure 1 qualitatively compares kernels but lacks quantitative metrics (e.g., $L^2$ error near boundaries). Why?
9. $\lambda(\cdot)$ denotes both true and estimated intensity functions (Section 3.1). Standardize notation (e.g., $\lambda^*(\cdot)$ vs. $\hat{\lambda}(\cdot)$ ).
10.Scalable Bayesian methods (e.g., Lloyd et al., ICML 2015) are ignored. How does K²IE compare to modern Bayesian nonparametrics?

I will be willing to improve the scores if the authors give me good answer of the above questions.

论据与证据

Supported Claims:1. K²IE achieves computational efficiency comparable to classical KIEs (evidenced by CPU time in Tables 1-2). 2.Edge correction via equivalent RKHS kernels works effectively in multi-dimensional settings (supported by 2D results in Table 2).

Problematic Claims:1.The assertion that K²IE "combines the computational efficiency of KIEs with the effectiveness of Flaxman’s estimator" is partially unsubstantiated. While K²IE is faster than FIE, its edge-correction superiority over KIE is only qualitatively shown (Figure 1), lacking quantitative comparison. 2.The claim that K²IE "does not require model fitting" is misleading, as hyperparameter tuning (e.g., γ, β) is still necessary.

方法与评估标准

Strengths: 1.The degenerate approximation using random Fourier features (Eq. 17) is a practical approach for solving the Fredholm equation. 2.Evaluation metrics (L^2, ∣L∣) align with standard practices for intensity estimation.

Weaknesses: 1.Experiments are limited to synthetic data with high SNR. Real-world datasets are absent, raising concerns about generalizability. 2.The cross-validation setup uses p-thinning, which may introduce bias if events are temporally/spatially correlated.

理论论述

Theorem 1’s proof relies on path integral representations and operator inversions (Eq. 13–17). While the derivation is logically consistent, the critical step of connecting the least squares loss to the Gaussian process representation lacks rigor. Specifically: 1.The path integral representation of the RKHS norm (Kim, 2021) is cited but not explicitly justified in the context of Poisson processes. 2.The equivalence between q(⋅,⋅) and h(⋅,⋅) (Eq. 8) is asserted without proving uniqueness or convergence under finite-dimensional approximations.

实验设计与分析

1.The comparison with KIE uses edge-corrected kernels for KIE but does not clarify whether KIE’s edge correction was optimized similarly to K²IE’s hyperparameters. 2.The regularization parameter γ’s impact on edge correction is not analyzed. For example, does γ→0 degrade h(⋅,⋅) to an uncorrected kernel? 3.Negative intensity values (Section 3.1) are dismissed without quantitative analysis (e.g., frequency/severity of negative estimates in experiments).

补充材料

The appendix provides derivations for the degenerate kernel approximation (Eq. 17–19) and additional experimental details. However, key theoretical proofs (e.g., Theorem 1’s operator inversion steps) are omitted, limiting reproducibility.

与现有文献的关系

The work bridges classical kernel smoothing (Diggle, 1985) and modern RKHS-based methods (Flaxman et al., 2017). By connecting the least squares loss to the representer theorem, it extends the theoretical framework of Walder & Bishop (2017) for Cox processes. However, it does not engage with recent advances in neural point processes or scalable Bayesian methods.

遗漏的重要参考文献

Scalable Bayesian Methods: Lloyd et al., Variational Inference for Gaussian Process Modulated Poisson Processes (ICML 2015) – omitted despite being a key prior work.

其他优缺点

1.The connection between least squares loss and unit dual coefficients is novel, though the core idea (equivalent kernels for edge correction) builds directly on Flaxman et al. (2017). 2.Provides a computationally efficient alternative to FIE but does not surpass classical KIE in low dimensions. 3.The writing is dense, with inconsistent notation (e.g., λ used for both true and estimated intensity functions).

其他意见或建议

The authors need to carefully check all the details making sure no typos included.

作者回复

2025-03-28

We thank the reviewer for giving valuable comments. Below, we provide a detailed response to each question. We will include all discussions in the revised manuscript.

The equivalence between ( q(\cdot,\cdot) ) and ( h(\cdot,\cdot) ) in Equation 8 is asserted but lacks a proof ... Can this be rigorously shown without invoking path integral ...?

Yes, the equivalence between $q(\cdot,\cdot)$ and $h(\cdot,\cdot)$ can alternatively be established via Mercer's theorem, which aligns with the "rigorous" approach proposed by (Flaxman, 2017). Due to space limitations, we provide only a brief overview of the derivation here. Let $\lambda(x)$ be a function in the RKHS associated with the kernel $k(x,x')$ , and let $||\lambda||^2_{H_k}$ denote its squared RKHS norm. Then, following the notation in Eq. (7) in Section 2.2, we can express $\lambda(x) = \sum_{m=1}^{\infty} b_m \eta_m$ and $||\lambda||^2_{H_k} = \sum_m b_m^2 /\eta_m$ , where $b_m$ are coefficients. Substituting this into the penalized least squares loss in Eq. (11), we obtain: $-2 \sum_{n} \lambda(x_n) + \sum_m \frac{\eta_m+1/\gamma}{\eta_m} b_m^2$ . We can see that the 2nd term corresponds to the squared RKHS norm under a rescaled kernel defined by $q(x,x') = \sum_m \frac{\eta_m}{\eta_m + 1/\gamma}e_m(x) e_m(x')$ . It is evident that $q(\cdot,\cdot)$ is consistent with $h(\cdot,\cdot)$ in Eq. (8). A full derivation of Theorem 1 will be included in the text.

How prevalent are negative values in practice?

Following the reviewer’s question, we conducted an analysis of how frequently K $^2$ IE produces negative values using the 2D synthetic dataset $\lambda_{\text{2D}}^{1.0}$ . Specifically, we evaluated the estimated intensity values at 500 x 500 grid points within the observation domain and computed the ratio of negative values. The mean ± standard deviation of this ratio across 100 trials was $0.059 \pm 0.016$ . This result indicates that K $^2$ IE can indeed produce negative estimates in practice—particularly in regions with sparse data—highlighting the necessity of post-hoc clipping like $\max(\hat{\lambda}(x), 0)$ in applications where negative intensity values are not permitted. It is also worth noting that when Laplace RKHS kernels are used, the equivalent kernel $h(x, x')$ is functionally non-negative in one-dimensional input settings (see Section 3.2 in (Kim, NeurIPS2024)).

What fundamentally new theoretical insight does K $^2$ IE offer?

To the best of our knowledge, our paper is the first to prove that minimizing the least squares loss with a squared RKHS norm regularizer yields kernel intensity estimators (KIEs).
K $^2$ IE demonstrates that the equivalent kernel used in Flaxman (2017) can serve as an edge-corrected smoothing kernel for KIE. This insight enhances computational efficiency, as it eliminates the need to solve the dual optimization problem (9) in Flaxman's model.

The claim "no model fitting" is misleading ... How does hyperparameter tuning affect computational efficiency claims?

We will rephrase "no model fitting" as "no optimization of dual coefficients". Regarding hyperparameter tuning, K $^2$ IE offers a significant advantage over the reference models. Specifically, KIE and FIE require MC integration and solving a dual optimization problem for each cross-validation, respectively, whereas K $^2$ IE requires neither, which is beneficial especially in multi-dimensional settings.

How does RFF rank ( M ) trade off edge-correction accuracy?

Please refer to our 2nd response to Reviewer XSJm regarding the ablation study on $M$ . We want to emphasize that the edge correction (i.e., the integral over the domain in Eq. (8)) is performed exactly under the RFF approach (see the 2nd paragraph of Sec. 3.2.2). Rather, $M$ governs the approximation accuracy of shift-invariant kernels.

How does (\gamma) balance edge effects and overfitting?

We appreciate the important question. In classical KIEs, tanking $\beta \to \infty$ ( $\beta^{-1}$ is the scale parameter) leads to $\hat{\lambda}(x) = \sum_n \delta(x-x_n)$ . K $^2$ IE exhibits the same behavior regardless of $\gamma$ , but taking $\gamma \to \infty$ also yields the same solution regardless of $\beta$ in K $^2$ IE. Although $\gamma$ as well as $\beta$ controls the degree of overfitting, it seems that $\gamma$ acts globally and $\beta$ locally. A thorough investigation into the distinct roles of $\gamma$ and $\beta$ remains an important direction for future work. We believe the explanation is satisfactory to the reviewer.

Experiments are limited to synthetic 1D/2D data.

Please see our 1st response to Reviewer XSJm.

Standardize notation

We will standardize the notation according to the suggestion.

How does K $^2$ IE compare to modern Bayesian nonparametrics?

Please see our 3rd response to Reviewer upvZ.

does not clarify whether KIE’s edge correction was optimized

As in Sec 4, KIE's hyperparameter was optimized similarly to K2IE but based on test likelihood.

审稿意见

评分: 42025-03-24

This paper introduces K2IE, a kernel method-based kernel intensity estimator for inhomogeneous Poisson processes, which formulates the intensity estimation as a penalized least squares loss minimization in RKHS. A key theoretical contribution is the establishment of a specialized representer theorem leading to a computationally efficient estimator with unit dual coefficients, drawing a formal connection between classical kernel intensity estimators (KIEs) and RKHS-based estimators. The method is validated on 1D and 2D synthetic datasets, demonstrating comparable predictive performance to prior methods while offering improved computational efficiency.

给作者的问题

Why are real-world datasets not included in the evaluation? This would be critical to support claims of practical relevance.
How sensitive is the model to the number of random Fourier features (M)? An ablation study would help demonstrate robustness.
Can the non-negativity of the estimator be more rigorously enforced, e.g., through a post-processing projection or transformation?

论据与证据

The paper claims that:

K2IE is theoretically consistent with classical KIEs under least squares loss.
It provides comparable predictive performance to state-of-the-art methods while being computationally more efficient.
The proposed estimator handles edge effects effectively via RKHS-derived equivalent kernels.

These claims are largely supported by the theoretical derivations and empirical experiments. However, the experiments are limited to synthetic data, and evidence on real-world applicability or robustness to noise and irregular event distributions is missing.

方法与评估标准

The least squares loss within RKHS is a reasonable and novel formulation for this problem, especially given its computational advantages over log-likelihood loss. The comparison with KIE and Flaxman’s estimator (FIE) is appropriate, and the use of metrics like integrated squared and absolute error (L2, |L|), along with CPU time, provides a fair evaluation.

理论论述

The theoretical results are sound and well-supported through rigorous derivation. The connection to Fredholm integral equations and path integral representations is novel and mathematically grounded.

实验设计与分析

The experiments are designed well for demonstrating performance on a range of synthetic intensities, with appropriate variations in data sparsity and observation domains. Use of both low-dimensional and moderate-dimensional settings is appreciated. Still, the omission of real-world datasets or more complex 3D/temporal domains limits broader validation.

Also, while hyperparameters are tuned via cross-validation, there is no ablation to show sensitivity to the number of random features (2M), which could impact approximation quality.

补充材料

Yes. Code

与现有文献的关系

This work is well-positioned in the literature on nonparametric Poisson intensity estimation, kernel methods, and RKHS theory. It builds directly on foundational work by Flaxman et al. (2017) and distinguishes itself by moving from maximum likelihood to least squares loss, bridging classical and modern techniques.

遗漏的重要参考文献

The paper overlooks some recent work in Bayesian nonparametric methods for Poisson processes beyond what is cited, including approximate inference in deep Gaussian processes or deep kernel learning which could serve as competitive baselines.

其他优缺点

Strengths:

Theoretical originality in bridging classical KIE and RKHS estimators.
Analytical tractability due to representer theorem and Fourier-based approximation.
High computational efficiency and clear reproducibility through open-source code.

Weaknesses:

Limited to synthetic data.

其他意见或建议

作者回复

2025-03-28

We would like to thank the reviewer for the highly positive and constructive comments, by which we are strongly encouraged. Below, we provide a detailed response to each of the comments.

Why are real-world datasets not included in the evaluation? This would be critical to support claims of practical relevance.

We focused our evaluations on synthetic datasets, which allow for precise error estimation between the true and estimated intensity functions—an appropriate setting to verify the theoretical soundness of our model. However, from the perspective of practical relevance, we fully agree with the reviewer on the importance of validating our approach on real-world datasets. In response to the reviewer’s suggestion, we have conducted an additional experiment using an open 2D real-world dataset, bei, in R package spatsta (GPL-3). It consists of locations of 3605 trees of the species Beilschmiedia pendula in a tropical rain forest (Hubbell & Foster, 1983). Following (Cronie et al., 2024), we randomly labeled the data points with independent and identically distributed marks {1, 2, 3} from a multinomial distribution with parameters $(p_1 = p_2 = 0.3, p_3 = 0.4)$ , and assigned the points with label 1 and 2 to training data and test data, respectively; we repeated it ten times for evaluation. We evaluated the predictive performance of the estimators $\hat{\lambda}(x)$ based on the test least squares loss ( $L_{s}$ ) and the negative test likelihood of counts ( $L_{c}$ ): $L_{s}$ was computed as $L_{s} = \int_{X} \hat{\lambda}^2(x) dx - 2\sum_{n \in D_{\text{test}}} \hat{\lambda}(x_n)$ , where $D_{\text{test}}$ was the test data; the observation domain $X$ was discretized into10 x 10 sub-domain { $X_1, \dots, X_{100}$ }, and $L_{c}$ was computed as $L_{c} = \sum_{i=1}^{100} (\Lambda_{X_i}) - N_{X_i} \log \Lambda_{X_i} - \log(N_{X_i}!)$ , where $N_{X_i}$ is the number of test data points observed in $X_i$ , and $\Lambda_{X_i} = \int_{X_i} \hat{\lambda}(x) dx$ . We obtained the results across 10 trials with standard errors in brackets (the lower, the better): $L_s$ = -5.74(0.39), -6.17(0.52), -5.09(0.31) for KIE, K $^2$ IE, FIE; $L_s$ = 265(14.1), 278(10.8), 287(19.7) for KIE, K $^2$ IE, FIE. The results show that our K $^2$ IE achieved the best performance on $L_s$ , but was outperformed by KIE on $L_c$ , which could be because the hyperparameters were optimized based on the least squares loss and the log-likelihood for K $^2$ IE and KIE, respectively. We will include the results based on 100 trials in the final manuscript.

How sensitive is the model to the number of random Fourier features (M)? An ablation study would help demonstrate robustness.

We thank the reviewer for the constructive feedback. In response to the suggestion, we conducted an ablation study on the number of random features ( $2M$ ) using the 2D synthetic dataset $\lambda_{\text{2D}}^{1.0}$ . The integrated squared errors ( $L^2$ ) of our K $^2$ IE were 149(6.72), 74.8(12.1), 50.7(7.71), and 49.0(8.80) for $2M$ = $20$ , $100$ , $300$ , and $500$ , respectively, where standard deviations are in brackets. Similarly, the integrated absolute errors ( $|L|$ ) were 9.86(0.25), 6.65(0.60), 5.43(0.52), and 5.31(0.57) for the same values of $M$ . These results indicate that the predictive performance of K $^2$ IE improves with larger $M$ , and our chosen setting ( $2M=500$ ) is sufficiently large to yield accurate estimates. Please note that, due to the limited time available during the rebuttal period, the reported results are based on 10 trials. We will include the results based on 100 trials in the final manuscript.

Can the non-negativity of the estimator be more rigorously enforced, e.g., through a post-processing projection or transformation?

As discussed in the last paragraph of Section 3.1, applying $\text{max}(\hat{\lambda}(x),0)$ or $\text{ReLU}(\hat{\lambda}(x))$ is the simplest way to enforce the non-negativity of the estimator $\hat{\lambda}(x)$ . Fortunately, this operation improves the estimator's accuracy in a pointwise sense, as it satisfies $|\text{max}(\hat{\lambda}(x),0) - \lambda(x)| \leq |\hat{\lambda}(x) - \lambda(x)|$ for any input $x$ due to the fact that the true intensity function $\lambda(x)$ is always non-negative. However, when evaluating the integral $\int_{S} \lambda(x) dx$ over a compact region $S$ , applying the max operation hinders closed-form integration and necessitates the use of computationally expensive Monte Carlo methods. Therefore, whether or not to employ the max⁡ operation depends on the specific application. We will incorporate the above discussion into the text. For clarity, we note that the max⁡ operation was not used in our experiments.

overlooks some recent work in Bayesian nonparametric methods...

We added a study with a scalable Bayesian model (see our 3rd response to Reviewer upvZ). We promise to cite deep kernel/GP approaches, but it would be helpful if you could point us to a few references worth citing.

审稿意见

评分: 32025-03-27

The paper develops a new kernel based estimator for intensity in inhomogeneous Poisson processes. The estimator is shown to be a associated with a unique reproducing kernel Hilbert space and is compared to some previous estimation methods in a simulation study. The simulation study shows that the new methods achieves better reconstructions at a lower computational cost compared to existing methods.

给作者的问题

Key points to consider:

Why not include a lambda>=0 constraint?
Can the method be used to reconstruction of intensity at unobserved regions (e.g. the hash marked areas in Fig 3)
How does the method compare to log-gaussian-cox process methods?
Does the methods provide uncertainty estimates for the reconstruction, e.g. V(lambda(x)|observations), or is only the best reconstruction provided?

论据与证据

The theoretical derivation seems sound, I have some issues with the simulations study that does not include comparison to any state of the art statistical methods (e.g. https://doi.org/10.1111/2041-210X.13168) and which seems to solve the wrong problem for partially observed processes.

For a partially observed process we would like reconstructions for the entire domain even when observations are only from part of the domain. For this paper that would entail in equation (11) that the integral over lambda(x) (second term) be over the observed area while the last integral/norm (i.e. the penalty term) be over the entire domain, thus having a penalty on difference between observations and lambda(x) only for observed parts of the domain but a smoothness penalty for the entire domain, including unobserved parts. For the lower row in figure 3 this would imply that the reconstruction of lambda is computed (and evaluated) also for the hash-marked regions.

方法与评估标准

The simulation study is reasonable but evaluation is only based on L1 and L2 error between reconstruction and actual intensity in the observed area. No attempt at checking prediction uncertainties or the models ability to predict intensities at unobserved locations is made (see comment under Claims And Evidence). No comparison to methods for log-gaussian-cox processes is made (see Essential References Not Discussed).

理论论述

The derivation of theorem 1 seems sound. However, as the authors themself note the solution can have unreasonable results, i.e. negative intensities, this is due to an improper formulation of the minimisation problem in (11), either an additional constraint lambda>0 or a transform of lambda to ensure non-negative intensities should be included. My feeling is that the paper presents a theoretically sound solution to the wrong problem.

实验设计与分析

See comments under Methods And Evaluation Criteria.

补充材料

Not reviewed.

与现有文献的关系

The paper extends on existing RKHS theory for estimation of inhomogeneous Poisson processes. It provides some references to spatial statistical development but does not include any methods from spatial statistics in the comparison (see Essential References Not Discussed).

遗漏的重要参考文献

The paper lacks references to some recent log-gaussian cox process literature and fast numerical methods for these. E.g. https://doi.org/10.1093/biomet/asv064 and https://doi.org/10.1111/2041-210X.13168. Especially the later could be included in the simulation study.

其他优缺点

No additional comments

其他意见或建议

No additional comments

作者回复

2025-04-01

We thank the reviewer for the deep understanding of our model and the constructive comments. We provide a detailed response to each comment.

Why not include a lambda>=0 constraint?, My feeling is that the paper presents a theoretically sound solution to the wrong problem.

As pointed out, one can enforce the non-negativity of the estimator by introducing lambda>=0 constraint or by applying a non-negative transformation. However, as far as we know, such constraints prevent us from obtaining efficient estimators like K $^2$ IE. The main contribution of our work lies in showing that, by sacrificing strict non-negativity, one can obtain a feasible kernel-based estimator comparable to classical KIEs.

As discussed in Section 3.1, non-negativity can be enforced in a post-hoc clipping like max( $\hat{\lambda}(x)$ ,0). However, we totally agree with the reviewer's point that the functional optimization problem (11) does not explicitly take non-negativity into account, hence not yielding an optimal "non-negative" estimator. For future work, we would like to discuss technical issues arising when non-negativity constraints are imposed.

Consider modeling the intensity as a non-negative transformation $\sigma(x)$ of a latent function $f(x)$ lying in an RKHS. Then, the functional derivative of the objective functional in Eq. (11) leads to the following equation that the optimal $\hat{f}(x)$ solves (full derivation is omitted): $\frac{1}{\gamma} f(x) + \int_{X} k(x,s)\sigma(f(s)) \sigma'(f(s)) ds = \sum_n k(x,x_n) \sigma'(f(x_n)),$ where $\sigma'(x)$ is the derivative of $\sigma(x)$ . When $\sigma(x) = x$ , the equation reduces to a Fredholm integral equation for which Theorem 1 provides a feasible solution. However, when $\sigma(x)$ is nonlinear, even as simple as $\sigma(x) = x^2$ , deriving a feasible solution becomes non-trivial. Alternatively, one may consider enforcing non-negativity of intensity at finite virtual points { $q_1,\cdots, q_R$ }, which leads to a dual optimization problem. While this approach may reduce the risk of negative estimates at $q_i$ , it does not guarantee non-negativity of intensity elsewhere and undermines the computational advantages inherent in K $^2$ IE due to the added complexity of dual optimization.

Finally, we discuss whether the problem in Eq. (11) is truly improper/wrong. (Kim, NeurIPS2024) showed that when RKHS kernels belong to the class of inverse M-kernels (IMKs), the corresponding equivalent kernels $h(x,x')$ are non-negative. This suggests that Eq. (11) may not be inherently improper because K $^2$ IE, a sum of equivalent kernels, is also non-negative under IMKs. In one-dimensional cases, the Laplace kernel is known to be an IMK, but no construction is known for IMKs in higher dimensions, posing an interesting challenge. We will include the discussion in the text.

Can the method be used to reconstruction of intensity at unobserved regions

Yes, K $^2$ IE can reconstruct intensity at unobserved regions. At submission, we had assumed the intensity estimation at observed regions. However, in light of your insightful comment, we revisited the model and confirmed that K $^2$ IE in Eq. (12) could accept inputs from unobserved regions as well. Accordingly, a minor revision to Theorem 1 is warranted to reflect this more general setting. Specifically, all integral operators and the squared RKHS norm should be defined over the full domain ( $X$ -> $\mathbb{R}^d$ ), and Eq. (13) should be updated as $q^*(x,s) = \delta(x-s) 1(s \in X) +\frac{1}{\gamma}k^*(x,s)$ , where $1(\cdot)$ is the indicator. With this modification, Theorem 1, initially defined in $x \in X$ , should be revised to hold for $x \in \mathbb{R}^d$ . We sincerely appreciate the valuable comment. Due to time constraints, we have not yet conducted experiments to evaluate accuracy in unobserved regions. But, we are willing to include the results in the final version of the paper if you consider it essential.

How does the method compare to log-gaussian-cox process methods?

In response to several reviewers' suggestions, we conducted an additional experiment using the 2D synthetic dataset $\lambda_{\text{2D}}^{1.0}$ to include the result of a scalable Bayesian method. This time, we adopted a variational Bayesian model with a quadratic link function (Lloyd, ICML2015), where 10 x 10 inducing points were employed. We appreciate your kind suggestion of references (we will cite them), but we have not yet become proficient with the R package. $L^2$ , $|L|$ , and cpu of the Bayesian method were 58.0(8.13), 5.61(0.47), and 28.4(0.96), respectively, where standard deviations are in brackets. The result highlights the high efficiency of K $^2$ IE. Note that the reported results are based on 10 trials. We will include the results based on 100 trials in the text.

Does the methods provide uncertainty estimates for the reconstruction

K $^2$ IE does not provide uncertainty estimates, which limits our model compared to Bayesian models.

审稿人评论

2025-04-07

The comments answer most of my questions.

For the first point the comments made regarding positive kernels and the trade off between an "correct" model and computational efficiency are insight full as well as the comments regarding future research direction. Including the comment on IMKs in the paper would be a good addition, I would also be slightly interested in how big of a problem it is in practice (e.g. did lambda<0 occur in any of the simulations?).
The extension to un-observed regions is promising and it would be interesting to see results in the paper although I fully understand if this is not possible due to time and page limitations.
Given the complexity of R-INLA I fully understand that the authors comments and I'm satisfied with the alternative model comparisons.

最终决定Accept (poster)

2025-05-01

This paper demonstrates a correspondence between classical kernel intesnity estimators in non-homogeneous Poisson proceesses and those based on repreoducing kernel Hilbert space theory.

The reviewers broadly agree that the methodology is theoretically sound. Though one reviewer requests a more detailed derivation without path-integral heuristics. Additionally, all reviewers remark on the issue that the estimated intensity might become negative. It was also remarked that the error in the random Fourier feature approximation was not investigated sufficiently.

Furthermore, most reviewers agree that there are gaps in the literature review and that more comparisons with recent statistical/ML approaches would be beneficial. Additionally some reviewers would like an empirical investigation on the possible negativity of the estimated intensity and on the performance in extrapolation.

Some of the key-points that ought to be addressed in a camera-ready version are:

Discuss related work more thoroughly in accordance with reviewers' suggestions.
Discuss "clipping" the intensity estimator and inverse M-kernels in more detail.
Empirical investigation of the sign of the intensity estimator.