6.0

/10

Poster3 位审稿人

最低4最高7标准差1.4

4.0

置信度

正确性3.7

贡献度3.3

表达2.3

NeurIPS 2024

Inverse M-Kernels for Linear Universal Approximators of Non-Negative Functions

Hideaki Kim

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

摘要

关键词

kernel methodm-matrixnon-negative functionintensity estimationdensity estimation

评审与讨论

审稿意见

评分: 7置信度: 52024-07-05

The authors combines

the notion of M-matrix from linear algebra
the notion of positive semidefinite kernel from kernel learning theory

to create the notion of M-kernel. The M-kernel can be used for solving non-negative constrainted least squares in an RKHS.

优点

The idea is novel, to my knowledge
The fusion of these two ideas is an elegant way to efficiently solve the problem of non-negative regression in an RKHS
The experiments are thorough and demonstrates the theory well

缺点

There isn't a lot of kernels that are known to be M-kernels (it looks like we only know for 1-dimensional kernels). The authors acknowledge this.
There are many small grammar/writing errors.

问题

What is $\gamma$ on L209?
On L213, is this tridiagonal formula well-known? what is the main proof idea/reference citation to this result?

Typos/word choice/grammar. Number at the front denotes line number.

45, "reminiscent" -> "analogue"/"generalization"/"extension" ("reminiscent" isn't used much in mathematical writing to my knowledge)
45, "belong" -> "belonging"
48 "input space" -> "input spaces"
49 "shed the light" -> "shed light"
65 "invokes" -> "is given by"
71 "which is known as the" -> "a property known as"
73 "the computation of O(N 2) and O(N )" -> "O(N 2) and O(N ) kernel evaluations"
94 "represents the positive semi-definiteness of matrices" -> "is the positive semi-definite constraint"
176 "notorious" -> "well-known" ("Notorious" means "famous for something bad". Unusual for usage in a mathematical context.)

局限性

Yes

作者回复

2024-08-06

We would like to thank the reviewer for the highly positive comments, by which we are strongly encouraged! We feel sorry for the grammatical flaws in this paper. Despite this, we appreciate your thorough understanding of our work's content and value. We promise to improve the grammatical quality of this paper using a professional English proofreading service. Below we provide a detailed response to each of the comments.

There isn't a lot of kernels that are known to be M-kernels (it looks like we only know for 1-dimensional kernels).

As the reviewer pointed out, this paper succeeded in only showing two examples of strict inverse M-kernels, which could be a limitation of this paper. However, we are optimistic about the future. This paper successfully, for the first time, clarified a novel condition that allows a linear model to achieve both non-negativity and good representation power, that is, the condition of strict inverse M-kernels. By presenting the condition to a high-level NeurIPS audience, we believe that the open question of how to construct strict inverse M-kernels in a more general manner would be addressed eventually.

There are many small grammar/writing errors

We apologize for the inconvenience and appreciate the kind suggestions about writing, which we will reflect in our paper. We will ensure that the writing quality of the paper is improved through professional proofreading.

What is $\gamma$ on L209?

$\gamma$ on L209 is the hyperparameter of intersection kernel $k_{\text{int}}(x,x')$ , which should be determined based on data. We will ensure that it is clearly stated in the paper. If $\gamma$ on L339 confused the reviewer, we apologize for it: $\gamma$ on L339 is the typo of $s(N+1)$ .

On L213, is this tridiagonal formula well-known?

We derived the tridiagonal formula by using some known properties of symmetric Toeplitz matrices (e.g., see Trench 2001) as a reference. It is clear that the derived tridiagonal formula is a straightforward generalization of the inverse of Toeplitz matrices, but we cannot find any references that explicitly mention the tridiagonal formulae of one-dimensional exponential and intersection kernels. We will add the above discussion to the paper.

Trench, W. F. (2001). Properties of some generalizations of Kac-Murdock-Szego matrices. Contemporary Mathematics, 281, 233-246.

2024-08-12

Thank you for your detailed responses. I maintain my opinion that this paper is a nice contribution to the community including posing new interesting questions. I will keep my score.

审稿意见

评分: 7置信度: 42024-07-08

The paper considers learning in RKHS with non-negativity constraints $f(x) \geq 0$ , which turns up with surprising regularity in areas of machine learning. The proposed approach uses an elegant "trick" involving restricting the kernel selection the so-called inverse $M$ -kernels that allows the constraint to be enforced in a straight-forward manner (ie adding a simple linear constraint to the optimization).

优点

This paper addresses a problem which I have encountered numerous times in my own research, as such global constraints on GP models seem ubiquitous in applied Bayesian optimization. I have therefore seen (and used) any number of ad-hoc approaches to dealing with such constraints, typically involving either inducing point grids (which can add significant computational complexity) or latent-space constructs - log-space, function squared etc - that may distort the model in unhelpful and unintuitive ways. This paper is the first proposal I have seen that appears to offer what I would consider a satisfactory solution to the problem.

the problem is well-motivated.
presentation is mostly readable (though see my concerns below) and the flow of the paper is clear.
as far as I can tell the solution is mathematically sound.

缺点

The role of the function $s(N)$ needs to be discussed in more depth. Reading the paper I am left with the impression that $s$ is a bit of a fudge-factor: ideally you would only want to work with strict inverse M-kernels, but in practice they're hard to find so you compromise by including $s$ . My problem here is that you the set of functions $s$ is too broad, including potential arbitrary scaling with $N$ . Speculatively speaking, maybe you could discuss $s$ in big-O terms - you could even categorize inverse M-kernels this way, which would give you a sort of heirarchy (you might argue that a kernel for which $s \sim O(N)$ suffices is in a sense better than one requiring $s \sim O(N^2)$ ). Alternatively would it be reasonable to borrow from NTK theory (where weight matrices are scaled according to width) and scale the kernel itself by $\frac{1}{N}$ and replace $s$ with a constant?

I am a little disappointed that the experiments don't include multidimensional examples but it is clear from the discussion that this is a difficult problem so I am willing to overlook this point.

Minor point: please take the time to run this paper through a spell-checker!

问题

See weaknesses section.

局限性

N/A

作者回复

2024-08-05

We would like to thank the reviewer for the highly positive and constructive comments. It is incredibly encouraging to know that a reviewer with practical experience in the estimation of non-negative functions finds value in our work! Below, we provide a detailed response to each of the comments.

The role of the function needs to be discussed in more depth...

Thank the reviewer for the pertinent comments. We agree with the reviewer about the role of $s(N)$ being ambiguous in the paper. According to the reviewer's suggestion, we will introduce $s$ as a non-negative factor at the definition of inverse M-kernels, and discuss a possible scaling of $s$ with data size $N$ : The ideal case is $s = 0$ (strict inverse M-kernels), where inverse M-kernel model $f_{\text{IMK}}$ has greatest representation power, but we can only find such kernels on one-dimensional input scenario; By exploiting the strict path product condition (for details, see Future Work and Limitations), we can find inverse M-kernels with $s \sim \mathcal{O}(N)$ regardless of the input dimensionality, where $f_{\text{IMK}}$ has limited representation power for a large $N$ ; An essential next step is to investigate a more scalable factor (e.g., $s \sim \mathcal{O}(N^{1/2})$ ) according to observed data, which could relax the limited representation power of $f_{\text{IMK}}$ in multi-dimensional input settings.

please take the time to run this paper through a spell-checker!

Sorry for the inconvenience. We promise to improve the grammar quality of the paper by using English proofreading services. We appreciate the reviewer's efforts to read our paper despite the poor English.

2024-08-11

Thanks for the response. I am happy to keep my recommendation unchanged as I think this is a good (though not perfect) contribution tackling an important problem.

审稿意见

评分: 4置信度: 32024-07-08

The paper proposes a new kernel family that can be used to fit non-negative functions tractably, which has theoretical novelty. The universal approximation properties are analysed, and a connection is made to permanental processes. The method is applied to univariate regression, density estimation and intensity estimation.

优点

The method contribution has clear originality and novelty.

The paper is clearly written and presented.

The results are overall good.

缺点

The results are a bit uneven, and sometimes the new method does not seem to fit that well. In Fig1 the QNM is slightly better, in Fig2 the QNM is arguably much better. In FigB2 the new method is much better than baseline, but the fit is still quite poor. Furthermore, all experiments are very simple, and baselines quite simple as well. For instance, there are no GP-based or Cox-process based competing methods. There are also no comparisons to non-negative splines.

I'm afraid the paper has not been able to find a purpose for their good idea, or show what is the open problem that only it can solve.

问题

None

局限性

No issues

作者回复

2024-08-06

We would like to thank the reviewer for carefully reading our paper and giving valuable comments. While the reviewer has positive evaluations of the core part of this paper (Soundness: good, Presentation: good, Contribution: good), we understand that the critical concern of the reviewer lies in the ambiguity of the paper's purpose and significance. We can clearly state the purpose and significance of our work, and therefore, we believe that we can fully dispel the concern raised by the reviewer. Below, we provide a detailed response to each of the comments.

The method is limited to fitting univariate non-negative functions, which is disappointing. Scalar functions can already be handled with existing techniques. This limits the papers significance to low in its current form... I'm afraid the paper has not been able to find a purpose for their good idea, or show what is the open problem that only it can solve.

This paper tackles the open problem of whether a linear model can simultaneously satisfy non-negativity and universal approximation. Through the novel concept of inverse M-kernels, we have, for the first time, demonstrated that the answer to the problem is 'Yes' at least in the case of one-dimensional input scenario. While non-linear models that satisfy both non-negativity and universal approximation have been proposed so far, no such linear model has been developed in existing techniques. The linearity of a model offers significant computational advantages in both learning and prediction of latent function, and as Reviewer wAiw kindly commented, its practical value is significant—even in the case of one-dimensional inputs. A comparison of CPU times between QNM (non-linear model) and our linear model in Tables 1 and 2 clearly demonstrates that our model achieved computational speeds tens to hundreds of times faster than QNM.

Based on the reviewer's comments, we found that our explanation of the paper's purpose and significance was inadequate. We will add the above discussion in the introduction.

In Fig1 the QNM is slightly better, in Fig2 the QNM is arguably much better.

This is a misunderstanding by the reviewer, likely due to his/her overlooking that the evaluation metric ( $d_{\text{KL}}$ ) is the Kullback-Leibler distance (the lower, the better): In Figure 2, QNM is slightly inferior to our model regarding $d_{\text{KL}}$ . We apologize for the unclear explanation. We will add the statement "the lower, the better" to the legend and the main text.

In FigB2 the new method is much better than baseline, but the fit is still quite poor.

Actually, the estimation result achieved by our model in Figure B2 is by no means poor. Event sequence data generated from a point process is inherently noisy, and it is challenging to estimate the intensity function from such data accurately. For example, the center of Figure 2 in (John & Hensman, 2018) shows the estimation result using a variational Bayesian method for the same underlying intensity function, where the result is similar to that of our model in Figure B2. To clarify it, we implemented and evaluated one of the SOTA methods, the structured variational approach with sigmoidal Gaussian Cox processes (Aglietti et al., 2019). Please see the following response about intensity estimation for more details.

Furthermore, all experiments are very simple, and baselines quite simple as well. For instance, there are no GP-based or Cox-process based competing methods.

Regression and Density Estimation:

In this paper, we focus on point estimation rather than interval estimation, where GP-based models usually reduce to kernel methods that include our model. An exception is the approach by (Pensoneault et al., 2020), which imposes the non-negativity constraints on finite virtual points by utilizing the confidence intervals of posterior distributions. It is a very different approach from the kernel method that cannot access posterior distributions. However, as mentioned in Section 2.2, the approach is out of the scope of our paper because it does not guarantee non-negativity at locations other than the points. We believe that NCM, QNM, and SNF are appropriate baselines in the literature.

Intensity Estimation:

As the reviewer pointed out, Gaussian Cox processes (GCPs) are the gold standard for intensity estimation. Actually, the intensity estimator defined in Section 3.2 is the MAP solution of the permanental process, a variant of GCP assuming that the square root of the intensity function is generated from a Gaussian process. Therefore, the reviewer's claim that there are no Cox process-based methods in the experiments is inaccurate. However, from the reviewer's comments, we found that other GCP-based methods should be included in the baselines to clarify the contribution of our paper within the context of intensity estimation: In the literature on GCPs, the advantage of permanental processes over other GCPs has been considered as the efficient estimation algorithm, and our work improves the predictive performance of the fast-to-compute permanental processes. We added the estimation result of the structured variational Bayesian approach with sigmoidal Gaussian Cox processes (Aglietti et al., 2019), denoted by STVB. We implemented STVB with the TensorFlow code provided by (Aglietti et al., 2019), where the number of inducing points were set as regularly aligned 100 points within the observation domain. Table C1 and Figure C1 (please see the pdf file in global response) show that our model achieves comparable predictive accuracy while being hundreds of times faster than STVB, which highlights our work's contribution.

John and Hensman (2018). Large-scale Cox process inference using variational Fourier features. International Conference on Machine Learning.

Aglietti et al. (2019). Structured variational inference in continuous cox process models. Advances in Neural Information Processing Systems 32.

2024-08-13

Thanks for the response!

I can’t follow the response wrt the linear+universal models. Linear functions are not universal, since they can only fit simple straight functions. Maybe you mean non-linear kernel functions, but they are then not linear (in the ambient space). There is clearly some confusion about terminology.

Also, can’t we do universal non-negative kernel functions already eg. by having an exponential link function on a (latent) kernel function?

Based on the response I think in FigB2 you take Aglietti’s GP model, and change the kernel. Is this accurate? But surely this can’t be true: the Cox-GP is already non-negative by exponential link function construction, so adding M-kernels there should do nothing. You are probably doing some bigger changes to the CGP model, perhaps by removing the exp-function. This does not seem to be defined in the paper. The M-kernel+CGP combination is mysterious to me.

It also seems that the Aglietti’s model is not properly executed: the FigB2IEK is giving pretty bad fit, and it does not look like the plots in Aglietti or Hensman. Maybe the code is not that great, but one should still try to tune the results until they looks roughly like the papers.

Overall I’m still not yet convinced of the paper’s merits. The theoretical contribution seems substantial, but if it’s limited to univariate cases I don’t see benefit. I think the main empirical contribution is that the M-kernels can make simple univariate fitting tasks faster, but I don’t think it really matters whether we can do it in 5 seconds or 0.05 seconds. I would assume that if we really care about having a good model, one would happily spend hours doing eg. MCMC to build a robust model.

Also, doing non-negative spline fitting is well known, and does not seem to suffer from computational bottlenecks. The paper should at least discuss them.

Maybe a good demonstration of the method could be some really long time-series (eg. millions of points). Or perhaps the main contribution could be accelerating the CoxGP models, which would be very useful since GP inference is a major bottleneck. However to show this I think the paper should give a much more comprehensive treatment of how the M-kernels can be used to adapt CGP, and how they improve over the CGP domain in general [ie. comparing against a single CGP method from 5 years ago is not sufficient].

I still vote for rejection, but I will raise my score to 4, and would not object acceptance given others' positive reviews.

2024-08-14

We really appreciate the reviewer's responses. Unfortunately, the reviewer still seems to have misunderstood some points of our work, which we believe can be dispelled fully by the following discussion.

We use the term "linear model" to refer to a finite linear combination of (non-linear) kernel functions evaluated on data points. As the reviewer suggested, the term never means linear regressions or linear kernels. Although the term is consistent with the terminology used in the most relevant reference (Marteau-Ferey et al. 2020), we found it confusing. Is it acceptable for the reviewer to adopt "linear representation," which is often used in the literature on kernel method and representer theorem? We are willing to follow the reviewer's suggestion!

can’t we do universal non-negative kernel functions already eg. by having an exponential link function on a (latent) kernel function?

If the reviewer is stating that $f(\cdot) = \exp \Bigl( \sum_n \alpha_n k(x_n,\cdot) \Bigr)$ is a universal non-negative model, then it may be correct, but it is NOT what we are aiming for: we try to find a universal non-negative model of the form, $f(\cdot) = \sum_n \alpha_n k(x_n,\cdot)$ , in this paper.

Based on the response I think in FigB2 you take Aglietti’s GP model, and change the kernel. Is this accurate?

No. IEK in Figure B2 is not Aglietti's GP model, but the intensity estimator with reproducing kernels/permanental process (we adopted Gaussian kernels). If the reviewer is confusing Figure B2 (in Appendix) with Figure C1 (in the pdf file uploaded in rebuttal), then we did not change the kernel in the provided code, where Gaussian kernel is implemented: Figure C1 shows the result of Aglietti’s GP model with Gaussian kernel.

the Cox-GP is already non-negative by exponential link function construction, so adding M-kernels there should do nothing. You are probably doing some bigger changes to the CGP model, perhaps by removing the exp-function. This does not seem to be defined in the paper. The M-kernel+CGP combination is mysterious to me.

The reviewer is mistaken about Aglietti's model: the model employs a sigmoidal link function, not an exponential link function. Furthermore, we did not change the structure, link function or kernel function of Aglietti's model.

It also seems that the Aglietti’s model is not properly executed: the FigB2IEK is giving pretty bad fit, and it does not look like the plots in Aglietti or Hensman. Maybe the code is not that great

Please see Figure C1 (uploaded at the rebuttal period) for the result of Aglietti's model, not Figure B2 which does not contain Aglietti's model. Also, IEK in Figure B2 is a different model from Hensman's model (and of course, Aglietti's model), and thus IEK's poor performance is not inconsistent with Aglietti's and Hensman's papers.

Overall I’m still not yet convinced of the paper’s merits... I don’t think it really matters whether we can do it in 5 seconds or 0.05 seconds. I would assume that if we really care about having a good model, one would happily spend hours doing eg. MCMC to build a robust model.

First and foremost, we would like to express that we respect the reviewer’s perspective. However, we would also like to emphasize that in the field of machine learning, the development of faster estimation methods is widely recognized as an essential research topic, and many researchers are actively working on this issue. We hope that the reviewer will take this fact into consideration.

作者回复

2024-08-05

We would like to thank all reviewers for their valuable comments. We responded in as much detail as possible. We believe that we can entirely dispel the concerns raised by the reviewers. Please see the added figure (Figure C1) in response to Reviewer WaA4's suggestion.

最终决定Accept (poster)

2024-09-25

This paper presents an interesting new method of constructing sufficient conditions on positive definite kernel so that it may construct flexible and linear approximators of non-negative functions. It is shown connected to the permanental process. Experimental results demonstrate that the proposed method is competitive.

Overall, the method is novel and the performance on 1-D input is encouraging. The linear universal approximation property is also valuable. However, the construction of inverse M-kernel is still open for multi-dimensional input, and no experimental results are shown for it. This considerably limits the significance and applicability of the method. That said, balancing the overall merit, I’m inclined to recommend accepting this paper.