Scaling Continuous Latent Variable Models as Probabilistic Integral Circuits
摘要
评审与讨论
This paper extends the work of PIC with continuous LVs, proposes to tensorize and to use neural functional sharing to scale the model. The results show that the proposed approach decreases the trainable parameters as well as the memory usage in training, and also reduces the running time. The approach also works well in density estimation tasks.
优点
This paper extends the work of PIC for better scalability and efficiency, and thus has its novelty.
The paper is well written and not difficult to understand. e.g., the illustration of Algorithm 1 in Fig.2 (a-b) is a very nice visualisation.
With the claimed contribution, the paper has a high potential to contribute to the community.
缺点
It is not clear to me what data set is used in the experiment of "Scaling PICs", and how many RVs are modelled.
I compared the density estimation results in PIC[18] on the MNIST family and the results in this paper outperforms the ones in PIC[18]. As far as I understand, the PIC (F, C) in this paper has reduced its trainable parameters and therefore should in principle has smaller model capacity. I am wondering why the density estimation results of the proposed model outperform the original PIC?
Similarly, it would be nice if the drop (or increase, though unlikely) of density estimation performance can be shown between PIC (F, N) and PIC (F, C) so readers will know how much they can lose/gain w.r.t. performance when applying the functional sharing.
问题
-
Could you elaborate why the trainable parameters in PIC (F, C) have reduced significantly compared with the PC (F, N), since there is no functional sharing in PCs as I understand. That is, why the PIC (F, C) can have much fewer trainable parameters than PC (F, N)? Is this comparison fare?
-
In the left most plots in Fig 5, I would infer that the single orange triangles in upper left are PIC(F,N) with QG-TK, but without a dotted line connected it is a bit unclear.
-
is the
andbefore in line 156 a typo?
局限性
The limitations are well discussed in the end of the paper.
We thank the reviewer for deeming our paper to be novel, well-written, and for expecting it to have a high impact in the community. Below we reply to their points.
It is not clear to me what data set is used in the experiment of "Scaling PICs", and how many RVs are modelled.
In the experiment “Scaling PICs” (in the main text), whose results are shown in Fig 5, we used a PC modeling ImageNet 64x64. Note that it is not important which likelihood the PC assigns to different images, but only the time and space requirements for a forward/backward pass. Therefore, any batch of 64x64 RGB images would reproduce the same performance as reported in the plot. As we write in the caption, “The benchmark is conducted using a batch of 128 RGB images of size 64x64”. Finally, all RVs are categorical (256 categories for MNIST-like images, and 768 categories for RGB images), as described in lines 286-287.
About PICs with and without functional sharing
An extensive comparison between PICs with and without function sharing in terms of performance is not present in the paper, and below we explain why. However, luckily, as the reviewer pointed out, such comparison can be retrieved --- to a certain extent --- by taking results from Fig. 5 in [A], which corresponds to much smaller models than ours. We will account for it in the revised version of the paper.
As we extensively tested in our benchmarks (Fig. 5, Table D1, Table D2), PICs without functional sharing cannot scale to large model sizes, and therefore, simply because of that, PICs with functional sharing can deliver (better) results for the computational resources we have available.
This is in fact one of the main points of our paper: Scaling PICs with functional sharing, as we cannot without it. Therefore, given that PICs without functional sharing can unfortunately be run only for very small sizes (see Table D1 and D2), an extensive comparison in terms of performance w.r.t. PICs with functional sharing is not feasible on our GPUs.
We expect functional sharing to deliver better models as applying it drastically reduces the number of trainable parameters, and therefore possible overfitting. In Fig. 5 (top-right) we indeed showed that PICs without functional sharing hit 2B parameters, whereas with functional sharing only 6M.
[A] Gennaro Gala, Cassio de Campos, Robert Peharz, Antonio Vergari, and Erik Quaeghebeur. Probabilistic integral circuits. AISTATS, 2024.
Could you elaborate why the trainable parameters in PIC (F, C) have reduced significantly compared with the PC (F, N)
Yes, the reason why there is a significant reduction in the number of trainable parameters between PIC (F, C) and PC (F, N) is because the former applies compositional sharing of MLPs, as explained in Section 3.3. However, once we materialize a QPC out of it, the number of its parameters match exactly the ones of PC (F, N), since these represent exactly the same architecture. With that in mind, we believe the comparison is fair as the PIC approach can be seen as a way of training PC (F, N).
For a detailed analysis, we refer the reviewer to our answer to reviewer WQnV about characterizing the size of PICs and PCs.
In the left most plots in Fig 5, I would infer that the single orange triangles in upper left are PIC(F,N) with QG-TK, but without a dotted line connected it is a bit unclear
Precisely, those isolated orange triangles refer to PIC (F, N) with QG-TK. The reason why they are isolated is because we could only run such configuration at K = 16, hence the lack of dotted lines. We refer the reviewer to Table D.2 for some tabular results of the plots, where this is evident. We will make sure to detail this better in the main text.
is the and before in line 156 a typo?
Yes, it should be and will be removed, thanks for spotting it.
Thank you for the answer. My concerns have been addressed and I will keep my positive rating.
This paper presents a pipeline to build probabilistic integral circuits (PICs) more generally as directed acyclic graphs, as opposed to the tree-shaped structure they were limited to before. This significantly increases the expressiveness of the PICs, improving their representation of distributions. The authors present a procedure for turning a region graph into a PIC, into a QPC that approximates the PIC, and finally into a folded QPC, which are effectively tensorized structures that significantly speed up the inference and learning process. From the tensorized PIC the authors then make further improvements on QPC training, using neural functional sharing techniques over groupings of the input and integral units to result in greater materialization of QPCs and far fewer parameters.
优点
-
The proposed approaches significantly increase the expressivity and learning efficiency of PICs, as showcased by the experiments.
-
The paper is technically sound with clear arguments presented for each step.
-
The paper is overall well structured and easy to follow.
-
The authors also address the current limitations of PICs with a clear goal for future improvements.
缺点
Not a major weakness, but as the authors also point out, it is currently not possible to sample from PICs, limiting their impact as a generative model.
问题
-
Among the techniques discussed in the paper, how significant is the tree-shape vs DAG-shape for empirical expressiveness of PICs? It appears that DAG-shaped PICs achieve slightly better bpds, but tree-shaped ones scale much better.
-
Section 3.2 describes a numerical quadrature rule as consisting of integration points and weights that minimize the integration error, and the authors later say that the same quadrature rule is used for each approximation. Does this raise any issues in approximation?
-
How crucial/limiting is the assumption that every integral unit integrates out all incoming latent variables?
-
Do the QPCs approximating PICs still support tractable inference (e.g. marginal probabilities) similar to traditional PCs? It would be great to clarify the probabilistic reasoning capabilities mentioned in the introduction.
Minor comments:
- Lines 133-134: do you also need for CP-merge?
- Table 1: what does LVD-PG refer to?
局限性
The authors acknowledge the limitation of PICs that, despite being a continuous latent variable model like many generative models, they do not support sampling. This is left as future work.
We thank the reviewer for deeming our paper to be technically sound, clearly argumented, well-structured and easy-to-follow. Below we reply to their points.
it is currently not possible to sample from PICs, limiting their impact as a generative model.
It is true that since we use simple unconstrained MLPs to parameterize PICs --- which can be seen as energy-based models --- (differentiable) sampling from them is not possible. However, this is not a limitation of PICs per se, but a consequence of the way we choose to parameterize them. For instance, parameterizing using simple mixtures of Gaussians or even normalizing flows, would allow sampling from PICs.
Nevertheless, it is always possible to materialize PICs as QPCs and sample from them as usually done with PCs.
how significant is the tree-shape vs DAG-shape for empirical expressiveness of PICs? It appears that DAG-shaped PICs achieve slightly better bpds, but tree-shaped ones scale much better.
We remark that dropping 0.1 bpd on MNIST-like datasets would correspond to an increase of ~54 nats in log-likelihood, and perhaps this would be considered to be more significant. On RGB 64x64 datasets, the gap would even be much larger, i.e. ~851 nats.
It is true that there is a trade-off in terms of accuracy/efficiency. We just started exploring how to scale DAG RGs in a more effective way and believe our paper is a stepping stone in this direction. E.g., note that using Tucker layers yields the best performance for lower values of K, but we do not know how to scale it properly yet.
Section 3.2 describes a numerical quadrature rule as consisting of integration points and weights that minimize the integration error, and the authors later say that the same quadrature rule is used for each approximation. Does this raise any issues in approximation?
Using the same integration technique for each integral unit is not a limitation nor raises approximation issues when assuming a static quadrature regime. Assuming the same number of integration points, instead, allow us to tensorize the QPC in a regular way, thus promoting GPU parallelism via folding.
We conjecture that since we learn our functions through numerical approximation, these will stay nicely integrable w.r.t. to the points we used during training to fit them.
How crucial/limiting is the assumption that every integral unit integrates out all incoming latent variables
This is an interesting question! From a theoretical perspective, we would need to devise novel quadrature routines to materialize the QPC, as some latent variable would affect multiple integral units at once. In practice, it might be possible to “move” the non-integrated-out latent variables of the input function as to be part of variables of function , for certain parameterizations of and .
Nevertheless, we believe that from a learning perspective, PICs/QPCs have enough capacity to learn complex distributions with our assumption.
Do the QPCs approximating PICs still support tractable inference (e.g. marginal probabilities) similar to traditional PCs?
Yes, QPCs --- being composed of only sum, product and input units --- are proper PCs that inherit the structural properties of their corresponding PIC , e.g., they are also structured-decomposability if the underlying PIC is structured-decomposable. We will clarify this in the revised version of the manuscript.
Lines 133-134: do you also need Z_u3 != Z_u4 for CP-merge?
It does not need to be. If , we would then merge u3 and u4 using Tucker-merge, whereas if , we would then use again CP-merge. In this way, we allow for Tucker layers on top of CP-layers, and vice-versa.
Table 1: what does LVD-PG refer to?
It stands for Latent Variable Distillation - Progressive Growing, and it is a sophisticated learning scheme developed in [Liu, Xuejie, et al. "Understanding the distillation process from deep generative models to tractable probabilistic circuits." ICML 2023] that employs several heuristics to learn the parameters and structure of a PC on a lossy color space. We will specify this in the camera-ready version.
Thank you for the detailed response! I think clarifying the tractable inference and sampling (although differentiable sampling still seems to be a challenge at this point) supported by the learned PICs will make the contribution more impactful. I will keep my score of 8.
This paper introduces a new approach for building probabilistic integral circuits (PICs), a recently introduced probabilistic model that extends probabilistic circuits (PC) with continuous functions of latent variables (in addition to discrete latent variables), with inference performed using a numerical quadrature procedure. In particular, while the previous implementation was restricted to tree-structured region graphs/variable decompositions, it is shown how to construct PICs for arbitrary region graphs. Additionally, sharing of functions across different regions is used to improve parameter efficiency. Empirical results show that the architectural improvements enable scaling to complex image datasets such as ImageNet64, outperforming comparable PCs trained using maximum likelihood or EM.
优点
The paper proposes a number of innovations on top of the recently proposed PICs, such as allowing for arbitrary region graphs and functional sharing. Taken together, this constitutes a well-engineered solution that significantly improves the scalability and flexibility of PICs.
The experiments are well-executed with strong results on a range of image datasets in comparison to comparable PCs learned directly. Many experimental details are reported, such as memory usage, time taken and parameter counts/hyperparameters, which should make the results reproducible.
The paper was a pleasure to read, with clear and consistent notation throughout explaining the method and accompanying pictorial illustrations that make the idea of the paper transparent and easy to understand.
缺点
The paper is arguably a little weak in terms of novelty as most of the new components are adaptations of existing ideas in the literature (e.g. tucker/cp layers, parameter sharing) to probabilistic integral circuits.
问题
- What numerical quadrature rule was used?
- In the conclusion, it is remarked that differentiable sampling from PICs is not possible. What is meant by differentiable here, and is it possible to sample from the materialized PC if one is not interested in gradients (e.g. for visualization)?
- To what extent do the authors consider their method a continuous latent variable model, as opposed to a means of effectively constructing PCs? For instance, if the same quadrature points are used throughout training and evaluation, can one expect the MLPs to learn meaningful function values between evaluation points?
- Did the authors investigate learning PICs according to a HCLT region graph? This would give a better idea of whether a DAG (as opposed to tree-structured) region graph is necessary.
Typos:
- modes -> nodes 190
- capitalization of 'functional' 234
- 'pixels are categorical' -> 'pixels as categorical' 286
局限性
The limitations of the method are appropriately addressed in the paper.
We thank the reviewer for deeming our paper to be a well-engineered solution delivering strong results. We are also happy that the reviewer enjoyed the reading. Below we reply to their questions.
The paper is arguably a little weak in terms of novelty
We respectfully disagree as it is non-trivial to recover a compact quadrature PC (QPC) from a PIC with a DAG structure. First we needed to change the original semantics of PICs/QPCs, which were defined only for tree region graphs, second, parameterizing the newly found tensorized QPCs needs some careful engineering otherwise the models would never be able to scale to the number of integration points we used (up to K=512).
What numerical quadrature rule was used?
We trained using the trapezoidal integration rule, we will document this in the revised version of the paper.
What is meant by differentiable [sampling] here, and is it possible to sample from the materialized PC if one is not interested in gradients (e.g. for visualization)?
When we say that differentiable sampling is not currently possible, it is just because the functions we use are parameterized as simple unconstrained MLPs, which can be seen as energy-based models, for which exact sampling is not possible. While using MCMC would be possible to sample from these neural networks, obtaining meaningful gradients through the samples would require additional approximations. Nevertheless, we can always materialize PICs as QPCs and sample from them as usually done with PCs.
To what extent do the authors consider their method a continuous latent variable model, as opposed to a means of effectively constructing PCs?
PICs are continuous latent variable models by definition. It is true, indeed, that the way we currently use PICs seems like a parameter learning procedure for tensorized PCs. However, we stress that at test-time we can actually materialize an arbitrarily sized PCs based on the number of integration points we use. We plan to explore (and exploit) this capability further in future work.
Did the authors investigate learning PICs according to a HCLT region graph?
Learning PICs using HCLT region graphs (RGs) has been already investigated in prior work [A]. In our work, we directly use RGs that, differently from HCLTs, are not learned from data but rather fixed for a given image size: QuadTree (QT) and QuadGraph (QG) RGs (see Appendix B for details). By comparing our results with the ones in [A], one can clearly see that QT and QG outperform HCLT, and highlight how learning an RG from images does not bring any advantage. QG outperforming QT brings evidence to support the claim that non-tree RGs can be more expressive.
[A] Gennaro Gala, Cassio de Campos, Robert Peharz, Antonio Vergari, and Erik Quaeghebeur. Probabilistic integral circuits. AISTATS 2024
Thank you for the response and clarifications. I will keep my positive score.
This work extends the probabilistic integral circuits (PICs) from tree-shaped region graphs (RGs) based ones to DAG-shaped ones. While constructing PICs from DAG RGs can lead to more expressive models, it comes with the concern of scalability since the circuit sizes might be increased a lot. To address this concern, this work proposes tensorized circuits to approximate the resulting PICs by a PIC to QPC reduction. This is further combined with functional sharing and materialization to achieve efficiency and scalability, which is validated by experimental results on several image datasets.
优点
- The extension from tree-shaped PICs to DAG-shaped ones is a solid contribution as it greatly improves the expressiveness of the resulting PICs and is an important step towards PICs constructed from more complex RGs for complicated tasks.
- The functional sharing and materialization techniques proposed in this work are shown to be very useful by the empirical results. From Fig 5, learning PICs consumes similar resources to PCs which are efficient models and it requires much less trainable parameters than the ones without functional sharing, making the training of large-scale PICs feasible.
缺点
I appreciate that this work aims to solve an important problem and the proposed solution is shown to be novel and efficient. Still, it's heavy in technical details and thus I don't think I fully understand the algorithmic details. Specifically,
- I'm confused by the definition of the integral unit from Line 58 to Line 62, especially by the inconsistency in the input of g_i between Line 61 and Line 62 (the g_i inside the integral). Also, the integral in Line 62 contains the function g_i while Figure 1(b) does not, which is also confusing. It might be helpful to put a numeric example there.
- This also applies to the other technical sections of this work given that this work is heavy in techniques. It would be very helpful to provide a toy example to showcase the reduction of PICs to QPCs to improve accessibility and reproducibility for readers who are not well-versed in these areas.
- I'm curious to see some theoretical analysis on the circuit sizes characterized. I understand that the bounds might be loose and not that informative. Still, it might be helpful to understand how the PIC to QPC reduction affects the circuit sizes.
问题
- Can you provide some examples to help understand the integral units?
- From the definition of g_u in Line 71, the integral is a function of both variables X_i and Z_u. Can you explain the claim "In this way, the output LVs of any unit u will simply be Zu" that directly follows the definition of g_u?
- Can you characterize the circuit sizes of the resulting PICs and QPCs?
局限性
Yes.
We thank the reviewer for deeming our paper to be a solid contribution, and a novel and efficient solution to an important problem. Below we clarify the confusion about notation, and respond to their points.
I'm confused by the definition of the integral unit from Line 58 to Line 62, especially by the inconsistency in the input of g_i between Line 61 and Line 62
and are just equivalent notations, we will remove the brackets to avoid confusion. We used them trying to emphasize the set of LVs. Finally, note that, when writing an integral, the LVs that we integrate out are in lower-case (e.g. line 62)
the integral in Line 62 contains the function g_i while Figure 1(b) does not, which is also confusing. It might be helpful to put a numeric example there.
An integral unit , by definition, receives input from a single unit, . Assume that is associated with function , and that outputs the function . The integral unit would then compute the integral (the integration domain is the support of , an assumption), therefore outputting the function , as we can compute the integral in closed-form in this example. (We will add this example to the camera-ready.) Therefore, in Figure 1, for instance, integral unit 5 (the red coloured one) computes , i.e. and .
It would be very helpful to provide a toy example to showcase the reduction of PICs to QPCs
This is a great suggestion and in the camera-ready version we will describe more in detail in the main text the example we provided appears in Figure 2. There, we illustrate the application of PIC2QPC (Algorithm 3) from the PIC in (b) to the QPC in (c), and then to its tensorized folded version in (d). Zooming in, we also illustrated the details of the materialization of a 3-variate function to a Tucker Layer in Figure 3. We will further expand this explanation and mention all the functions involved and comment on how Algorithm 3 operates step by step.
I'm curious to see some theoretical analysis on the circuit sizes characterized
We are not entirely sure how to understand the reviewer's request. We interpreted it as an analysis on the number of trainable parameters of PCs and PICs. We are happy to discuss more in case we are off-topic.
We address the answer by analyzing 3 types of folded layers (CP-layer, Tucker-layer, and Categorical-Layer), as our tensorized architectures are nothing more than a sequence of them. We use to denote the number of layers stacked together with folding, the number of integration points, the MLP size for PIC neural nets, and the number of MLP layers. We recall that the size of PIC neural nets is independent from .
- A folded CP-layer is parameterized by a tensor of shape (where F is always even). Using PIC composite sharing, the size of the multi-headed MLP (assuming no bias term for simplicity) to parameterize such a layer is (see Figure 4 for an illustration).
- A folded Tucker-layer is parameterized by a tensor of shape . Parameterizing such a layer with composite sharing, again requires an MLP of size .
- A folded Categorical-layer is parameterized by a tensor of shape , where is the number of categories (256 of gray-scale images, and 768 for RGB ones). When using composite sharing, we can parameterize it with an MLP of size . However, we found that PICs deliver better results when using full-sharing for the input layer, and this means using an MLP of size .
As an example, consider a folded CP-layer with , , whose number of trainable parameters is . Parameterizing it with an MLP with and , delivers instead only trainable parameters, more than 99% less trainable parameters. We refer the reader to Appendix C1 for details about our MLPs, and to the plots on the right of Fig. 5 for the total trainable parameter counts of PC and PIC models. We will add this analysis in the revised paper.
We thank all reviewers for their insightful feedback, questions, and kind words. We are glad that the paper seems to be (i) very well-received (“The paper was a pleasure to read, with clear and consistent notation” - 6CM1, “transparent and easy to understand” - 6CM1, “The paper is technically sound with clear arguments presented for each step” - iaCH, “The paper is overall well structured and easy to follow” - iaCH, “The paper is well written and not difficult to understand'' - fmPL), and (ii) deemed as a strong contribution to the field (“The extension from tree-shaped PICs to DAG-shaped ones is a solid contribution” - WQnV, “the proposed solution is shown to be novel and efficient” - WQnV, “The paper proposes a number of innovations” - 6CM1, “this constitutes a well-engineered solution” - 6CM1, “The proposed approaches significantly increase the expressivity and learning efficiency of PICs” - iaCH, “the paper has a high potential to contribute to the community” - fmPL).
We answer all remaining questions below and will include feedback in the revised version of the paper.
This work presents a novel construction for probabilistic integral circuits (PICs) over general directed acyclic graphs. Extending PICs beyond trees increases their representational power. Overall, the reviewers felt that the work was clearly presented and represented a significantly novel contribution to the field with only minor weaknesses.