5.8

/10

Poster4 位审稿人

最低5最高8标准差1.3

4.0

置信度

ICLR 2024

Implicit Neural Representation Inference for Low-Dimensional Bayesian Deep Learning

Panagiotis Dimitrakopoulos,Giorgos Sfikas,Christophoros Nikou

OpenReview PDF

提交: 2023-09-20更新: 2024-03-15

摘要

关键词

Bayesian Deep LearningImplicit neural representationsProbabilistic machine learningHypernetworks

评审与讨论

审稿意见

评分: 8置信度: 42023-10-28

The authors propose an approximate inference method for Bayesian neural networks. The idea is to introduce a second (auxiliary) network which computes an approximate posterior over the multiplicative noise applied to the deterministic weights of the main network. Such a multiplicative noise then induces an approximate predictive posterior distribution of the main network, they key object of interest in BNNs. The auxiliary network can be of much smaller size than the main one, and an approximate posterior over its weights can be approximated using one of the existing methods (e.g. Laplace, SWAG, Normalising Flows). The proposed method is shown to be competitive in comparison to a number of baselines.

优点

An interesting approach to approximate BNN inference showing competitive performance
Clear presentation, the paper is easy to follow

缺点

The baselines used for comparison (Dropout, BbH, Ensembles) are relatively old methods (by Deep Learning standards of course). I'd be very interested to see how the proposed method compares to more modern approaches, e.g. those applying Laplace approximation directly on to weights of the main network.

A couple of minor points:

Typo in the second to last line on page 3 (wf).
Typo in Eq. (5) (w_{INR} should be in the subscript I guess?)

问题

I wonder about the integer inputs (i.e. the tensor coordinates) to the INR network. Do you normalise these coordinates somehow or directly input the integers into the INR network? Did such an integer input space cause any problems during training?
Why do you think that sinusoidal activations are particularly suitable for the INR network? Do you expect the result to deteriorate if you used other activations (e.g. sigmoid)?
Why do you think the INR with 350 outperformed 4k and 10k versions on CIFAR in Fig. 1?
I was very interested to see that the predictive uncertainty in Fig. 2 has a stationary structure (i.e. dependent on the distance from the observations) similar to a GP with a stationary kernel. The Dropout and Deep Ensembles baselines clearly don't have such a property, but I wonder how such a figure would look like if we used a Laplace approximation directly on some layers of the main network (without INR), e.g. similar to Kristiadi et al. (2020)? In other words, I wonder how specific this stationary uncertainty structure is to the inference using an auxiliary network?

2023-11-18

“I was very interested to see that the predictive uncertainty in Fig. 2 has a stationary structure (i.e. dependent on the distance from the observations) similar to a GP with a stationary kernel. The Dropout and Deep Ensembles baselines clearly don't have such a property, but I wonder how such a figure would look like if we used a Laplace approximation directly on some layers of the main network (without INR), e.g. similar to Kristiadi et al. (2020)? In other words, I wonder how specific this stationary uncertainty structure is to the inference using an auxiliary network?”

The stationary structure (or in-between-uncertainty) is one of the benefits of the Linearized Laplace approximation as shown in multiple works (Kristiadi et al. (2020), Daxberger et al. (2021), Immer et al., 2021), ). In this Fig. 2 what we want to highlight is that the low dimensional INR space we propose is able to maintain the appealing characteristics of the approximate inference methods applied (in this particular case the stationary structure of the Lin. Laplace )

评论- Thank you for the reply!

2023-11-20

Thank you for your detailed reply. I don't have further questions at this stage and I confirm my positive view on this submission.

评论- Thank you!

2023-11-23

Thank you for the effort you put for the review, and thank you for appreciating our work. We will do our best to improve the final manuscript as per your comments.

2023-11-18

“Why do you think the INR with 350 outperformed 4k and 10k versions on CIFAR in Fig. 1?”

We believe that the reason is related to the complexity of each problem. In CIFAR, a smaller model seems to be “enough” in terms of capacity, while the bigger versions are overly complex. Note that in Corrupted CIFAR (columns on the right) the situation is reversed, because the problem is comparatively more difficult, and we need the extra model capacity.

2023-11-18

“Why do you think that sinusoidal activations are particularly suitable for the INR network? Do you expect the result to deteriorate if you used other activations (e.g. sigmoid)?”

We trained Resnet18 in both CIFAR10 and CIFAR100 for 100 epochs to evaluate the predictive capabilities of the Sinusoidal hyper-network. Here we have a comparison of Sine vs ReLU. If you insist on including sigmoid or another activation function specifically, we'll be happy to include a comparison.

CIFAR10

RELU_MAP Accuracy: 91.11 LL: -0.4891 Error: 0.088 Brier: 0.1484 ECE: 0.05891

SINE MAP Accuracy: 91.70 LL: -0.4449 Error: 0.083 Brier: 0.138 ECE: 0.05401

CIFAR100

RELU_MAP Accuracy: 67.79 LL: -2.544 Error: 0.3221 Brier: 0.537 ECE: 0.23232

SINE MAP Accuracy: 68.49 LL: -2.3990 Error: 0.3151 Brier: 0.527 ECE: 0.2256

We find that Sine/Periodic activations – the “default” choice in SIREN – slightly outperforms a hypernet with ReLU activations. Still, results are very close, though there is a trend in favor of sine in all benchmarks. The original motivation behind using the sine activation is related to modeling high-frequency content, which translates as details in structured signals such as images or video [Sitzmann 2020]. We can however see this “in the top of its head”, so to speak: in structured signals we care more for low-frequency content, and high-frequency is a “good-to-have” content. We can interpret an input semantically if we see its low frequencies, but not necessarily vice versa. For example, image compression will invariably throw away high frequencies first, and the last frequencies to lose will be the lower ones. Our conjecture is as follows: When using an INR to model perturbations, we are faced with a different situation, that corresponds to a different “frequency landscape” (perhaps even different than the one of model weights). In particular, we do not have any reason to differentiate lower or higher frequency content in any respect. We “care” for all frequencies, so we need to have a good way to model high frequencies as well. Perhaps this is the reason the sine activation gives a small edge over ReLU.

2023-11-18

“I wonder about the integer inputs (i.e. the tensor coordinates) to the INR network. Do you normalize these coordinates somehow or directly input the integers into the INR network? Did such an integer input space cause any problems during training?”

Following the SIREN paper Sitzmann et al. (2020). The tensor coordinate inputs of the hypernetwork are real numbers normalized to [-1,1]. All the technical details for each experiment are in the Appendix but we could add a separate paragraph dedicated to the INR technical details.

2023-11-18

SWAG - CIFAR100 In-Dist Test data Accuracy: 69.85 LL: -2.1430 Error: 0.3015 Brier: 0.4997 ECE: 0.2078

SWAG - CIFAR100 Corrupted Test data Accuracy: 49.0 LL: -3.9721 Error: 0.51 Brier: 0.8218 ECE: 0.3483

Laplace - CIFAR100 In-Dist Test data Accuracy: 66.0 LL: -3.9903 Error: 0.34 Brier: 0.978 ECE: 0.6377

Laplace - CIFAR100 Corrupted Test data Accuracy: 49.0 LL: -4.1815 Error: 0.51 Brier 0.989 ECE: 0.4709

*For SWAG and Linearized Laplace with GGN, in order to be able to run across low dimensional spaces we choose the covariance to have Diagonal structure.

As for the results, we believe that in both datasets there is at the very least a trend in favor of both proposed INR-x methods. The results validate to a considerable degree the premise of the proposed methods: Instead of choosing a subset or subnet following the rationale of the corresponding methods, the INR produces "ξ" outputs that endow the full network with the desirable stochasticity, while keeping the dimensionality of the random process that we want to do inference upon at a low level.

2023-11-18

“The baselines used for comparison (Dropout, BbH, Ensembles) are relatively old methods (by Deep Learning standards of course). I'd be very interested to see how the proposed method compares to more modern approaches, e.g. those applying Laplace approximation directly on to weights of the main network.”

Following the reviewer’s recommendations we added an additional ablation study. We tried to measure the quality of proposed subspaces in terms of predictive uncertainty. Specifically we compare our INR low dimensional space with: Rank-1 (Dusenberry et al. 2020); Wasserstein subnetwork (Daxberger et al. 2021) and partially stochastic ResNets from (Sharma, Mrinank, et al. partially). We trained (each method) combined with a Resnet18 for 100 epochs in both Cifar10 and Cifar100 datasets while keeping the approximate inference method fixed same across all low dimensional spaces.

Sharma, Mrinank, et al. "Do Bayesian Neural Networks Need To Be Fully Stochastic?." International Conference on Artificial Intelligence and Statistics. PMLR, 2023.

RANK1

SWAG - CIFAR10 In-Dist Test data Accuracy: 91.78 LL: -0.4187 Error: 0.082 Brier: 0.1343 ECE: 0.0522

SWAG - CIFAR10 Corrupted Test data Accuracy: 77.80 LL: -1.2537 Error: 0.222 Brier: 0.3596 ECE: 0.1469

Laplace - CIFAR10 In-Dist Test data Accuracy: 91.01 LL: -1.5630 Error: 0.090 Brier: 0.6872 ECE: 0.6871

Laplace - CIFAR10 Corrupted Test data Accuracy: 78.07 LL: -1.7068 Error: 0.220 Brier: 0.7396 ECE: 0.5737

SWAG - CIFAR100 In-Dist Test data Accuracy: 65.84 LL: -2.29 Error: 0.34 Brier: 0.55 ECE: 0.228

SWAG - CIFAR100 Corrupted Test data Accuracy: 42.80 LL: -4.77 Error: 0.57 Brier: 0.92 ECE: 0.398

Laplace - CIFAR100 In-Dist Test data Accuracy: 69.00 LL: -4.01 Error: 0.31 Brier: 0.9714 ECE: 0.668

Laplace - CIFAR100 Corrupted Test data Accuracy: 42.00 LL: -4.25 Error: 0.58 Brier 0.979 ECE: 0.401

INR

SWAG - CIFAR10 In-Dist Test data Accuracy: 92.17 LL: -0.384 Error: 0.078 Brier: 0.1296 ECE: 0.0498

SWAG - CIFAR10 Corrupted Test data Accuracy: 78.60 LL: -1.164 Error: 0.214 Brier: 0.3566 ECE: 0.1447

Laplace - CIFAR10 In-Dist Test data Accuracy: 89.0 LL: -1.563 Error: 0.110 Brier: 0.6869 ECE: 0.6647

Laplace - CIFAR10 Corrupted Test data Accuracy: 81.0 LL: -1.660 Error: 0.190 Brier: 0.7156 ECE: 0.5890

SWAG - CIFAR100 In-Dist Test data Accuracy: 69.12 LL: -2.094 Error: 0.308 Brier: 0.5006 ECE: 0.2046

SWAG - CIFAR100 Corrupted Test data Accuracy: 46.5 LL: -4.1878 Error: 0.535 Brier: 0.8418 ECE: 0.3640

Laplace - CIFAR100 In-Dist Test data Accuracy: 70.0 LL: -3.9172 Error: 0.300 Brier: 0.967 ECE: 0.6747

Laplace - CIFAR100 Corrupted Test data Accuracy: 42.0 LL: -4.199 Error: 0.58 Brier:0.9770 ECE: 0.396

SUBNETWORK

SWAG - CIFAR10 In-Dist Test data Accuracy: 92.54 LL: -0.4251 Error: 0.074 Brier: 0.1255 ECE: 0.0491

SWAG - CIFAR10 Corrupted Test data Accuracy: 76.90 LL: -1.4599 Error: 0.231 Brier: 0.3845 ECE: 0.1732

Laplace - CIFAR10 In-Dist Test data Accuracy: 91.0 LL: -1.551723 Error: 0.090 Brier: 0.682 ECE: 0.6823

Laplace - CIFAR10 Corrupted Test data Accuracy: 81.0 LL: -1.650211 Error: 0.190 Brier: 0.7134 ECE: 0.5886

SWAG - CIFAR100 In-Dist Test data Accuracy: 69.86 LL: -2.1430 Error: 0.30 Brier: 0.49 ECE: 0.207

SWAG - CIFAR100 Corrupted Test data Accuracy: 49.0 LL: -3.9721 Error: 0.51 Brier: 0.82 ECE: 0.3483

Laplace - CIFAR100 In-Dist Test data Accuracy: 68.0 LL: -3.9505 Error: 0.32 Brier: 0.9682 ECE: 0.655

Laplace - CIFAR100 Corrupted Test data Accuracy: 49.0 LL: -4.1309 Error: 0.51 Brier: 0.974 ECE: 0.466

PARTIALLY STOCHASTIC

SWAG - CIFAR10 In-Dist Test data Accuracy: 92.54 LL: -0.4251 Error: 0.074 Brier: 0.1255 ECE: 0.0491

SWAG - CIFAR10 Corrupted Test data Accuracy: 76.92 LL: -1.4499 Error: 0.201 Brier: 0.3806 ECE: 0.1730

Laplace - CIFAR10 In-Dist Test data Accuracy: 90.8 LL: -1.561 Error: 0.091 Brier: 0.68 ECE: 0.702

Laplace - CIFAR10 Corrupted Test data Accuracy: 80.0 LL: -1.67 Error: 0.21 Brier: 0.72 ECE: 0.59

thread continued.........

2023-11-18

”An interesting approach to approximate BNN inference showing competitive performance [..] Clear presentation, the paper is easy to follow”

Thank you, we appreciate your positive comments. We are very content that you found the presentation clear.

审稿意见

评分: 5置信度: 42023-10-31

This work proposes so-called implicit neural representation inference for Bayesian Neural Networks. In the past, several "subspace" inference frameworks have been devised, where the aim is to only model smaller part of the weight space in a Bayesian manner. This type of approaches promise to better scale the approximate Bayesian inference for deep learning, while making the inference procedure more accurate. Building upon, this paper proposes to obtain the "subspace" of weights using implicit neural representation. The paper shows how the proposed method can enhance popular frameworks such as Laplace Approximation, SWAG and normalizing flows. Experiments are conducted on UCI, cifar10 and cifar100 and the approach is compared against the baselines with performs inference over all the parameters of neural networks, as oppose to the subspace.

优点

In my view, the paper has the following strengths:

the idea of using implicit neural representation for improving Bayesian Neural Network is interesting and novel.
Implicit neural representations are currently a popular topic, and may be therefore relevant to many researchers.

缺点

On the other-hand, I think the paper has several rooms to improve for a publication.

The paper could improve in terms of its clarity.

Specifically, I find it difficult to parse section 3.1. I get the problem statement, but the parts on implicit neural representation with the SIREN model was difficult to understand. It would help to have a figure on this since there are many notations introduced. The text contains many mathematical symbols, which would have been explained differently.

SIREN should be explained more in depth since I think it is an important technical detail. Also, some technical details on how implicit neural representation is obtained, and the general working principles, like an algorithm behind, would be helpful for the reader.

Experiments may become more solid with other choices of baselines and datasets.

First, there has been many subspace approaches and also many approaches that attempts to sparsely represent model uncertainty. Some examples are references in the paper: Kristiadi, Dusenberry, Daxberger, etc. I think the experiments should compare to these baselines as well, which can really show the advantages of implicit neural representation over existing methods within the same class of approaches. Comparisons to full weight space seem not natural.

Moreover, the paper uses UCI and CIFAR as main datasets. I would have liked the paper more if "uncertainty baselines" from google was used, as such works represent the more upto date standard in experimental protocol. Speaking of the protocol, the included baselines seem not very consistent, e.g., Figure 1 misses laplace approximation, figure 2 again selectively reports INR Laplace and contains no SWAG, figure 3 misses swag, INR swag, etc.

Overall, I recommend a weak rejection. Improving the technical quality and clarity of the presentation would be meaningful here.

问题

1.In line with section 3.1, is it possible to explain why INR is advantages to learn the subspace? I could not get why it might be a good idea.

Another question is on expressiveness Vs accuracy of the Bayesian inference. Basically, the weight space is very large. Having a simple distribution in such a complex high dimensional space can be already advantageous in terms of expressiveness of the probability distribution. Of course, a natural direction has been also improving the expressiveness through structured distribution, correlations amongst layers, etc. On the other hand, what these subspace approaches do is to model with smaller part of the network, which can actually reduce the overall expressiveness of the distribution, though overall inference might be easier and accurate. Then the question is, when should we look into the approaches for expressiveness, and when should we look into the subspace approaches?
Why not only take last few layers of the network and make them probabilistic? What are advantageous of using INR against very simple baselines as taking last one or three layers? I think it might be interesting to include them as a baselines in the experiments.

2023-11-17

“The paper could improve in terms of its clarity. [...] Specifically, I find it difficult to parse section 3.1. I get the problem statement, but the parts on implicit neural representation with the SIREN model was difficult to understand. It would help to have a figure on this since there are many notations introduced. The text contains many mathematical symbols, which would have been explained differently.”

Thank you for your comment. We will do our best to clarify this section in the final text. In the meanwhile, we’d be happy to explain any step or detail of the method you would like to. We will add a figure representing the proposed model.

We added a graphical assessment which depicts our method in simple MLP Training Setting, where the main network consists of 4 linear layers and the INR hypernetwork has 2 layers producing accordingly 4 sets of ξ factors (see https://freeimage.host/i/JnnQtyb).

2023-11-18

“Why not only take last few layers of the network and make them probabilistic? What are the advantages of using INR against very simple baselines as taking last one or three layers? I think it might be interesting to include them as a baselines in the experiments.”

Sharma, Mrinank, et al. "Do Bayesian Neural Networks Need To Be Fully Stochastic?." International Conference on Artificial Intelligence and Statistics. PMLR, 2023.