Convolutional Differentiable Logic Gate Networks
摘要
评审与讨论
This paper proposes a novel computational architecture for differentiable logic gate networks (LGNs), a machine learning methodology that aims to learn networks of logic gates for fast, gate-efficient inference on logic gate-based hardware. Specifically, the authors propose extensions to a prior work on differentiable LGNs inspired by the computer vision literature. They propose (i) logic gate tree convolutions which are layers that convolve trees of logic gates with the input, (ii) logical or pooling (inspired by max pooling) layers that compute the disjunction of the receptive field, and (iii) residual initializations that bias the initial distribution over logic gates to be an identity gate of one of the two inputs. The authors also detail various computational techniques to optimize the efficiency of the new architecture in training, simulation and hardware.
Experimental results on computer vision tasks (CIFAR-10, MNIST, Fashion-MNIST) and extensive comparison to SOTA methods demonstrate that the proposed architecture achieves competitive (if not SOTA) accuracy while being either significantly smaller (in terms of gate count) or faster (in terms of inference speed on FPGAs or CPUs). The authors demonstrate through ablation studies that the proposed components and architectural choices are all integral to the achieved performance. Moreover, the authors provide experimental results for insightful studies on the proposed components, such as why logical or pooling doesn’t result in too much network activation, the induced stability of residual initializations, and the effects of gate distribution discretization.
优点
- The proposed methods build upon prior work on differentiable logic gate networks (LGN) and take inspiration from compute vision literature. The contributions are novel and advance the SOTA for fast and efficient inference of machine learning models. The results are of importance to embedded and real-time machine learning applications.
- Related work is discussed in detail, and experimental results are compared to various prior works.
- The submission is technically sound, with claims supported by experimental results (see weaknesses for point on statistical significance). The authors demonstrate through ablation studies the utility of each of the proposed architectural components, and discuss tradeoffs, strengths and weaknesses for their techniques.
- Methods are detailed enough for reproducing the proposed architecture and results.
- The authors provide substantial insight into methodology choices, making the presentation clear and informative.
缺点
Lack of statistical significance measures for results. (The main prior work on differentiable LGNs (F. Petersen et al.) provides standard deviations in their appendix). However, the authors provide justification for this within the NeurIPS Paper Checklist.
问题
- If we take the CIFAR-10 results as an example, another method achieved greater accuracy (91%, Hirtzlin et al.) but requires significantly more gates. What is the current limitation on scaling LogicTreeNets to larger gate counts (beyond 111M)? If, for example, a greater accuracy were desired.
- Similar to the study conducted by F. Petersen et al., is there any insight to be gleaned from the learnt distribution over logic gates in logic gate tree convolution kernels?
- Possible typos
- Lines 344-345: is a forward slash missing?: “transistor count / chip area”
- Line 351: is an M missing?: “55.8M gates”
局限性
The authors have adequately addressed limitations.
Thank you very much for your extensive and positive feedback. We greatly appreciate that you find our result of importance to embedded and real-time machine learning applications. Thank you also for your praises wrt. coverage of related work, our technical soundness, our ablation studies, discussions of tradeoffs, detailed descriptions, and the insights into methodology choices that we provide.
Weaknesses:
Lack of statistical significance measures for results. (The main prior work on differentiable LGNs (F. Petersen et al.) provides standard deviations in their appendix). However, the authors provide justification for this within the NeurIPS Paper Checklist.
Thank you for pointing this out. In the following, we present standard deviations (over 10 seeds) for our smaller models.
| CIFAR-10 | Originally reported | With standard deviations |
|---|---|---|
| LogicTreeNet-S | 56.71% | 56.55±0.46% |
| LogicTreeNet-M | 70.65% | 70.72±0.37% |
| MNIST | Originally reported | With standard deviations |
|---|---|---|
| LogicTreeNet-S | 98.06% | 98.27±0.25% |
| LogicTreeNet-L | 99.11% | 99.10±0.10% |
We will extend the standard deviations to the larger models (takes very long) as well as Fashion-MNIST (we prioritized the other experiments for now) for the camera-ready.
Questions:
If we take the CIFAR-10 results as an example, another method achieved greater accuracy (91%, Hirtzlin et al.) but requires significantly more gates. What is the current limitation on scaling LogicTreeNets to larger gate counts (beyond 111M)? If, for example, a greater accuracy were desired.
Our primary limitation lies in computational ressources for training, both wrt. VRAM and raw compute. In the future, we hope to train even larger and deeper models
Similar to the study conducted by F. Petersen et al., is there any insight to be gleaned from the learnt distribution over logic gates in logic gate tree convolution kernels?
Yes, we provide a study on the learnt distribution over logic gates in logic gate tree convolution kernels in the Author Response PDF page. We also compare it to the same model but with Gaussian initializations. It actually illustrates an important point: the majority of gates in the network are residual gates (A).
typos
Thank you for spotting these typos. We have fixed them, and will do another proof reading for the camera-ready.
I thank the authors for their response and appreciate their effort in addressing my concerns and questions.
With standard deviations being added to the results, and the additional study on learnt distributions over logic gates, I have updated my score of soundness from 3 to 4.
In this work the authors propose a convolutional-like architecture along with two novel mechanisms oriented to differentiable logic gate neural networks, making the training and inference of such networks possible in more intense tasks in context of logic gate neural networks. More specifically, the authors augment the current state-of-the-art capabilities of Differentiable Logic Gate Networks by introducing a convolutional architecture and training approach that is based on logic gates and along with the proposed “logical or pooling” and “residual initialization” achieving higher accuracies in various dataset with lower number of gates, reducing significantly the inference time. The authors claim that the proposed method unlocks the capabilities of differentiable logic gate networks providing a comprehensive review on works that target efficient inference, discussing the benefits of adopting the proposed architecture on application that requires efficient and low cost inference.
优点
The paper is well organized providing a comprehensive review on methods that target efficient inference, discussing in depth the benefits of logic gate networks. The technical details, arguments and experimental results provided in the paper regarding the realization of the method in hardware are convincing.
The authors provide some experimental results in actual hardware demonstrating the effectiveness of the proposed method in terms of efficiency.
The authors provide interesting ablation studies to support experimentally some of the designing decisions.
The motivation is solid and easy to understand. Additionally, the contribution is clear, achieving state-of-the-art performance in the context of larger logic neural networks.
缺点
In many cases the work seems incremental in reference to [1], without, however, overcoming or justifying some theoretical lacks that are spotted in this previous work. More specifically, the random connections that applied in [1] are adopted in this work without well being theoretically justified or proposing an alternative way.
In addition to that, I find myself referring occasionally to [1] in order to understand some technical details. For example, the differentiable logic gates are presented only schematically in the paper.
From a technical point of view, I find it difficult to conceptualize the computational graph that is built during the training. Do the authors introduce a projection layer, parameterized by vectors z, for each channel of the kernel on the available gates? To this end, is the z vector optimized taking the partial derivative of z on the classification loss? Introducing some details on the optimization process will be useful.
The random selection of inputs on the receptive fields raises concerns regarding the consistency of the training process, with the paper not providing error bars regarding different training runs. Although the authors discuss the reasons for attaching higher probability to the logic gate choice A (or B), they do not discuss how they conclude on . Such empirical decisions potentially hinders the generalization ability of the proposed method.
A proof reading is required. There are some minor typos in the paper and in the appendix. (e.x. Paper-L293 “LGNs differ from the LGNs”, missing reference in L6 of the appendix.
[1] Petersen, Felix, et al. "Deep differentiable logic gate networks." Advances in Neural Information Processing Systems 35 (2022): 2006-2018.
问题
How many additional trainable parameters are introduced during training in contrast to the traditional CNNs and which of them are discarded during the inference?
How do the authors comment on the observation that the proposed method seems to not generalize equally well to smaller architectures?
How do the authors conclude on ?
I would like to stress the consistency of the proposed method in different training runs due to the fact that it is based on the random selection of inputs of the receptive field. Is the robustness of training preserved and what are their experimental observations?
局限性
Some technical details in the training process are not clear.
Some empirical decisions made on paper are not well justified neither experimentally nor theoretically.
The proposed method leads to lower accuracies in smaller models (e.x. LogicTreeNet-S of the table 1)
Taking into account that they promote logic gate choice A, it would be very interesting if the authors report the per layer distribution of logic gates after the training. This could be interesting also in contrast to Gaussian initialization.
Thank you very much for your helpful and positive feedback, and for appreciating that our "paper is well organized", provides a comprehensive review on methods that target efficient inference, our in-depth discussion of the benefits of logic gate networks. Thank you also for appreciating the technical details regarding the realization of the method in hardware, and expressing that you find the realization convincing. Finally, we appreciate that you find our contribution clear, achieving state-of-the-art performance in the context of larger logic neural networks.
Weaknesses:
[...] random connections that applied in [1] are adopted in this work without well being theoretically justified or proposing an alternative way.
We would like to clarify that, while our connections still have some level of randomness, they are substantially more structured. In particular, the convolutional layers are binary trees, so there is a deterministic structure within the convolutional kernels. Moreover, we restrict the input connections to be from only two channels, which further improved performance.
In addition to that, I find myself referring occasionally to [1] in order to understand some technical details. For example, the differentiable logic gates are presented only schematically in the paper.
Thank you for this remark; we will extend the discussion of differentiable logic gates in the camera-ready version, where we have an additional page.
[...] computational graph [...] Do the authors introduce a projection layer, parameterized by vectors z, for each channel of the kernel on the available gates? To this end, is the z vector optimized taking the partial derivative of z on the classification loss? Introducing some details on the optimization process will be useful.
Yes, the vectors are optimized by taking the derivative of on the classification loss. These vectors are the logits of the probability distributions over choices of logic gates, and can be mapped to those probabilites via softmax (Eq. 1). Accordingly, we do not use a projection layer. We will add clarifications to the camera-ready.
The random selection of inputs on the receptive fields raises concerns regarding the consistency of the training process, with the paper not providing error bars regarding different training runs.
I would like to stress the consistency of the proposed method in different training runs due to the fact that it is based on the random selection of inputs of the receptive field. Is the robustness of training preserved and what are their experimental observations?
Thank you for raising this important concern. Yes, consistency between training runs is given, especially for larger models, whereas for the smallest models the stochastic effects can be a bit larger. In the following, we provide means and standard deviations over 10 seeds for our smaller models:
| CIFAR-10 | Originally reported | With standard deviations |
|---|---|---|
| LogicTreeNet-S | 56.71% | 56.55±0.46% |
| LogicTreeNet-M | 70.65% | 70.72±0.37% |
| MNIST | Originally reported | With standard deviations |
|---|---|---|
| LogicTreeNet-S | 98.06% | 98.27±0.25% |
| LogicTreeNet-L | 99.11% | 99.10±0.10% |
| LogicTreeNet-XLD3 | (new) | 99.24±0.06% |
We will extend the standard deviations to the larger models (takes very long) as well as Fashion-MNIST (we prioritized the other experiments for now).
Although the authors discuss the reasons for attaching higher probability to the logic gate choice A (or B), they do not discuss how they conclude on . Such empirical decisions potentially hinders the generalization ability of the proposed method.
Question: How do the authors conclude on ?
is the value that leads to 90% probability being assigned to the gate choice 3 ('A'). We have clarified this in the revision. Also, we provide a code sketch for an explicit computation below.
>>> z = torch.zeros(16)
>>> z[3] = 4.905
>>> torch.softmax(z, dim=0)
tensor([0.0067, 0.0067, 0.0067, 0.9000, 0.0067, 0.0067, 0.0067, 0.0067,
0.0067, 0.0067, 0.0067, 0.0067, 0.0067, 0.0067, 0.0067, 0.0067])
Moreover, to further address your concern, we performed an additional ablation study, where we vary between 1.5 and 7.5, which we provide in the Author Response PDF page with the General Rebuttal (Figure 2).
Typos
Thank you for pointing out the typos; we have corrected them, and will proof read everything for the camera-ready as suggested.
Questions:
How many additional trainable parameters are introduced during training in contrast to the traditional CNNs and which of them are discarded during the inference?
We use 16 parameters (vector ) for each differentiable logic gate. After training, each gate is discretized to a single parameter, i.e., the choice of logic gate. Finally, during the simplification process, depending on the exact model, 60-80% of the logic gates are removed.
How do the authors comment on the observation that the proposed method seems to not generalize equally well to smaller architectures?
The proposed method leads to lower accuracies in smaller models (e.x. LogicTreeNet-S of the table 1)
First, we would like to state that we designed each of our models based on model size "L". Thus, for the smallest model, in order to match the number of logic gates with the baselines, we had to drastically reduce the number of channels down to 40, which was not the optimal model for the small size, but maintained for consistency.
Taking into account that they promote logic gate choice A, it would be very interesting if the authors report the per layer distribution of logic gates after the training. This could be interesting also in contrast to Gaussian initialization.
Thank you for this request. We have added a visualization of the per layer distribution of logic gates after the training to the Author Response PDF page.
I would like to thank the authors for the clarifications given and I appreciate their effort in answering my comments.
The authors discuss and provide clarification to my comments including additionally experimental results. Thus, I update the score of both soundness and presentation from 3 to 4. Additionally, they address my main concern regarding the stochastic effects of random connections. To this end, I update the score for contribution from 3 to 5 and overall score accordingly.
The presented work is a significant extension to "Deep Differentiable Logic Gate Networks" previously presented at NeurIPS 2022 [7]. Additional contributions are the support for convolutions including logic gate trees / or-pooling and residual initializations. All additions together allow to train logic gate networks that are deeper, achieving SOTA accuracies and beyond while using remarkably fewer resources during inference and training as well. The authors promise to also make the code publicly available.
优点
Improving efficiency of (small scale) neural networks substantially. Lowest latency of all SOTA baseline results, the majority of them being much slower with even worse accuracy.
缺点
There are a few issues with the clarity of the presentation. Upfront it should be mentioned that the Appendix is vital to understand many details (architectures, choice of parameters, memory usage, memory access, etc.) and should certainly be published as well. It was only supplementary material as part of the review.
Figure 4 only shows an effect, but does not explain why pre or-pooling is superior to the other 2 and the text does neither.
Lines 228-229 seem to contradict the statement made in line 115. Does the training time mentioned in line 115 mark a baseline? If yes, then please state that and put it in relation to the "substantially improved computational training efficiency" lateron.
In figure 6 (and from the associated text) it is not clear which 10 of the 18 subnetworks are trained. Is it possible to mark those? Are the blue networks the connection index tensors? A few more detailing remarks would help to understand the figure better.
Table 1 does not include a SOTA baseline using float weights. Even if that is a bit out-of-scope, it would help to put accuracies vs. number of bits (e.g. 32 (for FP32) * N parameters) in perspective.
Table 2 lists results for a Xilinx XC7Z020. If not mistaken, the upper limit for the number of configurable gates is ~1.3M. How does the execution of models M/L/G that all exceed that number by far actually work? Same for the MNIST model L in Table 3. Are any additional resources of the FPGA device being used? A breakdown would be very useful.
Table 5 contains a line for "No or pooling". It is unclear why it lists the number of total layers to be 10 and not 14. Please explain.
Appendix, Section A.3.3 CPU Inference Times: the CPU in use is not being mentioned. Is this a desktop/workstation CPU or the ARM-based CPU of the Xilinx XC7C020?
Typos:
line 441: "This means the that the..." -> "This means that the..."
Appendix, line 6: "... from Figure ??..." -> please cite the correct figure, my guess is #6.
问题
Reflecting on lines 216-226: the reviewers personal view of residual connections is similar to the authors' and can be summarized as means to preserve (mutual) information throughout the network. Although residual initializations seem pivotal to training of LGNs, they could potentially also help float-based NN trainings without the need to add residual connections. If the authors share that view it would be great to add a discussion on this to the presentation.
Why is the choice of gates being limited to 2-input variations? Especially in the context of convolutions and pooling, wouldn't it make sense to also allow for N-input OR-gates with N >> 2? It would also allow for reducing depth and signal propagation delays.
局限性
Despite improved training and inference efficiency, experimental results are limited to very small classification tasks. A single bigger task would prove scalability (or not).
Lines 416-417. The chosen model sizes (S,M,L,G) do not prove saturation of accuracy (except maybe for MNIST), but you even mention that they improve with increasing model depth. If there's a reason why you stop early, please mention it in the presentation.
Thank you very much for providing such extensive and positive feedback, and for appreciating our substantial efficiency improvements, and achieving the lowest latency of all SOTA baseline results. Due to the character limit, we keep our reponses short; please let us know if you would like us to elaborate.
[...] Appendix is vital [...]
Thank you for this remark. For the camera-ready, papers have an additional page, so we can fit a few more details in the main paper and improve the clarity. Yes, we will publish the appendix as well.
Figure 4 only shows an effect [...]
Fig. 4 indeed only shows the effect that the model automatically regularizes itself to have an activation level of around 50%. (Ideally, act. levels are ~50% to maintain high information content.) "pre or-pooling" refers to the activation level before the or-pooling operation and "post or-pooling" refers to the activation level after the or-pooling operation (both from the same model). "no or-pooling" refers to a modified architecture without or-pooling. We clarified the notation and explanation for the revision.
Lines 228-229 [...] line 115. [...]
In line 114-115, we referred to vanilla LGNs, which we will explicitly clarify in the camera-ready. We provide overall training speeds in the supplementary, and offer to add a direct comparisons between existing vanilla LGN, our vanilla LGN, and our convolutional LGN training speeds for the revision.
In figure 6 [...]
We apologize for the ambiguity. The layers that are trained are the "Conv" (/"C") and the "Rand" layers; each of the "Conv" blocks in the figure contains 2 layers of logic. Blue illustrate the input + hidden states; the index tensors are encoded within the green "Conv" and "Rand" blocks. We will explain and mark it in the revision.
SOTA baseline using float weights.
Thanks for the suggestion, we will include SOTA models with float weights into Table 1.
Table 2 lists results for a Xilinx XC7Z020. [...]
For CIFAR-10, as listed in the caption (Tab. 2), the times for M/L/G are based on CPU simulations.
For the MNIST L model, you are correct, based on our initially reported number of gates would not have fit on the FPGA. While we were able to fit the model on the FPGA, at the time of writing, we could only accurately compute the number of gates for the CIFAR-10 model, and used an upper bound for the numbers of gates for MNIST (we used the total number of gates during training).
In the rebuttal PDF (Fig. 1) we illustrate that a majority of gates is actually a trivial feedforward "A", which can be optimized away. Additional simplifications are also possible, e.g., "True and B" -> "B". As Vivado optimizes for 6-LUTs, we could not read out the number of ASIC gates. As there were no open-source libraries for logic gate network simplification that scale to our model, we developed our own stack, which at the time of writing only supported CIFAR. Now, it also supports MNIST, and we can report more accurate numbers of ASIC gates for MNIST:
| MNIST | # Gates (prev.) | # Gates (new) |
|---|---|---|
| LTNet-S | 296 K | 197 K |
| LTNet-L | 4.74 M | 671 K |
(The actual number of logic gates will still be lower than this number.)
Are any additional resources of the FPGA device being used?
So far, we only utilize Logic Cells and Flip Flops to keep it as close as possible to efficient ASIC designs.
Table 5 [...] "No or pooling". [...]
The or pooling pools 2x2 inputs, and thus requires 2 levels (layers) of 2-input logic, which can be reduced to a single level on certain hardware (see below).
Appendix, Section A.3.3 CPU [...]
The CPU in Appendix A.3.3 is an AMD Ryzen 5 7600X (consumer desktop) CPU, and we utilize only a single thread of the CPU.
Thanks for pointing out the typos, we fixed them for the camera-ready.
Questions:
residual initializations
This could indeed be a very interesting direction for future work. We will include a discussion in the revision.
limited to 2-input
Beyond what we discussed in the paper, we actually considered, implemented, and evaluated 4-input and 6-input gates in the convolutional kernels. We observed that the 2-input tree formulation leads to more favorable learning dynamics, as well as a better trade-off between numbers of gates and accuracy, which is why we stuck with 2-input gates.
OR-gates with N >> 2
E.g., for OR-pooling, yes, these can be implemented, e.g., with a 4-input OR gate. Which specific hardware implementations wrt. chip area, delays etc. are best depends on the particular ASIC manufacturing process. For our models, we count the 4-input OR gate as 3 gates to have a conservative estimate that applies independently of the hardware.
Limitations:
Despite improved training and inference efficiency, experimental results are limited [...] scalability
Thanks for the questions. We are indeed actively working on larger classification tasks for the proposed approach and consider this an important research question. Our current preliminary designs are internally reaching a performance of 48.06% on ImageNet (top-1). We will continue this direction and will hopefully reach even more generalist models in the future.
[...] saturation of accuracy [...] improve with increasing model depth. If there's a reason why you stop early, please mention it in the presentation.
The reason for us to stop rather early was computational training cost. Networks with d=3 are even more expensive to train (~2x compared to d=2). We had let it continue to train the d=3,3,3,3 model after submission and it reached 85.46% (vs. 85.22% in Tab.5.) Notably, the deeper models are barely more expensive in inference because the deeper models end up with more residual gates.
Inspired by your comment, we extended the MNIST model to a larger and deeper model with d=3, and now achieve 99.24%, which improves the accuracy over all baselines:
| MNIST | Acc. | # Gates |
|---|---|---|
| LTNet-XLD3 | 99.24±0.06% | 1.82 M |
I would like to thank the authors for the clarifications given and I appreciate their effort in answering my comments and especially conducting (and possibly including) even more experiments.
[...] Yes, we will publish the appendix as well.
Squeezing the Appendix into tiny space may make it less useful. In that case, feel free to consider the idea to publish a full, detailed version of the Appendix together with the source code repository and refer to it in the paper and/or short version of the Appendix.
[...] In the rebuttal PDF (Fig. 1) [...], we developed our own stack, [...]
I'm using OpenReview for the first time, so please forgive me if I'm wrong here. I don't see a rebuttal PDF, only a PDF that seems to be the original version. Is there a (not that obvious button) to show the updated version? I do see a rebuttal PDF extending the Appendix and ablation study (alone).
For the second part of my citation, it would be great to describe the gate-level optimizations developed and applied and also refer to the publication of source code of it if you plan to share those details as well.
Thank you for responding to our rebuttal, and for asking for the clarifications.
For the final publication, we will publish the full appendix along with the paper, and also include it with the source code.
The rebuttal PDF can be found in the general rebuttal titled “Author Rebuttal by Authors” at the top of this page. It is a single PDF page with 2 figures, which we will include in the final paper / appendix.
For the developed gate-level optimizations, we will include the details in the final appendix.
This paper introduced a convolutional logic gate network (LGN), which works effectively on high-dimensional spatial images. Inspired by LGN and convolutional neural nets (CNNs), the authors proposed (1) Logic Gate Tree as convolutional kernels (2) Logical OR as the pooling layer, and (3) residual initialization (instead of Gaussian random init). Besides, the authors also developed an engineering strategy to speed up the training, using low-level CUDA kernels, which is well admired. The authors have shown impressive results, in terms of performance and efficiency, on CIFAR10 and MINIST datasets.
优点
- Several novel ideas (technical contributions) exist in this paper. I especially admire the idea of primarily using a feedforward logic gate during initialization, which prevents both loss of information and vanishing gradients. The motivation and intuition are very clear in L208-215.
- The authors have demonstrated their design using insightful experiments. For example, when introducing logic OR as pooling, the authors discussed that training can implicitly prevent saturation of activations using experiments, which is very interesting.
- The experimental performance is impressive.
- The engineering strategy and CUDA implementation (and open-source) would benefit the community and future research a lot.
- The paper is very well-written. Though I'm not an expert in this domain, it is easy to understand the storyline, technical details, related works, and intuition.
缺点
- Some suggestions on Figure presentation.
- Figure 1. Also consider showing the speed advantage, as today's high-performance edge devices (Nvidia Xavier, Orin, etc) can accommodate large-weight networks and the weights of other works are already relatively small. Reporting that you can run inference the CIFAR-10 image in ~0.7 on an FPGA chip would be very impressive even without looking into your paper. Besides, what is "Pareto-curve" (in the caption)? I didn't see it shown in the figure.
- Figure 2. Maybe change the input into some "flattened inputs" (L112) to better show that vanilla LGNs are not designed to process images.
- Figure 3. Maybe re-arrange the figure and get some space for the details of your structure. Better to also show how your network process "channels", as currently it is not clear from the figure. I also suggest adding some annotations, e.g., in the figure, adding the same notations like "depth d=2", the green squares are the selected Cm Ch Ck, in the caption also mention your NN can share weights like CNN in different spatial regions. Polishing this figure can help the reader understand the processing details quicker than understanding Eqn. (3).
- Some technical questions need to be better explained.
- L114-115, why are vanilla LGNs "very computationally expensive to train"? Is it because they didn't implement CUDA kernels?
- I noticed that in the design of the network, the authors chose a relatively small depth but large channels (2 vs 40,400,2000+). Is there any intuitive reason to do so? How many layers (depth) does vanilla LGN have?
- The authors have implemented CUDA kernel, but in the speed comparison (Table 2), the results are from Xilinx FPGA (I guess only has CPU). Why didn't the authors implement experiments on GPU? Is it for fair comparison w/ others? Maybe I missed something, but on CPU, what's the advantage of implementing CUDA kernel?
问题
See weakenss.
局限性
The authors could provide a paragraph discussing their potential limitation to solving more complex CV tasks involving continuous decisions. E.g., regressing boundaries of bounding boxes (Object Detection/Tracking), localization and mapping (SLAM), generative CV, etc.
Thank you so much for your positive feedback, and for appreciating the feedforward gates during initialization, our ablation studies, our experimental performance, as well as our engineering and CUDA implementations. We appreciate that you find our "paper is very well-written". In the following, we address each of your recommendations, questions, and concerns.
Some suggestions on Figure presentation. Figure 1. [...] Figure 2. [...] Figure 3. [...]
Thank you for each of these helpful suggestions. We will incorporate them into Figures 1, 2, and 3, as well as their captions.
Some technical questions need to be better explained. L114-115, why are vanilla LGNs "very computationally expensive to train"? [...]
The primary reason for vanilla LGNs to be very computationally expensive to train is that they are bottlenecked by memory and cache read and write accesses during training. In contrast, by utilizing the proposed convolutional structure, the parameter sharing enables loading fewer parameters into a cache that is shared between different cores of the GPU, requiring reading only once. Moreover, by fusing the tree structure, and not storing any intermediate results in memory, expensive global memory writes are drastically minimized. If and we have a maxpool of fused to it, then 15 logic gates are executed during forward, and only a single output activation has to be stored. While this requires recomputation of intermediate results during backward, only one out of four paths through the pooling needs to be backpropagated through, and the choice of this path requires storing only 2 bits. This, combined, leads to a much higher utilization of the memory bandwidth, while at the same time reducing memory access requirements, and drastically improving the utilization of actual compute units in the GPU. Furthermore, the sparsity pattern as introduced by using convolutions is also more favorable for memory accesses.
Beyond this, we have made contributions to faster training of vanilla LGNs, and, e.g., reduced memory access from reading 18 floats down to reading 6 floats by precomputing coefficients in a simplification of Eq. (1). In particular, Eq. (1) can be rewritten as for a certain set of , which we precompute in a separate kernel, and thus only have to compute once for the entire batch. This constitutes another fundamental speedup that applies to both vanilla LGNs and convolutional LGNs, reducing both memory and compute requirements.
I noticed that in the design of the network, the authors chose a relatively small depth but large channels (2 vs 40,400,2000+). Is there any intuitive reason to do so? How many layers (depth) does vanilla LGN have?
The intuitive reason for the large number of channels compared to the depth is that the network is very sparse and only uses logic, and thus the expressivity of each channel is smaller (compared to a conventional CNN). Thus the model requires more channels in order to attain high overall expressivity.
The depth of 2 that you mention refers to the depth of each convolutional block. With 4 convolutional blocks and 2 randomly connected layers for the head, the total trainable depth of our model is 10 layers. Including the or-pooling, we have a total of 18 layers.
The best performance with vanilla LGNs is achieved with 4-6 layers. [7] reported trying up to 8 layers, and from our own experiments we can confirm that vanilla LGNs with 8 or more layers converge extremely slowly and to lower accuracies. The best vanilla LGN MNIST model uses 6 layers and the best vanilla LGN CIFAR-10 model uses 5 layers. The best vanilla LGN for CIFAR-10 requires 1,024,000 neurons per layer.
Thus, our CIFAR-10 model has 2x the trainable depth and 3.6x the total depth compared to the vanilla LGN, while having substantially fewer channels.
The authors have implemented CUDA kernel, but in the speed comparison (Table 2), the results are from Xilinx FPGA (I guess only has CPU). Why didn't the authors implement experiments on GPU? Is it for fair comparison w/ others? Maybe I missed something, but on CPU, what's the advantage of implementing CUDA kernel?
To clarify, while all training is performed on GPU (as it requires float operations), the inference is performed on FPGAs or CPUs as it only requires bitwise logical operations. While we could have also run inference on GPU, GPUs are highly optimized for float operations and rather neglect bitwise logics; further, for GPUs the speed of transferring input data would be the bottleneck. As FPGAs are effectively slow proxies of ASICs, utilized in hardware design, they are the closest one can get to ASICs without actually manufacturing ASICs.
The authors could provide a paragraph discussing their potential limitation to solving more complex CV tasks involving continuous decisions. E.g., regressing boundaries of bounding boxes (Object Detection/Tracking), localization and mapping (SLAM), generative CV, etc.
Indeed, we have not explored CV tasks involving continuous decisions. Continuous decisions, as the ones listed, are a great direction for future work, and we have included a paragraph in the camera-ready to highlight this.
We would like to thank all reviewers for their time and valuable comments, which have helped us improve our paper. We respond to each of your questions and concerns individually below. Moreover, we would like to highlight the following additions:
- We are now providing standard deviations for our smaller models, and have started training additional seeds for our larger model, which we will include in the camera-ready.
- Inspired by Reviewer zmod's comments, we have trained a larger and deeper model on MNIST, achieving 99.24±0.06% that requires only 1.82 M gates, achieving the best accuracy for logic gate and binary networks overall.
- In the author response PDF, we provide an illustration of the learned distributions over logic gates, comparing residual and Gaussian initializations.
- Finally, we added an ablation study wrt. .
The paper introduces a significant improvement in training Logical Gate Network with application to image data. The work introduces convolutional logical gates networks, pooling and residual connectivity operations, and shows that they can be trained to achieve impressive performance.
All reviewers appreciated the novelty and contribution, provided very enthusiastic reviews and endorsed the paper, after some discussion with the authors. I therefore recommend that this paper should be accepted.