PaperHub
6.4
/10
Poster4 位审稿人
最低3最高5标准差0.7
3
5
4
4
2.8
置信度
创新性3.5
质量3.0
清晰度3.5
重要性2.8
NeurIPS 2025

G-Net: A Provably Easy Construction of High-Accuracy Random Binary Neural Networks

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29
TL;DR

High-Accuracy Random Binary Neural Networks

摘要

关键词
Hyperdimensional ComputingRandom Binary Neural Networks

评审与讨论

审稿意见
3

The paper focuses on hyperdimensional computing (HDC), where the data is first transformed to some very high dimensional binary vector, and then inference is performed via some simple linear method. This paper shows how a network of a specific (but still quite generic) form can be converted to binary (in some sense) network in the infinite limit and some convergence results are presented for finite widths. Experiments on some datasets, such as MNIST and CIFAR10 are presented. The central result of the paper is to train a floating point network and binarizing it later in a way nothing is lost (in the infinite limit; in finite width, we are losing something and the paper quantifies how much)

The main identity that is behind all the constructions is the Grothendieck identity (quantifying the probability for two vectors u,vu,v that they are on the same side of a hyperplane with normal vector gg, gg is normally (or rademacher) distributed; the probability is some function of <u,v><u,v> for unit vectors u,vu,v). Then, by the law of large numbers, if we draw enough normal vectors gig_i, we can recover <u,v><u,v> from the binary indicators sign(<gi,u>)(<g_i,u>). Then, one can express the inner product <u,v><u,v> between input and weight matrix using the inner product of the binary embedings (sign(<gi,u>)(<g_i,u>) for i-th element of embedding of uu) modulo some technicalities such as input normalization or non-linearity application. This is then used for every layer sequentially.

优缺点分析

The technique itself is neat and I like the rigorous quantification of the approximation error. I am not aware of the technique being used in a similar context, but my knowledge of the field is arguably limited.

On the other hand, there seem to be limited motivation. The goal of this paper is a bit vague. It presents a way of constructing (binary, in some sense) networks with some goals in mind (efficient computation), but the evaluation solely focuses on accuracy on some datasets and it is not clear if this method should be preferred over standard CNN, because no concrete scenario where standard CNN is worse (or not applicable) is not presented.

I personally did not understand the HDC link as descibed in the paper. Specially, this paper does not propose a method falling to that framework (as described in the paper) in a nontrivial sense; i.e., there is no high dimensional representation with a linear classifier. Instead, to me this work falls in the binary (resp. integer) network paradigm, where with suitable design choices in the paper, the internal representations/operations are integral. Thus, I am not convinced the baselines are chosen well.

To summarize, the paper is entertaining, presents an interesting technique, but I find it hard to conclude more from the paper, because the task considered in the paper is not well specified and I fail to see a concrete scenario where I would consider using this technique. I think the paper would benefit from proper task definition and subsequent comparison with other methods on the task.

typo

I think fig2 contains wrong variable names at places. Say in the embedded layer. First, we compute embedding y=sign(Gx)y = sign(Gx), then we should multiply this with sign(GW)sign(GW) (cf. eq (3)), but the "linear transform by embedded weights" multiplies this weight thing by xx (and not by yy. This seem to be a consistent typo in the figure, where xx is often used instead of yy.

问题

The network architecture used in the paper is a bit weird (chosen so that it can be binarized) and the experiments were presented only on small datasets. Can the method me scaled up to (say) Imagenet? CIFAR100? If not, what is the bottleneck? Does the training get challenging, or is it the binarization?

局限性

最终评判理由

I have read the response and I was not convinced by the answers so i keep my score.

in general there is nothing really wrong with the paper but it does not reach the bar for me. If the rest of the reviewers and AC decide to accept the paper, I think it is fine.

格式问题

作者回复

We sincerely thank the reviewer for their thoughtful feedback. We appreciate the valuable remark on the novelty and contribution of our approach and are grateful for the recognition of the rigor in our approximation analysis. Below, we respond to the specific points raised by the reviewer, using "Q" to indicate questions and "W" to denote identified weaknesses.


Network Architecture (Q):

We did not observe any bottlenecks due to our non-standard architecture that would cause issues when scaling to larger datasets.

Because of the high-dimensional nature of HDC models and their emphasis on simplicity and interpretability, they are often evaluated on relatively smaller datasets such as MNIST, UCI HAR, and European Languages. In our experiments, we aimed to explore a more diverse set of datasets, including more challenging (in terms of achievable accuracy) and less commonly used benchmarks in the HDC literature, such as CIFAR-10. The use of very large datasets, such as ImageNet, is uncommon in HDC research due to the substantial memory and computational demands associated with high-dimensional vector encoding and classification. In future work, this can be extended to even more challenging datasets like ImageNet if applications beyond HDC are considered and lower dimensional embeddings become involved.

Beyond the binary embedding, the only non-standard aspects of the network architecture we introduce are: (1) 2\ell_2-norm normalization of the weights and input, and (2) composition of the activation by arcsin\arcsin. First, the 2\ell_2-norm normalization is equivalent to using a cosine similarity instead of a standard inner product when composing layers of the network. In fact, recent work has shown that such normalization helps with the performance of neural networks. References [32] (Salimans & Kingma), [24] (Luo et al.), [38] (Wu et al.), and references therein highlight key advantages of 2\ell_2 normalization, which we briefly summarize here. In [32], the authors demonstrate that 2\ell_2 normalization improves gradient conditioning and accelerates convergence during optimization. Luo et al. [24] advocate for using cosine similarity (equivalent to the Pearson Correlation Coefficient) instead of the dot product in neural networks, as the unbounded nature of dot products can lead to high variance in neuron activations. This high variance increases sensitivity to input distribution shifts, impairs generalization, and exacerbates internal covariate shift, ultimately slowing down training. Wu et al. [38] extend this idea to convolutional layers by applying cosine similarity, aiming to mitigate the variance issues associated with standard convolution operations. While methods such as batch normalization have gained more popularity due to their simpler gradient computations, both prior literature and our experimental results consistently show that 2\ell_2 normalization enhances model accuracy. Across a wide range of standard CNN architectures, we observed that the G-Net variants incorporating 2\ell_2 normalization consistently outperformed their CNN counterparts in terms of accuracy.

Second, empirical results suggest that composing the activation with the arcsin\arcsin function does not significantly impact performance. In practice, neural networks have been successfully trained using a wide range of nonlinear activation functions, which may explain why our approach achieves comparable accuracy to standard activations such as ReLU or Sigmoid. Rather than heuristically introducing a new activation function, the G-Net formulation provides a principled justification for the proposed activations. We believe this offers a meaningful technical contribution to the broader discussion on activation function selection.


Limited Motivation & Preference Over a Standard CNN (W):

We began the paper by presenting HDC as the central motivation. As noted in the introduction and discussed in detail in Section S1 of the Supplement, HDC techniques typically consist of an embedding step followed by an inference step. However, these components are often designed heuristically and lack effective coordination, which is a key reason for the reduced accuracy commonly observed in HDC models. To address this issue, Section 2 introduces the bundle embedding problem and explains that resolving this problem could enable HDC models to achieve accuracies comparable to those of high-performance models trained directly in the primal space (i.e., with floating-point storage and operations). The remainder of the paper presents our proposed strategy for solving the bundle embedding problem, including its mathematical formulation, theoretical analysis, and empirical evaluation. We would be glad to incorporate any additional comments that might help further clarify or strengthen the motivation behind our work.


HDC Link (W):

Consider a classic HDC framework, where the embedding stage maps input samples into high-dimensional binary vectors, followed by a simple linear classifier applied to these representations. While conceptually elegant, this approach often lacks the flexibility and expressive power required for complex datasets, resulting in poor classification performance.

To address this limitation, one can envision a multi-stage embedding pipeline designed to progressively transform the data such that the samples become increasingly linearly separable. The key challenge then becomes: "how can we coordinate these consecutive embeddings to ensure that, by the final stage, the data is suitably structured for accurate linear classification?"

This is precisely where the G-Net framework comes into play. Once trained, each layer of the EHD G-Net can be interpreted as an embedding stage. These stages are not independent or heuristic but are carefully coordinated through the G-Net's architecture. As a result, by the time the samples reach the final classification layer, the data has been transformed to a space where class separability has matured, allowing the final softmax layer to perform effective and accurate classification.

The proposed framework can essentially be viewed as an HDC pipeline with multi-stage embeddings, in which the consecutive embedding stages are systematically coordinated through the initial training of the G-Net in the primal data space. This viewpoint has already been briefly discussed in the paper (lines 187-191) and Section S1 of the Supplement; in the final version, we will further clarify this point.


Proper Task Definition (W):

Regarding the task specification, as stated earlier, under an HDC framework, the problem statement was solving the bundle embedding problem, where G-Net was proposed as a solution strategy (please refer to the extended response above related to limited motivation).


Variable Names in Fig 2 (W):

We appreciate the comment. If we wanted to use global variables, we would need to use many variables and long expressions in Figure 2. Instead we followed the convention that the input to each block of Figure 2 is denoted by xx, and the output is denoted by yy, and each block explains how the input (xx) and output (yy) are related. Upon your note, we added a clarifying remark in the caption that for each block xx represents the input and yy represents the output.

审稿意见
5

This paper introduces G-Net, a novel approach for constructing high-accuracy random binary neural networks. The method is inspired by hyperdimensional computing (HDC) and aims to bridge the gap between traditional neural networks and randomized binary neural networks. The core idea is to train a "floating-point" G-Net in the real-valued domain and then convert it into an Embedded Hyperdimensional (EHD) G-Net, which operates on binary data, without requiring further training in the binary space. This conversion is theoretically justified by Grothendieck's identity and concentration of measure, ensuring that the EHD G-Net retains the accuracy of its floating-point counterpart. The paper provides theoretical consistency guarantees for both Gaussian and Rademacher embeddings and demonstrates strong empirical performance, significantly outperforming prior HDC models on benchmark datasets like MNIST and CIFAR-10, while rivaling real-valued CNNs.

优缺点分析

Strength:

  • The paper provides rigorous theoretical guarantees for the accuracy preservation during the conversion from G-Net to EHD G-Net, leveraging Grothendieck's identity and concentration of measure. This is a significant contribution, as many prior HDC works lack such deep theoretical underpinnings.
  • The concept of training in the real domain and then performing an "inexpensive binary encoding" to achieve high-accuracy binary networks without training in the high-dimensional, constrained binary space is highly innovative and practical.
  • The experimental results clearly show that G-Nets outperform existing HDC models by substantial margins (e.g., almost 30% higher accuracy on CIFAR-10 compared to prior HDC models). They also achieve accuracies comparable to real-valued convolutional neural networks, which is a major step forward for HDC.
  • The motivation from HDC for efficient hardware implementation and the exploration of Rademacher embeddings (which are simpler to generate and apply than Gaussian vectors) highlight the practical relevance and potential for deployment on edge and low-energy devices.
  • The paper is well-organized, with a clear introduction of concepts, theoretical results, and experimental validation.
  • The authors provide a link to their GitHub repository, which significantly aids reproducibility and allows for further research and application of their method.

Weakness:

  • G-Nets require -normalization of inputs and row-wise normalization of weight matrices. While the authors argue these are advantageous, they might limit flexibility or require specific preprocessing steps not explicitly detailed for all scenarios.
  • The network consistency analysis (Section 4.1) focuses on the base ASU G-Net case and defers comprehensive treatment for TASU and RASU layers to an extended presentation. While understandable for brevity, it leaves a theoretical gap in the main paper for the more practically relevant RASU/TASU architectures.
  • The n/p\sqrt{n/p}​ discrepancy term in Rademacher embedding that doesn't vanish with increasing N is acknowledged, but its practical implications or strategies to mitigate it could be discussed further, especially given the claim that Rademacher embeddings achieve "comparably close performance."

问题

  • Could the authors elaborate on the practical implications of the l2​-normalization and row-wise weight normalization constraints for real-world datasets that might not inherently adhere to these properties? Are there recommended preprocessing steps or training adjustments for such cases?
  • For practical deployment, when would a user prefer a RASU G-Net over a TASU G-Net?
  • Regarding the assumption wiTxlmin>0|w_i^T​x| \geq l_{min} ​>0 for TASU analysis (Theorem 4.2), how robust are the empirical results to violations of this assumption? Are there strategies to ensure this condition is met during training, or does it hold naturally in common scenarios?
  • The paper mentions that "wide neural networks experience only minor changes in their weight distributions during training." Could the authors expand on how this property is leveraged or maintained to ensure the "spread-out" condition for applying Theorem 4.4 and Proposition 5.1?

局限性

Yes

格式问题

NA

作者回复

We sincerely thank the reviewer for their thoughtful comments and valuable feedback. We are grateful for the recognition of our theoretical contribution and appreciate the encouraging remarks highlighting its distinction from prior HDC studies. Below, we respond to the specific points raised by the reviewer, using "Q" to indicate questions and "W" to denote identified weaknesses.


Implications of the 2\ell2-Normalization (Q & W):

References [32] (Salimans & Kingma), [24] (Luo et al.), [38] (Wu et al.), and references therein highlight key advantages of 2\ell_2 normalization, which we briefly summarize here. In [32], the authors demonstrate that 2\ell_2 normalization improves gradient conditioning and accelerates convergence during optimization. Luo et al. [24] advocate for using cosine similarity (equivalent to the Pearson Correlation Coefficient) instead of the dot product in neural networks, as the unbounded nature of dot products can lead to high variance in neuron activations. This high variance increases sensitivity to input distribution shifts, impairs generalization, and exacerbates internal covariate shift, ultimately slowing down training. Wu et al. [38] extend this idea to convolutional layers by applying cosine similarity, aiming to mitigate the variance issues associated with standard convolution operations. While methods such as batch normalization have gained more popularity due to their simpler gradient computations, both prior literature and our experimental results consistently show that 2\ell_2 normalization enhances model accuracy. Across a wide range of standard CNN architectures, we observed that the G-Net variants incorporating 2\ell_2 normalization consistently outperformed their CNN counterparts in terms of accuracy.

Importantly, 2\ell_2 normalization imposes no practical limitations on the learning task. In classification settings with a softmax output layer, scaling the outputs does not affect the resulting class probabilities. For regression problems, the response variable can be scaled by a factor such as ymaxy_{max}, or an additional learnable scaling parameter can be introduced in the output layer to match the scale of the target values. For implementation purposes, Section S12 of the supplementary material provides a detailed and comprehensive guide for applying 2\ell_2 normalization across different layers, such as fully connected, convolutional, general linear, and classification versus regression output layers. It also presents numerical strategies for incorporating bias in linear layers to enhance the concentration properties of the EHD G-Net.


RASU G-Net over a TASU G-Net (Q):

As noted in Section 3 (lines 199–200), the RASU layer outputs non-negative integer vectors, while the TASU layer produces fully binary outputs. When hardware resources support integer operations—such as via expanded binary representations—RASU networks typically yield higher accuracy, as shown in our experiments. Otherwise, TASU networks operate entirely in binary mode, with a slight trade-off in accuracy.


Assumption wiTx>lmin>0|w_i^T x| > l_{min} > 0 for TASU analysis (Q):

This assumption stems from the properties of the sign function, which in our formulation outputs only ±1\pm 1 and is undefined at zero, see the proof of Theorem 4.2. While it is possible to define sign(0)=0(0) = 0, doing so would result in a trinary rather than binary network. Many HDC frameworks address this by randomly assigning ±1\pm 1 when a zero occurs. In practice, the condition wiTx=0w_i^T x = 0 is extremely rare, and experimentally, how we handled this edge case didn't make a practical difference in the results. An alternative presentation of Theorem 4.2 could have excluded neurons producing zero outputs from the analysis, as, given their rarity, their effect on discrepancies is minimal. However, the current form of Theorem 4.2 offers a more concise and readable presentation.


Minor Distribution Shift of Wide Neural Networks (Q):

To illustrate this property numerically, we performed an experiment (which will be included in the revised final paper supplement) where a wide G-Net layer is initialized with row-normalized Gaussian weights. The experiment shows that, after training, the shift in the weight distribution is minimal. This behavior is well-known and has been highlighted in prior works, such as reference [16] (on neural tangent kernels), which demonstrates this property in wide neural networks. In Theorem 4.4, we assume that a trained G-Net layer follows the normalized Gaussian model of equation (8). If the layer weight initialization follows (8), given that wide Gaussian layers experience only minor shifts during training (and the 2\ell_2 norm of all rows concentrate around a similar value), the trained weights are still expected to remain close to (8), supporting the applicability of Theorem 4.4 assumption. Moreover, the Gaussian initialization makes the weights initially “spread out”, and for sufficiently wide networks, they tend to remain so throughout training. This “spread-out” property is key to Corollary 5.1, which leverages Grothendieck’s identity to show that it approximately holds for Rademacher vectors when the inner product factors are well-dispersed. This connection enables the extension of Gaussian-based embedding results to Rademacher embeddings. Informally, the importance of the “spread-out” condition lies in its role in enabling central limit theorem behavior, allowing Rademacher embeddings to approximate Gaussian distributions—more precisely, it facilitates the application of a multivariate Berry–Esseen-type theorem (see Supplementary Material Section S8, Theorem S2).


Network Consistency Analysis (W):

We appreciate the note. For TASU layers, the near-isometry property in equation (7) can still be established using a similar strategy, thanks to the symmetry of the activation function. However, for RASU layers, the proof requires some modifications due to the one-sided nature of the ReLU activation. Including these additional details would have complicated the step-by-step exposition and reduced clarity, so we felt it was more appropriate to defer them to a future contribution.


The n/p\sqrt{n/p} Discrepancy (W):

As demonstrated numerically in the experiments, the performance of Gaussian and Rademacher embeddings are very close and sometimes indistinguishable (e.g., see panel (d) of Figure 3). As stated shortly after Proposition 5.2 (lines 323 and 324), taking into account the fact that y2\|y\|_2 scales with n\sqrt{n}, the relative discrepancy is of order O(1/p)O(\sqrt{1/p}) diminishing with increasing pp, which makes it less concerning.

审稿意见
4

This paper investigates a method for constructing highly accurate binary neural networks, motivated by hyper-dimensional computing (HDC). The key idea builds on the Grothendieck identity and random matrix theory, which imply that the composition of a linear transformation and the arcsin function can be approximately converted into a binary linear transformation by multiplying a large random Gaussian matrix to the continuous parameter and input vector. Based on this observation, the authors introduce G-Net, a (continuous) neural network with an activation involving the arcsin function. After training of G-Net, it can be converted into a binary neural network (EHD G-Net) by the above technique, enabling binary inference in an approximately lossless way. The experimental results show that the proposed method successfully constructs binary neural networks that outperform previous methods in HDC.

优缺点分析

Strengths

  • (S1) The paper is written so clearly and easy to follow.
  • (S2) The paper demonstrates an excellent application of the Grothendieck identity and random matrix theory for binary neural networks. Although the continuous counterpart, G-Net, has a seemingly weird activation involving arcsin, they is easily learnable via backpropagation and SGD.
  • (S3) While the main motivation in this paper is HDC, the proposed method may have broader applications as a novel technique of lossless binarization for neural networks.

Weaknesses

  • (W1) Even though the method is motivated by the potential application to efficient computing with HDC, the paper lacks an analysis of actual or idealized computational costs.
    • Is the proposed approach truly efficient? For instance, one could trivially binarize a given DNN by viewing floating-point weights as bit sequences. How does the proposed method compare in terms of the number of parameters or computational costs against such naive binarization?
  • (W2) According to Definition 3.2, in each layer, the input vector should be binarized by multiplying random Gaussians as well as the parameter matrix. However, in Figure 2 (b), it appears that such binarization (called sign embedding or input embedding) is applied only in the first layer.
    • I'm confused by this gap. If we employed the sign embedding only in the first layer following Figure 2 (b), the approximate consistency would be broken. Indeed, the proof of Theorem 4.3 recursively applies Theorem 4.1, and thus the sign embedding seems to be required in each intermediate layer.
    • On the other hand, if we employ Definition 3.2 as is, I'm concerned that the sign embedding in intermediate layers may be incompatible with the HDC computation because it involves continuous operations in each layer.
  • (W3) The paper does not cite a lot of prior work in neural network binarization/quantization. Even if the main motivation is HDC, there are many such prior work having a similar purpose or related method to this paper, and thus should be appropriately discussed as related work.

问题

See weaknesses.

局限性

Yes.

最终评判理由

I would like to keep my score since my major concern on computational efficiency has not been addressed, especially:

the paper lacks an analysis of actual or idealized computational costs.

格式问题

N/A

作者回复

We sincerely thank the reviewer for their valuable and insightful feedback. We truly appreciate the recognition of our contributions and the encouraging remarks regarding the broader potential of our method for lossless binarization in neural networks.


Computational Efficiency (W1):

Our network architecture has three key properties of HDC models: (1) there are no significant bits, so the architecture is robust to model noise and computing environments where random bit flips may occur; (2) the architecture can be implemented only using simple operations, binary XOR and popcount; and (3) the models have inherent randomness, and can be sampled from a distribution of models, which provides robustness to adversarial constructions. Note that we demonstrate (1) empirically in Supplementary Material Figures S4 and S5. In some cases, a trivial binarization given by considering floating-point weights as bit sequences may use fewer bits, but would lack these properties. Our work introduces a new theoretically justified random binary neural network architecture. An interesting direction for future work, which makes the method accessible beyond HDC applications, is compressing the embedded network architecture, which our G-Net analysis shows performs at the same level as standard DNNs with theoretical guarantees.


Figure 2 Note on Layer-wise Embedding (W2):

We thank the reviewer for pointing this out. There was a typo in Figure 2: the text “normalization” in the second and third blue blocks of panel (b) should be replaced with “sign embedding,” and this will be corrected in the final version. Regarding your comment, when using a Gaussian embedding, note that all inputs to interior layers in EHD G-Net are either binary or integer, so the operations reduce to simple additions and subtractions—no multiplications are needed for inner products. These additions and subtractions can be performed at low precision, as ultimately a sign function follows the inner product. In fact, a Rademacher vector can be seen as an extremely low-precision version of a Gaussian vector, and both theoretically and empirically, their performance is shown to be comparable. In practice, we recommend the Rademacher embedding for its fully binary operations and almost identical performance as Gaussian embedding (please refer to panel (d) of Figure 3 as an example).


Prior Work on Binarization/Quantization (W3):

Thanks for the suggestion. In addition to the current cited works on binary neural networks (BNNs), we will perform a more comprehensive review and update the final version to include references and comparisons to state-of-the-art quantization and binarization techniques (suggested by Reviewer fAxG). The initial submission placed less emphasis on BNNs for the following reasons:

  • Our proposed technique is both inspired by and fundamentally grounded in the Hyperdimensional Computing (HDC) paradigm, where binary high-dimensional models are constructed without requiring complex training procedures. In contrast, training conventional BNNs often involves significant numerical challenges due to the discrete nature of the problem and difficulties in approximating gradients.

  • HDC models typically incorporate an element of randomness, which is also a key feature of our proposed framework. Once a G-Net is constructed, exponentially many EHD G-Nets can be derived from it through a randomized process. This level of inherent variability and randomness is generally not present in conventional BNN training approaches.

  • A central contribution of our framework lies in providing rigorous mathematical guarantees on the performance of the constructed models. To the best of our knowledge, most existing BNN training methods rely on heuristic approaches and lack such guarantees, largely due to the inherent difficulty of the problem. Our method, through a novel randomized construction, introduces G-Nets as a new class of hardware-efficient architectures that are amenable to mathematical analysis.

We greatly appreciate the reviewer’s constructive comment and agree that incorporating additional references and adding these comparisons can broaden the scope of the work and enhance its relevance to a wider audience.

评论

Thank you for the author's rebuttal. Unfortunately, my major concern on computational efficiency has not been addressed, especially:

the paper lacks an analysis of actual or idealized computational costs.

Since one of the motivations in this paper lies in computational efficiency, such analyses should be included for justification of the proposed method.

Also, I think the following arguments do not provide valid reasons for ignoring previous studies of BNNs in the initial submission, and some of them lack sufficient evidence.

In some cases, a trivial binarization given by considering floating-point weights as bit sequences may use fewer bits, but would lack these properties.

In contrast, training conventional BNNs often involves significant numerical challenges due to the discrete nature of the problem and difficulties in approximating gradients.

This level of inherent variability and randomness is generally not present in conventional BNN training approaches.

To the best of our knowledge, most existing BNN training methods rely on heuristic approaches and lack such guarantees, largely due to the inherent difficulty of the problem.

评论

We thank the reviewer for the feedback and understand the concern regarding computational efficiency. To fully address this and highlight the efficiency of G-Net, we will take the following steps:

  • Using the one-page extension in the camera-ready paper, we will add a new section which mathematically compares the computational cost for a network that uses the bit sequence of the floating point weights, and a similar network which is binarized through the proposed technique. At the core of this analysis is the computational cost at each layer, which thanks to the feed-forward architectue of the model is extended to the entire network.
  • Since a main strength of the proposed approach is its HDC nature and its robustness, we will also add a numerical experiment comparing instances of the two networks above in terms of their robustness to bit flips. Unlike a network that uses the bit sequences, EHD G-Net has no significant bits and such an experiment will highlight the extent of robustness for the proposed networks.

Regarding prior work on BNNs, the initial submission cited several key BNN references (e.g., [4, 7, 11, 23, 30, 40]) to acknowledge foundational contributions in the field. Our intention was not to overlook previous BNN studies, but rather to introduce a novel perspective grounded in the HDC framework, for which we focused on reviewing the most relevant literature. However, to provide a better context for the paper, in the first round of comments, we fully agreed with the reviewer’s suggestion of adding a comprehensive review of existing BNN research. Furthermore, following the recommendation of another reviewer, we are including direct comparisons with previous BNN methods to contextualize the contributions of our work. Again, we thank the reviewer for the suggestions for improving the context and exposition of the work beyond the HDC literature.

审稿意见
4

This paper presents a new class of randomized binary neural networks, G-Nets, designed to operate in the hyperdimensional Hamming space with guaranteed accuracy. Unlike traditional quantization methods, G-Nets employ a random binary embedding that enables neural networks to be represented in binary form without a significant loss in performance. This embedding is derived from hyperdimensional computing (HDC), a paradigm inspired by brain-like computation using high-dimensional vector representations. The approach provides theoretical guarantees for performance and is shown to outperform prior HDC models, achieving near state-of-the-art accuracy with significantly lower computational resources. The authors empirically demonstrate the effectiveness of G-Nets on classification benchmarks such as CIFAR-10 and MNIST.

优缺点分析

Strengths: Novel Approach: The introduction of randomized binary embeddings and the theoretical framework based on HDC for constructing high-accuracy binary neural networks is highly innovative. It opens up new possibilities for low-resource deep learning.

Theoretical Rigor: The paper provides strong theoretical underpinnings, including proofs and consistency analysis, which justify the performance of the proposed method. The use of Grothendieck’s identity and concentration of measure guarantees the method's accuracy.

Empirical Results: The experimental results on MNIST, CIFAR-10, and human activity recognition clearly demonstrate the superiority of the proposed G-Nets over prior methods in HDC. The results are consistent and include appropriate error bars, making them reliable.

Reproducibility: The paper offers detailed information about experimental setups, making it easy for other researchers to reproduce the results. The code is publicly available on GitHub, which is a strong positive.

Practical Relevance: The proposed G-Nets show significant potential for edge and low-energy computing, which makes the work practically valuable for resource-constrained devices.

Weaknesses: Limited Task Evaluation: While the empirical results are compelling, they focus primarily on image classification tasks (MNIST, CIFAR-10). The generalizability of the approach to other domains, such as NLP or time-series forecasting, remains untested.

High-dimensional Embeddings: The reliance on high-dimensional binary embeddings may become inefficient in some use cases, especially when the hardware constraints are not well-suited for high-dimensional spaces. More discussion on potential downsides of this high-dimensionality in practical applications would be beneficial.

Computational Complexity: The theoretical analysis focuses on Gaussian embeddings, but the paper does not sufficiently address the computational overhead when applying these methods in real-world scenarios with larger networks or datasets. The complexity analysis of Rademacher embeddings is promising, but more clarity is needed on the trade-offs.

Evaluation against More Baselines: The comparison to existing methods in the HDC domain is solid, but the paper could benefit from a comparison against more state-of-the-art deep learning models, including quantized neural networks and binary neural networks.

问题

Task Generalization: Could the authors explore the application of G-Nets to other domains such as NLP or time-series forecasting?

Dimensionality vs. Efficiency: While high-dimensional embeddings contribute to the accuracy of the method, how does the model perform on hardware with limited resources, such as microcontrollers or edge devices? Could the authors explore how to balance dimensionality and hardware constraints?

Comparison with Binary Neural Networks: How do G-Nets perform compared to existing binary neural network architectures, such as Bi-Real Net or XNOR-Net, in terms of both accuracy and computational efficiency?

局限性

Yes

格式问题

No significant formatting issues were observed.

作者回复

We sincerely thank the reviewer for their thoughtful comments and valuable feedback. We are grateful for the recognition of our work as an innovative contribution with the potential to advance low-resource deep learning. Below, we respond to the specific points raised by the reviewer, using "Q'" to indicate questions and "W'' to denote identified weaknesses.


Task Generalization (Q & W):

Yes, our techniques are immediately applicable to other domains. In particular, since our approach is compatible with linear, convolutional, and normalization layers (when these layers are appropriately modified by composing the activation with an arcsin function and adding 2\ell_2-normalization), it can immediately be applied in cases where network architectures using these layers are effective. To provide concrete examples, we applied our method to AG News Classification dataset (as an NLP dataset) and the FaultDetection-A dataset from the UCR time-series repository (as a time series dataset), and obtained results comparable with neural network approaches. These experiments will be included in the final version of the manuscript.


Dimensionality vs. Efficiency (Q & W):

In the Supplementary Material Figures S2 and S3, we demonstrated how model accuracy depends on the hyperdimension. If the hyperdimension is restricted by hardware constraints, then these plots can be interpreted as illustrating model accuracy for a given hardware constraint. Moreover, in the Supplementary Material, Figures S4 and S5 demonstrate the robustness of the models to bit-flips; these figures illustrate that the network architecture is robust to computing environments where random bit flips may occur.


Comparison with Binary Neural Networks (Q & W):

In the final version of the paper, we will include a comparison with several well-known Binary Neural Network (BNN) frameworks, including Bi-Real Net, XNOR-Net, ABC-Net, and BinaryConnect. This comparison was not included in the initial submission for the following reasons:

  • Our proposed technique is both inspired by and fundamentally grounded in the Hyperdimensional Computing (HDC) paradigm, where binary high-dimensional models are constructed without requiring complex training procedures. In contrast, training conventional BNNs often involves significant numerical challenges due to the discrete nature of the problem and difficulties in approximating gradients.

  • HDC models typically incorporate an element of randomness, which is also a key feature of our proposed framework. Once a G-Net is constructed, exponentially many EHD G-Nets can be derived from it through a randomized process. This level of inherent variability and randomness is generally not present in conventional BNN training approaches.

  • A central contribution of our framework lies in providing rigorous mathematical guarantees on the performance of the constructed models. To the best of our knowledge, most existing BNN training methods rely on heuristic approaches and lack such guarantees, largely due to the inherent difficulty of the problem. Our method, through a novel randomized construction, introduces G-Nets as a new class of hardware-efficient architectures that are amenable to mathematical analysis.


Computational Complexity (W):

Because of the high-dimensional nature of HDC models and their emphasis on simplicity and interpretability, they are often evaluated on relatively smaller datasets such as MNIST, UCI HAR, and European Languages. In our experiments, we aimed to explore a more diverse set of datasets, including more challenging (in terms of achievable accuracy) and less commonly used benchmarks in the HDC literature, such as CIFAR-10. The use of very large datasets, such as ImageNet, is uncommon in HDC research due to the substantial memory and computational demands associated with high-dimensional vector encoding and classification. While we believe that the G-Net embedding theory and techniques have the potential to be applied to other domains and large-scale problems, this work represents an initial step. There remain several opportunities for improvement to enhance the scalability of the proposed framework for more complex machine learning tasks, beyond HDC.

最终决定

This paper introduces G-Net, a novel method for constructing high-accuracy random binary neural networks. Three reviewers responded very positively, while one raised concerns about its potential use cases. The AC acknowledged these limitations but found the work interesting, ultimately recommending acceptance at NeurIPS.