5.8

/10

Poster4 位审稿人

最低5最高6标准差0.4

4.0

置信度

正确性3.0

贡献度2.5

表达2.8

NeurIPS 2024

PrivCirNet: Efficient Private Inference via Block Circulant Transformation

Tianshi Xu,Lemeng Wu,Runsheng Wang,Meng Li

OpenReview PDF

提交: 2024-05-04更新: 2024-11-06

摘要

Homomorphic encryption (HE)-based deep neural network (DNN) inference protects data and model privacy but suffers from significant computation overhead. We observe transforming the DNN weights into circulant matrices converts general matrix-vector multiplications into HE-friendly 1-dimensional convolutions, drastically reducing the HE computation cost. Hence, in this paper, we propose PrivCirNet, a protocol/network co-optimization framework based on block circulant transformation. At the protocol level, PrivCirNet customizes the HE encoding algorithm that is fully compatible with the block circulant transformation and reduces the computation latency in proportion to the block size. At the network level, we propose a latency-aware formulation to search for the layer-wise block size assignment based on second-order information. PrivCirNet also leverages layer fusion to further reduce the inference cost. We compare PrivCirNet with the state-of-the-art HE-based framework Bolt (IEEE S&P 2024) and HE-friendly pruning method SpENCNN (ICML 2023). For ResNet-18 and Vision Transformer (ViT) on Tiny ImageNet, PrivCirNet reduces latency by $5.0\times$ and $1.3\times$ with iso-accuracy over Bolt, respectively, and improves accuracy by $4.1$% and $12$% over SpENCNN, respectively. For MobileNetV2 on ImageNet, PrivCirNet achieves $1.7\times$ lower latency and $4.2$% better accuracy over Bolt and SpENCNN, respectively. Our code and checkpoints are available on Git Hub.

关键词

Privacy-preserving deep learningHomomorphic encryptionCirculant matrices

评审与讨论

审稿意见

评分: 5置信度: 42024-06-18

Proposing the use of cyclic blocks for matrix multiplication to accelerate privacy-preserving inference in neural networks.

优点

PrivCirNet features a novel encoding algorithm optimized for block circulant weight matrices, dubbed CirEncode, that reduces the HE computation in proportion to block size. PrivCirNet also co-design a latency-aware optimization formulation for layer-wise block size assignment based on second-order information.

缺点

The model proposed by the authors is interactive privacy-preserving neural network reasoning, which only provides comparisons of reasoning latency with comparison objects and lacks comparative experiments with communication traffic.

问题

model parameters tend to be fractional, why choose bfv over the more fractional friendly ckks?
why is the table layout inconsistent? For example, table 1 and table 2.
again in table 4, why the accuracy of MobileNetV2 is higher after 50% compression of the proposed method than uncompressed? This is inconsistent with the intuition of information loss after compression.
the polynomial modulus used in Figure 4 is x^16-1, but the usual BFV encryption has a polynomial modulus of x^N+1. isn't the article using the original BFV? From the appendix and the supplementary material submitted, I can find no correlation.

局限性

Better to solve the limitions.

作者回复

2024-08-06

We appreciate Reviewer mVR6's feedback and we will address each of the questions as follows.

[To weakness, lack of communication cost comparison]

First, we do have a communication comparison for linear layers in Table 3 (see "# Ciphertexts"). CirEncode minimizes the communication of HE ciphertexts. Through the analytical comparison, we make it clear that our method shares the same communication complexity as Neujeans (CCS'24) [r7] for CNNs and Bolt (S&P'24) [r27] for Transformers.

Second, PrivCirNet focuses on optimizing HE computation complexity and does not specifically optimize the communication cost, as demonstrated clearly in the experimental results.

The detailed communication cost of PrivCirNet is as follows and will be included in the appendix in our final version. The communication improvement comes from the layer fusion proposed in Section 3.4.

Model	Dataset	Linear layers' communication (MB)	Nonlinear layers' communication (MB)
		PrivCirNet / Bolt (S&P'24) / Neujeans (CCS'24)	The same for each baseline we implement
MobileNetV2	ImageNet	505 / 635 / 526	100
ResNet-18	Tiny ImageNet	114 / 246 / 121	50
ViT	Tiny ImageNet	710 / 710 / 699	3159

[To Q1, why use BFV instead of CKKS?]

It is common for neural networks to run on integers through quantization, such as the fixed point quantization adopted by CryptoNets (ICML'16) [r8]. In our paper, we follow the hybrid HE+MPC strategy as clearly explained in the introduction. BFV is the mainstream scheme in hybrid HE+MPC frameworks and all baseline frameworks compared in our paper use the BFV scheme, e.g., CryptoNets (ICML'16) [r8], Gazelle (USENIX security'18) [r9], Delphi (USENIX security'20) [r20], CrypTFlow2 (CCS'20) [r6], Cheetah (USENIX security'22) [r10], Iron (NeurIPS'22) [r24], Neujeans (CCS'24) [r27], Bolt (S&P'24) [r7], etc.

Besides, applying CKKS in a hybrid HE+MPC framework is vulnerable to attacks (EuroCrypt'21) [a1]. When transforming from CKKS to secret sharing, its low-end noise may leak information, and there is currently no clear method to solve this problem [a1]. On the contrary, BFV can be safely transformed into secret sharing through noise flooding [a2].

[To Q2, why is the table layout inconsistent, i.e., table 1 and table 2?]

We use a text wrapping format in Table 1 to save space, which is commonly used in papers published in NeurIPS, e.g., "Privacy Auditing with One (1) Training Run" (NeurIPS 2023 Outstanding Paper). We will consider using a more beautiful layout in the final version.

[To Q3, why the accuracy of MobileNetV2 is higher after 50% compression?]

Neural networks tend to be over-parameterized and network compression can serve as a regularization method that improves network performance. This phenomenon has been widely observed in works concerning network compression, such as quantization [a3-a6], pruning [a7-a10], and circulant transformation [r45].

[To Q4, why CirEncode mod $x^n-1$ ?]

CirEncode conducts (inverse) discrete Fourier transform (IDFT/DFT) (mod $x^n-1$ ) in plaintext which is independent of BFV's operations in the ciphertext. $\mod x^n+1$ is a requirement for BFV's ciphertext polynomial multiplication. Our DFT is done in plaintext before encryption and does not violate BFV requirements. This approach is also adopted in Neujeans (CCS'24) [r27].

Here's an example to illustrate this process: Assume $W=\begin{bmatrix} 1 & 2 \\\\ 2 & 1 \end{bmatrix}\in \mathbb{Z}^{2\times 2}$ is a circulant matrix, input matrix $X=[3,4]^T\in \mathbb{Z}^{2\times 1}$ . We first encode these matrices into vectors: $\hat{w}=[1,2], \hat{x}=[3,4]$ . Next, we conduct DFT to obtain $\text{DFT}(\hat{w})=[3,114688], \text{DFT}(\hat{x})=[7,114688]$ . It's important to emphasize that $\text{DFT}(\hat{w}),\text{DFT}(\hat{x})$ are still plaintext vectors/polynomials. Then we apply BFV SIMD encoding to get the element-wise multiplication result: $\text{DFT}(\hat{w})\odot \text{DFT}(\hat{x})$ , where $\odot$ represents element-wise multiplication. That is, we get $\text{Encrypt}([3,114688]\odot[7,114688])$ .

The correctness is verified by $\text{IDFT}([3,114688] \odot [7,114688])=[11,10]=W\cdot X$ . Here, the modulus of DFT is $114689$ , which needs to be consistent with the plaintext modulus in BFV HE.

We will add this discussion to the appendix in our final version.

[Concern Regarding Reviewer mVR6 Conduct]

Reviewer mVR6 did not raise any issues or weaknesses related to the substantive content of our paper, which made it very hard to support the strong reject decision. We request the reviewer to re-evaluate our paper based on our rebuttal materials.

[r6,r7,r8,r9,r10,r20,r24,r27,r45] represent the 6th, 7th, 8th, 9th, 10th, 20th, 24th, 27th, and 45th references cited in our original submission.

[a1] Li, Baiyu, and Daniele Micciancio. "On the security of homomorphic encryption on approximate numbers." EUROCRYPT, 2021.

[a2] Gentry, Craig. A fully homomorphic encryption scheme. Stanford university, 2009.

[a3] Esser, et al. "Learned step size quantization." ICLR 2020.

[a4] Li, Yanjing, et al. "Q-vit: Accurate and fully quantized low-bit vision transformer." NIPS 2022.

[a5] Yamamoto, Kohei. "Learnable companding quantization for accurate low-bit neural networks." CVPR 2021.

[a6] Li, Yuhang, et al. "Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks." ICML 2020.

[a7] Han, Song, et al. "Learning both weights and connections for efficient neural network." NIPS 2015.

[a8] Zhang, Yuxin, et al. "Bi-directional masks for efficient n: M sparse training." ICML 2023.

[a9] Zhou, Aojun, et al. "Learning n: m fine-grained structured sparse neural networks from scratch." ICLR 2021.

[a10] Zhang, Yuxin, et al. "Learning best combination for efficient n: M sparsity." NIPS 2022.

评论- To authors

2024-08-11

Thank you for your explanation. You have solved some concerns about this paper. I have change my score. I hope my suggestions can help readers understand your paper more clearly.

Since your work is closely related to plaintext encoding, you should clearly explain the plaintext encoding in the BFV scheme to help readers better understand the entire scheme.
Since your scheme is a hybrid of HE and MPC, it would be better if you could provide a clearer explanation of the conversion between BFV and MPC.

Thanks.

评论- Follow-up on Rebuttal for paper 1826 - Request for Response

2024-08-10

Dear Reviewer mVR6,

We are writing to follow up on the rebuttal submission for our paper. We have carefully addressed all the concerns you raised in your initial review, providing additional data and clarifications to support our responses.

Given the significant impact of your review on the final decision, we believe our rebuttal must be fully considered. We kindly request your prompt feedback on our responses, as your input is essential to ensuring a fair and balanced evaluation of our work.

Please let us know if there are any additional points you would like us to address or if further clarification is required.

Thank you for your attention to this matter.

评论- Thanks for your helpful comment. Response to reviewer mVR6

2024-08-11

We appreciate your advice on improving the clarity of our paper. In the final version, we will explain in the appendix why our encoding scheme uses $\mod x^n-1$ . As for the conversion protocol, we initially omitted the details since HE+MPC is a mainstream technique in hybrid HE-based frameworks, and nearly all related works utilize the same conversion protocol. However, we will now add an explanation of the hybrid HE+MPC scheme to the appendix. Thank you again for your attention and valuable suggestions.

审稿意见

评分: 6置信度: 42024-07-13

This paper proposes an optimization for linear layers of secure inference systems. The authors propose a linear layer evaluation protocol that can efficiently compute linear layers when the model weights are circulant (diagonally repeating). They also propose optimizations to determine the optimal block size to achieve a balanced tradeoff between accuracy and latency. The authors demonstrated the linear layer efficiency with extensive evaluations.

优点

Improving the efficiency of secure inference systems is an important problem.
The idea of proposing HE-friendly linear layer evaluation for circulant weights is interesting.
The proposed method is sound with decent performance.

缺点

Some of the statements are misleading and the presentation can be further improved to avoid misunderstandings.
The details of how nonlinear layers are evaluated are missing.

问题

Overall I think this paper is interesting. I have some concerns about some statements and comparisons with the prior work BOLT (S&P'24).

In the motivation section, you mentioned that the linear layers are the bottleneck of BOLT. But from BOLT's results, the nonlinear layers seem to occupy more than 70% of the runtime in the WAN setting. I think it should be made clear that the linear layers become the bottleneck when you consider very fast network settings.
It should be made clear that the proposed method is for circulant weight matrix only instead of general matrix multiplications, the notation of GEMM is very misleading, and I think it should be replaced.
BOLT is specifically designed and optimized for transformers, I'm wondering what adaptations have you made to BOLT for the models considered in the paper? It might be not fair to compare your protocol to BOLT as it's not specifically designed for CNNs.
In BOLT, you can also make the weight matrix circulant. In this case, the multiplications between weights and ciphertext inputs should be very fast as you just need to multiply a scalar to the ciphertext. Can the authors compare this simple adaptation? Also, it should be made clear when comparing with prior methods that the prior methods are for general matrix multiplications, but your proposal is not.
I'm wondering how the nonlinear layers are evaluated. Which crypto primitive are you using for that? Does your code include the MPC part?

局限性

See my concerns in the Question section.

作者回复

2024-08-06

Thank you very much for your support and professional, detailed feedback! We are also grateful to Reviewer wNvj for evaluating our work as "an interesting paper". See below for the answers to your questions and comments.

[To Weakness1 and Question2, misleading notation of GEMM]

We highly appreciate the helpful advice to avoid misleading readers. In our paper, we use "GEMM" to denote "circulant matrix multiplication" for PrivCirNet, which may mislead readers. We will update the statements to distinguish between "GEMM" and "circulant matrix multiplication" to mitigate any potential misunderstanding. Thanks again for your valuable suggestion!

[To Weakness2 and Question5, the details of nonlinear layer evaluation]

For the details of the nonlinear layer evaluation, see the first part of the global response.

As for our C++ implementation of private inference, we are organizing the code and will release it upon acceptance.

[To Question1, the bottleneck of Bolt]

Thanks for your professional advice! In Figure 1, we show that linear layers account for more than 75% of the total latency. We will add an explanation in our updated manuscript that the experiment was conducted at a bandwidth of 3Gbps.

Additionally, Bolt uses IKNP OT [a1] for nonlinear layers, which can be improved by using VOLE OT [a2] as implemented in OpenCheetah [r10], offering an order of magnitude smaller communication overhead. Then even with lower bandwidth, linear layers remain a significant bottleneck. Related content is also discussed in Bumblebee (NDSS'25) [r22].

[To Question3, how to apply Bolt to CNNs?]

Bolt's protocol is designed for matrix multiplication rather than convolution. To evaluate Bolt on CNNs, we transform convolution into matrix multiplication by $\operatorname{img2col}$ algorithm. For example, given a weight kernel $W\in \mathbb{Z}^{K\times C\times R\times R}$ and an input matrix $X\in \mathbb{Z}^{C\times H\times W}$ , $\operatorname{img2col}$ transforms $W,X$ to $W'\in \mathbb{Z}^{K\times (C\times R\times R)}, X'\in \mathbb{Z}^{(C\times R\times R)\times (H'\times W')}$ , where $H',W'$ are the output resolutions. We then use Bolt's matrix multiplication protocol to evaluate the convolution.

However, this transformation increases the dimension and final latency. Hence, for CNNs with $3\times 3$ kernels like ResNet-18, we focus on the comparison with Cheetah (USENIX Security'22) [r10] and Neujeans (CCS'24) [r27] which are SOTAs for CNNs. As shown in Figure 9 of our paper, PrivCirNet still achieves $1.4\sim 2.2\times$ latency reduction with iso-accuracy.

We will add this illustration to our final version.

[To Question4, simply extending Bolt to circulant matrix multiplication]

In PrivCirNet, we have tried to adapt the encoding method proposed in Bolt to block circulant matrix multiplication. As shown in Figure 2 of our paper, SIMD encoding like Bolt cannot effectively utilize circulant matrices. For example, assume $W=\begin{bmatrix} 1 & 2 \\\\ 2 & 1 \end{bmatrix}\in \mathbb{Z}^{2\times 2}$ is a circulant matrix, and the input matrix is $X=[3,4]^T\in \mathbb{Z}^{2\times 1}$ . Although we only need to multiply $[3,4]$ with scalar $1$ , in SIMD encoding we still need to conduct element-wise multiplication $[3,4]\odot[1,1]$ , leading to no benefits compared to GEMM.

On the other hand, coefficient encoding can utilize circulant matrix multiplication but introduce extra repeating elements when handling block-circulant matrix multiplication. To better explain, let's consider a toy example: $W=\begin{bmatrix} -1 & 2 &1 & 3 \\\\ 2 & -1 & 3& 1 \end{bmatrix}, X=\begin{bmatrix} 5 & 4 & -2 & 3 \end{bmatrix}^T$ , with a block size $b=2$ . We encode $W$ to $\hat{w}=3+x+(2-x)x^3$ . When encoding $X$ , we need to repeat elements in $X$ to ensure all 2 permutations of $X$ appear in $\hat{x}$ , that is, $\hat{x}=5+4x+5x^2+(-2+3x-2x^2)x^3$ . Then $\hat{y}=\hat{w}\times \hat{x}=\ldots+10x^4+3x^5+\ldots$ contains the correct result of $WX$ . The theoretical complexity of this approach is shown in the table below.

	# mul	# rot	# cts	Encoding consistency
Coefficient only	$O(d_1d_2d_3(2b-1)/(b^2n))$	0	$O(d_1d_2(2b-1)/(nb)+\lceil d_1/n \rceil d_3)$	×
CirEncode	$O(d_1d_2d_3/(bn))$	$O(\sqrt{d_1d_2d_3/(bn)})$	$O(d_1(d_2+d_3)/n)$	√

There are two drawbacks of this approach: (1) The number of ciphertexts needed to be transferred increases, resulting in significant communication costs. Additionally, the number of multiplication is higher than CirEncode. (2) The packing of input and output is inconsistent, hindering consecutive computation on the output.

Consequently, CirEncode combines the advantages of both SIMD encoding and coefficient encoding schemes and maximizes the potential of the circulant matrices. We will add this important discussion to the appendix in our final version.

Last, thanks so much for helping us improve this work through your professional perspective. If you have any additional questions or anything you would like to discuss in more detail, please feel free to let us know. If you find we have successfully addressed your worries, please kindly consider re-evaluating our work. Thanks a lot!

[r10,r22,r27] represent the 10th, 22th, and 27th references cited in our original submission.

[a1] Y. Ishai, J. Kilian, K. Nissim, and E. Petrank, "Extending Oblivious Transfers Efficiently," in CRYPTO, 2003, pp. 145–161.

[a2] Boyle, Elette, et al. "Efficient two-round OT extension and silent non-interactive secure computation." CCS 2019.

2024-08-11

Thanks for the rebuttal. I don't have further comments, and I'll keep my score.

审稿意见

评分: 6置信度: 42024-07-13

The paper introduces PrivCirNet, a framework designed to enhance the efficiency of private DNN inference using homomorphic encryption (HE) and MPC schemes. The key contributions are as follows.

By converting DNN weights into block circulant matrices, the framework transforms general matrix-vector multiplications into HE-friendly 1-dimensional convolutions, significantly reducing computational overhead.
PrivCirNet customizes the HE encoding algorithm to fully leverage the block circulant transformation, reducing computation latency in proportion to the block size.
The framework includes a latency-aware formulation to search for optimal layer-wise block size assignments using second-order information.
PrivCirNet employs layer fusion techniques to further minimize inference costs.
The framework demonstrates substantial improvements over state-of-the-art HE-based frameworks (e.g., Bolt) and HE-friendly pruning methods (e.g., SpENCNN) in terms of both latency and accuracy across various models (ResNet-18, Vision Transformer, MobileNetV2) and datasets (CIFAR-10, CIFAR-100, Tiny ImageNet, ImageNet).

优点

The use of block circulant transformation for optimizing HE-based DNN inference at the HE/MPC hybrid approach is novel and innovative. The paper includes thorough evaluations and comparisons with existing methods, providing strong evidence of the proposed framework's effectiveness. The structure and presentation of the paper are clear, making the methodology and results easy to understand. Finally, the framework addresses a critical issue in privacy-preserving machine learning, offering substantial improvements in both efficiency and accuracy.

缺点

This paper does not provide the actual implementation code needed to reproduce the simulation results. Additionally, while the paper offers an explanation of homomorphic encryption (HE), it does not clearly present the implementation details for multi-party computation (MPC). The results report overall latency without distinguishing between the HE part and the MPC part, making it difficult to discern the extent of performance improvement specifically attributed to HE.

问题

In Figure 9, the comparison of latency and accuracy is conducted against the worst-case scenarios of existing work. Is this comparison fair?

局限性

They did adequately.

作者回复

2024-08-06

Thank you very much for your support and the thorough comments, which have greatly helped us improve our work! We appreciate that the novelty, effectiveness, and importance of PrivCirNet are acknowledged. Below we list our responses to each of the comments:

[Our implementation code of private inference]

Our C++ implementation of private inference is based on OpenCheetah [r10] and the SEAL library [r52]. We are organizing the code and will release our C++ implementation upon acceptance.

[The implementation details for multi-party computation (MPC) and the performance improvement attributed to HE]

See the first part of the global response.

[To Question, Figure 9 marks the points with the highest compression]

In Figure 9, compared with HE-based frameworks, PrivCirNet achieves $1.3\sim 2.2\times$ latency reduction with iso-accuracy, which is also the accuracy of the uncompressed model.

When compared with the structured pruning method SPENCNN, we marked the points with the highest compression ratios because compression methods are more important for more compact models, and thus the accuracy improves the most at the point of minimum latency. We will improve the description in the updated version. For example, "Compared with SPENCNN in ViT and Tiny ImageNet, PrivCirNet achieves 1.3%, 2.8%, and 13% higher accuracy under iso-latency 370s, 310s, and 280s, respectively."

Thanks again for the valuable comments which make our experiments more solid and robust. We will add all of these discussions to our final version. We hope you could kindly consider a re-evaluation if you find our responses effective. Thank you!

[r10, r52] represent the 10th and 52th references cited in our original submission.

审稿意见

评分: 6置信度: 42024-07-18

The authors have proposed a co-design method for making the Homomorphic encryption (HE) evaluation faster in private inference. Specifically, they used a new HE encoding mechanism to leverage the properties of the black-circulant matrix to reduce the HE rotations and ciphertext-plaintext multiplications. The experiments are performed on various CNN models and ViT models with multiple datasets.

优点

Authors have tailored the SOTA HE algorithm, BSGS implemented in BOLT, to leverage the block-circulant matrix properties to further speed up the HE evaluation in private inference. They have implemented various crypto algorithms, which require a reasonable effort in C++.
Thorogh experimentation, including ViT models on ImageNet, TinyImagNet, and CIFAR-10/100 datasets.
The paper is well-structured and thoroughly written, making it easy to follow even for those who are not experts in this field.

缺点

$\bullet$ Limited Novelty The idea of employing black-circulant matrices for faster convolution, both in plaintext [1] and ciphertext [2], is not new. The authors have just tailored them for the BGSG algorithm, by making the block-size variable.

Moreover, the efficiency of the proposed encoding method, CirEncode, has been validated with a relatively smaller hidden size. For example, ViT with hidden dimensions 192 and 256. In contrast, the BERT-based model, used in BOLT, has a hidden dimension of 768. The efficacy of the encoding mechanism should be validated against 768 and/or 1024 hidden dimensions.

$\bullet$ Compariosn against HybReNets Authors have shown comparison with ResNet18 baseline in Figure 9. However, recently, there have been more efficient baselines for private inference have been devised such as HybReNets (e.g., HRN-5x5x3x and HRN-2x5x3x) [3]. These privacy-friendly baseline networks have a relatively higher number of channels in deeper layers, thus, the efficacy of the encoding mechanism should be validated with the networks with a higher number of channels.

$\bullet$ Writing The year is missing in citations 27 and 38.

[1] Ding et al., CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-Circulant Weight Matrices, MICRO 2017.

[2] Lou et al., Falcon: Fast Spectral Inference on Encrypted Data, NeurIPS 2020.

[3] Jha et al., DeepReShape: Redesigning Neural Networks for Efficient Private Inference, TMLR 2024.

问题

For ViT experiments on CIFAR, did you use pre-trained ImageNet mode?
Why MobileNetV2 is used for benchmarking? If FLOPs efficiency is the reason, it should also be evaluated on RegNet-x [4] and ConvNeXt [5] models, and if required, by substituting complex non-linearities with privacy-friendly counterparts (LN with BN and GELU with ReLU).
Which Crypto-primitive is used for evaluating the nonlinear computations?
What is the overhead of finding the optimal block size using second-order information? That is, how much time is required to find the optimal block size for each layer? How does this timing overhead increase with the network size, particularly in wider networks (e.g., in HybReNets [3])

[4] Radosavovic et al., Designing Network Design Spaces, CVPR 2020.

[5] Woo et al., ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders, CVPR 2023.

I'm open to improving the score if the key concerns are addressed in the rebuttal

局限性

Addressed in the paper.

作者回复

2024-08-06

We appreciate Reviewer eRSG's constructive feedback, which has helped us improve our work! We have conducted all the additional experiments mentioned in the review and will be added to our final version. The results are included in the PDF attached to the global response, including comparisons with Bolt (S&P'24) under larger hidden dimensions (Table 1), the combination of PrivCirNet and DeepReshape (Tables 2 and 3), and results on RegNet and ConvNeXt (Table 4). Below, we address each of your concerns.

[To W1, concerns on novelty]

The contribution of PrivCirNet is not in proposing the circulant matrix, but in its application to SOTA networks to achieve better latency-accuracy trade-off in private inference.

While [r44,r45,r25] use block circulant matrices in plaintext and ciphertext, there remain two unresolved problems in both domains: (1) how to initialize circulant matrices, and (2) how to determine block sizes for each layer. As a result, it is hard for [r44, r45, r25] to apply block circulant matrices to more efficient networks, e.g., MobileNetV2, Transformers, etc. PrivCirNet addresses these problems with new algorithms, achieving a superior accuracy-latency trade-off.

In the ciphertext domain, [r25] cannot fully leverage block circulant matrices, resulting in limited or even increased latency. Compared with [r25], PrivCirNet introduces several core innovations:

Adopts a hybrid HE+MPC framework, eliminating the computationally expensive discrete Fourier transform (DFT) on the ciphertext.
Maximizes the potential of block circulant matrices by customizing the HE encoding algorithm, reducing computation latency proportional to the block size.

A comprehensive comparison between PrivCirNet and [r44,r45,r25] is summarized as follows:

Method	Application	Initialization method	Variable block size	Block size assignment	Customized Encoding Method	Network
CirCNN, CircConv [r44,r45]	Accelerate convolution in plaintext	Forbenius norm	√	Uniform/Manually set	/	ConvNets
Falcon [r25]	Accelerate end-to-end HE-based private inference	Forbenius norm	×	Only uniform	×	Three-layer network
PrivCirNet (ours)	Accelerate hybrid HE+MPC private inference	Loss-aware	√	Latency-aware block size assignment	√	ConvNets, Transformers

[To W1, comparison under a large hidden dimension]

CirEncode is effective for any layer dimensions as shown in Table 3 of our paper. As shown in Table 1 in the PDF attached to the global response, comparing with Bolt [r7] using BERT-base, PrivCirNet achieves $2.7\sim 7.5\times$ latency reduction under large hidden dimensions, consistent with our theoretical analysis and experimental results.

[To W2, comparison with DeepReshape (TMLR'24, April 2024) ]

Thank you for pointing out the work DeepReshape [a1], which helps to consolidate our work. DeepReshape optimizes both ReLUs and FLOPs by designing a series of more FLOPs-efficient networks, dubbed HybReNets while pruning the ReLU layers. DeepReshape achieves a better latency-ReLU trade-off compared with SENet [r32], SNL [r30], etc. We believe DeepReshape and PrivCirNet are orthogonal and can be applied together to further reduce the inference latency.

In Table 2 and Table 3 in the PDF attached to the global response, we show the application of PrivCirNet to HybReNets [a1] on CIFAR-100, which also yields promising results. We apply the ReLU pruning method proposed in DeepReshape to further reduce the latency of nonlinear layers.

From the results, we can see that PrivCirNet is effective when combined with DeepReshape, achieving significant latency reduction in both linear and nonlinear layers. We will include these experiments and cite DeepReshape in our final version. Thank you again for your valuable advice!

[To W3, missing year in citations 27 and 38]

Thank you for pointing out the typo! We will correct this in our final version.

[To Q1, ViT training]

We do not pretrain the model on ImageNet. We first train our ViT model from scratch on CIFAR following [r58], then apply circulant transformation and finetune the model. We will add this detail to our final version.

[To Q2, why choose MobileNetV2?]

MobileNetV2 is a classic efficient network. The inverted residual block and its variants are widely used in efficient network design, including EfficientNet [a4], ConvNeXt [a3], etc. PrivCirNet can be applied to any convolutions or fully connected layers. In Table 4 in the PDF attached to the global response, we benchmark PrivCirNet on RegNet [a2] and ConvNeXt [a3]. The results are consistent with MobileNetV2.

[To Q3, Crypto-primitive used for nonlinear layers]

See the first part of the global response.

[To Q4, the overhead of finding optimal block size]

For CIFAR and Tiny ImageNet, our block size assignment algorithm takes less than 20 seconds on a single RTX4090. For ImageNet, our algorithm randomly samples 100,000 images in the training dataset and takes less than 5 minutes on a single RTX4090. The time cost comes from running a backpropagation. We will add this description to our final version.

We sincerely appreciate the valuable, detailed advice from reviewer eRSG. We take all your suggestions into account in our final version. If you feel that the above improvements help clear up your doubts and further improve our work, you can kindly consider a re-evaluation, thank you!

[r7,r25,r30,r32,r44,r45,r58] represent the 7th, 25th, 30th, 32th, 44th, 45th, 58th references cited in our original submission.

[a1] Jha et al., DeepReShape: Redesigning Neural Networks for Efficient Private Inference, TMLR 2024.

[a2] Radosavovic et al., Designing Network Design Spaces, CVPR 2020.

[a3] Woo et al., ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders, CVPR 2023.

[a4] Tan et al., Efficientnet: Rethinking model scaling for convolutional neural networks, ICML 2019.

评论- Reply on the Authors' Rebuttal

2024-08-10

Thank you for providing a thorough rebuttal to each of the queries raised in the review. I appreciate the Authors' effort in conducting the additional experiments (particularly, Table 3 in the attached pdf) in a limited time.

Authors have addressed the novelty concern of employing a block circulant matrix for PI. I would suggest they include the Table, where the usage of the block circulant matrix is compared with the prior work (in rebuttal), in the paper's main text.

Further, I encourage the Authors to repeat the experiments on RegNet and ConvNeXt (Table 4, in the pdf) on at least one more dataset, if possible on CIFAR-100 (accuracy on CIFAR-10 does not always tell the story), in the final version.

There is one clarification needed to understand the results in Table 1:

What are the matrix size triplets (e.g., (512, 768, 3072)) in Table 1, in the pdf? Is there any particular reason for selecting the order of the triplet dimensions? The FFN of transformer-based models have bottleneck architecture, for example, (768, 4*768, 768)

评论- Thank you very much for your kind and valuable comments. Response to reviewer eRSG.

2024-08-10

Thank you very much for your kind and valuable comments. We will include a table comparing the use of block circulant matrices with prior work in our main text. Additionally, we have conducted experiments on RegNet and ConvNeXt using CIFAR-100, with the results shown in the table below.

Method	Top-1 Acc.	Linear layers’ latency (s)
RegNet	$79.17$	$64.2$
+PrivCirNet(b2)	$79.68$	$30.1\ (2.1\times \downarrow)$
+PrivCirNet(b4)	$77.78$	$22.4\ (2.9\times \downarrow)$
+PrivCirNet(b8)	$72.23$	$16.7\ (3.8\times \downarrow)$
ConvNeXt	$76.17$	$243.9$
+PrivCirNet(b2)	$77.11$	$120.1\ (2.0\times \downarrow)$
+PrivCirNet(b4)	$76.58$	$65.8\ (3.7\times \downarrow)$
+PrivCirNet(b8)	$76.38$	$36.6\ (6.7\times \downarrow)$

PrivCirNet achieves a $2.1\times$ latency reduction on RegNet and a $6.7\times$ reduction on ConvNeXt, both without accuracy loss, consistent with our findings on CIFAR-10. We are currently running experiments on Tiny ImageNet and will include all these results in the appendix of our final version. We will also release the model checkpoints upon acceptance.

In Table 1 of the global rebuttal, the matrix size triplets $(d_1,d_2,d_3)$ represent matrix multiplications with dimensions $(d_1,d_2) \times (d_2,d_3)$ . These dimensions correspond to those found in the BERT-base model. Specifically, $512$ represents the sequence length, $(512,768,768)$ is the dimension of the $Q$ , $K$ , and $V$ projections, $(512,768,3072)$ corresponds to the first fully connected layer in the FFN, and $(512,3072,768)$ corresponds to the second fully connected layer in the FFN. This experiment demonstrates the effectiveness of PrivCirNet across different dimensions in the BERT-base model.

Thank you again for your valuable comments, which have strengthened and enhanced the robustness of our paper!

评论- Response to Authors

2024-08-11

Thanks for the data and clarification. Since the key concerns are addressed in the rebuttal, I have increased the score to 6.

Authors should include all the relevant discussion and data in the final version of the draft. Also, a brief discussion on the impact of selecting different baseline networks (MobileNetV2/ResNet/HybReNets/RegNets/ConvNeXt) on the accuracy degradation with various block sizes should be included in the draft. This is crucial, as evident from the data provided in the rebuttals, going from b4 to b8 results in $\sim$ 5%, $\sim$ 2%, and 0.2% accuracy degradation (CIFAR-100) when baseline networks are RegNet, HybReNets, and ConvNeXt, respectively. Nonetheless, these networks have disparate impacts on the latency of nonlinear layers, as they have different ReLU distribution and ReLU efficiency.

评论- Thank you very much for your support. Response to reviewer eRSG.

2024-08-11

Thank you very much for your support! We greatly appreciate the valuable advice and will incorporate all the relevant discussions and data into our final version. The varying accuracy degradation observed across different baseline networks (MobileNetV2, ResNet, HybReNet, RegNet, ConvNeXt) can be partly attributed to the differing proportions of parameters occupied by standard convolutional layers. For instance, in ConvNeXt, 98% of the parameters are derived from standard convolution, with less than 2% from depth-wise/group-wise convolution, providing significant compression potential using PrivCirNet. In contrast, standard convolution parameters account for only 64% and 78% of RegNet and MobileNetV2, respectively. As a result, RegNet and MobileNetV2 exhibit larger accuracy degradation at higher compression rates. Additionally, these networks demonstrate varying latency characteristics in nonlinear layers. Jointly optimizing linear and nonlinear layers is an interesting question, and we plan to explore this in our future work. We will include this discussion in our final version.

Thank you once again for your support and insightful feedback!

作者回复

2024-08-06

[1. Implementation details and latency for nonlinear layers]

We utilize HE to evaluate linear layers and MPC to evaluate nonlinear layers. For nonlinear layers, we follow Cheetah [1] to evaluate the ReLU function (Section 4 in Cheetah) and Bolt [2] to evaluate GeLU, LayerNorm, and Softmax functions (Section 5 in Bolt). Additionally, we adopt VOLE-style Oblivious Transfer (OT) [3] for all the nonlinear layers, which enables us to further reduce the communication for GeLU, LayerNorm, and Softmax compared to Bolt.

PrivCirNet focuses on optimizing the latency of linear layers which is the bottleneck, the latency shown in Figures 8 and 9 of our paper represents the latency of linear layers. In Table 5 of our paper, we show the total latency of PrivCirNet, where the bottleneck is the latency of linear layers. Below we further show the latency breakdown of MPC (for nonlinear) and HE (for linear) parts across different models under the LAN setting.

We thank the reviewers for pointing this out and will add the explanation to the experimental results and these additional results to the appendix in our final version.

Network	Dataset	Latency of Nonlinear Layers (s)	Latency of Linear Layers (Bolt) (s)	Latency of Linear Layers (Ours with block size 8) (s)
MobileNetV2	ImageNet	92.0	137.8	27.7
ResNet-18	Tiny ImageNet	12.6	156.4	11.2
ViT	Tiny ImageNet	190.6	479.6	254.8

[1] Huang, Zhicong, et al. "Cheetah: Lean and fast secure {Two-Party} deep neural network inference." USENIX Security 2022.

[2] Pang, Qi, et al. "BOLT: Privacy-Preserving, Accurate and Efficient Inference for Transformers." IEEE S&P (2024).

[3] Boyle, Elette, et al. "Efficient two-round OT extension and silent non-interactive secure computation." CCS 2019.

[2. Additional experimental results]

In response to the questions of Reviewer eRSG, we further conducted the following experiments, and the results are included in the attached PDF:

Table 1: Comparison of Bolt and PrivCirNet for matrix multiplication with larger hidden dimensions in the BERT model. This experiment demonstrates that PrivCirNet achieves $2.7\sim 7.5\times$ latency reduction compared to Bolt under larger hidden dimensions, consistent with our theoretical analysis and experimental results.
Table 2: Accuracy and latency results when applying PrivCirNet to HybReNets proposed in DeepReshape. The result demonstrates the effectiveness of applying PrivCirNet to HybReNets (DeepReshape).
Table 3: Accuracy and latency results when combining PrivCirNet and DeepReshape (ReLU reduction). This experiment demonstrates that PrivCirNet and DeepReshape can be applied together to significantly reduce the latency of both linear and nonlinear layers.
Table 4: Accuracy and latency results when applying PrivCirNet to RegNet and ConvNeXt on CIFAR-10. From the result, we can find that PrivCirNet achieves $2.0\sim 6.8\times$ latency reduction on RegNet and ConvNeXt, which is consistent with the result on MobileNetV2.

We believe these results demonstrate that PrivCirNet achieves consistent latency and accuracy benefits over different network architectures. We will also add these experimental results in our final version.

最终决定Accept (poster)

2024-09-25

The work studies reducing the computation overhead over HE-based neural network inference. It proposes PrivCirNet, a protocol/network co-optimization framework based on block circulant transformation.

The final recommendation of the work is unanimously acceptance (6, 6, 6, 5).

Reducing communication overhead in HE is an important problem. The proposed method is novel and technically sound. The paper is well written. The effectiveness of the proposed method is evaluated with comprehensive experiments.

For the final version, please incorporate the results and discussions in the rebuttal and also address the following remaining points from reviewers:

eRSG

a brief discussion on the impact of selecting different baseline networks (MobileNetV2/ResNet/HybReNets/RegNets/ConvNeXt) on the accuracy degradation with various block sizes should be included in the draft.

mVR6

Since your work is closely related to plaintext encoding, you should clearly explain the plaintext encoding in the BFV scheme to help readers better understand the entire scheme. 2. Since your scheme is a hybrid of HE and MPC, it would be better if you could provide a clearer explanation of the conversion between BFV and MPC.