Accurate Split Learning on Noisy Signals
摘要
评审与讨论
This paper discusses the use of scaling and random masking as a denoising method for noise-injected networks in the context of split learning. The authors present both theoretical and empirical evidence showing that applying these denoising techniques during the training phase improves testing accuracy. Additionally, the study demonstrates that the proposed denoising methods enhance the network’s resilience against feature space hijacking attacks.
优点
The authors aim to demonstrate that the proposed denoising methods (scaling and random masking) improve test accuracy through both theoretical analysis and empirical evaluation.
缺点
I believe the focus of the paper is misaligned with the target. First, I think the motivation was made clear: to protect models against reconstruction attacks, noise is often added to intermediate representations (IRs), which leads to a drop in accuracy. This creates a privacy-accuracy or robustness-accuracy trade-off. The primary goal of the paper should be to demonstrate that the proposed methods (scaling and random masking) improve this trade-off, but unfortunately discussions about this trade-off is completely missing.
The authors dedicate substantial space to showing that the denoising methods preserve accuracy when noise is injected. However, this seems unnecessary. It is intuitive that denoising in noise-injected networks would help maintain accuracy—a rather trivial observation. Variance scales as the square of the variable scaling, so scaling down naturally reduces variance. Similarly, adding random masking during training makes the network more robust to noise. This is intuitive and easy to understand.
What the authors really want to show is that the proposed methods also helps in improve network privacy (or at least no degrade on it) so that a better trade-off become possible. This is critical but unfortunately, the authors spend so less effort on it. I expected to see quantitative measurements against multiple types of attacks, but only the feature space hijacking attack is covered, while other attacks mentioned in the introduction are ignored. Additionally, quantitative results are lacking.
As such, I feel authors fail to demonstrate a better accuracy-privacy trade-off, leading to their contribution unjustified.
Additionally, they are also minor issues in both theoretical and empirical study. In my understanding, the denoising methods are applied during the training, so with and without this it will lead to different weights. I don't think it is reflected in theoretical analysis. In the empirical analysis part, I don't think it is the best choice to show graphs of accuracy by training epoch as it takes a lot of space. Instead, I would expect to see much richer results from using different parameter settings.
问题
- Is there any conclusion on the choice of parameters (i.e. \lambda, p). Authors mostly use \lambda=0.1 or 0.2, p = 0.1 or 0.2. How did author choose on this setting? Does the theoretical analysis provide guidance on how we should choose \lambda and p?
We thank the reviewer for their review. There are some misunderstanding from the reviewer's side and below we try to clarify them.
Comment. The authors dedicate substantial space to showing that the denoising methods preserve accuracy when noise is injected. However, this seems unnecessary. It is intuitive that denoising in noise-injected networks would help maintain accuracy—a rather trivial observation. Variance scales as the square of the variable scaling, so scaling down naturally reduces variance. Similarly, adding random masking during training makes the network more robust to noise. This is intuitive and easy to understand.
Answer. We rather argue that this is not intuitive and nontrivial. Let us only consider the masking case. For the case with linear layer, if , then the masking helps to gain a better accuracy for noise injected splitNN. Note that denotes the injected noise level, is the masking ratio, is a matrix obtained by stacking in each row, and denotes the elementwise product. Similarly, for the nonlinear case, as in Theorem 3.4, if is large enough, there exists some such that for , masking helps to gain a better accuracy for noise injected SplitNN. These observations were validated through the numerical results. Results from Figure 2-(d), with the nonlinear loss, reflect Theorem 3.4: must be large for the improvement to be possible. If is too small (MSE curve for ), the masking does not work; the larger the , the more improvements one can expect by using masking.
Comment. What the authors really want to show is that the proposed methods also helps in improve network privacy (or at least no degrade on it) so that a better trade-off become possible. This is critical but unfortunately, the authors spend so less effort on it. I expected to see quantitative measurements against multiple types of attacks, but only the feature space hijacking attack is covered, while other attacks mentioned in the introduction are ignored. Additionally, quantitative results are lacking.
Answer. We respectfully disagree with the reviewer. Theorem 3.7 from Dwork et al. (2014) shows that postprocessing mechanisms preserve privacy. Based on this result, both scaling and masking proposed in this work will maintain the privacy guarantee of the noise-injected SpliNN. We explicitly mentioned in Lines 104-109 that our work improves the accuracy and stability of noise-injected SplitNN training by proposing two post-processing techniques (i.e., scaling and masking). To the best of our knowledge, we are the first to propose denoising techniques on the Gaussian noise-injected IRs to improve the training accuracy of SplitNN. We wanted to send this message to the reader through rigorous theoretical analysis and our denoising techniques work empirically on diverse datasets (MNIST, FMNIST, CIFAR-10, CIFAR-100, IMDB, Names) and across different network architectures (CNN, RNN, and MLP). The privacy guarantee in this setup is a by-product; this is not the paper's main goal; see Lines 76-77 in the Introduction. Nevertheless, we showed the resilience of our postprocessing techniques on feature-space hijacking attack (FSHA) by Pasquini et al. (2021) in SplitNN and evaluated their attack performance with 2 models and 4 publicly available datasets---CNN for MNIST and Fashion-MNIST, ResNet for CIFAR-10 and ImageNet. Admittedly, we can add other attacks but we feel it will overwhelm the readers. We sincerely expect the reviewer will judge the merit of the work based on what is provided.
Comment. As such, I feel authors fail to demonstrate a better accuracy-privacy trade-off, leading to their contribution unjustified.
Answer. We again respectfully disagree with the reviewer. Our theorems are proxies to measure whether the postprocessing would allow better accuracy. In Section 3.1, Lines 177-185 explain why "adding Gaussian plus the postprocessing" is better than ``adding the Gaussian'' alone at each iteration and we use MSE (e.g., in Theorem 3.1 (i) for masking case) as a proxy to measure this effect; please also see Remarks 3.2 and 3.3. Section 3.2 in our paper presents the DNN training; the intuitions are given in Lines 214-219 and 242-245. E.g., Theorem 3.4 states that by using random masking over a noise-injected layer, we can incur a lower deviation in the loss value than using the noise injection alone when compared to the loss, , of the original SplitNN under certain conditions. Because the original SplitNN always produces the least loss and therefore, would incur the highest accuracy.
Additionally, all our numerical experiments show that by adding postprocessing steps (masking or scaling) the noise-injected SplitNNs incur better accuracy, albeit similar to the original SplitNNs, than the noise-injected SpliNNs. Moreover, the postprocessing steps maintain the privacy; see Figures 3-5.
Finally, we will comment on the privacy aspect. Theorem 3.7 from Dwork et al. (2014) shows that any postprocessing mechanism, deterministic or random, preserves privacy. Based on this result, both scaling and masking proposed in this work will maintain the privacy guarantee of the noise-injected SpliNN. Moreover, using the advanced composition theorem from Dwork et al.(2010) for such mechanisms, if we consider a total of training iterations, we can guarantee that the postprocessing mechanisms are private over the training phase. Table 9 in Appendix provides the privacy bounds for one single forward pass during the SplitNN training, for various noise levels used in our work, without the postprocessing. These privacy guarantees remain even after using postprocessing, as we explained above.
Comment. Additionally, there are also minor issues in both theoretical and empirical study. In my understanding, the denoising methods are applied during the training, so with and without this it will lead to different weights. I don't think it is reflected in theoretical analysis. In the empirical analysis part, I don't think it is the best choice to show graphs of accuracy by training epoch as it takes a lot of space. Instead, I would expect to see much richer results from using different parameter settings.
Answer. Currently, in our theory, changes in the weight do not affect the result of our analysis on denoising performance and privacy guarantee. We have included more results with different settings in the Appendix. For example, in Table 8, we show our hyper-parameter tuning process at different noise levels for searching for the best masking ratio and scaling ratio. In Table 6, we show split learning with two Adam optimizers used separately for the client and server. In Table 4, we compare whether it can provide similar denoising improvements when decreasing the learning rate, adjusting weight decay, and adding Dropout in noise-injected SplitNN training.
Question 1. Is there any conclusion on the choice of parameters (i.e. ). Authors mostly use or 0.2, How did author choose on this setting? Does the theoretical analysis provide guidance on how we should choose and ?
Answer. In Table 8, we show the hyper-parameter tuning process of different scaling factors and masking ratios in four different deep learning tasks. We found that such a tuning process is not more complicated than tuning other hyper-parameters such as learning rate. However, in practice, when the model size is large or the dataset size is huge, grid search may not be practical. Fortunately, with the recently proposed Zero-Shot Hyperparameter Transfer[1] technique, one can tune the hyperparameters on a much smaller model and then use it directly on large models and datasets. Such a technique has been adopted by Microsoft and OpenAI to effectively train very large DNN models. We will elaborate on this more in the final version.
[1] Yang, Ge, et al. "Tuning large neural networks via zero-shot hyperparameter transfer." Advances in Neural Information Processing Systems 34 (2021): 17084-17097.
This paper presents innovative denoising techniques for Split Learning to enhance training accuracy while preserving data privacy against reconstruction attacks. The authors propose scaling and random masking methods, demonstrating their efficacy through theoretical and experimental results.
优点
- The paper addresses a critical issue in Split Learning regarding data privacy and accuracy, which is highly relevant in today’s data-driven environment.
- It introduces two novel denoising techniques—scaling and random masking—that show significant promise in improving training performance under noisy conditions.
- The theoretical analysis is well-supported by experimental results, enhancing the validity of the proposed methods. The clarity of the writing and organization of the content facilitate understanding, making complex ideas accessible to a broader audience.
缺点
- The experimental validation may lack diversity in the datasets used, potentially limiting generalizability.
- Comparisons with existing methods could be more robust to highlight the advantages of the proposed techniques.
- The paper could benefit from clearer explanations of the denoising algorithms for readers unfamiliar with the underlying concepts.
问题
- The literature review could be expanded to include more recent studies, providing a broader context for the proposed techniques.
- Additional details on the experimental setup would enhance reproducibility and transparency, allowing other researchers to validate the findings.
- The paper could benefit from more comprehensive discussions on the limitations of the proposed methods and potential future work.
- A more detailed analysis of the impact of varying noise levels on training accuracy could provide deeper insights into the practical applications of the proposed techniques.
We thank the reviewer for an overall positive evaluation. Below we answer the questions and the comments.
Comment. The experimental validation may lack diversity in the datasets used, potentially limiting generalizability.
Answer. We respectfully mention that this is a theoretically driven paper, and note that the reviewer also mentioned ``The theoretical analysis is well-supported by experimental results, enhancing the validity of the proposed methods." In summary, we provided empirical results of our denoising techniques on 6 diverse datasets (MNIST, FMNIST, CIFAR-10, CIFAR-100, IMDB, Names) and across 4 different network architectures (CNN, ResNet, RNN, and MLP). Additionally, we provided FSHA attacks in Split Learning on MNIST, FEMNIST, CIFAR-10, and ImageNet. As the reviewer mentioned, we described in Appendix C that testing our denoising techniques on larger datasets (ImageNet) and more complex DNN architectures is ongoing.
Comment. Comparisons with existing methods could be more robust to highlight the advantages of the proposed techniques.
Answer. To the best of our knowledge, we are the first to propose denoising techniques on the Gaussian noise-injected IRs to improve the training accuracy of SplitNN and theoretically and experimentally show the efficacies of these techniques and also provide privacy guarantee in this setup.
Comment. The paper could benefit from clearer explanations of the denoising algorithms for readers unfamiliar with the underlying concepts.
Answer. In Section 4.1, we discussed this particular issue in the synthetic data setting. For DNN training, we provide the discussion in Section 4.2. Additionally, we show that the random masking technique can improve data privacy in defense against the feature-space hijacking attack (FSHA). We can elaborate more on the denoising techniques in the final version.
Question 1. The literature review could be expanded to include more recent studies, providing a broader context for the proposed techniques.
Answer. We respectfully note that Split learning is a relatively young field. Nevertheless, we mentioned over 30 papers in the Introduction and Related work of our manuscript. To the best of our knowledge, we are the first to propose denoising techniques on the Gaussian noise-injected IRs to improve the training accuracy of SplitNN and theoretically show the privacy guarantee in this setup. We would appreciate it if the reviewer objectively mentions what other recent studies they want us to cite. We will be happy to add them.
Question 2. Additional details on the experimental setup would enhance reproducibility and transparency, allowing other researchers to validate the findings.
Answer. We have provided a detailed experiment setup in Appendix B, Tables 1 and 2, including models, datasets, split configurations, and training hyper-parameters such as optimizer, batch size, learning rate, and weight decay. In addition, we provided our experiment code and scripts in supplementary material for reproducibility.
Question 3. The paper could benefit from more comprehensive discussions on the limitations of the proposed methods and potential future work.
Answer. Please see LIMITATION in the Appendix (Section C, pp.28). We mentioned this explicitly in the main paper; see Section 4.2, Lines 472-473.
Question 4. A more detailed analysis of the impact of varying noise levels on training accuracy could provide deeper insights into the practical applications of the proposed techniques.
Answer. In Appendix, Table 8 we provided the performance of scaling and masking postprocessing against varying Gaussian noise levels, . Similarly, Table 3 in the Appendix shows denoising performance by using Laplacian noise in split learning for MNIST classification task for different Laplacian noise scales, . Figure 8 provides a comparison of tuning various hyper-parameters in noise-injected split learning at a fixed noise level () for the MNIST classification task. In Table 9, we provide the privacy bounds for one single forward pass during the SplitNN training, for various noise levels used in our work, without denoising. In Figure 5 (in the main paper) and Figures 10-12 in the Appendix, we provide private data recovery by FSHA on MNIST, FEMNIST, CIFAR-10, and ImageNet for SpliNN, noise-injected SpliNN, and noise-injected SplitNN with postprocessing for high noise level . We would appreciate it if the reviewer objectively mentions what analysis they want to see.
Thanks for your response. The author's response has answered my doubts.
We thank the reviewer for reading our rebuttal.
This paper introduces two denoising techniques—scaling and random masking—for Differential Privacy (DP) within the Split Learning framework. These methods aim to preserve security guarantees while maintaining model accuracy. The authors focus on theoretical contributions supported by extensive simulation and empirical studies, demonstrating that the proposed techniques enhance the accuracy of split neural network classification under DP. Additionally, the paper shows that the resulting deep neural networks are resilient to state-of-the-art hijacking attacks.
优点
- The paper provides a theoretical proof explaining how the proposed denoising methods reduce MSE under certain conditions, effectively denoising while preserving privacy guarantees. The paper covers both a simple identity layer and a more complex linear layer with non-linear activation, with implications for both classification and broader applications.
- It discusses how the two denoising methods approximate DP-SGD and contribute to privacy.
- Simulations are conducted for both linear and non-linear cases. Results indicate that the denoising methods exhibit different characteristics at varying noise scales, validating the theoretical claims and offering guidance on the application of each method.
- The authors conduct practical experiments across five datasets and various ML tasks, including image classification, recommendation, and language modeling. The results show significant improvements, with accuracy close to the clean model performance.
- The paper examines how hyperparameter settings (e.g., learning rate, weight decay, and optimizer choice) impact denoising performance in practical CIFAR-10/100 image classification tasks.
- Empirical studies indicate that the proposed method, combined with noise injection, effectively mitigates FSHA.
- Code is available for review and extensive discussion & extra experiments are available in appendix
缺点
- The theorem and its proof are limited to L-1 layers, which restricts the scope.
- While simulation largely supports the theoretical findings, discrepancies are observed between simulation and practical results. For example, scaling outperforms masking in simulation, though this does not hold consistently in practice.
- Performance degradation occurs when the cut layer is set below L-1.
- The paper does not explore hybrid applications of the two denoising methods.
- The empirical study tests only a limited set of noise-injection parameters (e.g., noise scale = 0.7). Exploring denoising limitations would be informative.
- The empirical study also uses a limited range of cut-layer settings.
- The paper evaluates only a limited range of attack types.
- The techniques are demonstrated only in a two-party split learning setup, despite their potential applicability to Split Federated Learning, which has broader real-world use.
问题
- The denoising mechanism requires the addition of a tanh function. Could this cause performance degradation? Is it always applicable to general ML tasks?
- There are no empirical results for cut layers below L-2. Exploring these results could help elucidate the proposed methods' limitations.
- Could the authors provide more insight into why scaling does not mitigate FSHA effectively?
- Could the authors discuss the limited performance improvement of the denoising mechanism on Laplacian noise?
We thank the reviewer for an overall positive evaluation. Below we answer the questions and the comments.
Comment. The theorem and its proof are limited to layers, which restricts the scope.While simulation largely supports the theoretical findings, discrepancies are observed between simulation and practical results. For example, scaling outperforms masking in simulation, though this does not hold consistently in practice.
Answer. Our present theoretical analyses in Section 3.1. and 3.2, consider the split layer at the pre-final layer of an -layer DNN;. Analysis of the split at an arbitrary -th layer, along with the final loss function used for DNN training, requires much more mathematical rigor due to the complex nature of nonlinear activations; we mentioned this in Lines 158-168. Note that the function is a composite function --- its complicacy increases based on the position of the split layer, nonlinear functions (ReLU, leaky Relu, GeLU, etc.) used in the consequent layers after the split layer, and the final output layer's configuration.
Scaling can improve the accuracy but may not be stable throughout the training; see lines 461--469. In Appendix B, Figure 8(e) and (f), we observed that scaling can perform as well as masking during noisy DNN training when combing scaling and weight decay.
Comment. Performance degradation occurs when the cut layer is set below L-1.
Answer. This is the nature of noisy split learning because when the cut layer is set below , more layers on the server side will be affected by the noise from the client side, which harms the final training accuracy and makes it more difficult to recover the original noise-free accuracy. In general, SpliNN considers the client side to contain a major part of the split network.
Comment. The empirical study tests only a limited set of noise-injection parameters (e.g., noise scale = 0.7). Exploring denoising limitations would be informative.
Answer. When the output in noise scale 0.7 is pretty high. We tried larger than 0.7 but we will not observe any gains. We can add the results if the reviewer wants to see them.
Comment. The empirical study also uses a limited range of cut-layer settings.
Answer. Extending a broader range of cut-layer settings would require a more complex theory investigation of other types of DNN layers such as convolution, recurrent, normalization, pooling, etc. This is out of the scope of this paper, and we want to investigate them in future work.
Comment. The paper evaluates only a limited range of attack types.
Answer. The privacy guarantee in this paper is a by-product; this is not our main goal; see Lines 76-77 in the Introduction. Nevertheless, we showed the resilience of our postprocessing techniques on feature-space hijacking attack (FSHA) by Pasquini et al. (2021) in SplitNN and evaluated their attack performance with 2 models and 4 publicly available datasets---CNN for MNIST and Fashion-MNIST, ResNet for CIFAR-10 and ImageNet. Admittedly, we can add other attacks but we feel it will overwhelm the readers.
Comment. The techniques are demonstrated only in a two-party split learning setup, despite their potential applicability to Split Federated Learning, which has broader real-world use.
Answer. Split learning is a relatively young field. To the best of our knowledge, we are the first to propose denoising techniques on the Gaussian noise-injected IRs to improve the training accuracy of SplitNN. We wanted to send this message to the reader through rigorous theoretical analysis and our denoising techniques work empirically on diverse datasets (MNIST, FMNIST, CIFAR-10, CIFAR-100, IMDB, Names) and across different network architectures (CNN, RNN, and MLP). We respectfully mention that the split federated learning would be completely out of the scope of the present focus of the paper.
Questions 1. The denoising mechanism requires the addition of a tanh function. Could this cause performance degradation? Is it always applicable to general ML tasks?
Answer. No, this does not cause any performance degradation in our experiments. For general ML tasks, when it is not applicable to use a tanh function, there must be another way to restrict the output range to provide a privacy guarantee in DP. Our denoising approaches do not rely on any particular type of restricting method.
Question 2. There are no empirical results for cut layers below L-2. Exploring these results could help elucidate the proposed methods' limitations.
Answer. We agree with the reviewer. At the present scope, we cannot support the cut layer at the convolution or embedding levels. This requires much involved work. %Our denoising depends on the layers after the cut layer. We
Question 3. Could the authors provide more insight into why scaling does not mitigate FSHA effectively?
Answer. We never mentioned that. Can the reviewer please clarify?
Question 4. Could the authors discuss the limited performance improvement of the denoising mechanism on Laplacian noise?
Answer. We left it for future work as mentioned.
Dear Reviewers,
We posted our detailed rebuttals addressing all your questions and comments. Indeed, we can incorporate many of them in the final version and they would strengthen the paper. We feel many issues raised by the reviewers stem from misunderstandings, and we did our best to clarify them.
We hope the reviewers will read our rebuttals and engage in meaningful interaction with us. We will be happy to clarify any further doubts.
Thank you again for your time.
Looking forward,
The Authors
Dear Reviewers,
We posted our detailed rebuttals addressing your questions on the 25th of November 2024 around 1:31 PM EST. The reviewers mentioned many positive points regarding the main contributions of our paper. They also raised some interesting questions and concerns that, we believe, stem from misunderstandings --- they are addressable and we did our best to respond to them in detail. Regardless, we plan to incorporate many of those comments in the revised version to increase the manuscript's clarity. We highlight two salient features of our work in the following.
Main motivation and focus. Split learning is an emerging learning technique between different parties in a client-server framework to learn a deep neural network (DNN) model collaboratively, without explicitly sharing raw input data [Gupta and Raskar, '18; Vepakomma et al., '18]. Noise addition (or noise injection) is normal practice to ensure privacy when sharing information with others in split learning; see [Titcombe et al., 2021; Abuadbba et al., 2020]. Indeed, noise injection degrades the accuracy. Therefore, we identify the problem as to how to alleviate the issue of reduced accuracy by noise addition while keeping the security intact, specifically for split learning. Improving the privacy accounting for split learning is not the primary focus of this work, instead, we want to show how denoising can improve noisy SplitNN accuracy. For completeness, we discussed the DP guarantee of our ``denoising" mechanisms.
Improving the privacy guarantee in noisy Split Learning remains an important open problem. We note that, after being introduced in the ML community in 2020 by Abuadbba et al., our work is the first attempt to theoretically discuss the differential privacy guarantee in a noisy SplitNN setup. Therefore, we followed the regular DP framework to show the readers that noisy SplitNN training with postprocessing maintains the DP guarantee. Improving the privacy accounting for split learning is not the primary focus of this work, instead, we want to show how denoising (in our case scaling and masking) can improve noisy SplitNN accuracy. Traditional denoising methods exist for Gaussian noise injection, but none of them is directly applicable in the SplitNN setting. In other words, the differential privacy denoising works cannot be directly applied and compared with noisy SplitNN denoising due to a much-complicated setup in the latter case.
We hope the reviewers have read our rebuttals and will engage with us in meaningful interaction. We will be happy to clarify any other doubts.
Thank you again for your time.
Looking forward,
Authors
This paper aims to enhance training accuracy while preserving data privacy against reconstruction attacks. The authors propose to use scaling and random masking as a denoising method for noise-injected networks in the context of split learning. Experiments are performed to evaluate the effectiveness of the proposed method against feature space hijacking attacks. The reviewers raised concerns about the obvious discrepancy between simulation and practical results, the limited experimental and evaluation setup, and the misalignment between the paper’s focus and its target. These concerns were not sufficiently addressed. Based on the above considerations, I think the current manuscript does not match the ICLR’s requirement and I do not recommend to accept this manuscript.
审稿人讨论附加意见
Two reviewers gave marginally positive rating scores, but they had low confidence and one review gave a clear rejection. Reviewers show concerns about the experimental setup and results, and also show concerns about comparisons, related works and research focus. Although authors provided rebuttals for each reviewer, the responses have not sufficiently addressed the concerns of all reviewers.
Reject