6.0

/10

Poster4 位审稿人

最低5最高8标准差1.2

3.3

置信度

正确性3.3

贡献度2.8

表达3.0

NeurIPS 2024

Injecting Undetectable Backdoors in Obfuscated Neural Networks and Language Models

Alkis Kalavasis,Amin Karbasi,Argyris Oikonomou,Katerina Sotiraki,Grigoris Velegkas,Manolis Zampetakis

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

TL;DR

We develop a strategy to backdoor any neural network while ensuring that even if a model’s weights and parameters are accessible, the backdoor cannot be efficiently detected.

摘要

关键词

backdoorswhite-box undetectableobfuscationtheory

评审与讨论

审稿意见

评分: 5置信度: 42024-07-11

The paper proposes a theoretical pipeline to inject undetectable backdoors into deep learning models. The pipeline consists of several steps. First, the neural network is trained and converted to a boolean circuit. Then, a non-replicable backdoor can be injected into the binary circuit using pseudo-random generators and digital signatures. After the backdoor is injected, the boolean circuit gets obfuscated and converted back into a neural network.

优点

The paper tackles a very serious problem, as backdoors can be a threat to systems used in production

缺点

To me, the paper's contribution is not quite clear. Since Goldwasser et al. (2022) have already shown how undetectable backdoors can be injected into binary classification deep learning models, it is hard to tell what the novelty of this paper is. Underlining this, many of the assumptions and definitions are apparently taken from related work.
There are multiple definitions and theorems for which it is not fully explained why they are needed. This makes it hard to read and follow the paper.
There is no conclusion or discussion at the end of the paper. Instead, the paper abruptly ends.
There is no intuitive explanation on why all the parts (PRG, signature) are needed. Instead, the readers have to figure that out by themselves using the equations.
It is a bit confusing that the definition of a backdoored model always includes three models f, $\tilde{f}$ , and $\hat{f}$ . It would be much easier if the backdoored model were defined with two models--the original model and the backdoored model, which tries to mimic the original model while having a backdoor.
Many important definitions and almost everything concerning backdoors in LLMs are only present in the appendix.
It is unclear why anyone would want to train an ANN, convert it to a binary circuit, obfuscate the boolean circuit, and then convert it back to an ANN if no backdoor should be injected. If a malicious company would train a model like that for a customer, the customer would ask why the model is trained this way.
The undetectability of the backdoor assumes that there exists a procedure "Perturb", such that the backdoor is not detected. However, it is unclear whether such a procedure exists. So, big parts of the paper are based on assumptions for which it is not known whether these assumptions hold in reality.

问题

Q1: In line 267, it seems the backdoor is only activated if the output of the PRG(X_PRG) equals the output of PRG(S*). However, the probability that this will happen is very low. So, what is the intuition of this part of the equation?
Q2: Why would anyone wanna train a network, convert it to a binary circuit, obfuscate it, and convert it back to a network when no backdoor is injected?

局限性

The limitations of the proposed approach are not discussed.

作者回复

2024-08-07

[Part 1/n]

We want to thank the reviewer for reading our paper carefully, for appreciating our results and for their constructive feedback and comments.

Q1) To me, the paper's contribution is not quite clear. Since Goldwasser et al. (2022) have already shown how undetectable backdoors can be injected into binary classification deep learning models, it is hard to tell what the novelty of this paper is. Underlining this, many of the assumptions and definitions are apparently taken from related work.

A1) The construction of Goldwasser et al. (2022) to plant backdoors to general NNs works only when the distinguisher has black-box access to the model, i.e., they cannot see the weights/architecture of the model, which we argue that it is a very important limitation of their work. Their construction to plant backdoors when the distinguisher has white-box access works for a rather limited family of training algorithms/models. Moreover, Goldwasser et al. (2022) do not touch the topic of LLMs. To the best of our knowledge, our work is the first to give a theoretical construction for planting backdoors in such models. We underline that we have listed our contributions in Lines 48-58 as well as Section 1.1 of our manuscript. Moreover, we have provided a comparison with Goldwasser et al. (2022) in Lines 129-142 as well as in Appendix C. For completeness, we reiterate some of these points below.

Our contributions can be listed as follows: We provide a modified (weaker) definition of planting backdoors, and we show how we can achieve undetectable planting under this definition under distinguishers that have white-box access to the model. We provide a construction that satisfies this definition and is based on iO. We provide similar definitions and constructions in the context of LLMs, which have not appeared in prior work. We show how Boolean functions can be well-approximated by quite compressed NNs (Theorem 14 in our manuscript), which we believe can be useful even outside the scope of our work.

Q2) There are multiple definitions and theorems for which it is not fully explained why they are needed. This makes it hard to read and follow the paper.

A2) Thank you for your feedback. We tried our best to produce a write-up that is consistent and within the page limits. For this we need all the definitions and theorems we have listed in our work but as you say in some places some additional explanation is missing due to the space limitations. That said, we put a lot of effort into being able to handle both NN classifiers and LLMs through a conceptually similar approach which reduces the need for extra space and makes the results easier to follow.

We plan to utilize the extra page of the camera ready version to elaborate more on several definitions and theorems and thus improve the presentation of our results. If the reviewer has any concrete suggestions about this we would be more than happy to hear them. .

Q3) There is no conclusion or discussion at the end of the paper. Instead, the paper abruptly ends.

A3) Thank you for pointing this out, we are happy to the following Conclusion section in the next version of our work: Given the plethora of applications of Machine Learning in general, and neural networks in particular, questions regarding the trustworthiness of publicly released models naturally arise. In particular, before deploying a neural network we need to guarantee that no backdoors have been injected allowing bad actors to arbitrarily control the model behavior. In this paper, we investigate the existence of backdoor attacks to obfuscated neural networks which are undetectable even when given white-box access. The notion of obfuscation that we consider is the well-studied and mathematically founded indistinguishability obfuscation (iO). We also show how our techniques can inspire backdoor schemes in large language models when combined with ideas from steganography.

While our constructions are purely theoretical, we leave as an interesting direction how to use heuristic obfuscation methods to show practical instantiations of our constructions. Another interesting open question is whether cryptographic schemes weaker than iO suffice to show backdoor undetectability in the white-box model.

Q4) There is no intuitive explanation on why all the parts (PRG, signature) are needed. Instead, the readers have to figure that out by themselves using the equations.

A4) We respectfully disagree with this comment. We believe that we explain why all the cryptographic tools that we define are needed. For example, we explain where PRGs are needed in Lines 260-266 of our manuscript. Similarly, we explain the use of signatures in these lines. Moreover, in Lines 278–281 we explain why digital signatures give us non-replicability. We are more than happy to elaborate more or modify the explanations if this improves our writeup but we have already put a lot of effort into this.

Q5) It is a bit confusing that the definition of a backdoored model always includes three models f, \hat{f}, and \tilde{f} . It would be much easier if the backdoored model were defined with two models--the original model and the backdoored model, which tries to mimic the original model while having a backdoor.

A5) Our main goal is to show that the obfuscated version of $f$ and the obfuscated version of the backdoored $f$ are indistinguishable. Since it is not clear how to directly compare the obfuscated version of $f$ with the obfuscated version of the backdoored $f$ , we use some “intermediate” quantity in order to, intuitively, “chain” the equalities. This is mostly part of the theoretical analysis in order to derive the theoretical result. We underline again that our guarantee is established with respect to two functions: the obfuscated version of $f$ and the obfuscated version of backdoored $f$ .

We will clarify that for the final version.

评论- Rebuttal Answer

2024-08-09

Thank you for your detailed answer. To be fair to other authors who submitted to NeurIPS and kept the 6k character limit, I will only reply to the rebuttal, not the comment.

A1) Thank you. I now have a better understanding of the difference to Goldwasser et al. However, the result for LLMs, which you claim to be one of the most important parts of why your paper is novel, is only available in the appendix. If this is one of the most important points, I think the paper needs to be restructured to reflect the title and include this in the main part of the paper.

A2/A3) Thank you for your answer. I get that it is hard to convey everything with such little space. However, I think this can be done by rearranging the paper's content. In my opinion, it is not a good idea to omit the conclusion and the important section of a discussion of the own work, as this gives the reader additional context.

A4) Thank you for pointing that out. I get why you are using the digital signature. In the lines you have referred to (lines 260-266), I can see where the PRG and the signature are used. However, I was referring to giving the reader an intuition why these cryptographic tools are used there. So, what is the intuition of using the PRG?

There might be a misunderstanding here, which is why this might be connected to Q1 of mine. It seems the backdoor is only activated if the output of the $PRG(X_{PRG})$ equals the output of $PRG(S*)$ . However, the chance of this happening seems to be very low. Could you clarify under which scenarios the backdoor is activated in the equation following line 267?

A5) Thank you very much for your answer. This resolves my question.

2024-08-11

A1

We will restructure the paper and LLM content to the main body.

A4 (part 1)

The intuition for using a PRG can be explained nicely by the strawman solution proposed by Reviewer paZi: A backdoor should be a part of the code that can be activated if we know some information that nobody else can efficiently find. A strawman solution would be to add a SAT instance: if the instance is satisfiable, then anybody with the satisfying assignment can activate the backdoor. If it is not satisfiable, then there exists no backdoor. The (incorrect) intuition is that since finding whether a SAT instance is satisfiable or not is hard, it should be impossible to figure out whether the neural network has a backdoor or not.

This intuition does not directly work, as we explained to Reviewer paZi, and to make it work we have to replace the SAT with a PRG. We are repeating our answer to Reviewer paZi for an explanation on why this strawman solution does not work. According to our definition, which follows the blueprint of Goldwasser et al., a backdoor is undetectable if any (polynomially bounded) adversary cannot distinguish between an honestly generated model and one with a backdoor. If we inject a specific satisfiable formula in the honest case, then there exists a simple adversary, that checks whether a hardcoded assignment is satisfiable, succeeds. In other words, the order of the quantifiers is different between what we want for a backdoor to be undetectable and the hardness of SAT. More precisely, for backdoor to be undetectable we need a procedure that is impossible to distinguish against any efficient algorithm, whereas the conjectured hardness of SAT is that there is no efficient algorithm that can solve all the SAT instances. We are happy to elaborate more if needed. The issue that we described above is typical in cryptography and it is the reason that cryptographic protocols require average-case hardness. Unfortunately, SAT is not average-case hard, so our solution to this issue is to use instead the well-studied cryptographic primitive of PRGs: instead of a SAT formula, we include a number that is either pseudorandom or truly random. Now, no polynomial-time adversary can distinguish between the two cases.

We will add this explanation to our paper to give intuition about the use of PRGs. For the digital signatures, the intuition is quite simple: the PRG (as the SAT formula) only requires one value to activate the backdoor. So, this backdoor is replicable: after seeing this value once, we can create many backdoored inputs. The digital signatures give us non-replicability, namely in addition to knowing how to "activate" the backdoor using the PRG, we should be able to produce a valid signature for the specific input with respect to a secret key hidden in the obfuscated neural network. Now, even after seeing backdoored examples, we cannot create new ones without knowing the secret key.

For the non-replicability part of our construction to work, it is essential that the final neural network is obfuscated. Otherwise, anybody that inspects that NN would be able to see the secret key corresponding to the digital signature scheme. Even though outputting obfuscated NNs might seem suspicious, we believe that there are legitimate use-cases where adding an obfuscation step before the release of the model could be part of an honest training pipeline and should not constitute on its own an attempt from the company to do something malicious with the model. Consider the following motivating examples: -Privacy of Training Data: If the training data includes sensitive user information, using obfuscation could help ensure that this data remains private and secure. -IP Protection: Companies often have proprietary models. Applying obfuscation allows these companies to publicly release their models with less risk of reverse-engineering their models or techniques. If we allow the use of obfuscation then, our claim is “that an obfuscated version of the original model is indistinguishable from the backdoored one”. And due to the general utility of using obfuscation we consider such a recipe not suspicious. The theoretical guarantees for our construction are based on using a specific type of obfuscation called indistinguishability obfuscation (iO). The recent breakthrough result of Jain et al. [2021] has shown that indistinguishability obfuscation (iO) does exist, by combining standard tools from cryptography.

A4 (part 2)

You are right that the backdoor is activated only when $PRG(X_{PRG})=PRG(S∗)$ and this happens with extremely small probability unless we know $S*$ . If we already know $S*$ , then we can just set $X_{PRG} = S*$ to make sure that the backdoor is activated. So, the fact that it is hard to find $S*$ is necessary so that only people that know the backdoor can activate it. On the other hand someone that knows $S*$ can activate the backdoor whenever they want. This creates a usable but undetectable backdoor.

2024-08-11

Thank you very much for the detailed answer. Most of my questions have been answered.

However, I have another question regarding the legitimate use cases for adding an obfuscation step before the model's release. You propose to use obfuscation to ensure privacy and IP protection. However, I don't see how obfuscation of the NN helps in this case. Privacy attacks like model inversion and membership inference attacks usually don't look at the weights but rather perform privacy attacks by looking at the input-output relation. Similarly, in the case of IP violation, the attacker wants to steal the model and is not interested in analyzing the weights themselves but rather stealing them to use the model.

Could you clarify how obfuscation helps in these scenarios?

2024-08-12

Thank you very much for the question. You raise a nice point that will help us clarify the motivation of obfuscation in our paper.

Obfuscation is a method used in software applications as a tool for significantly improving, but not for completely resolving the model’s robustness to malicious attacks. As mentioned by one of the seminal works of Barak et al., “roughly speaking, the goal of (program) obfuscation is to make a program “unintelligible” while preserving its functionality. Ideally, an obfuscated program should be a “virtual black box,” in the sense that anything one can compute from it one could also compute from the input-output behavior of the program.”

The short answer to your question is that IP and privacy attacks are types of malicious attacks and hence we expect the system to become more robust against them if it uses obfuscation. We elaborate more on IP and privacy below, using the abstract idea from above that obfuscation essentially transforms a white-box access to a black-box access.

For IP protection, obfuscation has two contributions: a) Οbfuscation serves as a tool to reduce the power of an adversarial entity from having white-box access to having roughly speaking black-box access to the model. To make it more concrete, a lot of companies developing LLMs these days including OpenAI and Google, do not actually release the weights and the architecture of their flagship models, they merely give users black-box access to them because they want to protect their IP and make sure that the competitors will not get any advantage by looking at the inner workings of their models. By doing obfuscation at this level, as the previous discussion shows, they can give white-box access to the obfuscated models while making sure that this does not reveal any more information than the input-output access. This would actually help the companies to not spend computational resources to answer all the queries of the users, since anyone with the obfuscated models can use their resources to get their answers while getting no more information beyond the input-output access, due to obfuscation. b) Οne form of heuristic obfuscation that is used to protect IP in virtually every aspect of proprietary computer programs is sharing the binary or Assembly code instead of the source code. Take as an example the Microsoft Office. Microsoft shares the binary of the code that is needed to run Office but it does not release the source code. This protects Microsoft’s IP since no one can make sense and utilize the binary code more than they could utilize a black-box query access to it. The situation is similar with NNs, with obfuscation no one can make sense of the architecture or use parts of the NN as pre-trained models to solve other tasks easier.
For privacy: a) black-box (BB) access is much more restrictive than white-box access; for example, with WB case one can perform gradient-descent optimizations on the model weights, which leads to very powerful, and much less computationally expensive attacks, e.g., see [1]. b) Our approach to obfuscation does not inherently protect against the privacy attacks you mentioned, which focus on the input-output behavior rather than the internal structure. These privacy concerns are distinct from those addressed by obfuscation and there are defenses specifically targeting these privacy attacks, e.g., differential privacy [2] . Nevertheless, these defenses become essentially useless if someone has white-box access to the model [3, 4, 5].

From a theoretical perspective, our result rigorously shows that any training procedure followed by an obfuscation step is susceptible to very strong backdoor attacks (white-box undetectable and non-replicable). Prior to our work, we only knew of only one well-structured training procedure, namely RFF, that could be backdoored in a white-box manner.

We hope that this helps, we are happy to answer any other questions and we commit to including a significant part of the discussion above to the final version of the paper.

[1] Evading deepfake-image detectors with white-and black-box attacks, Carlini and Farid, 2020

[2] Membership Inference Attack against Differentially Private Deep Learning Model, Atiqur Rahman et al., 2018

[3] The Secret Revealer: Generative Model-Inversion Attacks Against Deep Neural Networks, Zhang et al., 2020

[4] Extracting Training Data from Diffusion Models, Carlini et al., 2023

[5] Comprehensive Privacy Analysis of Deep Learning: Passive and Active White-box Inference Attacks against Centralized and Federated Learning, Nasr et al., 2020

2024-08-12

Thank you very much for the detailed response and the additional insight. My questions have been appropriately addressed and I will adjust my score.

2024-08-07

[Part 2/n]

Q6) Many important definitions and almost everything concerning backdoors in LLMs are only present in the appendix.

A6) Thank you for your comment. Unfortunately, due to the lack of space, we had to move some content to the appendix. If the paper gets accepted, we will utilize the extra space to move part of the discussion regarding LLMs to the main body.

Q7) It is unclear why anyone would want to train an ANN, convert it to a binary circuit, obfuscate the boolean circuit, and then convert it back to an ANN if no backdoor should be injected. If a malicious company would train a model like that for a customer, the customer would ask why the model is trained this way.

A7) This is a good point worth clarifying. Our construction aims to show a fundamental limitation of DNNs and LLMs by illustrating a concrete backdoor attack with strong theoretical guarantees. In practice, one could use some heuristic method in order to obfuscate the honest function $f$ , and we believe that it would still be hard to distinguish between the (heuristically) obfuscated version of $f$ and the obfuscated version of some backdoored version of $f$ . Moreover, we believe that there are legitimate use-cases where adding an obfuscation step before the release of the model could be part of an honest training pipeline, and should not constitute on its own an attempt from the company to do something malicious with the model. Consider the following motivating examples:

Privacy of Training Data: If the training data includes sensitive user information, using obfuscation could help ensure that this data remains private and secure.
IP Protection: Companies often have proprietary models. Applying obfuscation allows these companies to publicly release their models with less risk of reverse-engineering their models or techniques. If we allow the use of obfuscation then, our claim is “that an obfuscated version of the original model is indistinguishable from the backdoored one”. And due to the general utility of using obfuscation we consider such a recipe not suspicious.

The main reason that we transform an ANN to a binary circuit and then back to an ANN is to argue that there exist theoretically grounded methods of obfuscation of ANN. In any practical scenario we do not envision this obfuscation procedure to be followed. Nevertheless, our backdoor can be added even if the company is using a heuristic obfuscation method for legitimate reasons, as the ones we mention above. If the heuristic obfuscation approach has similar properties with iO in practice then our backdoor is still undetectable.

We will clarify these points for the final version of our work.

Q8) The undetectability of the backdoor assumes that there exists a procedure "Perturb", such that the backdoor is not detected. However, it is unclear whether such a procedure exists. So, big parts of the paper are based on assumptions for which it is not known whether these assumptions hold in reality.

A8) There might be a misunderstanding here. The Perturb function that we use is iO. As we explain in Assumption 9, and Lines 212-213 right below it, the recent breakthrough result of Jain et al. [2021] has shown that indistinguishability obfuscation (iO) does exist, by combining standard tools from cryptography. Following standard nomenclature from cryptography we list the existence of iO as an assumption but this is only because we do not have an unconditional proof for the hardness of problems, like factoring, that are used for security reason in many aspects of our everyday life, e.g., internet and credit cards. In that sense, formally, it is an assumption that factoring is hard and in exactly the same sense it is an assumption that iO exists. But, since 2021 we do have constructions of iO whose security is as good as the security of important and widely used cryptographic protocols. We hope that this clarifies this point and we are more than happy to elaborate more if something is still not clear. Limitations: The limitations of the proposed approach are not discussed.

We will explain the limitations of this approach, which is a theoretical construction for planting backdoors and, at its current form, is not practical, in the final version of our work. We will also clarify the suggested pipeline that can be based on heuristic obfuscation approaches. Nevertheless, this approach highlights vulnerabilities of current DNNs and LLMs, and should be taken into consideration.

审稿意见

评分: 5置信度: 32024-07-13

The authors propose a procedure to process a neural network such that it is impossible to efficiently tell whether a backdoor was injected or not. The construction is formal and is based on common cryptographic assumptions. The authors also provide a technical formulation of backdoors for languages models and integrate it into their framework.

优点

The presentation in the paper is very clear and there is a lot of helpful discussion to help readers appreciate and understand the result.

缺点

The Section 4.2 is a bit too compressed as it does not even define the steganographic function. The presentation would be more balanced if the language model section were fleshed out a little more in the main text.

问题

Is there any setting where this backdoor is practical where [1] was not? Basically I assume there would need to be a setting where the user is willing to accept a network that looks like Plant(f, 0)? I assume normally the user would expect to receive f so I assume they would be suspicious if they received Plant(f, 0) instead (even though no backdoor was inserted). Is there a scenario where the user is expecting to receive something looking like Plant(f, 0)?

[1] Shafi Goldwasser, Michael P Kim, Vinod Vaikuntanathan, and Or Zamir. Planting undetectable backdoors in machine learning models. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 931–942. IEEE, 2022.

局限性

The main limitation is that while the Plant(f, 0) and Plant(f, 1) distributions are indistinguishable, the distributions are easily distinguished from the distribution of f itself.

作者回复

2024-08-07

We want to thank the reviewer for reading our paper carefully, for appreciating our results and for their constructive feedback and comments.

Q1: The Section 4.2 is a bit too compressed as it does not even define the steganographic function. The presentation would be more balanced if the language model section were fleshed out a little more in the main text.

A1: Thank you for your suggestion. Unfortunately, due to space constraints, we had to move some of the discussion to the appendix. We will re-organize the paper and, if it gets accepted, we will utilize the extra page to bring some of the content from the appendix to the main body.

Q2: Is there any setting where this backdoor is practical where [1] was not? Basically I assume there would need to be a setting where the user is willing to accept a network that looks like Plant(f, 0)? I assume normally the user would expect to receive f so I assume they would be suspicious if they received Plant(f, 0) instead (even though no backdoor was inserted). Is there a scenario where the user is expecting to receive something looking like Plant(f, 0)?

A2: This is a great question, and thank you for bringing this up. First, we would like to point out that the generic construction of [1], that does not make strong assumptions on the training process, holds only in the regime where the distinguisher has black-box access to the classifier, i.e., they do not have access to the weights. The main point of the construction we provide, and for using iO, is that we believe iO (or another obfuscation technique) can be part of a legitimate non-suspicious real-world training pipeline. In other words indeed “Perturb” can be unnatural but the “Perturb” procedure that we use, i.e., obfuscation, is natural and we can expect it to be used as the last step of any training process for the following reasons:

Privacy of Training Data: If the training data includes sensitive user information, using obfuscation could help ensure that this data remains private and secure.
IP Protection: Companies often have proprietary models. Applying obfuscation allows these companies to publicly release their models with less risk of reverse-engineering their models or techniques. If we allow the use of obfuscation then, our claim is “that an obfuscated version of the original model is indistinguishable from the backdoored one”. And due to the general utility of using obfuscation we consider such a recipe not suspicious.

Another way to interpret our results is hence that we do not modify $f$ but we assume that the last step of the training process that produces $f$ is an obfuscation step. If this is the case then our result says that we can always plant undetectable backdoors.

We will elaborate on these points, and we will explain what types of “Perturb” function we allow for, as well as give examples of settings where applying iO to the model would be expected.

Limitations: The main limitation is that while the Plant(f, 0) and Plant(f, 1) distributions are indistinguishable, the distributions are easily distinguished from the distribution of f itself.

Please look at our response above. Our results can be interpreted as indistinguishability between Plant(f, 1) and f if we assume that the last step of the training procedure that generates f is an obfuscation step. Additionally it is worth mentioning that if we view $f, Plant(f, 0)$ as a mapping from features to $\{0, 1\}$ , then these two mappings are the same. We will elaborate on this in the final version of our paper..

2024-08-13

I would like to thank the authors for their detailed rebuttal. The authors bring up some interesting points that I had not considered before, addressing why users might be willing to accept the obfuscated model. In my mind the remaining issues are that:

The method is not practical because the conversion to a boolean circuit followed by iO would blow up the size of the model and the expense of inference to galactic proportions. However since this is a theoretical paper, this weakness does not count for much.
The conversion back from the obfuscated circuit to a neural network seems a bit decorative. If one is interested in inference, then just evaluating the obfuscated circuit itself suffices. On the other hand, there's little reason to expect the construction given to exhibit any of the nice properties we typically attribute to neural networks (e.g. fine-tunability).

In this light, I am raising my score to a 5.

审稿意见

评分: 8置信度: 42024-07-13

This paper presents a way of inserting backdoors into deep learning and language models, whereupon the resulting backdoored models are not efficiently distinguishable from other perturbed variations of a given model. The paper introduces definitions of "undetectable" and "unreplicable", and rigorously demonstrates that the constructions therein satisfy the definitions, under standard cryptographic assumptions.

优点

This is a wonderful paper. It's clear, it's a pleasure to read, it does a great job easing the reader into the math and gradually building up to the main result. I appreciate the clear and well-explained introduction, which builds good intuition for what is to come; starting out with a comparison with Goldwasser et al. [2022] is also very useful, since I'm sure this is on most readers' minds as soon as they see the title. Also, I don't know if "Insidious Procedure" is a standard term from cryptography, but I love it. And, of course, the main result is an important one, and this paper presents it in a convincing way. The appendix is delectable.

缺点

While this is a strong theoretical result, this seems to be vanishingly unlikely to be used in practice, since it depends on indistinguishability obfuscation, which Wikipedia claims to be highly impractical in practice. Additionally, I have certain questions about the $\text{Plant}$ function allowing too much, potentially weakening the overall result (see Question A, below).

问题

A) Here is the main question which I have about the definition of undetectability. Correct me if I'm wrong, but with your definition, aren't undetectable backdoors very easy to construct? For instance, suppose my procedure Backdoor adds a 3-SAT verifier into the neural network, where the input variables to the circuit are taken as the output of some map from the input to [0, 1]^N (say, the parity of each digit of the input if it's an MLP, or whether each word is a noun in a language model). Then, if the 3-SAT circuit is satisfied, the model gives a bad output. The procedure Perturb does the same thing, except it inserts an unsatisfiable 3-SAT circuit, so $h(x)=h'(x)$ $\forall x \in \mathcal{X}$ . Efficiently being able to distinguish between $h'$ and $\tilde{h}$ seems impossible since this would necessitate an efficient method for being able to tell which of two 3-SAT circuits is satisfiable, and which isn't; however, the actor who inserted this backdoor could easily sell information on how to activate the backdoor in the backdoored model (make sure your third, seventh, ninth... words are nouns). [and if we want it to be non-replicable use one of the steg methods from the appendix]

If this is true, isn't allowing Perturb to be an unbounded function perhaps too powerful? Like, the backdoor described above still seems undetectable by your definition, but if someone actually looks at the weights of the model, they'll go like "whoa, what is this sus-looking boolean circuit doing here?". Ideally it seems like we'd want Perturb to either be restricted in some way, or perhaps the distribution of weights of $\tilde{h}$ , or its computational graph, should not be super out-of-distribution for other models sampled from $\text{Train}$ . Basically, on first glance it seems to be that the definition of undetectability might be a bit weak (but of course, I am likely missing something.)

B) This is even more of a half-baked thought than (A), but if we're relying on the existence of iO as an assumption, can't we also then just assume FHE and just have Perturb encrypt all the models so nobody knows what's going on at all

Nitpicks:

N1) On line 56, there is an extra comma after "and".

局限性

Addressed. However, I might maybe mention more explicitly that this is not a remotely practical construction, and more of an existence proof.

作者回复

2024-08-07

[Part 1/n]

We want to thank the reviewer for reading our paper carefully, for appreciating our results and for their constructive feedback and comments.

Q1: While this is a strong theoretical result, this seems to be vanishingly unlikely to be used in practice, since it depends on indistinguishability obfuscation, which Wikipedia claims to be highly impractical in practice. Additionally, I have certain questions about the function allowing too much, potentially weakening the overall result (see Question A, below).

A1: Thank you for your feedback. As you mentioned, the primary objective of our research is to theoretically demonstrate that neural networks (NNs) and large language models (LLMs) have inherent vulnerabilities and are susceptible to the backdoor attacks outlined in our paper. Beyond our theoretical results, which are significant on their own, our work also suggests a novel pipeline for injecting backdoors. This new pipeline could naturally be implemented using practical obfuscation techniques. For instance, one could replace Indistinguishability Obfuscation (iO) with more practical variants of obfuscation tailored for NNs. We believe that is a very interesting open question that arises from our framework and result and we hope that it will inspire people in the field.

Even without replacing the obfuscation method that we use our results are significant also because of the following:

Future Practicality: The iO method is expected to become more practical in the coming years, which aligns with the future applicability of our theoretical findings.
Current Importance: Given the widespread deployment of these models in decision-making processes, it is crucial to ensure they have robust security guarantees.

2024-08-08

Thank you so much for your detailed response! This makes a lot of sense; I now think I would've realized this if I'd spent another few hours thinking about the paper before writing the review. I am increasing my score to an 8 and my confidence to a 4.

2024-08-07

Part[2/n]

Q2: Here is the main question which I have about the definition of undetectability. Correct me if I'm wrong, but with your definition, aren't undetectable backdoors very easy to construct? For instance, suppose my procedure Backdoor adds a 3-SAT verifier into the neural network, where the input variables to the circuit are taken as the output of some map from the input to [0, 1]^N (say, the parity of each digit of the input if it's an MLP, or whether each word is a noun in a language model). Then, if the 3-SAT circuit is satisfied, the model gives a bad output. The procedure Perturb does the same thing, except it inserts an unsatisfiable 3-SAT circuit, so . Efficiently being able to distinguish between and seems impossible since this would necessitate an efficient method for being able to tell which of two 3-SAT circuits is satisfiable, and which isn't; however, the actor who inserted this backdoor could easily sell information on how to activate the backdoor in the backdoored model (make sure your third, seventh, ninth... words are nouns). [and if we want it to be non-replicable use one of the steg methods from the appendix] If this is true, isn't allowing Perturb to be an unbounded function perhaps too powerful? Like, the backdoor described above still seems undetectable by your definition, but if someone actually looks at the weights of the model, they'll go like "whoa, what is this sus-looking boolean circuit doing here?". Ideally it seems like we'd want Perturb to either be restricted in some way, or perhaps the distribution of weights of , or its computational graph, should not be super out-of-distribution for other models sampled from . Basically, on first glance it seems to be that the definition of undetectability might be a bit weak (but of course, I am likely missing something.)

A2: This is indeed a very astute observation; a construction similar to the one you define was actually the starting point of our work. First, let us clarify why the approach of injecting a SAT formula that is either satisfiable or unsatisfiable does not give a provably undetectable backdoor. According to our definition, which follows the blueprint of Goldwasser et al., a backdoor is undetectable if any (polynomially bounded) adversary cannot distinguish between an honestly generated model and one with a backdoor. If we inject a specific satisfiable formula in the honest case, then there exists a simple adversary, that checks whether a hardcoded assignment is satisfiable, succeeds. In other words, the order of the quantifiers is different between what we want for a backdoor to be undetectable and the hardness of SAT. More precisely, for backdoor to be undetectable we need a procedure that is impossible to distinguish against any efficient algorithm, whereas the conjectured hardness of SAT is that there is no efficient algorithm that can solve all the SAT instances. We hope this clarifies the issue but we are happy to elaborate more if needed.

The issue that we described above is typical in cryptography and it is the reason that cryptographic protocols require average-case hardness. Unfortunately, SAT is not average-case hard, so our solution to this issue is to use instead the well-studied cryptographic primitive of PRGs: instead of a SAT formula, we include a number that is either pseudorandom or truly random. Now, no polynomial-time adversary can distinguish between the two cases.

As you also correctly pointed out, this is indeed a very suspicious construction and someone by looking at the weights of the NN will be skeptical about what is going on. The main point of the construction we provide, and for using iO, is that iO (or a function similar to that) can be part of a legitimate non-suspicious real-world training pipeline. Ideally, we want the function “Perturb” to be non-suspicious. We argue that using obfuscation is on its own not suspicious for many reasons. Consider the following motivating examples:

Privacy of Training Data: If the training data includes sensitive user information, using obfuscation could help ensure that this data remains private and secure.
IP Protection: Companies often have proprietary models. Applying obfuscation allows these companies to publicly release their models with less risk of reverse-engineering their models or techniques. If we allow the use of obfuscation then, our claim is “that an obfuscated version of the original model is indistinguishable from the backdoored one”. And due to the general utility of using obfuscation we consider such a recipe not suspicious.

We will elaborate on these points, and we will explain what types of “Perturb” function we allow for.

2024-08-07

[Part 3/n]

Q3: This is even more of a half-baked thought than (A), but if we're relying on the existence of iO as an assumption, can't we also then just assume FHE and just have Perturb encrypt all the models so nobody knows what's going on at all.

A3: This is also an interesting thought: your suggestion is to publish the encryption of the neural network weights under an FHE encryption. Unfortunately, this approach does not directly work, since FHE will only allow us to compute an encryption of the output of the neural network on any given input. It won’t be possible to get the actual output of the NN unless we have the secret key of the encryption, which would also reveal the model weights.

We will revise our manuscript according to our rebuttal. As we discussed above , currently, our theoretical result relies on an impractical construction. However, i) in the future we might have more practical iO algorithms, ii) our construction can work in practice with methods that simulate iO but do not have the strong theoretical guarantees we obtain. Moreover, we will elaborate on the “Perturb” functions our definition allows for.

审稿意见

评分: 6置信度: 22024-07-15

The paper studies undetectable backdoor attacks for neural networks from a theoretical perspective.

优点

The problem of undetectable backdoor insertion is important for the community.
The paper considers a more complicated scenario compared to prior work, as it shows undetectability under the white-box setting.

缺点

It would be good to include a "Conclusion" section in the main body of the paper, as it currently seems to end abruptly.
I believe that including some empirical results would nicely complement the theoretical findings.

问题

局限性

Yes

作者回复

2024-08-07

We want to thank the reviewer for reading our paper carefully, for appreciating our results and for their constructive feedback and comments.

Q1: It would be good to include a "Conclusion" section in the main body of the paper, as it currently seems to end abruptly.

A1: We would like to thank you for your suggestion. We will add the following Conclusion section:

Given the plethora of applications of Machine Learning in general, and neural networks in particular, questions regarding the trustworthiness of publicly released models naturally arise. In particular, before deploying a neural network we need to guarantee that no backdoors have been injected allowing bad actors to arbitrarily control the model behavior. In this paper, we investigate the existence of backdoor attacks to obfuscated neural networks which are undetectable even when given white-box access. The notion of obfuscation that we consider is the well-studied and mathematically founded indistinguishability obfuscation (iO). We also show how our techniques can inspire backdoor schemes in large language models when combined with ideas from steganography.

Q2: I believe that including some empirical results would nicely complement the theoretical findings.

A2: Thank you for your feedback. We would like to clarify that the primary objective of our research is to theoretically demonstrate that neural networks (NNs) and large language models (LLMs) have inherent vulnerabilities and are susceptible to the backdoor attacks outlined in our paper. Beyond our theoretical results, which are significant on their own, our work also suggests a novel pipeline for injecting backdoors. This new pipeline could naturally be implemented using practical obfuscation techniques. For instance, one could replace Indistinguishability Obfuscation (iO) with more practical variants of obfuscation tailored for NNs. We believe that is a very interesting open question that arises from our framework and result and we hope that it will inspire people in the field. Even without replacing the obfuscation method that we use our results are significant also because of the following:

Future Practicality: The iO method is expected to become more practical in the coming years, which aligns with the future applicability of our theoretical findings.
Current Importance: Given the widespread deployment of these models in decision-making processes, it is crucial to ensure they have robust security guarantees, even considering attacks that are not implementable in the present but will become implementable in the future.

2024-08-13

Thank you for your response. I decided to keep my score.

最终决定Accept (poster)

2024-09-25

This paper offers a formal framework for comprehending the effectiveness of backdoor attacks through an analysis employing cryptographic techniques. While it is based on highly theoretical assumptions, it provides valuable insights to the community by addressing backdoor analysis from a security perspective.