Stealix: Model Stealing via Prompt Evolution
Stealix is the first model stealing attack leveraging diffusion models against image classification models without relying on human-crafted prompts.
摘要
评审与讨论
This paper introduces a method for model stealing attacks that do not require manually crafted prompts. Unlike prior approaches, which rely on predefined class names or expert knowledge to generate synthetic data, Stealix employs a genetic algorithm to iteratively refine prompts based on a victim model’s responses.
给作者的问题
see Other Strengths And Weaknesses part.
论据与证据
Yes.
方法与评估标准
Yes.
理论论述
all good.
实验设计与分析
see Other Strengths And Weaknesses part.
补充材料
No Supplementary Material.
与现有文献的关系
This paper builds upon and extends prior research in model stealing, generative adversarial techniques, and automated prompt optimization.
遗漏的重要参考文献
None.
其他优缺点
Strengths: Eliminates the need for manually crafted prompts or class names, making it more accessible and scalable for attackers with limited expertise. This paper proposes a more realistic threat model. The proposed proxy metric shows a strong correlation with the feature distance to the victim data.
Weaknesses: As authors mentioned, this approach relies heavily on the quality of open-source generative models. While Stealix is tested on various victim model architectures (e.g., ResNet, VGG, MobileNet), the paper does not extensively explore more complex or state-of-the-art architectures (e.g., Transformers). This paper assumes that the victim model only provides hard-label outputs as a defense mechanism. However, the authors does not explore how Stealix would perform against more sophisticated defenses. The paper would benefit from a more detailed computational cost analysis.
其他意见或建议
None.
We thank the reviewer for recognizing the realism of our threat model, the scalibility of our approach, and the effectiveness of our proposed proxy metric. We aim to address their concerns below.
W1. While Stealix is tested on various victim model architectures (e.g., ResNet, VGG, MobileNet), the paper does not extensively explore more complex or state-of-the-art architectures (e.g., Transformers).
We do address this in Section 5.3: Stealing Model Based on Proprietary Data, where we apply Stealix to a real-world Vision Transformer (ViT) model trained on proprietary data and demonstrate better performance than other methods.
W2. This paper assumes that the victim model only provides hard-label outputs as a defense mechanism. However, the authors does not explore how Stealix would perform against more sophisticated defenses.
Thanks for the suggestion. Most defenses such as [1, 2] perturb the posterior prediction to reduce the utility of stolen models, while keeping the predicted class (argmax) unchanged to preserve original performance for benign users. This pushes attackers to rely on hard labels, which are less informative but immune to such perturbations. Our work directly targets this setting, where attackers proactively use hard labels to circumvent the defenses. We view exploring additional defenses as complementary and will add this discussion in the revision. We are open to evaluating Stealix against defenses that the reviewer has in mind.
[1] Taesung Lee, Benjamin Edwards, Ian Molloy, and Dong Su. "Defending against machine learning model stealing attacks using deceptive perturbations." IEEE Security and Privacy Workshops (SPW) 2019.
[2] Mantas Mazeika, Bo Li, and David Forsyth. "How to steer your adversary: Targeted and efficient model stealing defenses with gradient redirection." ICML 2022.
W3. The paper would benefit from a more detailed computational cost analysis.
We appreciate the reviewer's suggestion. We have reported the runtime comparison across methods in Appendix C: Comparison of Computation Time. The results show that Stealix maintains competitive computational efficiency while outperforming other baselines.
Thank you for the response. After reading other reviews and rebuttals, I decide to keep my current score.
This paper proposes a model stealing attack method, named Stealix, to steal the functionality of an image classification victim model. Stealix generates synthetic images through a diffusion model, and fine-tunes the image-generation prompt based on victim model's responses. An iterative prompt refinement and reproduction process is employed to capture the features of training data distribution, so that the synthetic dataset is closer to the training distribution, leading to higher accuracy of the attacker's rebuilt model. Stealix enables an automatic choice prompt and does not require the knowledge class name, given that a few seed images are available. Comprehensive experiments are conducted, and Stealix consisstently outperformed baseline methods.
给作者的问题
Please find in the weakness section above.
论据与证据
Yes. The paper is well-organized and well-written. The experiment results seem to be convincing.
方法与评估标准
Yes.
理论论述
Not applicable.
实验设计与分析
The authors conduct experiments on four representative datasets and compare their proposed method to six other methods, which looks great to me. Also, an experiment on stealing a model trained on private dataset is conducted, further enhancing the realiability of authors' findings. Nevertheless, why the comparison with PEZ method is deferred to appendix and is conducted under only one setting? It will be great if PEZ is also included in the comparison.
补充材料
I checked the missing algorithms and Appendix C, D, F, G, H, J, K. They all look sensible.
与现有文献的关系
N/A.
遗漏的重要参考文献
The authors mention that the proposed method will be degraded to PEZ (Wen et al., 2024) if the image triplet contains the seed image only. It seems the work of PEZ is strongly related to this paper. Could you please clearly clarify the contribution and difference of PEZ and this work?
其他优缺点
Strengths
- The proposed method, as author highlighted, avoids the need of pre-defined prompts or class name to generate synthetic images. This direction of soliciting queries is promising.
- This paper is highly complete with sufficient experiments.
Weakness It will be great if authors can clarify a couple questions:
- While the attacker aims to steal the entire model, Alg 1 requires a specific target class . Can you clarify which setting is used and correct any inconsistency?
- How large is the seed image set needed, and used in the experiments?
- Line 129-131, it is somewhat incorrect to claim that other methods do not utlize victim model's outputs. My understanding is that PEZ also optimizes the prompt using victim's responses, isn't it?
- Does the proposed method apply to stealing regression models?
其他意见或建议
Typos: Line 218, Section 4 should be Figure 3, I guess.
伦理审查问题
N/A
We appreciate the reviewer's recognition of this direction of soliciting queries as promising, and we're glad that the completeness and thoroughness of our experiments came through clearly. We aim to answer the questions below.
Q1. Why the comparison with PEZ method is deferred to appendix and is conducted under only one setting? It will be great if PEZ is also included in the comparison.
We placed the PEZ comparison in the appendix due to page limits. Since PEZ is not originally a model stealing method, but a prompt tuning technique, we include it as part of an ablation study—not a baseline comparison—to isolate the impact of our proposed components: prompt refinement with victim feedback, prompt consistency, and prompt reproduction. Please see our answers to Q2 for a detailed comparison.
Q2. Could you please clearly clarify the contribution and difference of PEZ and this work?
The key difference is that PEZ optimizes prompts using only the seed image , while our prompt refinement reformulates prompt optimization as a contrastive loss over a triplet , guided by the victim model’s predictions. This enables Stealix to capture class-relevant features more effectively. Stealix further introduce prompt consistency as a proxy for evaluation, and prompt reproduction using genetic algorithms, forming a complete and victim-aware model stealing framework.
W1. While the attacker aims to steal the entire model, Alg 1 requires a specific target class. Can you clarify which setting is used and correct any inconsistency?
We apologize for the inconsistency and we will correct it in the paper. The target class is not required as we steal the entire model. Stealix iterates over all classes to collect synthetic images. We will revise Algorithm 1 to wrap Lines 3–25 in a for each class loop, and move Line 26 outside the loop with an updated description: "Train model using all class image sets ", where denotes the total number of classes. We will also update the algorithm input accordingly to reflect that it processes all classes, rather than requiring a specified target class .
W2. How large is the seed image set needed, and used in the experiments?
We use only a single seed image per class in all our experiments, as noted in Line 205 (left column).
W3. Line 129-131, it is somewhat incorrect to claim that other methods do not utlize victim model's outputs. My understanding is that PEZ also optimizes the prompt using victim's responses, isn't it?
We are sorry for the confusion: we do not claim that PEZ and other methods do not utlize victim model's outputs. More precisely, they use the victim's predictions only during attacker model training; in contrast, Stealix additionally uses them for optimizing prompts (refinement, consistency check and reproduction). PEZ, as detailed in our response to Q2, do not use victim responses to optimize the prompt, as the class of the seed image is known. We will update the paper to clarify this point.
W4. Does the proposed method apply to stealing regression models?
Stealix can potentially be extended to regression tasks. For example, during prompt refinement, low regression error, such as a low mean square error (MSE), could be interpreted as “positive” feedback and high error as “negative,” similar to our classification setting. This would allow the triplet-based optimization and prompt consistency metric to operate analogously.
Thank you for your response. The authors have addressed most of my concerns. I decided to keep my score but lean to accept.
This paper introduces Stealix, a new model stealing method that leverages images synthesized from diffusion models to steal victim models. Compared with existing diffusion model-based model stealing attacks, the key improvement is that Stealix can automatically construct attack prompts for the stealing-image generation, thus eliminating the need for human-crafted prompts. Experiments demonstrate that Stealix enhances both query efficiency and stolen model performance in black-box query scenarios.
update after rebuttal
After reading the rebuttal, I think this paper has novel results but the authors need to improve their presentation (espectially Algorithm 1) to make the paper more clear. So I decide to keep my current score but tend to acceptance.
The authors should update their paper according to me and other reviewers' reviews.
给作者的问题
See Weaknesses & Suggestions & Questions.
论据与证据
There are many technical details in this paper that need to be further clarified. (See Weaknesses & Suggestions & Questions)
方法与评估标准
The query budget comparison in Table 1 might be not fair. (See Weaknesses & Suggestions & Questions)
理论论述
N/A
实验设计与分析
See Weaknesses & Suggestions & Questions.
补充材料
I have checked part of the supplementary material to find some experimental details but unfortunately failed. (See Weaknesses & Suggestions & Questions)
与现有文献的关系
N/A
遗漏的重要参考文献
N/A
其他优缺点
Strengths:
-
The attack is conducted in a very strict black-box setting, which I appreciate.
-
The idea of automatically constructing prompts for generating model stealing images is promising.
Weaknesses & Suggestions & Questions:
-
In Algorithm 1, are you performing Stealix with only a single class ? Is that really effective? I would be interested in seeing the performance when stealing the model with samples from multiple classes.
-
In Algorithm 1, Line#5 constructs the initial sample set from , , and . However, according to Line#3, both and are initialized as empty sets. Wouldn't this result in being an empty set, and thus the overall Algorithm 1 could not continue (Because the for-loop in Line#8 could never start)? Please clarify.
-
The query budget comparison in Table 1 may not be fair. Unlike other methods listed, the proposed Stealix method requires further querying the victim model during the prompt-constructing stage, as it needs to repeatedly query the victim model with newly synthesized images (see Lines#19-20 in Algorithm 1). I suggest the authors provide the exact equation for calculating the overall query budgets and list all related hyperparameters in a single table for clarity.
-
The experiments only consider a single victim model backbone (i.e., ResNet-34), which I think is insufficient to demonstrate the effectiveness of Stealix. I suggest including additional experiments on ResNet-like/CLIP-like victim backbones.
-
In Algorithm 1, Lines#9-11 are redundant and can be removed.
其他意见或建议
Please note that while I give a score of 3 (Weak Accept), it actually means that I think this paper is a borderline paper. As such, my final score will be based on the response of authors. If my concerns could not be addressed, I will decrease the score accordingly.
We thank the reviewer for appreciating our strict black-box threat model and recognizing the novelty of our automatic prompt construction approach for model stealing. We answer their questions below.
Q1. In Algorithm 1, are you performing Stealix with only a single class? Is that really effective? I would be interested in seeing the performance when stealing the model with samples from multiple classes.
We apologize for the mistake in Algorithm 1 and we will fix it in the revised paper. To clarify, Stealix considers all classes simultaneously. Algorithm 1 illustrates the process (Line 3-25) for a single class, but in practice, it is applied to all classes in parallel. After processing all classes, the generated images are collected and used to train the attacker model A, as outlined in the method overview (Lines 201–203, left column)
We will revise Algorithm 1 to wrap Lines 3–25 in a for each class c loop, and move Line 26 outside the loop with an updated description: "Train model using all class image sets ", where denotes the total number of classes. We will also update the algorithm input accordingly to reflect that it processes all classes, rather than requiring a specified target class .
Q2. Wouldn't initial empty and result in empty initial ?
We will revise the algorithm to fix the notation. While and are initially empty, the seed set is not, so initial is populated using seed images: . For generality, we denote and allow any of the elements in the triplet to be null. Our prompt optimization supports learning from as little as a single image (Equation 3), ensuring the process works in the early stages where positive and/or negative samples are unavailable.
Q3. The query budget comparison in Table 1 may not be fair. Unlike other methods listed, the proposed Stealix method requires further querying the victim model during the prompt-constructing stage, as it needs to repeatedly query the victim model with newly synthesized images (see Lines#19-20 in Algorithm 1). I suggest the authors provide the exact equation for calculating the overall query budgets and list all related hyperparameters in a single table for clarity.
Thanks for the suggestion. We clarify that Table 1 presents a fair comparison, because all synthesized images in Lines#19-20 of Algorithm 1 are included in training the attacker model, with each prompt synthesizing images (Line 15 in Algorithm 1). The full query budget corresponds exactly to the queries made during prompt construction, as noted in Line 200 (left column).
Taking CIFAR-10 in Table 1 as an example, with a total budget per class of , Stealix uses queries per prompt (Line 15 of Algorithm 1), resulting in 50 prompts per class. Across 10 classes, this leads to a total query budget of that are used to train the attacker model, which is the same budget used for other methods.
Q4. The experiments only consider a single victim model backbone (i.e., ResNet-34), which I think is insufficient to demonstrate the effectiveness of Stealix. I suggest including additional experiments on ResNet-like/CLIP-like victim backbones.
We clarify that our paper does include comparisons across multiple victim model architectures. Specifically, Appendix H provides results with different victim backbones, and Appendix G covers variations in attacker model architectures. For both setups, we study two ResNet variations, one VGG and one MobileNet. Additionally, in Section 5.3, we demonstrate Stealix's effectiveness against a Vision Transformer (ViT)-based victim model trained on proprietary data.
Q5. In Algorithm 1, Lines#9-11 are redundant and can be removed.
Removing Lines 9–11 would allow the consumed budget to exceed the total allowed budget , since is updated within the inner loop (Line 17). Nevertheless, we appreciate the reviewer's suggestion and will revise the algorithm to test the budget constraint only once.
Thanks to the authors for their rebuttal. After reading the rebuttal, I decided to keep my current score (but tend to accept).
Please update your paper according to me and other reviewers' reviews.
The paper proposes a new method for model stealing attacks for computer vision classification models. Specifically, they find that prior work uses a pretrained text to image generator model to synthesize images similar to the victim data. However, this step requires an attacker to have knowledge to craft useful prompts, an assumption the paper claims is often not met for more specialized domains.
Hence, they introduce Stealix, which uses genetic algorithms to find the right prompt to synthesize useful images for model stealing. Specifically, they optimize a useful prompt under a contrastive loss using features extracted by a vision-language model from the prompt itself, further improving the prompt by using a genetic algorithm using a proxy metric as the fitness function.
They compare their method to different methods form the literature, all of which are based on stronger assumptions for the attacker. Across 4 datasets, they still find stealix to work better (better accuracy on test set of victim model). They also provide qualitative results, showing that synthetic images generated using stealix generate images more similar to the original, real data.
给作者的问题
- Can authors provide some examples of optimized prompts? For instance, it would be nice to have the optimal prompt for each class label determiend by stealix such as in Table 7. Additionally, can you give some examples how the prompts change during the evoluation. This would shed more light into why stealix works so well compared to human-crafted prompts.
- The main argument for the proposed method is when the task is highly specialized and requires specific prompts. What is the author's intuition why the attack works so much better than the baselines for a simple dataset like Cifar-10? What do the prompts look like in this case?
- Do you think the same attack method/philosophy would work for other kinds of models? (e.g. image segmentation, text classification).
- How would stealix perform on datasets with even more classes?
论据与证据
Yes. They claim their method removes an assumption made by previous work on model stealing, and convincingly show that their attack still works on a variety of setups, and even improves upon other attacks.
方法与评估标准
Yes
理论论述
NA
实验设计与分析
Yes, the experiments seem to make sense.
补充材料
No.
与现有文献的关系
The specifically focus on model stealing for image classification models. They identify that many methods rely on certain expertise/knowledge of the attacker to craft useful prompts to generate useful images to train the proxy model. They remove this assumption, providing a way to craft useful prompts, and find model stealing to work better than previous work - especially for more specialized datasets.
遗漏的重要参考文献
NA
其他优缺点
Strengths:
- The paper identifies an assumption which prior methods make (an attacker being able to identify useful prompt to synthesize images to train the proxy model), argues that this assumption is not always met in practice and offers an effective method as a solution.
- While complicated, the method is explained very carefully and formalized well.
- The method outperforms previous methods although these are based on stronger attacker assumptions.
- A substantial amount of ablations give confidence in the method and the quality of the work.
Weaknesses:
- Limited interpretability insights into why this method works so well (suggestions see questions).
其他意见或建议
NA
We thank the reviewer for recognizing our contribution in addressing a key limitation of prior work, and for their appreciation of our method's clarity, effectiveness, and thorough evaluation. We answer the questions below. The reviewer can try the provided prompts at Stable Diffusion 2.1 demo. Note that the demo may use a different generation setup from ours in the experiments.
Q1. Can authors provide some examples of optimized prompts?
We provide examples of optimized prompts in the following table and will include them in the revised paper. The table showcases prompts corresponding to the high Prompt Consistency (PC) values as described in Appendix E. One immediate observation is that the optimized prompts are not always interpretable to humans, echoing our motivation that human-crafted prompts may be suboptimal for model performance. Moreover, our approach often supplements class-specific details that may be overlooked by humans. For example, gps crop emphasizes geospatial context for AnnualCrop, jungle suggests dense vegetation for Forest, and floodsaved, port, and bahamas convey water-related cues for River and SeaLake. These examples illustrate how Stealix uncovers latent features that the victim learns.
| Class | Prompt with high PC |
|---|---|
| AnnualCrop | sdc ngc icular gps crop scaled farming pivot plane ⃣@ seen piszurich t colton 2 |
| Forest | colombian seva jungle spectral रrgb visible sp detected slicresolution ��ि xxl sdk |
| River | nxt nav nasa ifclearer ouk floodsaved immensalzburg port overlooking salzburg deplo_ thumbs |
| SeaLake | fiawec apurwreck bahamas visible sli(!) rh sd usaf calf y infront nearby visible usaf |
Q2. Additionally, can you give some examples how the prompts change during the evoluation. This would shed more light into why stealix works so well compared to human-crafted prompts
We provide an example to illustrate prompt evolution. In Figure 5, the seed image for the "Person" class includes a prominent dog, leading to the first-generation prompt — "chilean vaw breton cecilia hands console redux woodpecker northwestern beagle sytracker collie relaxing celticsped" — which generates dog images and results in prompt consistency (PC) of 0. Stealix then uses the misclassified image as a negative example and refines the prompt to — "syrian helene pasquspock hands thumbcuddling sheffield stuck smritihouseholds vulnerable kerswednesday humormindy intestin" — removing dog-related features and achieving PC = 1. This example shows how Stealix evolves prompts by filtering out misleading features using victim feedback. We will revise the paper to include this example for better clarity.
Q3. What is the author's intuition why the attack works so much better than the baselines for a simple dataset like Cifar-10? What do the prompts look like in this case?
Thank you for the question. As discussed in the Diversity comparison (Line 377, right column) and shown in Table 3, the better performance stems from the greater diversity in our synthetic data, enabled by prompt evolution. E.g., one optimized prompt for the cat class — "punisher desktop kittens siamese beef twins personality bosnicorgi schnautuxedo 일tuxedo satellite consecutive desktop" — includes fine-grained categories such as "siamese" and relational cues like "twins" to encourage multiple distinct instances. Additionally, terms like "desktop" provide varied contextual environments. In contrast, simple prompts like "a photo of a cat" tend to produce less diverse images. These diverse prompts are generated automatically by Stealix, without requiring a human in the loop.
Q4. Do you think the same attack method/philosophy would work for other kinds of models? (e.g. image segmentation, text classification).
It might be possible to generalize to other tasks. For image segmentation, recent work [1] shows that prompts can guide image generation toward specific segmentation layouts, suggesting that prompt refinement based on mask consistency could be a feasible direction. For text classification, the image generator could be replaced with a language model, with prompts optimized to generate inputs aligned with what the classifier has learned.
[1] Yumeng Li, Margret Keuper, Dan Zhang, and Anna Khoreva. "Adversarial supervision makes layout-to-image diffusion models thrive." ICLR 2024.
Q5. How would stealix perform on datasets with even more classes?
Stealix is expected to scale to more classes. We optimize prompts per class and treat all non-target classes as negatives, ensuring focus on class-specific features. Unlike baselines that generate images without considering predicted classes, Stealix actively steers synthesis toward the intended class.
This paper considers a more realistic setting for model stealing attack, where there is no requirement for prompt design or knowledge of class names. The reviewers found this setting reasonable and realistic. The paper proposes a new method Stealix to perform model stealing under the new setting. Stealix refines the prompts based on genetic algorithm to synthesize images for model stealing. The experiments demonstrate the effectiveness of the proposed method.
Overall, the paper studies a more realistic setting for model stealing attacks, proposes a new algorithms, and conducts extensive experiments to verify the effectiveness. All reviewers agree to accept this paper. Thus I would like to recommend Accept.