Data-Free Model Extraction for Black-box Recommender Systems via Graph Convolutions
A black-box model extraction method under data-free settings for recommender system.
摘要
评审与讨论
This paper proposes a new method for data-free model extraction via graph convolutions. This is an adversarial attack setting in which an adversary wants to extract the model or parameters of a recommender system via system queries.
优缺点分析
-- New approach to the problem and results that seem positive.
-- This topic, while relevant enough to the venue, is fairly niche. There have only been a few papers on the problem across several years. I'm not totally convinced as to how realistic of an attack setting this is, which may be related to the obscurity of the topic.
-- Relatively small datasets, not certain about the scalability of the approach.
-- The member data rates and query rates seem to be unrealistic in an attack setting.
问题
-- Naive question, but what is meant by "member data rate" in what's a "data free" setting? I would have thought only queries were impossible and that data is not available, though I assume this is a misunderstanding on my part.
-- The member data and query rates seem not realistic from an attack perspective. Can you describe a realistic attack setting in which the rates presented here are reasonable? How does the method work for very small rates?
-- The datasets look fairly small. What is the scalability profile of the method on larger datasets?
局限性
yes
最终评判理由
The authors addressed my clarification questions to some extent, and calibrating based on the other reviewers I've increased my score slightly
格式问题
None
Thank you for your review and for acknowledging the novelty and relevance of our work. Below, we provide detailed responses to clarify our method and address your questions:
Q1: What is meant by "member data rate" in what's a "data free" setting? I would have thought only queries were impossible and that data is not available, though I assume this is a misunderstanding on my part
A1: Member data, i.e., the data used by companies to train their recommender models, is typically inaccessible, as it is considered proprietary and commercially sensitive. The "data-free" setting refers to the assumption that attackers do not have access to such internal training data. Correspondingly, we haven't use the any member data in our method.
Q2: The member data and query rates seem not realistic from an attack perspective. Can you describe a realistic attack setting in which the rates presented here are reasonable? How does the method work for very small rates?
A2: Thank you for raising this concern. We provide the following clarifications:
[Clarification on member data and query rates] We believe this question refers to the investigation study section. In that part, we perform a systematic analysis to examine how varying the member data rate (when accessible) and the query budget affect the performance of model extraction. At the data level, this helps to understand the dependency of extraction effectiveness on information availability; at the model level, we also study how different model architectures respond to these factors.
[which the reasonable rates] Accessing to member data is generally unrealistic in real-world attack settings. That’s why our proposed method focuses on the data-free scenario. The member data/query rate is used purely for comparative and analytical purposes.
Q3: The datasets look fairly small. What is the scalability profile of the method on larger datasets?
A3: Thank you for the suggestion. We have added a large datasets Steam (with 334730 users and 13047 items) to our experiments. The results further demonstrate the effectiveness and robustness of our proposed method.
| Dataset | Victim | DFME | Sim4Rec | DBGRME |
|---|---|---|---|---|
| Steam | BPR | 0.0600 | 0.0916 | 0.2553 |
| NCF | 0.1296 | 0.1599 | 0.2201 | |
| GCMC | 0.3401 | 0.3037 | 0.3552 | |
| NGCF | 0.1098 | 0.1142 | 0.3265 | |
| LGN | 0.3649 | 0.3903 | 0.5662 |
[1] Wang Y, Su J, Chen C, et al. Sim4Rec: Data-Free Model Extraction Attack on Sequential Recommendation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2025, 39(12): 12766-1277.
[2] Yue Z, He Z, Zeng H, McAuley J. "Black-box attacks on sequential recommenders via data-free model extraction." Proceedings of the 15th ACM conference on recommender systems. 2021.4.
Dear reviewer:
Thank you for your time and contribution in reviewing our paper. We noticed that you acknowledged the rebuttal. We would appreciate it if you could let us know whether our response addressed your concerns. We’d be happy to engage in further discussion.
Yes your response satisfactorily addresses the clarification questions that I'd asked. Though given that I've already entered a positive score which is mostly consistent with other reviews, I haven't updated my score.
This paper focuses on model extraction attacks in recommender systems. Considering that existing methods often rely on unrealistic assumptions about data accessibility, a Data-free Black-box Graph convolution-based Recommender Model Extraction (DBGRME) is proposed. The core of DBGRME is to make up for the need for member data through an interaction generator, and propose a Generalization-Aware Graph Convolution-based Surrogate Model to capture complex recommendation interaction patterns. Finally, the effectiveness of the proposed strategy is demonstrated on multiple datasets and recommendation models.
优缺点分析
Strengths:
- The data accessibility and generalization that this work focuses on are two core issues of model extraction.
- The proposed interaction generator avoids dependence on sensitive member data.
- The experimental results show that the proposed method seems to be effective.
Weaknesses:
- The superiority of graph convolution-based recommender models in the empirical experiments on three recommendation models is unconvincing. Are there similar findings on more models? Or analyze its advantages theoretically?
- Although the paper claims to attack under the "black box" setting and limited queries, it is usually difficult to obtain the top-k recommendation list in the strict real world, which limits the applicability of the proposed attack. The paper should discuss the performance of the attack under more stringent query responses.
- How is the query budget selected by default? From Table 6, this choice is strange.
- Minor: Yelp in Line 321 should be ML-1M?
- Are the conclusions of the ablation experiments and sensitivity analysis on the Yelp and Gowalla datasets similar? I understand the space limit, but it is recommended to add it in the appendix.
- Although the baseline methods in the experiments are related, it would be more convincing if they could be compared with the recent model stealing method [1].
[1] Wang, Yihao, et al. "Sim4Rec: Data-Free Model Extraction Attack on Sequential Recommendation." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 39. No. 12. 2025.
问题
Please refer to "Strengths And Weaknesses".
局限性
More discussion on limitation is needed.
最终评判理由
The authors' response addresses my major concerns regarding the experimental setup and evaluation.
格式问题
N/A
Thank you very much for your thoughtful and constructive feedback. We appreciate your recognition of our method's motivation and technical design. We have carefully responsed to each of them in point-by-point.
Q1: More investigation study experiments:
A1: Thank you for your suggestion. We have reached this conclusion after extensive experiments and combined with theory. Specifically:
[Empirical Evidence] We have conducted comprehensive cross-model extraction experiments on BPR, NCF, GCMC, NGCF, and LightGCN across several datasets. Due to the char limitation, wo provide the PTD results with 50% member data rate on ML-100K. The results clearly show that graph convolution-based models demonstrate superior performance.
[Theoretical Analysis] As discussed on Page 4, Lines 150–156, BPR, NCF, and LightGCN represent a progressively enhanced modeling hierarchy. BPR is a classical matrix factorization model; NCF builds on it by incorporating MLPs to capture nonlinear user-item interactions; and LightGCN further introduces graph convolutional layers, enabling the model to better exploit high-order collaborative signals. This progression not only reflects increasing modeling capacity but also empirically supports the superiority of graph convolution-based recommender models in model extraction tasks.
| Target/Surrogate | BPR | NCF | GCMC | NGCF | LightGCN |
|---|---|---|---|---|---|
| BPR | 0.2718 | 0.3822 | 0.5504 | 0.4480 | 0.6073 |
| NCF | 0.2740 | 0.4558 | 0.5806 | 0.5000 | 0.6617 |
| GCMC | 0.2628 | 0.3452 | 0.5199 | 0.4553 | 0.5462 |
| NGCF | 0.2893 | 0.4109 | 0.5772 | 0.4970 | 0.6723 |
| LightGCN | 0.2875 | 0.4032 | 0.5835 | 0.4882 | 0.6714 |
Q2: Why top-k recommendation list:
A2: For this important point, our setting is both theoretically grounded and practically feasible:
[Theoretical basis] The top-k feedback setting is consistent with existing model extraction studies [1,2].
[Pratical case analysis] In real-world recommendation platforms, such as social recommendation systems (e.g., X) or multimedia platforms like YouTube, users can easily access hundreds or even thousands of recommendation items, whether by entering a search query or browsing the homepage. Therefore, obtaining top-k recommendation lists is both practical and realistic in real deployment scenarios.
Q3: How is the query budget selected:
A3: As this work is the first to study data-free black-box model extraction in recommender systems that rely solely on user and item identifiers as input features, we adopt a stringent and practical definition: each new user history interaction list (not each new user) submitted to the victim model is counted as a separate query. Correspondingly, we empirically select a query budget that balances extraction performance and budgets. Detailed analysis is provided in our Sec 5.3.
Q4: Minor: Yelp in Line 321 should be ML-1M?
A4: Thanks you for pointing it. It is a typo.
Q5: More experiments on other datasets.
A5:Thank you for the suggestion.
[Experiment results] Due to page limits, we presents part of representative results on additional datasets. It is easy to observed that the conclusions of the ablation studies and sensitivity analyses on the remaining datasets are consistent with those reported in the main paper.
[Revision plan] Considering the paper length and readability during submission, we included results on a subset of datasets in the main text. We will incorporate the necessary experimental results for the remaining datasets into the final Appendix to to provide a more comprehensive presentation.
[Ablation Study]
| Dataset | Victim | Black-box R@50 | w/o GA R@50 | w/o GA Agr@50 | w/o constraint R@50 | w/o constraint Agr@50 | w/o GA&constraint R@50 | w/o GA&constraint Agr@50 | DBGRME R@50 | DBGRME Agr@50 |
|---|---|---|---|---|---|---|---|---|---|---|
| Yelp | BPR | 0.0833 | 0.0234 | 0.0656 | 0.0242 | 0.0676 | 0.0231 | 0.0653 | 0.0235 | 0.0703 |
| NCF | 0.0777 | 0.0228 | 0.1757 | 0.0200 | 0.1952 | 0.0217 | 0.1637 | 0.0232 | 0.2004 | |
| GCMC | 0.1059 | 0.0389 | 0.1201 | 0.0401 | 0.1422 | 0.0361 | 0.1084 | 0.0618 | 0.1942 | |
| NGCF | 0.1065 | 0.0408 | 0.1233 | 0.0424 | 0.1552 | 0.0394 | 0.1168 | 0.0652 | 0.2872 | |
| LightGCN | 0.1266 | 0.0814 | 0.3426 | 0.1046 | 0.5855 | 0.0731 | 0.3216 | 0.1037 | 0.6062 | |
| Gowalla | BPR | 0.1857 | 0.0616 | 0.0587 | 0.0646 | 0.0673 | 0.0632 | 0.0580 | 0.0654 | 0.0704 |
| NCF | 0.1853 | 0.0415 | 0.0518 | 0.0636 | 0.0744 | 0.0414 | 0.0510 | 0.0659 | 0.0763 | |
| GCMC | 0.2079 | 0.0894 | 0.1319 | 0.1329 | 0.2162 | 0.0799 | 0.1165 | 0.1350 | 0.2252 | |
| NGCF | 0.2408 | 0.1504 | 0.2412 | 0.1609 | 0.2694 | 0.1413 | 0.2201 | 0.1662 | 0.2828 | |
| LightGCN | 0.2758 | 0.1770 | 0.4529 | 0.1874 | 0.5389 | 0.1716 | 0.4395 | 0.2177 | 0.6154 |
[Budget Analysis] Yelp:
| Budget | 3M | 5M | 7M | 9M | 11M | 13M | 15M |
|---|---|---|---|---|---|---|---|
| DBGRME | 0.5448 | 0.5674 | 0.5803 | 0.6062 | 0.6108 | 0.6131 | 0.6168 |
| Random-DBGRME | 0.4903 | 0.5375 | 0.5397 | 0.5751 | 0.5792 | 0.5823 | 0.5854 |
Gowalla:
| Budget | 1M | 2M | 3M | 4M | 5M | 6M | 7M |
|---|---|---|---|---|---|---|---|
| DBGRME | 0.4476 | 0.5365 | 0.5703 | 0.6154 | 0.6176 | 0.6209 | 0.6228 |
| Random-DBGRME | 0.4425 | 0.4957 | 0.5077 | 0.5871 | 0.5874 | 0.5987 | 0.6023 |
Q6: Comparison with Sim4Rec(AAAI2025)
A6: Thank you for the suggestion. We have conducted a comparison with Sim4Rec. The results further demonstrate the effectiveness of DBGRME. Although the Sim4Rec (published in Apirl 11th, 2025) belongs to the official NeurIPS (March 1st, 2025) defined “Contemporaneous Work”, we consider its topical relev7ance and plan to include the comparison results in the final Appendix for completeness.
| Dataset | Victim | DFME | Sim4Rec | DBGRME |
|---|---|---|---|---|
| ML-100K | BPR | 0.2181 | 0.2292 | 0.2976 |
| NCF | 0.3034 | 0.3161 | 0.3197 | |
| GCMC | 0.2579 | 0.2725 | 0.4707 | |
| NGCF | 0.3373 | 0.3938 | 0.4891 | |
| LGN | 0.5803 | 0.4388 | 0.7727 | |
| ML-1M | BPR | 0.0609 | 0.2618 | 0.2664 |
| NCF | 0.2911 | 0.2520 | 0.3185 | |
| GCMC | 0.2982 | 0.2787 | 0.3467 | |
| NGCF | 0.2626 | 0.2401 | 0.2773 | |
| LGN | 0.2775 | 0.2845 | 0.6047 | |
| Yelp | BPR | 0.0593 | 0.0446 | 0.0703 |
| NCF | 0.1470 | 0.1108 | 0.2004 | |
| GCMC | 0.1504 | 0.1054 | 0.1942 | |
| NGCF | 0.1462 | 0.1898 | 0.2872 | |
| LGN | 0.2830 | 0.2653 | 0.6062 | |
| Gowalla | BPR | 0.0697 | 0.0677 | 0.0704 |
| NCF | 0.0584 | 0.0487 | 0.0763 | |
| GCMC | 0.1528 | 0.2034 | 0.2252 | |
| NGCF | 0.1237 | 0.1395 | 0.2828 | |
| LGN | 0.3665 | 0.3998 | 0.6154 |
[1] Yue Z, He Z, Zeng H, McAuley J. "Black-box attacks on sequential recommenders via data-free model extraction." Proceedings of the 15th ACM conference on recommender systems. 2021.4.
[2] Wang Y, Su J, Chen C, et al. Sim4Rec: Data-Free Model Extraction Attack on Sequential Recommendation. Proceedings of the AAAI Conference on Artificial Intelligence. 2025, 39(12): 12766-1277.
Thanks to the author's detailed response, most of my questions have been addressed. I will consider adjusting the score before the deadline. Furthermore, for models that only have recommendation results but lack top-k results, is the proposed strategy feasible? Could the author further discuss its feasibility?
Thank you for your follow-up comment. If the system only returns recommendation results, i.e., top-1 recommendation list, it is still feasible for our proposed method: we can obtain more recommendation sequences through consecutive queries to support surrogate training. Moreover, in most real-world scenarios, such as social media, multimedia platforms, and e-commerce systems, users can typically access a large number of recommended items by browsing, often far exceeding top-100, making our strategy practical in many cases.
We sincerely appreciate the reviewer for raising this important point, and we will emphasize this discussion in the revised version. We are also very grateful for all the valuable feedback, which has helped improve the quality of our work. Finally, we truly appreciate your kind support and consideration to raise the score for our submission.
Thanks for the response. I have no more questions. I have adjusted my score accordingly.
The paper focuses on the issue of data-free, black-box model extraction in recommender systems. It presents a novel approach, DBGRME (Data-free Black-box Graph Convolution-based Recommender Model Extraction), to overcome limitations in previous methods. DBGRME utilizes two key components: an interaction generator that alleviates the need for member data and a graph convolution-based surrogate model that improves generalization and mitigates overfitting. The method is validated through experiments on multiple datasets and various victim models.
优缺点分析
Strength:
- The introduction of the surrogate model and the interaction generator addresses the key challenge of model extraction in data-free black-box settings.
- The paper includes experiments demonstrating DBGRME’s superior performance.
Weakness:
- My concern is the lack of comparison with some SOTA works [1]][2][3].
- Is the query budget set too large compared to the existing works?
- Some common dataset settings (Steam and Beauty) are missing.
- I suggest the authors add some discussions about the possible defense methods.
[1] Wang Y, Su J, Chen C, et al. Sim4Rec: Data-Free Model Extraction Attack on Sequential Recommendation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2025, 39(12): 12766-12774.
[2] Wang C, Sun J, Dong Z, et al. Data-free knowledge distillation for reusing recommendation models[C]//Proceedings of the 17th ACM Conference on Recommender Systems. 2023: 386-395.
[3] Lin Z, Xu K, Fang C, et al. Quda: Query-limited data-free model extraction[C]//Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security. 2023: 913-924.
问题
Please see the weaknesses.
局限性
NA
最终评判理由
The authors have addressed all of my concerns. I hope the authors can include the relevant experimental details and definitions in the revised version.
格式问题
NA
Thank you for your thoughtful review and constructive suggestions. We appreciate your recognition of our method's motivation. We have carefully addressed each of them in our point-by-point response below.
Q1: More comparison works:
A1:
[Paper Review] We analyzed the three suggested papers from multiple perspectives, including task type and research domain, as shown in the table below. Among them, Sim4Rec is the most recent and closely related to our work; therefore, we included it in our comparative experiments.
| Method | Task Type | Domain |
|---|---|---|
| DBGRME(ours) | Model Extraction | Recommendation |
| Sim4Rec[1] | Model Extraction | Recommendation |
| DFKD[2] | Model Distillation | Recommendation |
| Quda[3] | Model Extraction | Computer Vision |
[Comparison experiments] Based on the comparison results, it can be observed that Sim4Rec performs between DFME and DBGRME, further highlighting the superiority of DBGRME. Moreover, although the Sim4Rec (published in Apirl 11th, 2025) belongs to the official NeurIPS (March 1st, 2025) defined “Contemporaneous Work”, we consider its topical relevance and plan to include the comparison results in the final Appendix for completeness.
| Dataset | Victim | DFME | Sim4Rec | DBGRME |
|---|---|---|---|---|
| ML-100K | BPR | 0.2181 | 0.2292 | 0.2976 |
| NCF | 0.3034 | 0.3161 | 0.3197 | |
| GCMC | 0.2579 | 0.2725 | 0.4707 | |
| NGCF | 0.3373 | 0.3938 | 0.4891 | |
| LGN | 0.5803 | 0.4388 | 0.7727 | |
| ML-1M | BPR | 0.0609 | 0.2618 | 0.2664 |
| NCF | 0.2911 | 0.2520 | 0.3185 | |
| GCMC | 0.2982 | 0.2787 | 0.3467 | |
| NGCF | 0.2626 | 0.2401 | 0.2773 | |
| LGN | 0.2775 | 0.2845 | 0.6047 | |
| Yelp | BPR | 0.0593 | 0.0446 | 0.0703 |
| NCF | 0.1470 | 0.1108 | 0.2004 | |
| GCMC | 0.1504 | 0.1054 | 0.1942 | |
| NGCF | 0.1462 | 0.1898 | 0.2872 | |
| LGN | 0.2830 | 0.2653 | 0.6062 | |
| Gowalla | BPR | 0.0697 | 0.0677 | 0.0704 |
| NCF | 0.0584 | 0.0487 | 0.0763 | |
| GCMC | 0.1528 | 0.2034 | 0.2252 | |
| NGCF | 0.1237 | 0.1395 | 0.2828 | |
| LGN | 0.3665 | 0.3998 | 0.6154 |
Q2: Query budget setting comparison:
A2:Thank you for your question. Compared to existing works, our query budget is actually relatively modest. Specifically:
[Budget definition] Firstly, we would like to clarify that there exists a difference between the query budget definitions in prior work (e.g., [1,4]) and ours (“per user/history interaction query”). Specifically, prior works typically count each user as one query, regardless of the update of interaction lists (refer to the description in Sec 4.1.1 of [4]). In contrast, we adopt a stricter definition: each new interaction list, even from the same user, is counted as a separate query. This leads to a more conservative and realistic budget estimation.
[Comparison with existing work] Under our strict definition, recommendation model extraction methods such as [4] incur a budget of up to 1M on ML-1M (refer to the description in Sec 4.1.1 and Sec 5.2.1 of [4], i.e., 5000 budgets × 200 sequence length), which significantly exceeds our defined budget of 400k.
Q3: Other common datasets
A3:
Thanks for your advice. To further strengthen the evaluation, we compared various public datasets by interaction scale and sparsity. Based on this comparison and time permitting, we conduct experiments on the Steam dataset, which contains significantly more users and interactions. And the results of another dataset will be post in the discussion stage. We will include the corresponding results in the final Appendix.
| Dataset | #Users | #Items | #Interactions | Density |
|---|---|---|---|---|
| ML-100K | 944 | 1,683 | 100,000 | 0.0630 |
| ML-1M | 6,040 | 3,416 | 1,000,209 | 0.0447 |
| Yelp | 31,668 | 38,048 | 1,561,406 | 0.0013 |
| Gowalla | 29,858 | 40,981 | 1,027,370 | 0.0008 |
| Beauty | 40,226 | 54,542 | 400,000 | 0.0002 |
| Steam | 334,730 | 13,047 | 3,700,000 | 0.0008 |
| Dataset | Victim | DFME | Sim4Rec | DBGRME |
|---|---|---|---|---|
| Steam | BPR | 0.0600 | 0.0916 | 0.2553 |
| NCF | 0.1296 | 0.1599 | 0.2201 | |
| GCMC | 0.3401 | 0.3037 | 0.3552 | |
| NGCF | 0.1098 | 0.1142 | 0.3265 | |
| LGN | 0.3649 | 0.3903 | 0.5662 |
Q4: Discussions about possible defense:
A4:
Thank you for the suggestion. Due to space constraints, we briefly discussed potential defense strategies on Page 9, Lines 360–364. Following your valuable feedback, we will expand this discussion from three perspectives (detection, model-level defense, and legal & ethical considerations), and include a more comprehensive analysis in the Appendix.
Here, we discuss the future extension of potential countermeasures from three perspectives: detection-based defense, model-level defense, and legal & ethical considerations.
[Detection-based defense] Recent attacks [1,4] often rely on autoregressive querying or generator-driven interaction synthesis. Monitoring query frequency, interaction entropy, and behavioral anomalies can help detect suspicious users. Platforms may then apply query rate limiting or introduce noise to hinder extraction.
[Model-level defense] This approach involves injecting perturbations into the model's representation or output space to mislead attackers. However, in hard-label recommendation settings, such defenses require careful calibration to maintain model accuracy while degrading extraction effectiveness.
[Legal & ethical considerations] In addition to technical defenses, platform operators should implement responsible deployment practices such as strict API access control, policy enforcement, and legal safeguards. Raising public awareness and establishing regulatory frameworks can further mitigate the misuse of recommendation models.
Moreover, this study is conducted on benchmark datasets with the aim of improving system robustness and promoting awareness of potential vulnerabilities.
[1] Wang Y, Su J, Chen C, et al. Sim4Rec: Data-Free Model Extraction Attack on Sequential Recommendation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2025, 39(12): 12766-12774.
[2] Wang C, Sun J, Dong Z, et al. Data-free knowledge distillation for reusing recommendation models[C]//Proceedings of the 17th ACM Conference on Recommender Systems. 2023: 386-395.
[3] Lin Z, Xu K, Fang C, et al. Quda: Query-limited data-free model extraction[C]//Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security. 2023: 913-924.
[4] Yue Z, He Z, Zeng H, McAuley J. "Black-box attacks on sequential recommenders via data-free model extraction." Proceedings of the 15th ACM conference on recommender systems. 2021.4.
I would like to thank the authors for their responses, which have addressed some of my concerns. However, I still have the following questions:
- Could you provide the query budget compared to the existing works based on the strict definition and the previous definition?
- Why is the performance improvement the most significant on LGN?
Thank you for your continued feedback. We would like to respond to your comments point by point as follows.
Q1: Could you provide the query budget compared to the existing works based on the strict definition and the previous definition?
A1:
[Budget Definition and Calculation]
- Previous Definition: num_users
- Our Strict Definition: num_query * num_users * num_iters
Among them, num_users refers to the number of synthesized users, num_query denotes the number of queries per iteration, and num_iters indicates the total number of iterations involving querying. Hence, the comparison budget results (bracelets denoting attack success rate Agr@50) are:
| Buget Type | DFME | Sim4Rec | DBGRME-200k | DBGRME |
|---|---|---|---|---|
| Previous | 5k(0.2775) | 1k(0.2845) | 1k(0.5259) | 1k(0.6047) |
| Strict (Ours) | 1M (0.2775) | 200k+(0.2845) | 200k(0.5259) | 400k(0.6047) |
Unlike the previous definition, which treats each query as a single user, our strict definition considers every user in the interaction list as an independent query, regardless of whether the queries originate from the same user.
Hence, we have key observations:
- Under the previous definition, all methods are compared at a similar budget level (1k/5k). Even in this case, our DBGRME significantly outperforms others (0.6047 vs. 0.2845).
- Under our strict definition, with same query budgets for Sim4Rec and our proposed DBGRME-200k, they both have 200k budgets, but our methods exceed it significantly.
Q2: Why is the performance improvement the most significant on LGN?
A2: We believe the underlying reason lies in the inherent vulnerability of the LGN architecture. The LGN model relies solely on basic graph convolution, without incorporating non-linear transformations or attention mechanisms. Compared to other architectures, it is shallower and more linear, making its recommendation behavior more transparent and easier to approximate. As a result, attackers can more effectively infer its internal mechanisms and parameter patterns from limited feedback. These factors contribute to the most significant performance improvement observed on LGN.
Moreover, we are truly grateful for your valuable and constructive feedback, which has significantly contributed to enhancing our work. Thank you again for your positive and engaged response. We sincerely hope that you will consider revisiting your score in light of our response and clarifications.
I am grateful for the authors' detailed responses, which have addressed all of my concerns. Hence, I will adjust my scores accordingly.
The paper tackles model-extraction attacks on recommender systems under the strictest data-free black-box setting: the adversary sees only top-k outputs and must generate all queries. The authors first run a careful diagnostic and observe that graph-convolution surrogates generalise better than architecture-matched ones. Guided by that insight they build DBGRME, which couples : 1) an interaction generator that synthesises user–item pairs and, GAN-style, maximises disagreement with the victim and 2) a generalisation-aware LightGCN-like surrogate with frozen user embeddings and an ℓ2 constraint on item updates to curb over-fitting
On four public datasets, DBGRME outperforms prior data-free baselines and, on average, even beats partial-data+query (PTQ) by 17 % Agr@50 on LightGCN. An ablation verifies that both the GA module and the constraint are useful. Overall, the attack pipeline is novel, and empirical gains are convincing.
优缺点分析
Strengths
- Data-free is the worst case for defenders and mirrors scenarios where member data are protected by privacy law.
- The initial cross-architecture study motivates the choice of graph surrogates instead of blindly copying the victim architecture, an idea that may transfer to other domains.
- Provides an end-to-end view; gradients through discrete sampling are handled via re-parameterisation
- Comprehensive ablations and evaluation : Five victims spanning MF, deep MLP and GCN families, four datasets of varying sparsity, plus budget sweeps give a robust empirical picture.
- Ablations and hyper-params show thoughtful tuning. Code is to be released, aiding reproducibility .
Weaknesses
- Query efficiency Budgets of 4-9 M queries (Gowalla/Yelp) are two orders of magnitude above typical commercial rate limits. Unsure how significant these dataset results are applicable
- The interaction generator requires extensive training (100 iterations with 1,000 synthetic users per loop), implying over 10^8 forward passes. Is that quite expensive?
- The paper lacks discussion of attack detectability and potential countermeasures.
问题
- Would be good to provide some wall clock time or compute utilisation for generator cost
- The primary metric (Agr@50) treats all items equally, whereas practical scenarios prioritize high-rank agreement. Additional metrics such as
局限性
Yes
格式问题
None
We sincerely thank you for your positive and encouraging feedback on our work. We truly appreciate your recognition of our scenario design, methodology, and experimental setting. Below, we would like to respond to your comments point by point.
Q1: Query budgets discussion.
A1:
[Comparison to existing work and discussions] Firstly, prior studies typically count each user as one query, regardless of how many times their interaction list changes (see Sec 4.1.1 of [1]). In contrast, we adopt a stricter and more realistic approach: each new interaction list, even from the same user, is treated as a separate query. Under this definition, methods like [1] incur a budget up to 1M on ML-1M (5000 users × 200 interactions), which is significantly larger than our setting of 400k. Moreover, as the first attempt to explore data-free model extraction for recommender systems based on user-item free embeddings, while our current setup may not fully match industrial-scale deployment, we believe it provides a solid and conservative foundation for understanding query efficiency in data-free recommendation model extraction.
Q2: Lacks discussion of attack detectability and potential countermeasures:
A2:
Thank you for the suggestion. Due to space constraints, we briefly discussed potential defense strategies on Page 9, Lines 360–364. Following your valuable feedback, we will expand this discussion from three perspectives (detection, model-level defense, and legal & ethical considerations), and include a more comprehensive analysis in the Appendix.
Here, we discuss the future extension of potential countermeasures from three perspectives: detection-based defense, model-level defense, and legal & ethical considerations.
[Detection-based defense] Recent attacks [1,2] often rely on autoregressive querying or generator-driven interaction synthesis. Monitoring query frequency, interaction entropy, and behavioral anomalies can help detect suspicious users. Platforms may then apply query rate limiting or introduce noise to hinder extraction.
[Model-level defense] This approach involves injecting perturbations into the model's representation or output space to mislead attackers. However, in hard-label recommendation settings, such defenses require careful calibration to maintain model accuracy while degrading extraction effectiveness.
[Legal & ethical considerations] In addition to technical defenses, platform operators should implement responsible deployment practices such as strict API access control, policy enforcement, and legal safeguards. Raising public awareness and establishing regulatory frameworks can further mitigate the misuse of recommendation models.
Moreover, this study is conducted on benchmark datasets with the aim of improving system robustness and promoting awareness of potential vulnerabilities.
Q3: Computation cost about generator:
A3: Thank you for raising this point. Regarding mentioned “100 iterations with 1,000 synthetic users per loop”., in our setting, the total budget is at most 10^5 forward passes, not 10^8. While if each full batch of 1,000 synthetic users is considered as one forward pass, then the total number is only 100 forward passes. Therefore, we believe the overall computational cost of the generator remains manageable in practice.
Q4: High-rank agreement:
A4: Thanks for your advice. We have re-evaluated the high-rank agreement (Agr@10) between the trained surrogate model and the victim model. The results further confirm the effectiveness of DBGRME. We would like to add this to the final Appendix.
| Dataset | Victim | bpr | ncf | gcmc | ngcf | lgn |
|---|---|---|---|---|---|---|
| ML-100K | random-DBGRME | 0.1546 | 0.1128 | 0.2177 | 0.1395 | 0.5965 |
| DFME | 0.1370 | 0.1189 | 0.1372 | 0.2159 | 0.3353 | |
| DBGRME | 0.1895 | 0.1672 | 0.2584 | 0.2854 | 0.6285 | |
| ML-1M | random-DBGRME | 0.0874 | 0.2090 | 0.1690 | 0.1477 | 0.4378 |
| DFME | 0.0410 | 0.1989 | 0.1210 | 0.1506 | 0.2109 | |
| DBGRME | 0.1424 | 0.2176 | 0.1896 | 0.1675 | 0.4725 |
[1] Yue Z, He Z, Zeng H, McAuley J. "Black-box attacks on sequential recommenders via data-free model extraction." Proceedings of the 15th ACM conference on recommender systems. 2021.4.
[2] Wang Y, Su J, Chen C, et al. Sim4Rec: Data-Free Model Extraction Attack on Sequential Recommendation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2025, 39(12): 12766-12774.
This submission proposes DBGRME, a data-free black-box model extraction method for recommender systems using graph convolutions. The paper received consistently positive reviews.
The work addresses two fundamental limitations in existing model extraction research: unrealistic data accessibility assumptions and generalization problems with architecture-constrained surrogate models. The key insight that graph convolution-based surrogates outperform architecture-matched ones is novel and well-validated across multiple victim models and datasets.
The authors demonstrated exceptional responsiveness during the rebuttal period, addressing all major reviewer concerns including SOTA comparisons (adding Sim4Rec), scalability (adding Steam dataset), and query budget realism (providing detailed comparative analysis). Multiple reviewers explicitly stated that their concerns were satisfied and increased their scores accordingly.
While the attack scenario may seem niche, the problem is important for understanding recommender system security, and the work establishes a new research direction with both theoretical insights and practical techniques.
I recommend ACCEPT.