ObCLIP: Oblivious CLoud-Device Hybrid Image Generation with Privacy Preservation
We introduce ObCLIP, a cloud-device hybrid oblivious image generation framework that achieves private Stable Diffusion inference (protecting the sensitive attributes in user prompts) with an enhanced efficiency-utility trade-off.
摘要
评审与讨论
The paper tackles text-to-image generation via diffusion models with a focus on improving privacy and maintaining efficiency in a hybrid cloud-client generation pipeline. To preserve privacy, the paper proposes to obfuscate sensitive attributes in the text prompts (e.g., concerning gender, age, ethnicity, etc.) by generating a set of augmented prompts where the sensitive attributes are randomly replaced To offset the heavy increase in compute cost such a simple approach would cause, the paper uses a combination of hybrid generation (where some steps are performed in the cloud and some on a device on a smaller model) and techniques to leverage redundancy in the diffusion process on the server side (i.e., due to similar semantics among the set of augmented prompts and redundancy among different diffusion steps). Experiments are performed on various combinations of server/client models on MS-COCO and MJHQ, and the proposed approach is compared primarily to baselines consisting of full server/client-side generation and existing hybrid approaches.
优缺点分析
Strengths:
- The paper is well presented and easy to follow.
- The approach to increase privacy through prompt augmentation is sensible and straightforward. Techniques to leverage redundancy in the resulting set of prompts (some of which appear to be novel) also make sense and appear effective in accelerating server-side computation and reducing the negative impact on overall compute.
Weaknesses:
- A bit unclear what the novel technical contributions are. Many of the techniques used for improving the efficiency of the system appear to be from existing works (i.e., the hybrid processing and leveraging temporal redundancy in the diffusion process). The paper appears to be more a combination of existing techniques to a large extent (which is ok, but limits technical contributions).
- Comparison to HE-Diffusion and other methods listed in Table 1 is lacking. It is unclear why only latency is considered for HE-Diffusion in Table 2. It also appears from Table 1 that the MPC methods would be preferable (though I might misunderstand the meaning of some of the columns).
- The qualitative examples shown are not very convincing in my opinion. For example, it is not obvious that one of the examples in Fig. 3b is clearly better than the other (both show obvious defects).
- Though the approach is sensible (i.e., obfuscation by augmentation + improve efficiency via hybrid processing), it is also rather straightforward, and the paper therefore does not provide a lot of novel insights.
- Given that the obfuscation mechanism appears to be a key contribution, I found that there is limited detail and evaluation of the approach. It appears to be rule-based and relies on some prior model to find sensitive attributes. This seems potentially rather limiting, and I have some doubts about the generality of it. For example, while Fig. 8 highlights "young" as a sensitive attribute (age), the information that the person should be 18 years old is not highlighted (though it is likewise referring to age).
问题
It would be great if the authors could address some of my concerns listed in the weaknesses above. In particular, I would like to know better about:
- How the method compares to prior work like HE-Diffusion?
- What are the novel technical contributions?
- What are the limitations of the obfuscation approach to identify and replace sensitive attributes?
局限性
I mentioned some potential technical limitations above. I don't see any negative societal impacts from the work.
最终评判理由
After considering the extensive author response (which addresses several of my concerns) and the other reviews, I decided to raise my rating to "borderline accept". I think the paper is sufficiently valuable and interesting to the community to warrant publication.
格式问题
We sincerelly thank the reviewer pbWf for the positive feedback on "Techniques to leverage redundancy ... (some of which appear to be novel) also make sense and appear effective" and helpful reviews. We address the concerns raised and respectfully hope that our clarifications will help prompt a positive rating of the paper.
Q1: Explanation for the comparision to cryptographic methods.
Apologies for the misunderstanding. We first clarify that the comparison with HE-diffusion is presented in Section 5.2 on page 7. We report only the latency because the original HE-Diffusion paper (in Table 23) only provides runtime numbers for efficiency and implicit similarity metrics for utility (to measure error between plaintext and encrypted generations), but lacks full utility evaluation. Due to the considerable computation overhead of HE-Diffusion (one denoising step takes over 100 seconds), we were unable to conduct a complete evaluation. During the rebuttal, we found a recent HE-based paper[1] that uses plaintext to approximate the utility. We adopt this methodology, which should provide the utility upperbound for these two HE-based works as it avoids the approximation errors introduced by encrypted inference. We provide the supplementary evaluation results in the table below. We take FID score reported in the paper of [1] due to lack of source code. MPC-based approaches do offer strong privacy and somewhat comparable utility with delicate approximation methods. However, their efficiency is typically orders of magnitude slower (due to heavy cryptographic computations). For example, as shown in the Table below, HE-Diffusion is over slower and [1] is even over slower than ObCLIP. Such drawback greatly hinders the practicality of MPC-based approaches. In contrast, ObCLIP achieves significantly better efficiency with comparable utility.
| Generation Method | FID | Latency (s) | |
|---|---|---|---|
| Non-private | SD-v1.4 | 13.86 | 1.12 |
| BK-SDM-small | 18.30 | 0.78 | |
| Hybrid SD () | 14.29 | 0.55 | |
| Hybrid SD () | 15.48 | 0.28 | |
| Private | HE-Diffusion | 106 | |
| [1] | 660 | ||
| ObCLIP | 15.73 | 0.57 | |
| ObCLIP | 16.45 | 0.33 |
Regarding the qualitative comparison in Table 1, to better highlight the trade-offs, we update the table with four levels of performance: High, Medium, Low, and Very Low. Specifically, MPCViT focused on secure inference of vision transformer, which is different from our approach (U-Net based diffusion). Therefore, we do not include quantitative comparison to this work. Note that, due to LaTeX syntax incompatibility, we here use textual indicators instead of circular symbols. We hope this resolves your concern and promise to update the table in the final version.
| Method | Domain | Privacy | Server Cost | Utility | ||
|---|---|---|---|---|---|---|
| Non-private | Standalone | Server-Only | Text-to-Image | Low | Low | High |
| Client-Only | Text-to-Image | High | High | Low | ||
| Hybrid | Hybrid SD | Text-to-Image | Low | Medium | Medium | |
| Private | MPC | MPCViT | Text-to-Image | High | Very Low | Medium |
| HE-Diffusion | Text-to-Image | High | Very Low | Medium | ||
| [1] | Text-to-Image | High | Very Low | ~High | ||
| Hybrid | ObCLIP | Text-to-Image | High | Medium | Medium |
[1]: Secure Diffusion Model Unlocked: Efficient Inference via Score Distillation
Q2: Clarification on the novel contributions.
We argue that in caching-based acceleration, the core challenge lies in devising a caching strategy that effectively balances efficiency and utility. Specifically in this paper, we incorporate both temporal- and batch- redundancy-based accelerations designed for the client-server hybrid generation paradigm. For temporal redundancy, we first analyze server-side redundancy across self-attention, cross-attention, and block outputs, as shown in Figure 6. Guided by these insights, we develop a multi-granularity acceleration strategy that further reduces overhead compared to prior work. For example, different from T-Gate[1], we cache both the self-attention and cross-attention in the server-side early denoising (T-Gate only caches self-attention). In addition, we propose a novel batch-redundancy-based acceleration technique that reduces the total attention computation by a factor of . By jointly leveraging both temporal and batch redundancy, our approach achieves a better efficiency–utility trade-off, which is thus non-trivial. We hope this approach offers useful insights for future work, particularly in real-world high-concurrency inference services, where similar forms of redundancy across different user requests could be exploited.
We also appreciate the reviewer’s positive feedback that "the approach is sensible." We believe that, despite its simplicity, our method offers an effective and practical solution for private image generation while maintaining good utility.
[1]: Faster Diffusion via Temporal Attention Decomposition. 2025
Q3: Clarification on the obfuscation approach.
Thanks for your careful reading. We provide the detailed design of obfuscation mechanism in Algorithm 1 in the Appendix B. We clarify that it mainly involves two steps: (i) detecting sensitive tokens, and (ii) constructing the corresponding candidate set. The first step, also known as named entity recognition, are used as an existing technique[1] in this work. One limitation is that we depend on the underlying model to detect sensitive tokens. We argue that this could be of independent interest. For example, there are recent works that use LLM for NER detection[2], which yields much better performance.
Regarding Fig. 8, we apologize for the insufficient elaboration due to page limit. In this figure, the parentheses indicate an example age (e.g., "18-year-old") to represent the "young" category. For general tasks that require age as input, we discretize age into three categories: young (18–35), middle-aged (36–59), and old (60+). We hope this resolves your concern and will include a more detailed explanation in the final version.
[1] Isotonic. Distilbert fine-tuned for ai4privacy task, 2025.
[2] UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition. 2023
Q4: Examples in Fig. 3b
We acknowledge that as shown in the figure, even server-only generation yields obvious defects. However, server-first hybrid generation is clearly more influenced by server-side guidance. This supports our design choice to offload the early denoising stages to the server, enabling stronger semantic alignment and improved generation quality.
Dear Reviewer pbWf,
Thank you once again for your thorough and constructive review of our submission. We have carefully considered your constructive comments and sincerely hope that our responses have adequately addressed the concerns you raised.
Please kindly let us know if you have any follow-up concerns in the author-reviewer discussion period.
Best regards
Dear Reviewer pbWf,
We hope this message finds you well. We noticed that you submitted the acknowledgement but did not leave any comment. As the discussion is nearing its end with about two days remaining, please kindly let us know if there are any unresolved points you'd like us to consider. Your insights are invaluable to us, and we are eager to address any remaining issue to improve our work.
Thank you sincerely for your time and effort in reviewing our paper.
Best regards,
The paper proposes ObCLIP to achieve privacy protection in image generation scenarios under the cloud - device hybrid architecture. The method leverages the collaboration between the obfuscation of multiple prompts and the Oblivious computing mechanism to construct a cloud-device collaborative privacy-enhancing framework, which obfuscates and hides user generation intent (prompt) and computing processes to resist privacy leakage risks. ObCLIP consists of components such as the local processing module on the device side and the cloud-side collaborative generation module. Experiments show that ObCLIP can effectively reduce the privacy leakage risks of user prompts and generation processes while maintaining the quality of image generation.
优缺点分析
Pros:
- The paper is clearly written.
- Uses a cloud–device hybrid pipeline where the server only handles partial denoising and the client completes the rest, cutting server load without compromising generation quality.
- Integrates cache-based accelerations exploiting temporal and batch redundancies to drastically reduce server-side computation with minimal impact on utility. Cons:
- Insufficient innovation in core privacy mechanism. The paper relies on generating multiple prompts to obfuscate the real prompt as the core privacy method. Relying solely on prompt-level obfuscation, this single-dimensional strategy shows limited innovation in resisting attacks that exploit prompt correlations or sequences.
- The paper uses single-dimensional attributes (e.g., age, gender) for illustration. Real-world privacy risks involve multi-attribute correlations and temporal prompt patterns. Limiting experiments to simple cases fails to validate defense against complex threats, weakening applicability in vertical domains.
- Limited generative model generalization: The paper conducts experiments mainly on specific Stable Diffusion models, while not extending to other mainstream generative architectures (such as Transformer-based DALL-E-like models or hybrid-structured generative framewo
问题
-
The experiments mainly use single - dimensional attributes like age and gender for illustration. But real - world privacy risks often involve multi - attribute correlations and temporal prompt patterns.
-
Could you extend your experiments to more complex scenarios involving multi - attribute correlations? For instance, in a healthcare domain where patient information includes age, gender, medical history, and treatment records, how would your method perform in protecting privacy against potential threats that exploit correlations among these attributes? Including such experiments would strengthen the applicability of your work in vertical domains.
-
The experiments are mainly conducted on specific Stable Diffusion models, without extending to other mainstream generative architectures such as Transformer - based DALL - E - like models or hybrid - structured generative frameworks.
-
What are the key differences between Stable Diffusion models and other mainstream generative architectures like DALL - E - like models that might affect the performance of your privacy protection method? A detailed comparison and analysis would be beneficial.
局限性
yes
最终评判理由
The rebuttal addressed the issues I was concerned about, and most of them have been resolved. I will maintain my "borderline accept" rating.
格式问题
No formatting issues found.
We thank the reviewer 7DQm for the positive feedback on "ObCLIP can effectively reduce the privacy leakage risks of user prompts" and thoughtful reviews, which is quite encouraging for us. We hereby answer the specific questions below and hope the explanations resolve your concerns.
Q1: Experiments for multi - dimensional attributes
We clarify that our experiments actually encompass multi-dimensional attributes. To ensure a comprehensive evaluation, we include experiments involving two-attribute combinations (gender × age) and three-attribute combinations (gender × age × ethnicity), presented respectively in Table 2 and Table 6 in the Appendix. The results consistently demonstrate the effectiveness of ObCLIP across these attribute combinations.
We summarize the utility data in Table below, with ObCLIP yielding comparable performance to the no-private baselines. Beisdes, compared to the parivate baseline vanilla OG, our efficiency is about higher (and even orders of magnitude higher than cryptographic methods like HE-Diffusion). For healthcare scenario, attributes such as treatment records can be pre-discretized similarly (e.g., headache, cancer). The generation process can then be conducted as described in this paper by leveraging different attribute combinations.
| Generation Method | 2-Attribute | (gender age) | 3-Attribute | (gender age ethnicity) | |||
|---|---|---|---|---|---|---|---|
| FID | IS | CLIP | FID | IS | CLIP | ||
| Non-Private | Realistic Vision v4.0 | 113.39 | 5.32 | 0.3215 | 111.87 | 4.78 | 0.3322 |
| small-sd | 118.19 | 5.11 | 0.2980 | 115.86 | 5.02 | 0.3034 | |
| Hybrid SD () | 116.15 | 5.05 | 0.3117 | 113.57 | 5.02 | 0.3108 | |
| Hybrid SD () | 114.05 | 5.02 | 0.3226 | 112.75 | 5.04 | 0.3214 | |
| Private | Vanilla OG | 113.39 | 5.32 | 0.3215 | 4.78 | 4.78 | 0.3322 |
| ObCLIP | 116.15 | 5.05 | 0.3117 | 113.57 | 5.02 | 0.3108 | |
| cache | 117.29 | 5.00 | 0.3091 | 114.07 | 5.04 | 0.3083 | |
| reuse | 113.92 | 4.87 | 0.3076 | 113.30 | 4.66 | 0.3077 | |
| ObCLIP | 114.05 | 5.02 | 0.3226 | 112.75 | 5.04 | 0.3214 | |
| cache | 115.65 | 5.02 | 0.3174 | 113.31 | 5.03 | 0.3171 | |
| reuse | 109.76 | 4.94 | 0.3152 | 110.22 | 4.62 | 0.3138 |
Q2: Insufficient innovation in core privacy mechanism.
We clarify that our method obfuscates all sensitive attributes within the real prompt, rather than only single-dimensional attributes. This design explicitly addresses potential attribute correlations that could otherwise lead to information leakage, thereby ensuring perfect indistinguishability. This approach is applicable to realistic scenarios involving diverse and inherently multi-dimensional sensitive attributes. Despite its simplicity, our method offers an effective and practical solution for private image generation while maintaining good utility.
Moreover, as discussed in the Conclusion section, the obfuscation mechanism could be further optimized by incorporating differential privacy-based private subset sampling, which offers a better privacy-utility trade-off. Notably, this direction is orthogonal to our current approach and fully compatible with it. We would leave it as future work.
Q3: Limited generative model generalization
We argue that this paper focuses on U-Net based diffusion models, which is consistent across all compared baselines. Furthermore, we have conducted extensive experiments on a diverse set of stable diffusion models, including SD-v1.4, Realistic Vision, SDXL-1.0, and LCM-SDXL, to ensure a comprehensive evaluation. We believe these results sufficiently demonstrate the effectiveness of our proposed approach.
Regarding the extension to other Transformer-based diffusion models, we provide a theoretical discussion as follows. The primary difference lies in the denoiser backbone—Transformer-based models replace the U-Net architecture. Nevertheless, both architectures rely heavily on self-attention and cross-attention modules. Our cache-based acceleration primarily targets these attention modules, and therefore, the proposed method remains potentially applicable to Transformer-based models, provided that temporal and batch-wise redundancies still exist. Notably, temporal redundancy has already been observed in prior work[1], supporting the feasibility of our approach in this broader context. This is actually orthogonal to this work and we leave a comprehensive study into Transformer-based architectures as future work.
[1]: Faster diffusion via temporal attention decomposition.
Dear Reviewer 7DQm,
Thank you once again for your thorough and constructive review of our submission. We have carefully considered your constructive comments and sincerely hope that our responses adequately addressed the concerns you raised.
Please kindly let us know if you have any follow-up concerns in the author-reviewer discussion period.
Best regards
Thank you for your thorough response and the clarifications provided. Your efforts in addressing my concerns through the rebuttal have resolved most issues. Accordingly, I will maintain my 'borderline accept' rating."
This paper introduces a novel cloud-device collaborative framework for image generation. By splitting the diffusion process between the device and the cloud, the framework enhances prompt privacy while reducing server-side costs. To further protect private prompts, the paper proposes an oblivious prompt transformation and a local feature extraction strategy. Additionally, two acceleration techniques are developed to reduce computational overhead.
优缺点分析
Strengths:
S1: The paper presents a novel and well-motivated cloud-device collaborative framework for image generation, which effectively protects prompt privacy and reduces computational costs.
S2: The paper is well organized and clearly written, making it easy to follow.
S3: Experimental results demonstrate both the effectiveness and efficiency of the proposed method.
Weaknesses:
W1: The privacy guarantee heavily depends on the completeness of candidate prompt enumeration. In scenarios involving high-dimensional sensitive attributes, the exponential growth of the candidate set may either weaken privacy protection or lead to significant computational overhead, resulting in a fundamental trade-off that is difficult to resolve.
W2: Some important implementation details are missing. For instance, it remains unclear how the oblivious prompt transformation can be applied in general scenarios. The paper only provides dataset-specific implementations in the experimental section, without offering broader methodological guidance.
问题
Please address the identified weaknesses. Additionally, I have the following questions:
- How is the diffusion process divided between the device and the cloud in general scenarios? A more detailed explanation should be supplemented.
- How many devices were used in the experiments? Does the distribution of client data have a significant impact on performance?
局限性
Yes
最终评判理由
The rebuttal has helped resolve most of the issues I raised. I will maintain my "borderline accept" rating.
格式问题
None
We thank the reviewer YVsw for the positive feedback on "presents a novel and well-motivated cloud-device collaborative framework" and quite insightful reviews, which is quite encouraging for us. We hereby answer the specific questions below and hope the explanations resolve your concerns.
Q1: Trade-off for large candidate prompt set
Thanks for your clear understanding of our proposal. We briefly discuss such efficiency limitation in the Conclusion section. The computation cost indeed grows with the candidate prompt set size . To handle large case, we propose to use a subset of candidate prompts (of cardinality , with ) to greatly lower the computation cost. This can be achieved by differential privacy-based top- selection. By choosing appropriate privacy budget , we can achieve measurable privacy, guaranteed by differential privacy theory, while providing a much better efficiency. Such optimization is orthogonal to the method in this paper and we leave it as future work.
Q2: Extension of oblivious prompt transformation to general scenarios
We clarify that the oblivious transformation can be seamlessly extended to general scenarios. This transformation basically involves two steps: (i) detecting private tokens, and (ii) constructing the corresponding candidate set. For general applications, only the first step requires adaptation, as the definition of privacy is highly context-dependent. For instance, in location- or occupation-sensitive scenarios, one can employ regex-based rules or NER models to identify such tokens. The subsequent transformation process remains unchanged. We will include the discussion of this extension to general scenarios in the final version to enhance clarity.
Q3: Detailed diffusion process between the device and the cloud in general scenarios
We define switch point to divide the denoising steps between the cloud and the device. Specifically, the early steps are conducted on the cloud, with candidate prompt set as input. Then the remaining steps are conducted on the device, with the cloud-generated intermediate noisy latent as input. In the paper, we tried for comprehensive evaluations. A larger means more steps are conducted on the cloud, which can improve the quality of the generated images.
Q4: Devices in experiments.
We clarify that we adopt a standard client-server scheme. Therefore, only one device is used in the experiments. Besides, since we focus on the image generation (i.e., inference), with no fine-tuning of the diffusion model, the distribution of client data should not have significant impact on the performance. As evidenced by the experiment results, ObCLIP yield comparable performance to the server-side model on both MSCOCO and MJHQ datasets.
Thank you for your detailed response. I appreciate the clarifications and the efforts made to address the concerns. Your rebuttal has helped resolve most of the issues I raised. I will maintain my "borderline accept" rating.
Dear Reviewer YVsw,
Thank you for your response. I am glad to hear that our previous discussion has resolved your concerns. Please kindly let us know if you have any follow-up concerns in the author-reviewer discussion period.
Best regards.
The paper proposes a cloud-device collaborative framework to achieve privacy protection in image generation. It integrates the obfuscation of prompts and the oblivious computing mechanism to mitigate privacy risks. The effectiveness of the proposed method is demonstrated by extensive experiments. The paper is well-written and its cloud-device design has practical values. However, the technical novelty of the proposed method is a concern, and its effectiveness to a broader range of real-world attacks are still yet to be verified. as also pointed out by the reviewers.
During the rebuttal, the authors addressed most of the reviewers' questions, leading to consensus positive scores. Based on the reviewers' final scores and justifications, I would like to recommend the acceptance of the paper.