6.5

/10

Poster4 位审稿人

最低6最高8标准差0.9

2.3

置信度

正确性2.8

贡献度3.0

表达2.8

ICLR 2025

SelectFormer in Data Markets: Privacy-Preserving and Efficient Data Selection for Transformers with Multi-Party Computation

Xu Ouyang,Felix Xiaozhu Lin,Yangfeng Ji

OpenReview PDF

提交: 2024-09-25更新: 2025-03-16

TL;DR

Secure and fast data selection & appraisal over MPC (Multi-Party Computation), for training NLP/CV Transformer models

摘要

Critical to a free data market is $ private data selection$, i.e. the model owner selects and then appraises training data from the data owner before both parties commit to a transaction. To keep the data and model private, this process shall evaluate the target model to be trained over Multi-Party Computation (MPC). While prior work suggests that evaluating Transformer-based models over MPC is prohibitively expensive, this paper makes it practical for the purpose of data selection. Our contributions are three: (1) a new pipeline for private data selection over MPC; (2) emulating high-dimensional nonlinear operators with low-dimension MLPs, which are trained on a small sample of the data of interest; (3) scheduling MPC in a parallel, multiphase fashion. We evaluate our method on diverse Transformer models and NLP/CV benchmarks. Compared to directly evaluating the target model over MPC, our method reduces the delay from thousands of hours to tens of hours, while only seeing around 0.20% accuracy degradation from training with the selected data.

关键词

Secure Multiparty ComputationMachine LearningEfficiencyTransformer model

评审与讨论

审稿意见

评分: 8置信度: 22024-11-04

The paper focuses on private data selection for transformers over MPC. The authors consider a two-party setting, where a model owner wishes to purchase data points from a data owner. The model owner needs this data to fine-tune a target model. To find the most relevant data points without revealing the rest of the dataset, the two parties engage in a multiparty computation.

Computational and communication overhead is a key challenge in making MPC practical for data selection. In an ideal world, the model owner could evaluate its target model on data points with MPC. However, this is not practical for large transformer models, which contain nonlinearities (such as layer norm or softmax) that are prohibitively expensive to compute with MPC. Thus, an alternative approach is to use a cheaper proxy model to approximate the target model efficiently while still selecting useful datapoints.

This paper proposes a new way of building such proxy models, by replacing costly nonlinearities by more MPC-efficient, trainable, multilayer perceptrons. The authors also introduce a multiphase selection approach where datapoints are selected progressively, thereby using previously selected datapoints to improve the selection of future datapoints. Delay can also be optimized by overlapping computation and communication.

The paper is evaluated on vision and NLP tasks, and shows significant improvements in both delay and accuracy compared to prior work.

优点

Main strengths:

SelectFormer shows strong improvements in delay and accuracy over baselines, which include two naive baselines (Oracle and Random) and one recent MPC inference paper (MPCFormer).
Multiphase selection is a nice technique. The key idea is that we can significantly reduce delay by selecting the first datapoints with coarse model proxies, and progressively use more accurate and slower proxies (trained on a now larger collection of purchased datapoints) to better select the following datapoints. It turns out that this technique also improves accuracy a bit.
Another useful contribution is MLP approximation of nonlinearities, where the MLPs are trained by generating a large synthetic dataset using metrics coming from the small number of datapoints already purchased.
The paper's evaluation is broad and strong, across both vision and NLP tasks. I appreciated the numerous experiments and very granular ablation studies, such as Fig 4 and Fig 6.

Other strengths:

It is useful to compare the accuracy drop from different MLP approximations.

Handling imbalanced and unlabeled data is important.
Using computation/communication parallelism sounds intuitive, but it might not be done by other works, so it is valuable to evaluate it. This parallelism offers a nice gain in performance.

缺点

Main weaknesses:

I don't know the literature extensively, but I wonder if MPCFormer is really the strongest baseline against which we should evaluate SelectFormer (the other baselines, Oracle and Random, are useful to set the delay/accuracy range but are not real alternatives to SelectFormer). Indeed, as the authors note, "MPCFORMER’s model distillation approach is ill-suited to data selection", and I wonder if this is the reason behind MPCFormer's particularly poor accuracy (Table 3). The paper mentions THE-X (Chen et al, 2022), which could be a stronger baseline. Another potentially relevant paper is Bolt, published at S&P 24: https://eprint.iacr.org/2023/1893. I don't know these works in detail, but they might be more amenable to data selection than MPCFormer if they do not rely on data distillation or data-dependent approximations. In short, I am worried that MPCFormer might be a strawman against which SelectFormer shines too easily.
Another concern is that the techniques proposed in this paper seem to only apply to a quite specific application, namely MPC data selection. Hence, depending on how widespread MPC data selection is, SelectFormer could have a pretty limited impact. Indeed, the authors note that their "MLP approximation is specifically suitable for data selection while impractical for model inference directly", which might limit the applicability of SelectFormer to other problems.
Finally, I am not an expert in MPC systems, but I am a bit skeptical of the blanket claim that "No prior MPC systems exploit such parallelism". I remember a similar idea being mentioned by Meta in a research whitepaper (https://research.facebook.com/publications/private-computation-framework-2-0/), and a cursory web search returned a preprint showing how "it is possible to carefully orchestrate the computation and communication steps to overlap" in ML MPC training and inference (https://arxiv.org/pdf/2209.13643). The paper might still be making valuable contributions in MPC parallelism, but highlighting such contributions might benefit from a more detailed comparison to prior work.

Minor comments:

typo: "Note that we cannot directly compare to PUMA Dong et al. (2023), which is designed under 3 compute parties. and three computing parties."
I can't find the data showing that MPS reduces total delay by 33% to 61%. Is this from the difference between PM and PMT in Figure 6? Maybe this would be easier to break down if there was a delay column in Table 4?

问题

Have you considered alternatives to MPCFormer, such as THE-X (that you cite, Chen et al 2022) or other modern 2PC inference papers for transformers, especially if they use data-independent nonlinearity approximations? Or are there reasons why such baselines don't apply or are already outperformed by MPCFormer in a data selection setting where distillation data is scarce?
If I understand correctly, computing ReLUs is still a bottleneck in MPC. This might be why FFNs take most of the computation time in Fig 1. Yet, your method introduces even more ReLUs by adding MLPs. Have you considered other learnable approximations as an alternative to MLPs, such as polynomials with learnable coefficients, which could be more MPC-friendly? Why should we expect MLP approximation to be the best way to obtain MPC-friendly proxy models?

审稿意见

评分: 6置信度: 32024-11-04

This paper presents an efficient approach to private data selection, leveraging multi-party computing (MPC) for data appraisal in complex models such as transformers and ViT. The authors propose using a proxy model that approximates the complex model by emulating non-linear modules with MLP for data appraisal. While experimental results demonstrate improved efficiency compared to random selection and MPCFormer, the paper would benefit from clearer algorithmic descriptions and more comprehensive experimental analysis.

优点

The paper addresses the important and challenging problem of efficient private data selection.
The proposed hierarchical algorithm demonstrates thoughtful consideration of multiple components, including multi-phase selection and dimension optimization, to enhance efficiency.

缺点

The pipeline of the algorithm is not clear. In multi phase selection, why do you need to query the candidate data? Whether or not to select this subset of data depends on the test data performance. It seems that when you train the MLP, except for the selected data, you also generate an Gaussian distribution for the layer, but how to use this distribution to supervise fine-tuning is not clear. In sum, I hope there will be an algorithm pipeline to clearly show each steps and which parameters are hyper-parameters.
The results in Table 2 is surprising. Does it mean that MLP is enough and we can drop all non-linear layer since some of your results show that all emulations with MLP outperform no emulations.
The gap between MLP and non-linear module is simply shown by the final accuracy of your task, which may contain much randomness. Could you explain the gap in embedding layer? Like how much embedding difference for different layer emulation.
The experimental results are not clear. E.g., in Table 3, did you test the model under the same communication and computation budget? In Figure 5, what does 1 phase mean? How many phased do you use in the "ours"? Why not compare your delay with the baseline MPCFormer?
Lack of analysis. As your work focus on improving the efficiency and keeping performance, it is important to specifically analyze the computation and communication costs.

问题

See weaknesses.

审稿意见

评分: 6置信度: 22024-11-12

The paper considers the private data selection problem, i.e. the problem of a model trainer selecting data to purchase from a data owner without revealing the purchased data to the model trainer. The authors focus on the active learning framework, which targets examples which have the largest prediction entropy, i.e. which the model can learn more from, and does not require labelled examples. Ideally multi-party computation can be used to have the model trainer and data owner compute the examples' entropies (or, if entropies are themselves sensitive, a relative ordering of the examples' entropies) without revealing any other information about the examples themselves. For transformers, MPC is infeasible because transformers use high-dimensional, non-linear operations. Past work got around this issue by approximating the non-linear operations using linear operations. However, using MPC to compute these approximations remains expensive and the approximation comes at a cost in accuracy.

The authors instead propose a number of techniques to make MPC evaluation of transformers more efficient. First, they fuse multiple non-linear steps. Next, they use multi-layer perceptrons (MLPs) to approximate the fused steps and reduce their dimension (as opposed to past work, which use MLPs to reduce a single non-linear step without dimension reduction). Third, they employ a multi-phase selection protocol, where initially a small model is used to filter the initial dataset $S$ into $S_1$ , and in each successive step a larger model is used to filter $S_i$ into $S_{i+1}$ until the final dataset is acquired. To find the MLP approximation and also construct the smaller models, the model trainer purchases a small arbitrary bootstrap dataset up front, and uses the inputs/outputs of the large model on this dataset to train the MLP to approximate layers of the large model. The number of layers and width of the layers, as well as the dimension of the MLPs, can be reduced to construct a smaller model.

The authors perform an empirical comparison of their data selection protocol to (1) picking data at random (2) an oracle which chooses the highest-entropy examples without MPC, and (3) MPCFormer. At varying dataset sizes, the authors' result is competitive with baseline (2) in terms of accuracy and achieves a ~200x speedup in runtime over (2), and has large accuracy improvements over (1) and (3), on a variety of empirical tasks.

优点

The paper studies a problem that has been of interest in past work. It proposes a method that is both very efficient and achieves high utility in the empirical studies. The authors can operate in a more general setting (where examples are unlabelled) than past work. It is clear from the presentation what techniques the authors' method uses to improve upon the past work, and the techniques are made understandable at a high-level even to someone who is not an expert in the area. Furthermore, techniques such as using MLPs for dimension-reduction and fusing multiple operations are themselves sufficiently different from past work and a possibly interesting contribution independent of the problem of data selection.

缺点

The MLP dimensions, proxy layers, and selectivity per round are chosen via grid search, which might limit the practicality of the method. It would be nice to have either "standard" guidance for choosing these parameters or a justification for grid search being practical. See Questions below.

The paper also needs some editing, there are some major typography errors, though I expect this is easy to handle in a revision. For example:

Line 82, citation to ??
Line 398 - "under 3 compute parties" appears twice
Line 498, $d_i$ is not formatted properly

问题

Is there a way to make the grid search practical? We probably do not have the luxury of trying MPC with a data vendor multiple times with different parameters in practice. Maybe we can do a grid search to see what parameters perform the best on a non-private test dataset, and then use these values in all actual interactions with data owners?

Can you include a 'baseline' for Figure 1? i.e., show how long the corresponding operations take without using MPC and the memory requirement for each (communication rounds could be omitted, of course).

审稿意见

评分: 6置信度: 22024-11-13

This paper proposes SelectFormer, a method for privately selecting data for transformer finetuning using MPC. The setting is that a data holder is trying to sell data to a user, who wants to finetune a transformer on only a subset of the holder’s data. However, the data holder doesn’t want to expose all fo their data, and the data user doesn’t want to expose their model weights to the data holder. The main idea is to iteratively learn a sequence of models, each of which is used to privately rank points in the dataset in terms of their informativeness (entropy of the output distribution of the target model). These models replace nonlinear operations like softmax with an MLP with ReLU activation, where the MLP is trained by collecting input-output pairs from the intermediate model. The authors show that their method significantly reduces the time to select a good training dataset relative to prior private baselines, without sacrificing much in terms of quality (i.e. it selects data points that lead to a good downstream model).

优点

The problem formulation is interesting and new
The algorithm appears to cleverly use the available information
The empirical performance and experiments are promising

Overall, I think the paper is proposing an interesting new idea (both formulation and algorithm), and hence I gave it a positive score. However, I think the level of polish and writing is rather low, and I think the paper would need a significant clean-up prior to publication.

缺点

The paper is not polished or easy to read/understand
It’s not clear if the paper compares against relevant baselines
Imprecise threat model and privacy guarantees. The threat model does not clearly specify who should not learn what data
How does the MPC handle nonlinearities in the MLP?
The level of formalism in the paper is very low

In your privacy guarantees, please specify who learns or does not learn what information. For instance, if the data holder knows the architecture of the model(s) and they know the ranked entropy of their samples, why can’t they train the model locally on the most informative samples to approximate the private model? This leakage was not explored in the paper, to my knowledge.

The paper’s evaluation considers time-accuracy tradeoffs. However, these are mostly provided implicitly in tables that do not clearly show the tradeoff. It may be helpful to show a plot with selection time on one axis and accuracy on the other, and then see different baselines, including variants of your method. Specifically, based on the results in the Appendix, I think the 1-phase variant with a dimension 2 MLP may do well enough relative to the other hyperparameter settings, especially if you consider its lower delay. But it’s hard to judge from the tables provided. Can you produce this plot and compare it to the suggested hyperparameter settings?

A lot of notation seemed to be undefined or imprecisely defined. This made the paper difficult to read. For instance, in Sec. 4.1, “with a selectivity αi = Si/Si−1” if S_i is a dataset, do you mean the ratio of the cardinalities of S_i and S_{i-1}? In Sec 2.1 and 2.2 -- What is the difference between M_t and M_target? Why do you want to query all the samples in D on M_t instead of M_target? I thought M_t was the finetuned model, which is unknown until you finetune with the selected data? The notation $\hat M_i$ was not defined, and the proxy model was previously defined as $M_p$ . And what is $M_g$ in Figure 2? It is not defined until later in the paper, in Section 4.2. But the definition doesn’t match Figure 2 (is it the bottom K or L layers?) What is the difference between $W_i$ and $w_i$ ?

In the same vein, the paper does not seem to formally write out the MPC algorithm or prove any guarantees.

The paper is missing several relevant references, including one potentially important baseline for comparison:

Pang et al, “Bolt: Privacy-Preserving, accurate, and efficient inference for transformers” (S&P 2024)

Table 1 is hard to read. Please enlarge. Also what metric is it providing? Please make the table and caption self-contained. This is true also of the tables in the appendix, many of which don't specify what metric they are listing.

Minor comments:

Intro: “As shown in ??,”
The section “Workflow” is very difficult to follow—it’s not clear what you mean by ranking samples by entropy, for instance
The model owner has a private, small validation set, on which she wants to maximize the test accuracy.  Do you mean validation accuracy?
“offline generate a random triple called Beaver triple Beaver (1992)” wrong type of citation typo

I am not an expert in MPC and may be missing relevant literature and requirements.

问题

Can you share a plot of accuracy vs. delay for baselines and the variants of your method, including the variant with 1 phase and a dimension 2 MLP?
Why is replacing one nonlinearity with a different nonlinearity useful for MPC?
Can/should you compare to BOLT in the evaluation?

AC 元评审

2024-12-20

The reviewers were supportive of the paper. There were a few concerns that if addressed in the final version that would be great,

a) The threat model should clearly define who should not learn what data. b) The paper should specify more clearly how the MPC handles nonlinearities in the MLP. c) The level of formalism is a bit low, and the privacy guarantees should be more specific. d) The section on "Workflow" is difficult to follow.

审稿人讨论附加意见

See above.

最终决定Accept (Poster)

2025-01-22

Accept (Poster)