PaperHub
7.0
/10
Spotlight4 位审稿人
最低6最高8标准差1.0
8
8
6
6
3.3
置信度
ICLR 2024

GIO: Gradient Information Optimization for Training Dataset Selection

OpenReviewPDF
提交: 2023-09-22更新: 2024-03-20
TL;DR

We introduce a domain- and task-agnostic, scalable and efficient KL divergence-based method for training dataset selection and show it can achieve outstanding results with very small train sets on a variety of text tasks and an image task.

摘要

关键词
data selectiondata-centric AIinformation theorykl divergencegradientnatural language processingcomputer vision

评审与讨论

审稿意见
8

The authors propose a method called gradient information optimization (GIO) to select a subset of a large training dataset that preserves model performance. GIO is formulated as a problem of minimizing the KL divergence between the subset and the target distribution. As an algorithm for this minimization problem, the authors propose an algorithm using the gradient of the KL-divergence based on the greedy method.The authors experimentally demonstrate that the proposed method outperforms existing methods on the tasks of NLP and image recognition.

优点

It seems reasonable to select the subset so as to minimize the KL divergence.The authors propose a feasible algorithm for this minimization problem.Experimental results suggest that their proposed method works better than existing methods for machine translation and spell-correct tasks.

缺点

  1. There is a gap between the two equations, i.e., the solution obtained by the greedy method in equation (2) and the optimal solution in equation (1) are different. However, the authors did not discuss the gap between the two solutions.
  2. The subset selection given the target distribution is not sufficiently motivated and GIO seems not to be suitable for the natural setting of subset slection, D=D=\emptyset, G=XG=X. For this setting, V=XV=X (selecting all data) is the obvious optimal solution to equation (1) as the authors state in Section 3.4.
  3. It is not clear whether vopt\mathbf v_{opt} obtained by the gradient method and the optimal solution of equation (3) match. From appendix A.2., the authors assume that the two solutions match if the data density is sufficient. However, from Algorithm 1, the data are selected only from k-means centroids(G_c). Their assumption does not seem to hold, especially in high-dimensional spaces.There is not enough discussion about this.

问题

  1. Could you discuss the gap between the solutions of equations (1) and (2), on the theoretical side or on the experimental side? For example, on the theoretical side, if submodularity holds for the objective function, the approximation rate is bounded. On the experimental side, could you evaluate the rate of approximation between the solutions obtained by equation (1) and equation (2), and algorithm 1?
  2. Could you discuss what real-world situations where target distributions can be obtained? Could you also discuss what kind of data should be set as target distributions? For example, in the experiment, to improve the performance of the WMT14 test set, the authors set the WMT08-13 dev set as the target instead of the WMT14 training set. As a result, they report that the performance of the WMT14 test set actually improves. Could you discuss this result, for example?

Comments

  1. Ω\Omega is not defined.
  2. In Algorithm 1, DcDc+vbD_c \gets D_c + \\{\mathbf v_b\\} should be DcDcvbD_c \gets D_c \cup \\{\mathbf v_b\\}.
  3. The definition of relaxed v\mathbf v and vopt\mathbf v_{opt} is ambiguous. vΩ\mathbf v \in \Omega?

Post-rebuttal

My main concerns were (1) about the gap between Eq.1 and Eq.2 and (2) about the usefulness of the proposed method in real-world tasks.

There is a gap between the optimization problem (i.e., minimization of KL-divergence; "information-theoretic approach") and the greedy algorithm in this paper. The relationship between them was not clearly discussed. In their response, the authors clarified that the objective function lacks certain properties, such as submodularity. Unfortunately, they couldn't demonstrate the (empirical) approximation ratio achieved by the greedy algorithm. However, it is acknowledged that the greedy algorithm is one of the natural approaches to discrete optimization.

In their experiments, the proposed method performs well compared to baseline methods, including the one using submodular optimization. Furthermore, the authors' response highlighted that the proposed method outperformed existing methods in an experiment more closely resembling realistic scenarios. Furthermore, this paper proposes an effective algorithm, which has actually been applied to more than 10M data, while the minimization of KL-divergence by naive greedy algorithm is intractable. The response to C7wm states that the proposed method is computable in ~40min for a dataset containing 35M data.

Based on the above points, I agree that this paper holds significant value for the community, prompting me to revise my score from 5 to 8.

评论

We thank reviewer NSKq for their insightful feedback. We are pleased that they find the theoretical concept underpinning the method interesting and appreciate the algorithm’s competitive experimental results. We outline our response to your questions below:

W1/Q1 You are right that the greedy hill-climb method does not correspond exactly to the overall optimal solution. In the method section under Limitations, we note that the space is non-convex and methods like hill-climbing (gradient descent) will likely find local but not globally optimal minima. We propose alternative methods more suited to non-convex spaces like particle swarm optimization and look forward to exploring those techniques and seeing the impact on experimental results in future work. However, we acknowledge that the Limitations paragraph is phrased more as a limitation on gradient descent itself, and not a limitation between equations (1) and the hill-climbing concept of equation (2). Making that gap the focus of the Limitations paragraph rather than gradient descent specifically would make the insight and dicussion more general, and we will update this shortly in our revision. Along similar lines with your first question, submodularity unfortunately does not hold for KL divergence so we can’t bound the approximation rate directly. The closest related submodular function and technique we could discover is submodular optimization using submodular mutual information, which we evaluate (the Submodular Optimization baseline in our experiments). We find our method outperforms that technique with the submodular function, which implies our approximation rate must be less than that of the submodular function (submodular mutual information). We will add these details on this in the revision shortly. This is an extremely interesting insight and we can’t thank you enough for pointing this out and helping us improve our presentation and insight!

W2 We are unsure about your point; the concept of selecting a subset that matches a target set is commonly adopted as a data selection problem, see Yao et al (BM25 baseline), Kaushal et al., 2022 (Submodular optimization), Xie et al, Brown et al (GPT large language models) etc. Our setup using G (general data pool), X (a desired data distribution to follow) and D (data we must include) allows to flexibly represent many data selection problems. Our method is then based on incorporating all three variables. The data selection problem as you mention is to find the best subset that represents the full set, which as you correctly point can be initialized in our setup as D=D=\emptyset , X=GX=G and the algorithm can simply be applied accordingly. We investigate the performance of our model under this setup in the FashionMNIST experiment with good results, surpassing the baseline.

Q2 Absolutely, in response to your’s and another reviewers feedback we will incorporate details and guidance on how to choose an X (and D) in real world setups. Selecting X is specific to the goal of what you are data-selecting for; for example, WMT provides many dev sets of high quality which are close to the desired task, so we were confident those were a good target (which we collected from 08 to 13 to have more data). The WMT14 training dataset comes from a variety of sources (e.g. commoncrawl) of variable quality, so we felt finding the best subset that represents this full set (X=G) would not yield the most gains, when instead we have much higher quality dev sets available to be our target. In the real world, for foundation models for example which use web-scraped data as training data, choosing X to be the Wikipedia corpus (as the authors of GPT do when data-selecting) makes the most sense as this represents high quality language. Or in industry, for example, choosing real customer traffic your model may see in production makes the most sense as this will get the training set closest to the real production application of the model. We feel you have brought up an important discussion on how to practically apply this algorithm to new settings; we will provide the above examples and some further guidance on how to select X and update the revision shortly. Thank you!

Comments 1.Ω\Omega is the measure space on which the sets of data and probability densities are defined. 2. Thank you for pointing this out, we will fix it in our revision. 3. You are absolutely correct, the relaxation is vG\mathbf{v} \in G to vΩ\mathbf{v} \in \Omega, vopt\mathbf{v_{opt}} is also in Ω\Omega (and then we find the closest to that in G).

评论

Thanks for your response. My concerns have mostly been addressed by the response. At this point, I intend to raise my score. I feel that I can raise my score further if one of the following is met. (1) A theoretical derivation of the rate of approximation between the solution obtained by the proposed method and the optimal solution to the integer programming problem in Eq.1. (2) Experimental measurement of the empirical approximation ratio between the solution obtained by the proposed method and the optimal solution (Eq. 1). (3) Demonstration of the usefulness of the proposed method in an experiment simulating a real-world task (such as an experiment using web-scraped data and the Wikipedia corpus in the response).

评论

We thank you for the feedback and we are glad we were able to address most of your concerns. We thank you for the additional suggestions and agree it would make our method more interesting. For (3), we chose the spelling correction experiment to demonstrate showing our algorithm selects mostly high quality data from a mixed set of high and low quality data (given a target set of real human (highest quality) mistakes). However, after reviewing your feedback, we agree that a separate demonstration on a real world task would add to the strength of the argument. Based on your suggestion, we conducted an additional real-world experiment as follows: the task is to extract appropriate data from a general corpus and train a translation model intended for biomedical use (for translating medical articles for example, in EN-DE) given a small target set of sample biomedical phrases. With a general corpus of web-crawled data (CommonCrawl), a few other sources (EuroParl, UN) and biomedical data extracted from patents (for a total of ~20% biomedical data), we ran our algorithm and best-performing baselines using the small target set X of biomedical phrases, and then trained a translation model (EN-DE) and evaluated translation quality on Khresmoi biomedical test set, translating biomedical queries. 100% biomedical data would be the theoretically optimal selection for this task. We report the results:

% Biomedical Data SelectedData SizeBLEU
Ours72%500k28.7
Submodlib49%500k27.7
Pruning37%500k25.7

We are pleased to report that our method selected 72% biomedical text and performed better than comparable baseline methods on the translation itself, demonstrating it's capability in a real world scenario. It is worth noting that on spelling correction data it yielded 73% high quality data (a very similar number), so both that and this experiment's results together seem to empirically suggest an approximation ratio of around 1/0.721/0.72 or 1/0.731/0.73, 1.4\approx 1.4. We appreciate your insightful suggestion and look forward to including these additional results in our final paper – they deepen our argument in very useful ways. Thank you!

评论

Thank you for conducting the additional experiment. I can confirm that the proposed method demonstrates superior performance compared to the existing method in more realistic scenarios through the additional experiment. I am pleased with the results of this additional experiment. However, I believe that (1/% Biomedical Data Selected) does not represent an empirical approximation ratio, although it might be a suitable metric for this experiment. I believe the empirical approximation ratio is typically defined by the value of the objective function within the optimization problem (the KL-divergence in Equation 1). To calculate the empirical approximation ratio, it is necessary to compare the minimum value of the KL-divergence (i.e., the global optimal solution) with the KL-divergence calculated by the solution obtained through the proposed algorithm.

评论

Thank you, we are glad that you find the method superior compared to existing methods in a realistic setting and that our responses and additional experiments support increasing the score. For empirical approximation ratio, we thought it was the content of the set selected by the optimization objective, but we see that you mean value of the optimization objective itself (KL). In our experiment, it may be a little tricky as we don't truly know the "optimal" set V - it probably has majority biomed samples in it, but it could theoretically be optimal to also have included some non-biomedical samples from the general corpus as well - we just don't know which ones. To find out, we computed the KL divergences between our selected data and X and the pure biomed data and X, and found that our selected data's KL had a value of 60, whereas pure biomed had a larger value of 85. We believe this does indicate that the "optimal" set must include some non-biomed data in it, unfortunately we don't have any further indication on how we could construct this optimal set (as the optimal algorithm is not feasible). Nevertheless, we are glad that the added experimental result on the real world scenario shows our method's superior performance and surpasses the competitive baselines. We are glad you feel pleased with the additional experiment and hope you feel added confidence in supporting our method. We can't thank you enough for inciting this interesting discussion with us and look forward to including the work on this topic in the paper!

评论

As a quick followup, we were able to test the empirical approximation ratio in an idealized experiment. We needed access to the powerset to evaluate every single subset in order to calculate the optimal set and KL value, so we could not have a set larger than about 15 points for feasibility. We performed the following experiment, all in 768 dimensions to mimic embedding vector sizes: the target distribution X is a normal distribution centered at 0 with a covariance matrix of ones (15 datapoints(, and the general set G is a uniform set spread from -8 to 8 (15 datapoints). We performed GIO which selected a set with 1557 KL divergence. We evaluated every single set combination to find the true optimal set, which had a KL divergence of 1555. In this experiment therefore, the approximation ratio is about 1, nearly perfectly optimal. It is also worth noting that in this experiment in 2 dimensions, GIO actually selected the exact optimal set! We look forward to including this discussion in the paper and thank you again for your insights and suggestions!

审稿意见
8

This paper proposes a scheme to perform training data subset selection, connected to ideas in information theory by minimizing the (approximate KNN) KL divergence between the selected subset and the original dataset. However, this is intractable in practice, and the authors approximate this by finding the optimal data point (not necessarily contained within the training data) to add and then projecting this onto the nearest point contained within the training set. This only scales with the number of gradient steps (and not the size of the original dataset), which is much more computationally efficient. They run this gradient trick over K-means clusters rather than individual data points to make the problem more tractable.

The authors evaluate the proposed approach through multiple different qualitative and quantitative metrics: (1) self-consistency - coming from the same distribution, (2) negative-consistency - essentially ignoring outliers in the data distribution, (3) the selected subset should maintain similar values of KL divergence given a quantization of the data (i.e. K-means).

优点

  • The authors propose various methods/approximations to select a subset of training data with greater computational efficiency, and these approximations are tied to an underlying naive approach based on information theory. In addition, the method is simple and easy to understand.

  • The paper is well-motivated and solves an important issue; improving the efficiency of pretraining (through shrinking the dataset size) makes the training of large models more available to researchers at lower costs.

  • This approach sees good empirical results on multiple machine translation tasks and an image classification task, and the authors' intuitions are also demonstrated in synthetic experiments.

  • The authors also provide ablations to demonstrate the proposed method is not very sensitive to different underlying embeddings or the hyperparameters of the K-means quantization.

缺点

  • I’m a bit confused as to the stopping criteria in Section 4.3. This seems to come out of nowhere and be rather ad-hoc. Why do you introduce the second stage of reselecting instances from G until the KL divergence begins to decrease? Why couldn't you simply run the first stage here?

  • Not a big issue, but I also think the presentation in Section 4.3 is a bit confusing. The first sentence of the section seems to claim that GIO selects a subset of high-quality data with respect to a target set that is a combination of both high and low-quality data (which is confusing as something that minimizes the KL divergence should select both). However, this isn’t the experiment being run; the target set XX here is only a dataset of high-quality data.

问题

  1. Why were the other baselines not included in the FashionMNIST experiments?
  2. See the weaknesses section regarding the stopping criteria in Section 4.3.
评论

We thank reviewer Km99 for their insightful review and feedback and are thrilled they find our method simple, well-motivated and theoretically grounded and demonstrates strong empirical results in multiple domains. We agree that our method is highly relevant to the current research trend in pretrained models, and look forward to our method being used to more efficiently train models on smaller data while achieving similar or superior performance.

W1 GIO is flexible enough to have many stopping criteria depending on what the goal of data selection is (just match the target with unique samples, select only up to a data size, etc) which we explore in Appendix A. For Section 4.3 Spelling Correction experiments specifically, we wanted to experiment with a different stopping criterion, and previous expertise in that particular domain implied that selecting good duplicates (i.e. implicitly weighting some data more) improves model performance. However, we acknowledge we did not discuss this decision in depth and thank the reviewer for pointing this out. We will add more background to this decision in the revision shortly.

W2 We thank you for your suggestion on the phrasing of Section 4.3; you are absolutely correct, the experiment is selecting a set of data from a pool of mixed high and low quality data with reference to an X that is real human mistakes (very high quality), and shows that given an X of high quality data, GIO identifies and selects high quality data (73%) from the pool of mixed quality data. We acknowledge that the wording confuses readers as to which set is X and G, and will rephrase this section in our revision shortly.

Q1 The goal of the FashionMNIST experiment was primarily to show that GIO works well on other domains besides text, and also to show another use case of reducing a dataset size using GIO (X=G) under a data constraint, rather than picking from data to conform to some other target X. The experiment showed GIO was able to do so effectively; further analysis in Appendix A has some fascinating clues into why this works. Within a class most images are similar (e.g. most bags are rectangular in the “bag” class), but there are outliers and some may even look like other classes. We found that random selection selected both the “core” of the classes (e.g. rectangular bags) but also included outliers, whereas our method focused almost exclusively on choosing the “core'” of the classes first and therefore had fewer outliers. Under the 25% data constraint, it makes sense that we should focus on getting the “core” images correct and not waste data on outliers we are unlikely to get right anyway, which is why our method's data led to better performance. We showed this phenomenon visually with a tSNE and an analysis of the average distance from one class to another, and confirmed that our method selects data that is clustered closer to together (the “core” data) and more separable from other classes. We wish we could have spent further time and paper space both diving deeper into the performance of GIO on FashionMNIST and running many comparative image data selection baselines as it is an interesting domain to see GIO work visually, and look forward to expanding on this area of application in future work.

评论

Thanks for the thorough responses and clarifications regarding stopping criteria in Section 4.3. Overall, all of my (minor) concerns have been addressed.

审稿意见
6

This paper proposes GIO, a new method for selecting subset of the training data in order to reduce data/compute cost while retaining performance similar to the original dataset. It is a task and domain agnostic scheme that requires a small set of unlabeled examples representing target distribution. It minimizes the KL divergence between the target distribution and subset in an iterative manner along with some approximations that make the procedure tractable. Empirical experiments show that this approach is competitive to the existing baselines such as submodular optimization and data pruning.

优点

  • task agnostic and domain agnostic approach to dataset pruning
  • does not require labels to apply GIO

缺点

  • Algorithm 1 is incomplete on its own. It does not define the inputs X, D, G. How does one find v_opt? Which equation are you referring to? Ideally, one should be able to infer the algorithm by reading this procedure.
  • No benchmarking for the compute cost of the proposed method against baselines (submodular optimization, BM25, etc.). It is hard to judge how computationally expensive this approach is compared to various baselines?
  • Its unclear if the proposed method could scale to large datasets (for instance on Image recognition task, only FashionMNIST is used)
  • It could be challenging to select target state X and initial state D

问题

  • In table 4, why is D_KL lower for the random method than the proposed scheme, also random D_KL is much closer to Full dataset?
  • Table 4 does not include other baselines for subset selection in the classification setting?
  • How do you define high quality in Table 3?
  • Have you compared the training cost of the proposed method against other baselines?
  • How does the method perform when scaled to larger datasets? FashionMNIST seems too small a dataset (even random subset is not that bad in performance)?
  • Have you done any ablations to see the impact of target state X and the initial state D? How does one go about initializing these for a new task/domain?

伦理问题详情

N/A

评论

We deeply thank C7wm for their thoughtful and in-depth response. We are glad they find the approach competitive against recent comparable methods in this area and appreciate the task, domain and label agnostic nature of the algorithm. Below we outline our response to your questions:

W1 Algorithm 1 is intended as a supplement to the introduction, related work and method sections which define the variables X, D and G when we introduce the paper and problem setup (G is the general pool of data, X is the target set representing the desired target distribution and D is a set, possibly empty, of data we absolutely want to include.). v_opt is found using gradient descent until converges (line 5 of Algorithm 1 and 3.3 in the Method). However, we acknowledge that Algorithm 1 without the context of introduction and method doesn’t stand alone, as you correctly point out. We thank you for your feedback and will update Algorithm 1 to define the variables and be explicit about the gradient descent step so it is a standalone figure.

W2/Q4 We agree a study of computational complexity would be beneficial and will update the version to include them. BM25 is between O(n) and O(n2) with parallelization, Submodular optimization is O(n) and pruning is O(n log(n)) (GIO is O(S n) as noted in the paper where S is the gradient descent steps). In practice, we found submodular optimization to be fastest, followed by our GIO, pruning and finally BM25, which aligns. Training iterations are fixed between all methods. On our largest set with 35M training examples, GIO takes only about 40min end-to-end to complete (on CPUs only)! We feel the speed and simplicity of GIO makes it a great candidate for dataset filtering. We provided more details on how long it took for each experiment in Appendix C; it ranges from ~20min ~40min. We thank the reviewer for this feedback that makes our case about GIO stronger and look forward to elaborating these datapoints in our paper!

W3/Q5 We agree that FashionMNIST is a smaller dataset. The largest dataset we used (first experiment, WMT14) was 35M samples for WMT14 EN-FR which is a relatively large dataset, and GIO took ~40min to complete. The model trained on GIO selected data surpassed the performance of a model trained on all data and all competitive baselines, which we feel is a strong result that GIO performs well on large datasets. We also used 14M samples for spelling correction and 5M samples for WMT14 EN-DE. We wish we could have spent further time and paper space diving deeper into the performance of GIO on FashionMNIST, Cifar10, Imagenet etc and look forward to investigating this in future work!

W4/Q6 Selecting X is important to the success of GIO(and submodular optimization and BM25), and as we mention in the limitations section a bad choice of X could lead the model to generalize poorly. Selecting X and D is specific to the goal; for example, WMT14 provides dev sets of high quality which are close to the task, so we were confident those were a good target. For WMT14 EN-DE, given the small amount of data at D=D=\emptyset initialization we ablated providing other Ds (D=25, D=50), and as our experiments suggest this has a positive impact (but not for EN-FR where there is plenty of data). We hope these details give you some insight into how we made these decisions. We feel you have brought up an important discussion on how to apply GIO to new settings and deeply appreciate it; we will provide the examples and further guidance on how to select D and X and update the revision shortly. Thank you!

Q1 Yes, this is interesting and we did an investigation in Appendix D. It makes sense that random has the closest KL divergence to target X, as here X=G and random is a subset of G. Here is a summary of the findings: We found that random selection selected both the “core” of the classes (e.g. rectangular bags) but also included outliers (e.g. triangular bags), whereas GIO focused almost exclusively on choosing the “core'” of the classes first and therefore had fewer outliers. We confirmed this with a tSNE and an analysis on separability of classes which showed GIO data clustered closer within classes, the “core”, and was more separable, therefore better performance. We hope you find this analysis as interesting as we did and encourage Appendix D for more information and some diagrams.

Q2 The goal of the FashionMNIST experiment was primarily to show that GIO works well on other domains besides text, and also to show another use case of reducing a dataset size using GIO (X=G) under a data constraint, rather than picking from data to conform to some other target X. The experiment showed GIO was able to do so effectively. We wish we could have spent further time and paper space both diving deeper into the performance of GIO on FashionMNIST and running many comparative image data selection baselines, and look forward to expanding on this area of application in future work!

评论

Thank you again for the insightful review. We believe we responded to all concerns, but please let us know if we can supply any additional information before the discussion period closes. We appreciate your feedback, and thanks again!

评论

Thank you for the rebuttal. I appreciate clarifying the compute cost of the proposed method. I believe the paper could be strengthened by including discussion on how to select the set D and X. Further, studying the impact of the proposed method on larger datasets in domains other than text would be helpful (such as image, or speed or other such modalities). I'll raise my score accordingly.

审稿意见
6

The paper presents a new approach for data selection called Gradient Information Optimization, based on a natural information-theoretic objective of minimizing the KL divergence between the sampled and target distribution. The paper then evaluates the approach thoroughly in a number of different settings.

优点

  • Data selection is an increasingly important problem, and the paper presents a new sound approach
  • Overall, very polished and well-written; communicates their contributions clearly. Also, placed well within the context of related work
  • Mostly thorough evaluation of proposed approach, including self-consistency checks, ablations, and different domains (language and vision)

缺点

  • It's not clear to me if the lack of hyperparameter/stopping criterion (which is argued as a strength) is real and if actually a strength. For example, in the self consistency section: there is still an implicit hyperparameter for GIO when the authors denote "d(pX(x),pG(x))≫0". How do you define what is too big of a distance? So it seems as if it's a bit misleading to present the lack of the dataset size parameter as a strength, if the hyperparameter just shows up in a different form. Also, intuitively, I think it feels that there should naturally be a hyperparameter, as the optimal dataset really depends on the different constraints you have (whether that be size or something else).

  • Lack of comparison to a similar recent approach: How does the approach here compare (both conceptually and empirically) to that of Xie et al. [1]? They also use a notion of KL distance over a feature space. Not sure if it's quite right to list their approach under one of the "heuristic" methos.

[1] Xie, Santurkar, Ma, Liang. Data selection for Language models via Importance Sampling.

问题

Other clarifications:

  • In 4.2, "GIO works with different embedding models": are the selected examples themselves similar when using different embeddings? More broadly, why should GIO be invariant to this choice though? For example, it's plausible that certain embeddings focus on certain features more than others, distorting the distribution.
  • Did you try using simpler feature spaces, such as the n-grams one considered in Xie et al.?
评论

We thank reviewer u6CK for their insightful review and feedback. We are glad to hear they have recognized GIO as a new sound approach to an important problem and appreciate their acknowledgement of the paper writing quality and thorough evaluation of the approach.

W1 The presence/lack of hyperparameters is an interesting subject. Our method can choose to stop on its own based on theoretically optimal criteria: whenever the KL divergence stops decreasing, any new datapoint selected is no longer adding any information about X and the algorithm and selected data has reached it’s optimal point. We believe this is a strength compared to many data selection methods which require a stopping criterion, most commonly desired data size, to be chosen beforehand. How can we know what is the best data size? Maybe we chose too little, or too much. By using the KL divergence itself as a stopping criteria, the algorithm can find that optimal state. That said, we acknowledge that there are problems where data size should be chosen beforehand, e.g. a certain compute budget, in which case choosing a data size beforehand like Xie et al. do is both necessary and important (and a primary motivator for their method). Our algorithm can also terminate early at a desired data size if that is a concern, or even finish before that point if the theoretical optimal state is achieved earlier. In short, we believe GIO’s ability to use both a theoretical optimal state to terminate, or a predetermined criterion, is a strength of the method. We acknowledge that this point could be expanded in the paper and presented as a strength of our method, not a weakness of other methods, and will include that shortly in our revision.

W2 We selected a diverse breadth of peer-reviewed data selection approaches to compare our method to. We selected BM25 for its information-theoretical foundation and identical problem setup (matching distributions using information theory with a desired target set X), Self-Pruning for both it’s strong empirical results in the category of data pruning and it’s ability to apply to any domain beyond text (like our method), and submodular optimization as a very similar method optimizing information on vector spaces with a desired target X, but using a different class and concept of functions. We believe these baselines are representative of the available accepted methods in various categories of data selection and have shown strong results. Xie et al’s method does use probability theory but also some heuristics fundamental to the method (vector size), which is why we initially classified the method under a heuristic method. However, considering that the concept and theory is well fleshed out and used as a background for their algorithm, we will reclassify their method under n-gram information-theoretical methods along with BM25 and update this shortly in our revision. We thank the reviewer for pointing out the merits of Xie et al’s method.

Q1 This is an excellent point. Broadly speaking, GIO only seeks to find the data in G that best represents the target X, both of which are represented by the same model in the same embedding space. As long as en embedding model is a) deterministic in the whole space and b) consistent for both X and G then the result should hold no matter how good or bad the embedding model is. If an embedding model is bad, so long as it is bad for both G and X in the same way, then the algorithm still works. Based on the reviewer's feedback, we decided to try a radical experiment: if we embedded our text with a terrible encoder, would GIO still work? We decided to repeat the EN-DE@25% translation task but this time, used a Spanish language encoder designed exclusively for Spanish to encode our English data. Here are the results of the translation:

Embedding ModelBLEU
GIO with regular embedding model27.0
GIO with Spanish embedding model27.1
Encoding the English text with a Spanish language model led not only to similar, but 0.1 BLEU better performance on the test set. We will definitely include this interesting experiment in our robustness section as it demonstrates our model is robust to bad embedding models. We deeply appreciate the reviewer’s insightful feedback to us on this subject!

Q2 We considered using simpler feature spaces like n-grams used by Xie et al. Xie et al’s decision to operate on n-grams has the distinct advantage that it scales more computationally efficiently to very large datasets than embeddings, which would require more storage for example. However, we designed our algorithm to be domain and task-agnostic and for this reason, as well as the advantages of semantic similarity with embeddings can capture more complex relationships, we designed our algorithm to operate on any arbitrary vector space from any domain (and demonstrated this with successful results on both text and image tasks).

评论

Thank you again for the insightful review. We believe we responded to all concerns, but please let us know if we can supply any additional information before the discussion period closes. We appreciate your feedback, and thanks again!

AC 元评审

Gradient Information Optimization (GIO) is a scalable, task-agnostic method for selecting high-quality training data from a larger set, using a small subset of unlabeled examples. It simplifies a complex information-theoretic objective for practical use, demonstrating significant effectiveness in machine translation, spelling correction, and image recognition with minimal training data, and is adaptable to various domains and datasets.

All the reviewers are happy with the results, and I agree with the consensus. This paper would be a great addition to ICLR!

为何不给更高分

N/A

为何不给更低分

This paper presents strong theoretical and empirical results, and all the reviewers found the paper worth accepting to ICLR!

最终决定

Accept (spotlight)