Retraining with Predicted Hard Labels Provably Increases Model Accuracy
We theoretically show that retraining a model with its predicted hard labels can improve its accuracy when the given labels are noisy, and empirically demonstrate that retraining significantly improves label DP training at no extra privacy cost.
摘要
评审与讨论
This paper investigates the benefits of retraining a model using its own predicted hard labels in scenarios where training data contains noisy labels.
There are two strategies for retraining the model:
- Full Retraining: The model is retrained on the entire dataset using its own predicted hard labels.
- Consensus-Based Retraining: Only samples for which the model's predicted label matches the original noisy label are used for retraining.
The paper provides a rigorous theoretical analysis showing that full retraining with predicted hard labels can improve a model's population accuracy. In a linearly separable binary classification setting with randomly flipped labels, the authors derive error bounds and sufficient conditions when retraining is beneficial.
The authors also conduct extensive experiments on datasets such as CIFAR-10, CIFAR-100, and AG News Subset (a language dataset). The results show that both full retraining and consensus-based retraining enhance model performance, with consensus-based retraining providing the most significant improvements.
update after rebuttal
The detailed response has resolved my concerns. Thus, I raise my score after the rebuttal.
给作者的问题
- In scenarios where the initial model has low accuracy, how does the retraining process behave? Addressing this question could clarify the robustness of your method and whether it remains effective when the initial model is weak.
- Regarding the training details (Lines 1439–1440), why must the number of gradient steps and the initial learning rate be chosen based on the performance of the baseline method? Are the retraining methods particularly sensitive to these hyperparameters?
- What are the noise rates corresponding to different values of ?
- The authors explicitly state that the forward correction algorithm is applied in the initial training stage for the experiments in Table 5. However, what loss function is used in the initial training stage for the experiments in Tables 1, 2, 4, and 6? Is it the standard cross-entropy loss? Clarifying this would improve the reproducibility of the reported results.
论据与证据
The paper supports its main claims with a combination of rigorous theoretical analysis and extensive empirical validation.
However, there are some aspects where the evidence is less complete:
- The theoretical analysis focuses on full retraining under a uniform label noise model, while the consensus-based retraining, which empirically shows superior performance, lacks a corresponding theoretical analysis.
- The experiments are conducted on moderate-scale datasets, so the scalability and generalizability of the approach to larger or more complex settings (e.g., experiments on the ImageNet dataset) remain to be further explored.
方法与评估标准
The proposed methods, namely full retraining and consensus-based retraining, make sense to tackle the challenges of learning with noisy labels and label differential privacy. Additionally, benchmark datasets like CIFAR-10, CIFAR-100, CIFAR-100N, and AG News Subset are well widely recognized.
理论论述
Main theorems (Theorem 4.1, Theorem 4.2, Theorem 4.8, Theorem 4.9) are checked.
实验设计与分析
The benchmark datasets are widely recognized in the community, and the experimental designs are reasonable. But some experimental settings remain unclear (see questions).
补充材料
I have reviewed the problem setting part, proof part, experimental details part, and experiment on the real-world dataset (CIFAR-100N) part.
与现有文献的关系
The key contributions of the paper relate to two broader scientific literatures:
- Learning with Noisy Labels: There is a lot of work on training models in the presence of noisy labels, which often involves robust loss functions or noise-correction techniques. The paper contributes to this literature by offering the first theoretical guarantees showing that full retraining with predicted hard labels can provably improve model accuracy under uniform label noise.
- Label Differential Privacy (DP): In the context of privacy-preserving machine learning, label DP has emerged as an important concept. Prior works have proposed various noise-injection mechanisms (such as randomized response) to ensure privacy for sensitive label information. This paper shows that retraining methods (full retraining and consensus-based retraining) can enhance the model's performance without additional privacy costs.
遗漏的重要参考文献
All essential related works are cited or discussed in the paper.
其他优缺点
This paper investigates the benefits of retraining a model using its own predicted hard labels for label differential privacy (DP) and provides theoretical analysis. However, it exists several limitations.
First, the theoretical analysis is confined to binary classification using linear models. Consequently, the derived results and error bounds are limited in scope and may not extend to practical scenarios where many tasks involve multiclass classification and complex nonlinear models. In real-world applications, sufficiently powerful nonlinear models can potentially memorize all the noisy labels. As a result, the model's predicted hard labels would simply replicate the noisy labels, rendering full retraining ineffective. This limitation suggests that while the theoretical contributions are valuable for understanding retraining in controlled settings, their applicability to more complex, realistic models remains questionable.
Second, although consensus-based retraining shows superior performance empirically, the paper does not provide a corresponding theoretical framework to analyze its behavior or guarantees.
Third, the effectiveness of retraining is heavily dependent on the accuracy of the initial model's predictions. In scenarios where the initial model performs poorly, the retraining process might not yield significant improvements.
其他意见或建议
The meaning of is not explained in the introduction, yet it appears in both the abstract and the conclusion of the introduction. Readers who are not familiar with Label Differential Privacy (DP) may be confused. It would be beneficial to provide an intuitive explanation of in the introduction to enhance clarity and accessibility.
Thanks for your review and questions! We address your concerns below.
Other Strengths And Weaknesses:
1. "First, the theoretical analysis is…remains questionable.":
-
We agree that our analysis on linear models for binary classification will not fully explain what happens in the case of non-linear models for multi-class classification, and we don’t intend to oversell the scope of our theoretical results. But we believe it is valuable as a first step; after all, ours is the first work to analyze retraining with hard labels in any setting. Moreover, we believe that some of our proof ideas could be useful even in the analysis of non-linear models. For instance, the proof technique of constructing dummy predicted labels that match the actual predicted labels with high probability (see lines 307-319 left column) should be useful in general, because the issue of dependence of each predicted label on the entire training set is universal regardless of the model type.
-
Regarding your point about complex models perfectly fitting noisy labels, we completely agree. And that is why, for such expressive models, it is important to apply (both in theory and practice) some kind of regularization when training them with noisy labels; for e.g., regularization, early stopping, etc. Applying regularization is reasonable in scenarios such as label DP, where we already know that the labels will be noisy.
2. "Second, although consensus-based retraining…its behavior or guarantees." / first bullet point under Claims And Evidence: Agreed. We have admitted this limitation in Section 6, and plan to analyze consensus-based retraining in the future. Please note that the analysis of full retraining is itself pretty non-trivial (main technical challenges have been discussed after Thm. 4.8) and interesting in our opinion.
We do acknowledge the above two weaknesses. However, it is usually very difficult to perfectly align theoretical analysis with practical settings, and it is common to analyze simplified settings. So we believe these weaknesses do not fundamentally undermine the significance of our work.
3. "Third, the effectiveness of retraining…yield significant improvements." Indeed, retraining should intuitively only be beneficial when the initial model’s predictions are more accurate than the given (noisy) labels used to train the initial model. We have discussed/demonstrated this in several parts of the paper – Fig. 1 (see its caption), Tables 3 and 7 (these are on real data), and the comment on the range of after Remark 4.10 (specifically, regarding the lower bound on ). Moreover, in Appendix J & Table 10, we did an ablation study with and without a validation set. The initial model is naturally weaker w/o a validation set (due to overfitting); despite this, retraining is still beneficial but the gains are less than those with a val set. This observation is not surprising.
Second bullet point under Claims And Evidence: As mentioned in Section 6, performing larger experiments is left for future work. While we didn’t have the time to train ImageNet from scratch now, we ran experiments on DomainNet dataset (available in Tensorflow) which has 345 classes & is much larger than CIFAR. We did linear probing (due to lack of time) with features extracted from a ResNet-50 pretrained on ImageNet. The setup is similar to our full fine-tuning experiments. DomainNet Results:
| Baseline | Full RT | Consensus-based RT | |
|---|---|---|---|
So even here RT (especially, consensus RT) yields large gains.
Questions For Authors:
1. Please see the response to weakness 3 above (especially the last two sentences about the ablation).
2. They need not be chosen based on the baseline’s performance. We did this to avoid any further hyper-parameter tuning based on retraining – to demonstrate that retraining is not very sensitive to hyper-parameters. If one were to optimize the hyper-parameters based on retraining’s performance as well, the gains would only increase.
3. If randomized response (RR) is used as the baseline, then with classes and for -labelDP, each sample receives its true label w.p. and some other label w.p. for all (this has been explained in lines 177-181 left column). If the method of Ghazi et al. (2021) is used, then their first stage is RR (so the same as before), but the noise level of subsequent stages depends on the performance of the previous stage's model.
4. Standard cross-entropy loss was used; we’ll mention this in the next version. Thanks for pointing this out!
We hope to have resolved your concerns and we're happy to discuss further. If you’re satisfied, we sincerely hope you will raise your score!
Thanks for your detailed rebuttal and extra experiments. My concerns have been resolved. Then, I decide to raise my recommendation score.
Thanks for raising your score! We’ll add the extra experiments (and important clarifications from the rebuttal) in the next version.
The authors theoretically analyze retraining in a linearly separable binary classification problem and show that it can improve the model accuracy with respect to the initial training in presence of label noise. They show that retraining is particularly helpful with high levels of label noise. Then, the paper empirically shows that the proposed consensus-based retraining works better than the normal retraining.
Update after rebuttal
After reading all the reviews carefully and considering the additional effort made by the authors, I decided to raise my score from 3 to 4. I think this is an excellent paper.
给作者的问题
-
How do you relate the retraining technique with the problem of memorization of noisy samples? That is a well known problem in the noisy labels literature and I am afraid that retraining could worsen the memorization effect. Can you provide an empirical analysis of the memorization effect when using retraining? [a,b,c]
-
What does it happen when we increase the number of gradient steps? Does the gap between the accuracy achieved with and without retraining decrease? Is there a point in which, if we train the model for X steps, the retraining lowers the accuracy? Maybe this would be an interesting ablation study.
-
You use the baseline in Ghazi et al. 2021. Which objective function do you use to train your neural networks? I assume you use the cross-entropy. However, you did not study how the performance of your algorithm would change by changing baseline or objective function. This could arise questions on the general validity of retraining. Could you study the performance applying these changes?
[a] Arpit, D., Jastrzębski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M. S., ... & Lacoste-Julien, S. (2017, July). A closer look at memorization in deep networks. In International conference on machine learning (pp. 233-242). PMLR.
[b] Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. ICLR 2017.
[c] Liu, S., Niles-Weed, J., Razavian, N., & Fernandez-Granda, C. (2020). Early-learning regularization prevents memorization of noisy labels. Advances in neural information processing systems, 33, 20331-20342.
论据与证据
The claims are almost all clear and convincing.
- The main claim for which the clarity could be improved is the specification (especially in the abstract) that they theoretically analyze a binary classification problem.
- In line 123 you claim that your work is on the fully supervised setting. Isn't the label noise scenario considered weakly-supervised?
方法与评估标准
The proposed evaluation criteria make sense for the problem considered.
理论论述
I checked the theoretical claims superficially and they seem correct and well written.
实验设计与分析
I would have preferred to see a comparison with other algorithms that perform classification with label noise, but I only see a minor result on the combination of retraining and forward correction. I think that a wider comparison would help in understanding if the contribution of this paper is mainly theoretical or if there is also a possible advancement for state-of-the-art techniques. I don't understand why the authors did not share the code. This arises concerns on the reproducibility of their results.
补充材料
The code is not provided. The appendix is well written.
与现有文献的关系
The paper contributions are incremental, as the retraining technique is well known. However, the theoretical analysis is interesting and novel in my opinion.
遗漏的重要参考文献
I am not aware of important related work that is not cited in the paper.
其他优缺点
Strenghts
- The paper is well written and even though it is theoretically heavy, it can be easily read by non-experts
- The related work section is very useful
- The experimental results enforces the theoretical claims
Weaknesses
- No code
- No comparison with other techniques for classification with label noise (apart from Forward correction)
- No theoretical analysis or comments for the multi-class classification problem
其他意见或建议
- in line 165 at the beginning of pp. 4 I would prefer the authors to use instead of
Thanks for your review and great questions! We address your questions/concerns below.
Claims And Evidence:
- We will clarify "binary" in the abstract.
- Here we simply meant a setting where we have labels for all samples - to distinguish it from the setting of self-training where we are not given labels for all the samples. We’ll clarify this.
Experimental Designs Or Analyses / first two weaknesses:
-
Regarding comparisons with other noise-robust methods, please note that we are not claiming retraining is a SOTA general-purpose noise-robust method (see lines 100-103 left column). We are just advocating it as straightforward post-processing step that can be applied on top of vanilla training or a noise-robust training method. In case it wasn’t clear, Table 5 shows results wherein initial training (baseline) was done with forward correction applied to the method of Ghazi et al. 2021, and retraining was done on top of this. Please also see our response to your third question (under Questions For Authors) below, where we show that retraining is beneficial as a post-processing step even when using a noise-robust loss function instead of the usual cross-entropy loss. Moreover, it’s not straightforward to apply many existing noise-robust methods to sophisticated label DP mechanisms (such as Ghazi et al.); retraining is very easy to apply in contrast.
-
We didn’t release the code because at the time of submission, we didn’t obtain our organization's approval to release it. We weren't sure if code can be shared in the rebuttal because the email on rebuttal instructions didn't mention anything about code. We will release the code upon paper acceptance.
No theoretical analysis or comments for the multi-class case (Weakness 3): Extending our analysis to the multi-class case is left for future work. Here is a starting point: in the case of classes, the labels ’s will be -dimensional one-hot vectors, the ground truth will be a matrix (features ’s are still -dimensional vectors, but ’s need to be defined appropriately in terms of and the ’s) and our predictor will also be in .
Questions For Authors:
1. Yes, memorization of noisy labels with powerful models is an issue. And if initial training is done naively, retraining may exacerbate this issue. That is why in almost all our experiments (except in Appendix J), we assume access to a clean validation set; please also see footnote 5 for the practical version of this assumption. This prevents the model from heavily memorizing. Moreover, as we show in Tables 3 & 7, the accuracy of the predicted (= given) labels on the consensus set is much more than the accuracy of both the predicted and given labels on the full set. This shows that regulated initial training is effective at avoiding memorization. Further, as shown in Appendix J, even in the absence of a validation set, retraining is still beneficial but the gains are less – this is expected because the initial model’s performance is degraded due to more memorization/overfitting here.
2. Indeed, the benefit of retraining decreases when initial training is done for a larger number of steps. We studied this in Appendix J – here we don’t have a validation set and trained blindly for 100 epochs. Due to more overfitting here, the gains of retraining are lower than the corresponding gains with a validation set where we stopped at 40 epochs. If we train for even longer, the initial model will heavily memorize the noisy labels and this will probably render retraining ineffective.
3. Yes, we used the cross-entropy (CE) loss; we’ll state this in the next version. Our baseline for AG News is actually randomized response (see lines 425-426 left column) to demonstrate the generality of retraining w.r.t. the baseline. Further, in Table 5, our baseline is forward correction applied to the method of Ghazi et al. 2021. So we do have results with other baselines. And based on your suggestion, we performed experiments with the noise-robust symmetric CE loss function proposed in [1] (1k+ citations) instead of the vanilla CE loss. In their loss (eq. 7 of [1]), we set and . Here are the results for CIFAR-100 w/ ResNet-34.
| Baseline | Full RT | Consensus-based RT | |
|---|---|---|---|
Thus, consensus RT yields meaningful gains even with the loss function of [1].
We hope to have resolved your concerns and we are happy to discuss further. If you’re satisfied with our answers, we sincerely hope you will raise your score!
[1]: Wang, Yisen, et al. "Symmetric cross entropy for robust learning with noisy labels." ICCV 2019.
Thank you for the answers. I will keep my score as it is.
Thanks for your reply.
We are adding some new results on a bigger dataset to show that retraining is effective when applied on top of label noise correcting methods. Specifically, we show results when the baseline is forward correction and backward correction (from Patrini et al. 2017 cited in the paper) applied to the first stage of Ghazi et al. 2021 (similar to Table 5 in the paper); these results are in (A) and (B) below, respectively. For comparison in (C) below, we also show results when the baseline is just Ghazi et al. 2021 (i.e., no correction is applied). These results are on the DomainNet dataset (available on Tensorflow) which has 345 classes and is much larger than CIFAR. We did linear probing (using cross-entropy loss) with features extracted from a ResNet-50 pretrained on ImageNet. The setup is similar to our full fine-tuning experiments.
(A) Baseline = Forward Correction (Patrini et al. 2017) + Ghazi et al. 2021:
| Baseline | Full RT | Consensus-based RT | |
|---|---|---|---|
(B) Baseline = Backward Correction (Patrini et al. 2017) + Ghazi et al. 2021:
| Baseline | Full RT | Consensus-based RT | |
|---|---|---|---|
(C) Baseline = Ghazi et al. 2021 (no correction):
| Baseline | Full RT | Consensus-based RT | |
|---|---|---|---|
As expected, forward and backward correction lead to better initial model performance (compared to no correction). The main thing to note however is that consensus-based RT yields significant gains even with forward and backward correction, consistent with our earlier results. Thus, consensus-based RT is a very effective post-processing step for improving learning with noisy labels. (It is worth noting that for , consensus-based RT leads to similar accuracy with and without noise correction.)
We hope you will take these extra results into consideration.
The paper gives a theoretical treatment on when learning with predicted hard label is beneficial than learning with original noisy label.
给作者的问题
-
line 072, Figure 1: what's the classifier used to get the result? (linear, MLP, or the one in Eqn. 4.3/4.8?)
-
line 217: why need "u"? It seems a bit redundant and does not seem to play a key role.
-
Eqn. 4.3 & 4.8: my biggest concern, why the classifier takes the form of it does not correspond to any standard classifier, and what's the multi-class version of it?
-
Theorem 4.6 (minimax lower bound): I am aware of three (non-parametric) lower bounds in [1-3], so I would like to know the position of this lower bound in the literature.
-
Theorem 4.6 also applies to the "retraining classifier" in Eqn. 4.8, therefore, predicted hard label do not provide a gain in terms of rate/sample complexity. Then the benefit of using the hard label is only in terms of the constants?
[1] T Tony Cai and Hongji Wei. Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. The Annals of Statistics, 49(1):100–128, 2021.
[2] Hyungki Im and Paul Grigas. Binary classification with instance and label dependent label noise. arXiv preprint arXiv:2306.03402, 2023.
[3] Yilun Zhu, Jianxin Zhang, Aditya Gangrade, and Clayton Scott. Label Noise: Ignorance Is Bliss. In Advances in Neural Information Processing Systems, 2024
论据与证据
Yes, the claims were proved.
方法与评估标准
Overall makes sense to me. Though not quite sure why consider the "label DP" setup, seems to me a standard label noise setup suffices.
理论论述
I've skimmed through the proofs, but have not checked the details.
实验设计与分析
Yes.
补充材料
I've skimmed through the proofs and read the experimental setups.
与现有文献的关系
The benefit of using predicted hard label is has been studied empirically, the theoretical treatment is new.
遗漏的重要参考文献
Essential references are included.
(Optional) there can be some supplementary references that are related, see below.
其他优缺点
Overall I enjoyed reading the paper. The theoretical treatment of hard labels is new to me, and I think it is a good contribution to the literature. Have some concerns at this point, I will be happy to read the authors' comments and re-assess my review.
My biggest theoretical concerns are:
- The form of the classifier considered: it does not correspond to any standard classifier. (At first glance, I would expect ERM or logistic regression type.)
I look forward to see experimental setup that is more aligned with the theoretical setting:
- a data simulation that corresponds exactly to the theoretical setting, e.g., a 2-dimensional mixture of two Gaussian (and use the exact form of the classifier in theory).
- I think it is also possible to align CIFAR experiments with the theoretical setting, e.g., use a pretrained NN to extract the feature, then apply linear classifier on top of it (aka, "linear probing" in self-supervised learning).
These should provide stronger evidence to the theory.
其他意见或建议
- Notation: in label noise literature, people usually use to denote the noisy label, rather than . is usually used to denote the predicted label by the classifier. This is a bit confusing.
- line 197: "perfect separable setting", I don't think it's separable, because Gaussian has infinite support, therefore positive and negative classes are overlapping. Do you mean "Bayes decision boundary is linear"?
- Eqn. 4.4: the notion is overloaded, here it integrates , while in Theorem 4.1, it is conditioned on .
Thanks for the review and great questions!
(A) Label DP setting. We focused on this because it’s not clear how to apply existing noise-robust techniques on top of existing label DP mechanisms, while retraining is a simple post-processing step. For e.g., as mentioned in lines 365-366 right column, it’s not obvious how to apply forward correction to the second stage of Ghazi et al 2021.
(B) Form of classifier . This is a simplification to the least squares’ solution (LSS) obtained by removing the empirical covariance matrix’s inverse. The way to analyze the LSS would be to bound the deviation of the empirical covariance matrix from the population covariance matrix (which shrinks as ), then analyze with the features pre-multiplied by covariance matrix’s inverse. This would just make the math more tedious w/o adding any meaningful insights. Also, as we wrote around eq. 4.3, our classifier is similar to kernel methods with the inner product kernel & it has been used in the highly-cited work of Carmon et al 2019.
(C) Experimental setup more aligned with theory setting
-
Setting of Fig. 1 corresponds to our theory setting; see Appendix A.
-
We did linear probing (LP) for CIFAR-100 with features extracted from a ResNet-50 pretrained on ImageNet. The setup is similar to our full fine-tuning (FT) experiments; we omit details here due to lack of space. Results:
| Baseline | Full RT | Consensus-based RT | |
|---|---|---|---|
So even here RT (especially, consensus RT) yields good gains. Note that LP performs much better than full FT; this is often the case when training with noise due to less overfitting with LP.
(D) Line 197: perfect separable setting. Our modified GMM setting (eq 4.1) is separable. As discussed in lines 172-176 right column, separates the data perfectly.
We’ll fix/clarify notational ambiguities.
(E) Questions for Authors
1. The ones in eqs. 4.3 & 4.8. Also see Appendix A.
2. We introduced so that the margin of a data point along is not the same. As explained in lines 172-175 right column, . If there is no , all the points would have the same margin. We agree that from the analysis perspective, is not very important.
3. For the binary case, see (B) above. Even in the multi-class case, something like this has been studied in reference [A] (see Section 2.2). Specifically, for classes, the labels ’s will be -dimensional one-hot vectors, the ground truth will be a matrix (features ’s are still -dimensional, but ’s need to be defined appropriately in terms of & ’s) and our predictor will also be in .
4. These lower bounds are in much more general settings than ours and so they are weaker than ours. In [1], our setting corresponds to . Per Definition 2 of [1], and as per the paragraph after Remark 3, . Now as per Thm 3.2, the lower bound on the error is effectively . So when , this lower bound yields a much worse sample complexity than our result in Thm 4.6. In [2], the lower bound on the error (Thm 2) doesn’t reduce with , so even if there are infinite samples, we can’t get 0 error in the worst case. As for [3], the lower bound on the error (Thm 1) also has a non-diminishing term depending on . In the special case of (or being small), there is a dependence but no dependence on the dimension (or a related quantity). But their upper bound in Thm 2 does have a dependence on a VC dimension-like quantity as expected, so their lower bound is probably loose w.r.t. dimension.
5. Yes, Thm. 4.6 also applies to the retraining (RT) classifier. If you see Remark 4.10, the min. # of samples needed for RT to be better is more than , i.e., the lower bound of Thm. 4.6. And as discussed after Remark 4.10, this requirement on is probably tight (modulo log factors) because we can only hope the RT classifier to be better if the accuracy of the labels with which it is trained – namely, the initial model’s predicted labels – is more than (= accuracy of the noisy labels with which the initial model is trained); this requires at least samples (per Thm. 4.6). So yes, RT can’t improve the sample complexity beyond .
We hope to have resolved your concerns and are happy to discuss further. If you’re satisfied, we hope you will raise your score!
[A]: "Theoretical Insights Into Multiclass Classification: A High-dimensional Asymptotic View", Thrampoulidis et al., 2020
Thank you for the detailed response.
It would be nice to see a more comprehensive discussion of the lower bound in the next version of the paper (can be a formalized version of the response). Given the limited exploration of lower bounds in the existing label noise literature, I think it is a good addition.
Regarding the debate on "linear separability," my understanding is that distributions are separable if and only if their supports do not overlap. However, this difference in perspective is minor, and I am comfortable moving forward despite differing views.
Overall, the authors have adequately addressed my concerns, and I anticipate the next version of the manuscript will provide further clarity. I recommend acceptance (and have raised the score from 3 to 4).
Thank you for raising your score! Yes, we will add a discussion on the lower bounds in the next version and we agree, it’ll be a good addition. Thanks for pointing out these papers! We’ll also clarify what we mean by separability and add the extra experiments.
This is a case where author's response has resulted in two reviewers increasing their original scores so that overall paper score now is 4,4,3. The responses have addressed the concerns and the reviewers have increased the scores with the condition that the authors will include the additional results (e.g. discussions and formalizing the lower bounds and additional experimental results) in the camera ready. I recommend acceptance.