PaperHub
5.7
/10
Rejected3 位审稿人
最低3最高8标准差2.1
3
6
8
4.3
置信度
ICLR 2024

Distribution Aware Active Learning via Gaussian Mixtures

OpenReviewPDF
提交: 2023-09-21更新: 2024-02-11
TL;DR

This work aims to mitigate overfitting in AL by reducing distributional discrepancy between labeled set and unlabeled set with the aid of distributional information obtained by Gaussian Mixutre Models (GMM)

摘要

关键词
Active LearningUncertainty Estimation

评审与讨论

审稿意见
3

This paper presents a robust distribution-aware learning and sample selection strategy that employs Gaussian mixture model (GMM) to effectively encapsulate both labeled and unlabeled sets for AL. A regularization method and an informativeness metric are further proposed to detect overfitted areas. Experiments on mulltiple datasets including balanced and imbalanced datasets demonstrate the validity of the proposed method.

优点

(1) The solution of applying distribution alignment for active learning is valid. (2) Multiple datasets including balanced and imbalanced datasets are evaluated to show the advantages of the proposed method.

缺点

(1) It seems that the paper pose a relatively strong assumption that the data follows Gaussian mixture model (GMM) and derive the statistics from GMM. However, the author fails to show that GMM is a good model to approximate the data distribution in active learning. Actually, because of the nature of active learning, there are frequently samples from new classes that does not follow the estimated GMM model. In that case, the estimation based on GMM will not be accurate. (2) The novelty of the paper is limited as the distribution alignment has been widely studied in the previous work: (i) Zhang et al “Distribution Alignment: A Unified Framework for Long-tail Visual Recognition”. CVPR 2021 (ii) “Agreement-Discrepancy-Selection: Active Learning with Progressive Distribution Alignment”, AAAI 2021 The reference (ii) also mentioned distribution alignment with adversarial learning but under a more general setting without GMM assumption. Thus, it seems this paper is talking about an existing method under some special conditions.

问题

As metioned in the weakness, the fundament problem of this paper is that it is merely a special case of a solved general problem, which makes the novelty of the paper very limited. Without detailed discussion of the difference from the existing work as listed, the novelty of the paper does not stay on a safe ground.

Moreover, the author needs to show why the GMM model is a good approximation of the data distribution in active learning as there can be many outliers and unknown classes. How about if the data is a long-tail distribution. The assumption of the paper is a bit too strong.

Last but not least, the paper did not any new insights of this field. The results are merely a small extension and special cases of existing approaches.

评论

We appreciate the reviewer for the time and efforts, and providing helpful comments and valuable feedback. Below, we provide responses to the reviewer's comments.

Is GMM a good model in real world AL settings?

As the reviewer correctly points out, in early learning cycles of AL, the labeled set (XLX_L) does not generally align well with the true distribution of the entire dataset. We agree that a single GMM would not give a good estimate here. However, note that in our approach we utilize two different GMMs ψL\psi_L and ψUL\psi_{UL} for XLX_L and XULX_{UL}; we fit new GMMs ψL\psi_L, ψUL\psi_{UL} at every AL cycles to reflect changes in XLX_L and XULX_{UL} such that new ψL\psi_L can encompass samples from new classes. This is how our method can handle samples from outliers and new classes. Related to this comment, we clarify that our intention is not to estimate the true distribution using ψL\psi_L alone, but to detect the discrepancy between XLX_L and XULX_{UL} using GMMs ψL\psi_L and ψUL\psi_{UL}. Our goal is then to “reduce this discrepancy” via distributional alignment in Section 3.4 and our informativeness-based data selection in Section 3.5.

Moreover, our work does not assume special cases that completely exclude outliers or new classes, yet we achieve high performance in general AL settings adopted in previous AL works. To verify this further, we conducted additional experiments in two setups the reviewer suggested: 1) unseen classes setting where the initial labeled dataset only includes 70% of the entire label set, 2) long-tail imbalance setting that follows the common setup of [1,2] The results are provided in the tables below. It can be seen that our approach still outperforms other AL methods under these challenging settings. This further confirms the applicability and advantage of our method that uses two GMMs (for insights on why our method is more reliable, see later).

  • Table 1: Accuracy(%) in unseen classes setting on CIFAR10
# of labeled samples1K2K3K4K5K6K7K8K9K10K
Random39.1965.6974.6278.4680.7782.5583.5485.5086.0286.30
TOD40.8368.7376.9180.4383.4784.7987.0388.1888.8289.07
ALFA-Mix39.1967.8676.3679.9283.2984.8387.2388.5289.0389.31
TA-VAAL39.1969.0177.8180.0382.9484.5686.1187.6488.0388.44
DAAL42.9572.5179.3782.4784.5885.9888.0789.3689.5990.04
  • Table 2: Accuracy(%) in long-tailed imbalance setting on CIFAR10 (imbalance ratio = 0.1)
# of labeled samples1K2K3K4K5K6K7K8K9K10K
Random45.5957.3368.4271.3676.0776.8678.0479.9381.6482.37
CBAL [2]45.5961.3271.6774.9378.0979.2081.8782.3683.1285.51
DAAL50.9271.8777.9881.4782.4885.3486.5387.0987.7488.53

These results are now included in Sections A.4, A.5 of Appendix. Additionally, in Section A.7 of Appendix, we include experimental results on a different dataset (i.e., TinyImageNet) and backbone (i.e., ResNet-50).

In summary, the use of two GMMs as suggested provide good estimates for data distribution in realistic active learning settings, as empirical results indicate.

[1] Cui et al., "Class-balanced loss based on effective number of samples." CVPR 2019.
[2] Bengar et al., “Class-balanced active learning for image classification”, WACV 2022.

评论

Further Comment on the novelty of the proposed method

The novelty of our work lies in the incorporation of GMM into training and selection methods that aim to minimize the discrepancy between XLX_L and XULX_{UL}, thereby mitigating the risk of overfitting to small XLX_L. Particularly, our approach involves the estimation of latent distributions of XLX_L and XULX_{UL} using GMM, a novel feature not observed in existing AL works. GMM is an useful tool for active learning since it can estimate the original data distribution even with the absence of certain categories in XLX_L, by leveraging the rich data in XULX_{UL}. Our proposed GMM-aided methods pose distinctive contributions that differ from prior works, which either train the task model using only XLX_L or indirectly estimate the latent distribution without using probabilistic models like GMM. Also, as demonstrated in Table 1 in the main paper, our GMM-aided components can be easily integrated into existing works, potentially enhancing performance.

Again, we appreciate the reviewer for the time and effort. Your concerns are clear, and we feel we have successfully addressed all the issues you raised. Please let us know in case there are remaining questions/concerns; we hope to be able to have an opportunity to further answer them.

评论

Is novelty limited in light of prior works? ((i) CVPR 2021 (ii) AAAI 2021)

High-level differences with distributional alignment (DA) methods: With regards to Distributional Alignment (DA) methods including references (i) and (ii), note that our contribution is not on DA itself but more on capturing distributional information via GMM, for both DA and informativeness metric design. As illustrated in Fig. 5 of the main paper, the discriminator of traditional DA methods tend to overfit to XLX_L, thereby generating overconfident and unreliable predictions (i.e., low entropy). In contrast, we propose to feed GMM statistics to the discriminator to improve efficiency of DA and predicted informativeness. Aside from distributional alignment, other features of our proposed methods are distribution-aware regularization and distribution-aware data selection; they work synergistically such that the distribution of XLX_L quickly takes after the distribution of XULX_{UL}, thereby contributing to the significant performance improvement. Specifically, the idea of computing informativeness based on the difference between likelihoods has not been considered before, which is a key contribution of our work that can be achieved by taking advantage of GMMs.

More detailed technical differences with (i) and (ii): With regards to the ADS work (reference (ii)), our method is different in the following senses. Whereas ADS leverages the difference between outputs of two adversarial classifiers to approximate the latent distribution, we directly apply the EM algorithm to the latent space. Taking the example of using ResNet18 on CIFAR10, we believe that our approach (which optimizes GMM on 512-dimensional latent space) is a more plausible design than ADS (which compares 10-dimensional classifier outputs) to estimate and align the latent distribution. Also, whereas ADS adopts adversarial learning that minimizes or maximizes the discrepancy between two adversarial classifiers, our work is based on a more solid theoretical background that optimizes the 1-lipschitz discriminator to minimize the Wasserstein distance between XLX_L and XULX_{UL}. Furthermore, we enhanced the efficiency of distributional alignment by feeding the relative positional information (likelihood, posteriors) of training samples to the discriminator. We wanted to provide comparative results as well. However, note that reference (i) requires prior information about the true category distribution, which is not available in AL except for a small XLX_L. For ADS (the reference (ii)) we ran experiments on CIFAR10 and CIFAR100. The table is provided below and A.6 of Appendix. It can be seen that our work outperforms ADS. Finally, we note that we have already compared our method with other well-known DA-based AL methods, TA-VAAL and WAAL, in our original manuscript and validated our superiority (please refer to Figures 9 and 10 in our manuscript).

  • Table 3: Accuracy(%) on CIFAR10
# of labeled samples1K2K3K4K5K6K7K8K9K10K
Random51.7663.5072.1776.9480.5682.3284.1785.0086.1187.27
ADS56.9270.1877.7581.7883.4685.8387.7688.7489.5590.30
DAAL58.9874.1579.2183.4285.6386.7588.0989.3990.0190.87
  • Table 4: Accuracy(%) on CIFAR100
# of labeled samples2K4K6K8K10K12K14K16K18K20K
Random20.6831.7641.3447.5651.6555.8058.0759.6661.5363.07
ADS23.4035.2543.0549.2953.3757.3260.8462.7763.7465.54
DAAL28.4241.8251.1456.1359.3761.9463.8966.1467.3868.85

In summary, our method is very different from existing DA methods and also gives superior results.

(i) Zhang et al., “Distribution Alignment: A Unified Framework for Long-tail Visual Recognition”, CVPR 2021.
(ii) Fu et al., "Agreement-Discrepancy-Selection: Active Learning with Progressive Distribution Alignment”, AAAI 2021.

评论

Rationale behind the choice of GMM as an approximation model

Compared to other clustering algorithms such as K-means clustering, agglomerative clustering, and DBSCAN, GMM has the advantage of considering the desired number of modes along with the variance; it allows GMM to capture complex patterns in data with non-linear or non-spherical cluster shapes. Additionally, GMM is effective when clusters have overlapping boundaries due to its ability to allow for soft assignment, which takes into account ambiguity or uncertainty. Due to these advantages, GMM has been proven to be effective for the approximation of general real-world data (e.g., ImageNet, Cityscapes) in existing works [3,4]. Adopting GMM thus enables us to directly and accurately calculate the latent distribution as probabilistic models, providing a quicker and more reliable acquisition of distributional information compared to previous distributional alignment based AL approaches [5, 6] that indirectly infer the latent distribution by relying on pairwise distances or the output of additional modules.

[3] Lee et al., "Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning." ICLR 2020.
[4] Liang et al., “GMMSeg: Gaussian Mixture based Generative Semantic Segmentation Models”, NeurIPS 2022.
[5] Fu et al., "Agreement-Discrepancy-Selection: Active Learning with Progressive Distribution Alignment", AAAI 2021.
[6] Shui et al., "Deep Active Learning: Unified and Principled Method for Query and Training", AISTATS 2020.

Insights into the advantage of our method in realistic settings (e.g., unseen classes & long-tail distribution)

  1. Proposed likelihood informativeness (ILikeI_{Like} in equation 4) can be calculated without any modification of our algorithm, even when the number of categories in XLX_L and XULX_{UL} differs (i.e., the number of modes in two GMMs is different). This is because we compare the likelihood score computed from each GMM. In a realistic imbalanced data setting, the likelihood difference between two GMMs (ψL\psi_L, ψUL\psi_{UL}) is large around minor/unseen classes (i.e., large likelihood from ψUL\psi_{UL} & low likelihood from ψL\psi_L). Thus, the proposed informativeness score ILikeI_{Like} prioritizes samples from minor/unseen classes to be selected, effectively addressing the imbalance problem.
  2. Our semi-supervised learning with GMM also works well in an imbalanced setting. This is because, even if there are no or few samples for certain classes in XLX_L, GMM can capture modes for all classes based on information from XULX_{UL}. Therefore, our GMM-aided semi-supervised learning does not suffer from data imbalance, but rather encourages distributional alignment with the aid of GMM statistics from XULX_{UL}.
评论

We appreciate your response. We would like to make the following points clearer:

Generality/applicability of our work

We respectfully reiterate that our algorithm is applicable to general AL settings, and the use of GMM does not limit the applicability of our work. We also do not assume any special cases. To support these, we have provided comprehensive responses, conducted extensive experiments under the same/fair setting as other AL baselines in the manuscript, and showed general superiority in various datasets and AL scenarios (balanced, imbalanced, long-tail, new classes), including the ones that the reviewer requested.

Open-set annotation

In the original review, we believe that the reviewer didn’t mention the open-set annotation setting but instead suggested the “new classes” and “long-tailed” setup. This is why we have focused on the experiments mentioned above, and have explained why our scheme still works well with clear insights.

Open-set annotation is a recently proposed new AL setting [CVPR’22] that deviates from the typical AL setting, and there is still limited literature on it. It is assumed that the human annotator is not able to label certain unknown classes that are irrelevant to the main task. Open-set annotation and conventional AL settings target different applications with different focuses: For example, as provided in [CVPR’22], during AL for sports image classification, a large portion of images in the unlabeled set could be irrelevant to the main task (e.g., house, dogs). Open-set annotation setting will become important here. However, the conventional AL setting (w/o open-set annotations) is still important in many cases: In healthcare applications such as disease prediction, which is a primary application of AL, all images in the unlabeled set are directly related to the main task (e.g., X-ray image from the patient), which requires us to consider the conventional AL setup.

To handle the new open-set setting, an orthogonal approach to existing AL strategies is required: As we can see from [CVPR’22], all existing AL methods do not work well in this setting. Although this setup has not been actively studied, and unexplored, we can potentially address the challenge by orthogonally introducing an OOD sample detector as in [CVPR’22] to exclude OOD samples from the unlabeled set, and then apply our GMM-based algorithm.

[CVPR’22] Active Learning for Open-set Annotation

Novelty

We would like to highlight that the key philosophy of our work does not lie in distributional alignment (DA). DA (in Section 3.4) is just one of our components, and our contribution spans from GMM modeling (in Section 3.2) to regularization (in Section 3.3) and informativeness metrics (in Section 3.5), encompassing the entire AL scheme. The core of this work is the use of GMM to better capture the dataset distribution and build a unified framework for "distribution-aware AL" that facilitates effective regularization, effective DA-based learning, and effective informativeness-based sampling. We successfully integrate GMM into both the training and selection stages, aiming to identify and mitigate the discrepancies between the labeled and unlabeled sets.

During experiments, we also showed that GMM can enhance the performance of various existing DA methods and provided related analyses in Figure 5 of the manuscript and Section A.6 of the Appendix. Moreover, as illustrated in Figure 12, each proposed component exhibits a progressive enhancement in performance. Furthermore, our compatibility analysis in Table 1 demonstrates that our components can be easily plugged into the existing methods, leading to notable performance improvements.

Importantly, our work introduces a mathematical modeling of datasets employing GMM to enhance AL performance effectively.

评论

Thanks for the author's reply and comments.

I have read through the response from the author and other reviewers' comments.

I still have the fundamental questions that for active learning, due to the open-set annotations, using GMM for modelling is too restrictive as it imposed strong assumptions to the data. It is difficult to model other types of distributions. Namely one of my main concerns is from the fundamental math model.

Moreover, regarding distribution alignment, I feel the current manuscript is merely applying the existing technique and applied it to the restrictive model (GMM) by ignoring other more common cases. The situation that discussed here is a special case of a more common problem which results in small novelty and limited scope.

Therefore, based on the restrictive math modeling and limited novelty and scope as well as overlap with previous work, I maintain my review score.

审稿意见
6

This paper introduces Gaussian Mixtures of labeled and unlabeled dataset to facilitate the model training, regularization, sample selection for active learning.

优点

  1. the improvement in performance over baselines is significant.
  2. well-organised and easy to follow and understand, though there are many componenets in the proposed method.
  3. The idea is original and it uses classific GMM to facilitate model training, as well as for high-quality and diverse sample selection for active learning.

缺点

  1. The main concern would be the computation burden it introduces. With a batch, within each optimization iteration, there are 10 EM runs for both labelled and unlabelled samples. So it would be very time-consuming, even though there are some tricks used to reduce it. So both theoretical and experimental analysis about time complexity would be expected.

  2. Some details about loss functions are missing to better understand the training. For example, there is a trade-off parameter α\alpha to balance cross-entropy loss and regularization loss. The value for this constant was not given. Besides, there is another loss for adversarial learning. How to balance with this function is not clear to me.

  3. Some intuitive explanation to demonstrate the superiority would be great. For example, with each active learning cycle, for each class, how the selected samples from the proposed method are different from the ones from baselines? Or use the selected samples to verify the claim of 'high quality and diverse'.

问题

see above weakness points.

评论

We appreciate the reviewer's valuable comments on our paper. Our answers to the reviewer's comments are given below:

GMM time complexity

We appreciate the reviewer’s concern regarding the complexity issue. In our work, we compute the GMM in both the “supervised learning stage” and the “semi-supervised learning stage”. GMM calculations impose an increased computational burden of approximately 5.66% and 15.19% in each respective stage. For detailed explanations, please refer to the information provided below with our Algorithm in Section A.2 in Appendix.

Supervised learning stage. We calculate the GMM parameters ψL\psi_L for the labeled set (XLX_L) once per minibatch (we do not compute the GMM for XULX_{UL} at this stage). As we have label information, we can directly compute ψs(πs,μs,σs2)\psi_s≔(\pi_s , \mu_s, \sigma_s^2) without multiple iterations of the EM algorithm. Detailed analysis on the operational time is provided below.

  • Table 1: Analysis of the required time to process one batch in the supervised learning | Is GMM-related? | Algorithm procedure| Time required | |----------------------|-------|-------| | x | Draw a mini batch (BB) from the dataloader | 33.1505 msec. | | o | Draw a i.i.d. subset (XsX_s) from the dataloader | 4.5133 msec. | | o | Pass XsX_s through the feature extractor fθf_\theta to get fθ(Xs)f_\theta(X_s) | 6.1559 msec. | | o | Caclulate GMM ψs\psi_s from fθ(Xs)f_\theta(X_s) | 0.0470 msec. | | x | Compute LossCELoss_{CE}, LossGMMLoss_{GMM} using minibatch (B) and GMM ψs\psi_s | 3.7744 msec. | | x | Update model by minimizing LossCELoss_{CE}, LossGMMLoss_{GMM} | 141.7344 msec. |

Semi-supervised learning stage. This stage consists of a total of 10,000 iterations (batches), and GMMs (ψL,ψUL)(\psi_L,\psi_{UL}) are computed at every 500 iterations for semi-supervised learning. Similiarly with the above, ψL\psi_L can be directly computed using the labels, and only ψUL\psi_{UL} is computed via an iterative EM algorithm. The computation of GMM accounts for 15.19% of the entire time required for semi-supervised learning, and a detailed time analysis is provided below.

  • Table 2: Analysis of the required time to process one the semi-supervised learning | Algorithm procedure | Time required | |-------|-------| | Time required for 500 iterations (including GMM calculation) | 144.9939 sec. | | Time required for GMM calculation | 22.0256 sec.| | Ratio of time required for GMM calculation during semi-supervised learning | 15.1906 % |

These additional delays are the costs for achieving significant improvements in AL performance. Based on the reviewer’s comments, we have now included these results in Section A.8 of the Appendix.

Theoretical analysis on GMM time complexity. Basically, the time complexity for the computation of vanilla GMM is O(NKID2)O(NKID^2), where N,K,I,DN, K, I, D represent the number of data, number of Gaussian modes, number of EM iterations, and the dimensionality of the data. However, we reduced the complexity to O(NKID)O(NKID) by making the isotropic Gaussian assumption, which helps accelerate GMM computation. Moreover, as evident from the above experimental analysis, the time required for GMM computation is not significantly demanding compared to the time spent on forward/backward propagation for model update.

评论

Values of hyperparameters that balance multiple losses

We appreciate the constructive comments. The value of α\alpha is fixed to be 1e-4, and details on hyperparameters including α\alpha can be seen in Section A.1 of Appendix. In accordance with the reviewer’s comments, we compared the performance by varying the value of α\alpha from 0.00001 to 0.001 and reported the results below. When α\alpha is too large, distributional regularization on the latent space compromises the performance of the main task. On the other hand, when α\alpha is too small, the effect of regularization and performance improvement diminishes. These results are now included in Section A.9 of Appendix.

  • Table 3: Accuracy(%) on CIFAR10
# of labeled samples1K2K3K4K5K6K7K8K9K10K
α=0.001\alpha=0.00159.2473.9677.8182.8185.2285.8887.6088.1388.9689.53
α=0.0005\alpha=0.000559.2174.0778.9882.7385.4086.6787.5789.0489.5790.19
α=0.0001\alpha=0.000160.6474.1579.2183.4285.6386.7588.0989.3990.0190.87
α=0.00005\alpha=0.0000558.9873.6377.4582.1684.7685.8887.4388.9589.6390.38
α=0.00001\alpha=0.0000159.5873.4176.5281.7884.2586.2287.3288.8389.4490.01

As for the loss functions during adversarial training, we use two losses: cross-entropy loss LCEL_{CE} and discriminator loss Dϕ(xul)Dϕ(xl)D_\phi (x_{ul}) - D_\phi (x_l). Without a special constant weight to balance them, we added two losses in a 1:1 ratio and conducted the adversarial semi-supervised learning. We have made this clearer in Section A.9 of Appendix.

Intuitive explanation on the superiority of our method over other baselines.

In response to the reviewer’s comment, in Fig. 20 of Section A.10 in Appendix, we illustrate the candidate set selected by various AL methods under imbalanced CIFAR10. From the t-SNE visualization of candidate sets, we can intuitively compare the properties of various AL methods. For instance, Snapshot_AL [1] that calculates uncertainty using the output of the classifier exhibits redundancy by selecting multiple samples from narrow regions. Coreset [2] that considers only diversity can select diverse samples in broad areas but it lacks the ability to specify certain informative regions. CoreGCN [3] that favors samples expected to be from unlabeled set, is observed to select samples from dense areas of XLX_{L}, as well as outlier samples that hardly represent XULX_{UL}. In contrast, the candidate set from the proposed method demonstrates desired properties: 1) it is diverse enough to encompass a broad area while 2) represents the distribution of XULX_{UL} well. Our method compares GMMs of XL,XULX_L, X_{UL} to select samples that are unfamiliar in XLX_L, but representative of XULX_{UL}; especially in the imbalanced setting, this comparison leads to select minor classes that are rare in XLX_L, thereby contributing to the significant performance gain as seen in Fig.10 of the main paper.

Additionally, we reported the class count entropy of XLX_L under an imbalanced setting in Figure 11 (b) of the original manuscript (i.e., higher entropy corresponds to diverse XLX_L). The results show that the proposed method actively selects minor classes, exhibiting the highest entropy.

[1] Jung et al., “A Simple Yet Powerful Deep Active Learning With Snapshots Ensembles”, ICLR 2023.
[2] Sener et al., “Active Learning for Convolutional Neural Networks: A Core-Set Approach”, ICLR 2018.
[3] Caramalau et al., “Sequential graph convolutional network for active learning”, CVPR 2021.

We again appreciate the reviewer for providing helpful comments. We would be grateful if you could let us know if there are any other comments/questions.

评论

I have carefully examined the response from the authors, and most of my concerns have been addressed.

审稿意见
8

This paper introduces a new active learning strategy that addresses the issue of mismatched distributions between labeled and unlabeled samples by incorporating a Gaussian Mixture Model (GMM). The strategy combines various informativeness metrics for sample selection. Tests on different datasets show that this GMM-based method performs better than existing approaches and can be combined with other active learning methods to enhance results further.

优点

This paper introduces a new active learning strategy by incorporating a Gaussian Mixture Model (GMM). This strategy aims to address the issue of mismatched distributions between labeled and unlabeled samples.

This GMM-based method performs better than existing approaches on several datasets.

缺点

If I understand correctly, in Section 3.2, two GMMs are fitted for the labeled and unlabeled data, respectively. Is this correct? Equation (2) provides a means to compare two Gaussian distributions, but it is not immediately clear how this extends to the comparison of two GMMs, and whether the weight of each Gaussian component in the mix is considered in this comparison.

Regarding Equation (3), the purpose of the optimization seems to be to minimize the disparity between the unlabeled data, XULX_{UL}, and the labeled data, XLX_{L}, within the embedding space. If the model is trained effectively, it appears that XULX_{UL} and XLX_{L} would become indistinguishable in the embedding space. It raises the question of whether this is the intended outcome, as it would seem important for the embedding space to retain the information that differentiates XULX_{UL} from XLX_{L}.

The combination method outlined in Equation (5) is not entirely convincing as the best approach. An alternative sampling strategy might involve drawing from each of the three separate rankings and then using the highest-ranked samples from each ranking as the data set for annotation. It would also be insightful to see the results of an ablation study where each component of Equation (5) is used independently as the final selection criterion.

In Equation (5), it seems that only IEntI_{Ent} leverages the labels from previous iterations. The other two components are influenced by XULX_{UL} and XLX_{L} but are label-independent. It may be beneficial for the authors to make this point clearer in their writing.

问题

  1. Are two GMMs fitted for the labeled and unlabeled data, respectively?
  2. How does Equation (2) extend to the comparison of two GMMs?
  3. What is the rationale behind Equation (3)?
  4. Why Equation (5) is a good way to combine the information?
  5. In Equation (5), is IEntI_{Ent} the only component that leverages the labels from previous iterations, but the other components label-independent?
评论

Ablation study on Equation (5)

We thank the reviewer for the insightful comments. As can be seen in Fig. 8 (a) and (c) of the main text, two metrics (ILikeI_{Like}, IEntI_{Ent}) exhibit strong redundancy in their candidate set. Therefore, separately considering each metric (ILikeI_{Like}, IDisI_{Dis}, IEntI_{Ent}) can lead to redundant and overlapping selections. However, when considering a combined indicator as in our method (ITotalI_{Total}), one can select diverse yet informative samples as shown in Fig. 8 (d). In response to the reviewer’s comment, we conducted an additional experiment in which data selection is performed based on separate rankings for each informativeness score (ILikeI_{Like}, IDisI_{Dis}, IEntI_{Ent}). Through the table below, we confirmed that our integrated indicator achieves higher performance than using each independent indicator.

  • Table 1: Comparison with the sampling strategy based on separate rankings for each informativeness score on CIFAR10
# of labeled samples1K2K3K4K5K6K7K8K9K10K
Random51.7663.5072.1776.9480.5682.3284.1785.0086.1187.27
Separate rankings for (ILikeI_{Like}, IDisI_{Dis}, IEntI_{Ent})58.9871.7978.7982.6884.9686.3187.6988.6389.3790.08
DAAL58.9874.1579.2183.4285.6386.7588.0989.3990.0190.87

Does IEntI_{Ent} leverage label information?

We would like to clarify that after the training is finished, we measure the three informativeness metrics (ILikeI_{Like}, IDisI_{Dis}, IEntI_{Ent}) on the unlabeled set XULX_{UL} to select unlabeled samples with high informativeness. Specifically, we only utilize label information of labeled set XLX_L when training the model in the supervised and semi-supervised learning stages. In the sample selection stage, none of the three informativeness metrics utilize label information: The goal is to select the best samples from the "unlabeled set" to be added in the labeled set. Regarding IEntI_{Ent}, we compute Shannon entropy to compute IEntI_{Ent} as: IEnt(xul)=k=1Kp(y=kxul)log(p(y=kxul))I_{Ent}(x_{ul}) = - \sum_{k=1}^{K} p(y=k|x_{ul}) log(p(y=k|x_{ul})), where p(y=kxul)p(y=k|x_{ul}) is the classifier output (not label information) after the softmax function for class kk given the unlabeled sample xulx_{ul}. As for remaining metrics ILikeI_{Like} and IDisI_{Dis}, the two GMMs (ψL\psi_{L}, ψUL\psi_{UL}) for XLX_{L} and XULX_{UL} are utilized to compute ILikeI_{Like}, and the discriminator trained on XLX_{L} and XULX_{UL} is utilized to IDisI_{Dis}. Based on the reviewer’s comment, we have made these points clearer in Section 3.5 of the revised manuscript.

Again, we appreciate the reviewer for the time and positive position. Your comments made us think carefully, and we feel we have successfully addressed and clarified all the comments you raised. Please let us know in case there are remaining questions/concerns; we hope to be able to have an opportunity to further answer them.

评论

We appreciate the reviewer for taking a positive stance and providing helpful comments on unclear aspects and constructive suggestions. Below, we tried to carefully address the concerns raised by the reviewer.

Are two GMMs fitted for XLX_L and XULX_{UL}, respectively?

As the reviewer correctly pointed out, we need to obtain GMM parameters (ψ:=π,μ,σ2\psi := \pi, \mu, \sigma^2) in latent space for the label set (XLX_L) and the unlabeled set (XULX_{UL}), respectively. For XULX_{UL}, a GMM ψUL\psi_{UL} is fitted via EM algorithm. However, for XLX_L, since the label information is given, GMM parameters ψL\psi_{L} for each mode (i.e., class) can be efficiently obtained by directly computing mean and variance for each class without performing iterative EM algorithm.

Detailed explanation regarding Equation (2).

As mentioned above, since label information is available when obtaining a GMM ψL\psi_{L}, each Gaussian mode of the fitted GMM can be matched to each category (class) in the labeled data. However, when a GMM ψUL\psi_{UL} is fitted to the unlabeled samples via EM algorithm, the order of clusters can be scrambled since label information is unknown for the unlabeled samples. In this context, the Bhattacharyya distance in Eq. (2) is utilized to align each mode in the unlabeled GMM ψUL\psi_{UL} with the corresponding category (mode) in the labeled GMM ψL\psi_{L}, through Hungarian matching problem. Note that the intention behind using Eq. (2) is not to compare or calculate the distance between GMMs, but rather to compute the distance between two Gaussian distributions from each GMM for XLX_L and XULX_{UL} for alignment. Therefore, the mixing coefficient (πk\pi_k) does not play a meaningful role in this context. Furthermore, the mixing coefficient (πK\pi_K) in GMMs reflects the occurrence frequency of samples for each category. Therefore, in real-world scenarios such as imbalance settings where the class distributions of labeled and unlabeled sets are different, considering this coefficient (πk\pi_k) to align the Gaussian modalities from two different GMMs can hamper correct matching. For this reason, we solve the matching problem through equation (2) without considering the coefficient (πk\pi_k).

Rationale behind Equation (3)

As the reviewer mentioned, the goal of Equation (3) is to perform adversarial training based on GMM information to reduce the distribution discrepancy between XLX_L and XULX_{UL} in latent space, and it offers important advantages in AL. This is because models are susceptible to overfitting due to the limited size of XLX_L in the early stages of AL (Distinct difference between the distributions of XLX_L and XULX_{UL} can be seen in Fig. 1 of the main paper). Considering the nature of AL, we believe it is almost impossible for Equation (3) to align XLX_L with XULX_{UL} to be indistinguishable in the embedding space during challenging early AL cycles. As pointed out by the reviewer, there might be a moment when it is possible for Equation (3) to align XLX_L and XULX_{UL} when the size of XLX_L grows enough. However, this implies that training the model using only XLX_L can achieve performance similar to using XLX_L + XULX_{UL}, overshadowing the need to apply active learning.

AC 元评审

The paper introduces an novel active learning strategy using Gaussian Mixture Models (GMM) to address the distributional discrepancies between labeled and unlabeled datasets. This approach aims to reduce overfitting and enhance the diversity of feature representation by employing a combination of regularization, distributional alignment, and a hybrid sample selection strategy.

The reviewers acknowledged the novel integration of GMM in AL, the extensive experiments across various datasets, and substantial theoretical contributions. While the majority of the reviewers are positive, there is a concern regarding the complexity and assumptions regarding the GMM's ability to model data distributions in AL, particularly in scenarios involving new classes or outliers. This issue unfortunately was not fully resolved even after the discussion phase. The authors are encouraged to consider these points in future work, e.g., to adjust the positioning of the paper accordingly to further clarify and enhance the applicability of their work.

为何不给更高分

The paper does not receive a higher score primarily due to the concerns about its foundational assumptions and the novelty of its contributions. The reliance on GMM to model data distributions in active learning scenarios, especially those with new or outlier classes, is seen as a potential limitation. This modeling choice raises questions about the applicability and generality of the proposed method in diverse real-world scenarios.

Despite the novel integration of GMM, the core concept of distribution alignment has been previously explored in the literature, which limits the perceived novelty of this work. These aspects suggest that while the paper is a strong contribution to AL, it may not be sufficiently novel to warrant a higher score.

为何不给更低分

The novel use of GMM in this context and the development of new regularization and sample selection strategies are well recognized by the reviewers. The extensive experiments and results showcase significant performance improvements over existing methods. The authors' engagement in the review discussions, particularly in addressing questions about the runtime complexity and providing additional empirical evidence, further supports the paper's standing.

最终决定

Reject