Non-Stationary Predictions May Be More Informative: Exploring Pseudo-Labels with a Two-Phase Pattern of Training Dynamics
In contrast to existing pseudo-labeling methods, this paper explores a new type of pseudo-labels --- predicted labels that do not exhibit high confidence scores and training stationary.
摘要
评审与讨论
The paper proposes a novel and interesting 2-phasic metric for two-phase pseudo-label learning. The 2-phasic metric characterizes the two-phase pattern through both spatial and temporal measures. Extensive experimental results show the effectiveness of the proposed 2-phasic metric, especially when the number of labeled samples are very small.
update after rebuttal
Thanks for the rebuttal! Most of my concerns have been addressed. I keep my score unchanged.
给作者的问题
In the equation of Line 241, there is a typo in the the average change of other categories.
论据与证据
The proposed 2-phasic metric is novel for pseudo-label learning.
The motivation should be further explained in detail. The authors claim that the two-sample samples are important and have more information gain. But the author does not explain why the two-samples are important.
The authors give many observations to demonstrate the importance of the two-samples and the training dynamics that used to design spatial and temporal measures. I am curious that the observations are general for all kinds of datasets. Moreover, more theoretical analysis behind the observations should be presented, which could make the paper more solid.
方法与评估标准
The authors conduct extensive experiments to validate the effectiveness of the propose method. The experimental design is comprehensive.
理论论述
The theoretical claims are reasonable. However, more theoretical analysis behind the proposed observations in this paper should be presented.
实验设计与分析
The authors conduct comprehensive and solid experiments from multiple perspectives, including booster test, complementary analysis, ablation study, and hyperparameter analysis. The experimental results reveal the effectiveness of the proposed method and each component.
补充材料
I have reviewed all parts of supplementary material.
与现有文献的关系
The paper addresses the pseudo-label learning problem from a new perspective. In contrast to existing works that focus on samples exhibiting high confidence score and high training stationary, this work exploits the potential the two-phase samples. Moreover, the paper designs a 2-phasic metric to identify the two-phase samples with both spatial and temporal measures.
遗漏的重要参考文献
N/A
其他优缺点
Strengths:
-
The paper is very well written and easy to understand.
-
The proposed 2-phasic metric is novel.
-
The experiment design is comprehensive and experimental results show the effectiveness of the proposed method.
Weaknesses:
-
More theoretical analysis behind the observations should be presented, which could make the paper more convincing.
-
The additional overhead during the computation of spatial and temporal measures makes the proposed method less efficient, especially when the size of dataset is large. The authors could analyze the computational efficiency of the method from an experimental view.
-
The authors could perform ablation study regarding the design 2-phasic metric and analyze effectiveness of the spatial measure and the temporal measur
其他意见或建议
N/A
Thank you very much for your constructive comments! We have carefully studied them and revised the paper accordingly.
Weakness 1: More theoretical analysis behind the observations should be presented, which could make the paper more convincing.
Response: We appreciate your insightful feedback. Due to limited time during the rebuttal, we would like to propose a preliminary theoretical idea as a potential direction for addressing this concern.
Our idea is to use the local elasticity hypothesis to prove that adding 2-phase labels can promote feature separability. Local elasticity is the core concept introduced in [1], and it is defined as follows: when the model is updated at sample via SGD, the predicted change of another feature vector is positively correlated with the similarity between and . Specifically, local elasticity describes a phenomenon that if and belong to the same class (similar), the predicted change is significant; if they belong to different classes (dissimilar), the change is small.
Reference [2] utilizes the local elasticity assumption to derive conditions for feature separability. This study describes the temporal evolution of features for two classes of samples, denoted as and , in a binary classification context. Furthermore, Theorem 2.1 in [2] indicates that:
Given the feature vectors and for , as and large ,
- if , they are asymptotically separable with probability tending to one,
- if , they are asymptotically separable with probability tending to zero.
Our two-phasic metric is designed to identify samples that exhibit a two-stage characteristic in training dynamics. As illustrated in Figure 4, the 2-phasic samples tend to lie closer to the decision boundary. Intuitively, incorporating these samples may help reduce the inter-class effect , thereby promoting stricter feature separability. In the future, we plan to study the local elasticity about 2-phase samples, estimate the intra-class effect and inter-class effect before and after adding 2-phase samples. This analysis will allow us to verify our hypothesis and provide a theoretical foundation for the effectiveness of our method.
[1] He H, Su W J. The local elasticity of neural networks. In International Conference on Learning Representations, 2020.
[2] Zhang J, Wang H, Su W. Imitating deep learning dynamics via locally elastic stochastic differential equations. Advances in Neural Information Processing Systems, 2021.
Weakness 2: The additional overhead during the computation of spatial and temporal measures makes the proposed method less efficient, especially when the size of dataset is large. The authors could analyze the computational efficiency of the method from an experimental view.
Response: Thank you for pointing out this concern. We have conducted an empirical analysis of the computational efficiency of our method. Specifically, on the Cora dataset using GCN as the backbone, the measured time costs are as follows:
- Training dynamics recording time: 0.11625 seconds
- Two-phasic metric computation time: 0.00253 seconds
- Total training time: 2.24 seconds
These results indicate that the combined overhead of recording training dynamics and computing the two-phasic metric accounts for approximately 5% of the total training time. Therefore, we consider this additional computational cost to be acceptable in practice.
Weakness 3: The authors could perform ablation study regarding the design 2-phasic metric and analyze effectiveness of the spatial measure and the temporal measur.
Response: Thank you for the valuable suggestion. In response, we conducted an ablation study using GCN as the backbone on the Cora dataset to evaluate the contributions of the temporal and spatial characteristics in the 2-phasic metric. The experimental results are summarized in Table A. The results demonstrate that both the temporal and spatial components contribute positively to the overall performance, with the temporal features having a more significant impact. We will include the results and corresponding discussion in the final revised version of the paper.
Table A: Ablation study on the temporal and spatial characteristics of the two-phasic metric
| L/C | 3 | 5 | 10 |
|---|---|---|---|
| Temporal only | 69.76 | 73.36 | 75.99 |
| Spatial only | 68.24 | 72.70 | 73.87 |
| 2-phasic metric | 70.27 | 73.40 | 76.24 |
Other Comments Or Suggestions: In the equation of Line 241, there is a typo in the average change of other categories.
Response: Thank you for pointing out the error. We have corrected to in the formula on Line 241.
This paper discovers a new type of predicted labels suitable for pseudo-labeling, termed two-phase labels, which exhibit a two-phase pattern during training and are informative for decision boundaries. This finding is different from existing methods which typically select predicted labels with high confidence scores and high training stationarity as pseudo-labels to augment training sets, and thus offers new insights for pseudo-labeling. Besides, this paper proposes a 2-phasic metric to mine the two-phase labels, and a loss function tailored for two-phase pseudo-labeling learning, allowing models not only to learn correct correlations but also to eliminate false ones.
给作者的问题
- In pseudo-labeling tasks, what metrics have been proposed by existing works to identify effective samples?
- How do you construct an effective memory bank to support the calculation of the proposed metric?
论据与证据
The authors claim that they discover a new type of predicted labels suitable for pseudo-labeling. First, they analyze the rationale of two-phase labels for pseudo-labeling from different perspectives. Second, extensive experiments were conducted on eight benchmark datasets, including image and graph datasets, and the results show that the use of two-stage labeling can significantly improve the performance of existing pseudo-labeling methods. Specifically, the average classification accuracy of image datasets and graph datasets increased by 1.73% and 1.92%.
方法与评估标准
The proposed methods make sense for this task. Experiments are performed on CIFAR100, EuroSAT, STL-10,Semi-Aves, Cora, Citeseer, PubMed, AmazonComputers. The evaluation citeria contains classification accuracy, correctness, Information gain, overlap of the two pseudo-labels, which are commonly used benchmarks.
理论论述
The authors provide a theoretical rationale for the two-phase labels, and I find no apparent issues with their analysis.
实验设计与分析
Experiments are performed on the dataset of CIFAR100, EuroSAT, STL-10, Semi-Aves, Cora, Citeseer, PubMed, and AmazonComputers, and clearly verify the effectiveness of the proposed methods.
补充材料
I reviewed the supplementary material which helped me understand the methodology.
与现有文献的关系
The newly identified type of predicted labels, characterized by a two-phase pattern during training, is not only well-suited for pseudo-labeling and highly informative for decision boundaries, but also provides novel insights for other related learning tasks.
遗漏的重要参考文献
Key related works are all discussed.
其他优缺点
Strengths: This paper is a well-written and interesting paper. It introduces novel insights into two-phase labels for pseudo-labeling, proposes a two-phase metric to mine these labels, and presents a loss function specifically designed for two-phase pseudo-labeling learning. These new insights have the potential to make a significant impact on the community.
Weaknesses: The method introduces hyperparameters that must be carefully selected to ensure the effectiveness of the proposed approach.
其他意见或建议
A more detailed discussion on the theoretical guarantees of the proposed metric would enhance its reliability for practical use.
Thank you very much for your thorough and constructive feedback! Below, we present our responses to each of your concerns and questions.
Weaknesses: The method introduces hyperparameters that must be carefully selected to ensure the effectiveness of the proposed approach.
Response: We agree with you that the sensitivity of hyperparameter in 2-phasic metric is crucial for its practical applicability. As shown in Figure 6, although the model’s performance exhibits fluctuations with certain hyperparameter variations, the overall performance remains relatively stable. This indicates that some hyperparameters have low sensitivity and that there may be correlations among them. Therefore, in future work, we plan to reduce the number of hyperparameters by employing hyperparameter fusion strategies, aiming to simplify the deployment of the 2-phasic metric while retaining its performance advantages.
Other Comments Or Suggestions: A more detailed discussion on the theoretical guarantees of the proposed metric would enhance its reliability for practical use.
Response: We appreciate your insightful feedback. Due to limited time during the rebuttal, we propose a preliminary theoretical idea as a potential direction for addressing this concern.
Our idea is to use the local elasticity hypothesis to prove that adding 2-phase labels can promote feature separability. Local elasticity is introduced: when the model is updated at sample via SGD, the predicted change of another feature vector is positively correlated with the similarity between and [1].
Reference [2] utilizes the local elasticity assumption to derive conditions for feature separability. This study describes the temporal evolution of features for two classes of samples, denoted as and . Furthermore, Theorem 2.1 in [2] indicates that:
Given the feature vectors , for , as and large ,
- if , they are asymptotically separable with probability tending to one,
- if , they are asymptotically separable with probability tending to zero.
As illustrated in Figure 4, our 2-phasic samples tend to lie closer to the decision boundary. Intuitively, incorporating these samples may help reduce the inter-class effect , thereby promoting stricter feature separability. In the future, we plan to study the local elasticity about 2-phase samples, estimate the intra-class effect and inter-class effect before and after adding 2-phase samples. This analysis will allow us to verify our hypothesis and provide a theoretical foundation for the effectiveness of our method.
[1] The local elasticity of neural networks. ICLR, 2020.
[2] Imitating deep learning dynamics via locally elastic stochastic differential equations. NeurIPS, 2021.
Q1: In pseudo-labeling tasks, what metrics have been proposed by existing works to identify effective samples?
Response: Pseudo-label selection metrics mainly fall into two categories: confidence-based and uncertainty-based. Confidence-based methods typically use the maximum softmax score as the selection criterion [1]. Enhancements include FlexMatch [3], which applies dynamic class-wise thresholds for more flexible filtering, and SoftMatch [4], which weights samples by confidence to balance pseudo-label quality and quantity.
Uncertainty-based metrics prioritize samples with low prediction uncertainty. A representative approach is Monte Carlo dropout [5], which estimates uncertainty via the variance of predictions under multiple dropout runs. Another method leverages training stationarity (or time consistency), selecting labels with stable predictions over time [7]. Additionally, recent work proposes learning a metric via a neural network that models the relationship between label embeddings and feature embeddings to assess pseudo-label quality [8].
[1] Curriculum labeling: Revisiting pseudo-labeling for semi-supervised learning. AAAI, 2021.
[2] Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. NeurIPS, 2021.
[3] Softmatch: Addressing the quantity-quality tradeoff in semi-supervised learning. ICLR, 2022.
[4] Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. ICLR, 2016.
[5] Temporal self-ensembling teacher for semi-supervised object detection. IEEE Transactions on Multimedia, 2021.
[6] Semireward: A general reward model for semi-supervised learning. ICLR, 2024.
Q2: How do you construct an effective memory bank to support the calculation of the proposed metric?
Response:
In our experiments, we adopted a uniform sampling strategy to construct the memory bank. Specifically, model predictions were sampled and stored at fixed training steps. With only 50 recorded training dynamics, we achieved satisfactory accuracy in experiments on image datasets.
This paper investigates the potential of a novel type of pseudo-labels—two-phase labels—in semi-supervised learning. Unlike conventional methods that rely on high-confidence and stable predictions, two-phase labels exhibit relatively low correctness and demonstrate a unique two-phase training dynamic: they initially predict one category in the early training stages and switch to another in later epochs. The authors demonstrate that these labels are highly informative for decision boundaries. To effectively identify such labels, they propose a 2-phasic metric for their quantitative characterization. Furthermore, a loss function is designed to enable models to learn correct correlations while simultaneously eliminating false ones. Extensive experiments are conducted to validate the effectiveness of incorporating two-phase labels.
给作者的问题
Please see Other Strengths And Weaknesses, Other Comments Or Suggestions.
论据与证据
The claims made in the submission are indeed supported by clear and convincing evidence. First, the authors analyze the rationale of why two-phase labels are effective from both perspectives of pattern learning and decision boundaries. Furthermore, extensive experiments demonstrate that two-phase labels serve as an effective booster for existing pseudo-labeling methods. These experiments cover a wide range of datasets and various state-of-the-art (SOTA) baseline pseudo-labeling methods. Besides, the authors demonstrate that two-phase labels are high-quality pseudo-labels, which are often overlooked by existing pseudo-labeling methods.
方法与评估标准
Methods and evaluation criteria are well-suited for the problem and application at hand, including eight image and graph datasets (e.g., Cora and CIFAR-100), the baseline pseudo-labeling methods for comparison (e.g., Confidence and SOTA SoftMatch), and commonly used evaluation metrics (such as pseudo-labeling accuracy and IOU).
理论论述
This paper does not propose explicit theoretical claims. The in-depth analysis of the rationale provides valuable insights, helping readers better understand why two-phase labels are well-suited for pseudo-labeling tasks.
实验设计与分析
The experimental setup and analyses appear to be well-structured and appropriate for assessing the claims made, such as booster test (Table 1), complementary analysis (Table 2), ablation Study (Table 3), and parameter Sensitivity (Figure 6).
补充材料
I reviewed the supplementary material all.
与现有文献的关系
This paper focuses on the research problem of pseudo-label selection in semi-supervised learning. Unlike previous works that primarily emphasize the role of high-confidence, high training stationary labels (Type I labels in Figure 1) in pseudo-labeling, this study explores the potential of a novel type of predicted labels (i.e., Type II labels in Figure 1), which exhibit relatively high accuracy but low training stationary. This exploration provides new and interesting insights. Furthermore, this paper further identifies a subset within Type II labels, termed two-phase labels, which not only provide significant information gain but also mitigate the risk of misclassification.
遗漏的重要参考文献
The key contribution is a novel pseudo-label selection method. The literature discussed by the authors is comprehensive and highly relevant to the topic.
其他优缺点
Strengths: 1.This paper is well-written and highly engaging. 2.The paper conducts an insightful exploration, investigating the potential of a type of predicted labels in pseudo-labeling—specifically, labels that exhibit relatively high accuracy but low training stability, which have often been overlooked in previous works. This offers valuable inspiration for future research. 3.This paper uncovers two-phase labels. Extensive experiments demonstrate that incorporating two-phase labels can significantly improve the accuracy of existing pseudo-labeling algorithms (1.73% on image datasets and 1.92% on graph datasets). This indicates that two-phase labels can serve as a valuable complement to pseudo-labels provided by existing methods. 4.The authors analyze the rationale behind the effectiveness of two-phase labels from both perspectives of pattern learning and decision boundaries.
Weakness: 1.In the Booster test experiments, the authors adopted a three-stage protocol: in the early stage, pseudo-labels were generated using the baseline method, while in the later stage, pseudo-labels were generated using both the baseline method and the proposed method for comparison. This design is not commonly seen in general pseudo-labeling approaches. However, the authors did not clearly specify how the training epochs were divided into early and later stages, nor did they analyze the impact of this division on the experimental results. Therefore, the authors should provide a detailed analysis of this hyper-parameter setting to validate its robustness. 2.As shown in Figure 4, on the Cora dataset, two-phase labels are significantly closer to the decision boundaries compared to high-confidence labels. However, this phenomenon is not as evident on the CIFAR-100 image dataset. The authors should further analyze and clarify the reasons behind this discrepancy, such as whether factors like data distribution, model complexity, play a critical role.
其他意见或建议
The SoftMatch method achieves a balance between the quality and quantity of pseudo-labeled samples by weighting them according to their confidence, leading to significant performance improvements in pseudo-labeling tasks. Could this idea also be applied to the selection of two-phase labels to further enhance their effectiveness?
Thank you very much for your thorough and constructive feedback! Below, we present our responses to each of your concerns and questions.
Weakness 1: In the Booster test experiments, the authors adopted a three-stage protocol. This design is not commonly seen in general pseudo-labeling approaches. The authors should provide a detailed analysis of this hyper-parameter setting to validate its robustness.
Response: The three-stage training protocol was designed to validate whether the proposed two-phase labels can effectively enhance existing pseudo-labeling methods. It is worth emphasizing that the primary contribution of this paper lies in the identification of the two-phase labels, which serve as a complementary enhancement to baseline pseudo-labels. During Stage 2, we use the baseline pseudo-labeling method while recording training dynamics. These recorded dynamics were subsequently utilized to calculate the 2-phasic metric used to identify pseudo labels in Stage 3.
We determine the length of Stage 2 based on two robust principles:
- In graph datasets, the transition from Stage 2 to Stage 3 is determined by model convergence criteria. Specifically, Stage 2 concludes when additional training iterations using the baseline pseudo-labeling method no longer yield performance improvements, at which point Stage 3 is initiated.
- In image datasets, a fixed schedule is adopted, where Stage 3 begins after recording four epochs of training dynamics in Stage 2, followed by six training epochs in Stage 3.
Weakness 2: As shown in Figure 4, on the Cora dataset, two-phase labels are significantly closer to the decision boundaries compared to high-confidence labels. However, this phenomenon is not as evident in the CIFAR-100 image dataset. The authors should further analyze and clarify the reasons behind this discrepancy.
Response: Thank you for highlighting this critical observation. We find that this phenomenon is primarily attributed to the use of pre-trained models in image datasets. Specifically, pre-trained models learn diverse feature representations from large-scale pre-training datasets, which may cause samples from the same class to be scattered across multiple clusters in the latent space. This dispersion effect reduces the observable distinction between high-confidence pseudo-labels and the decision boundaries.
To validate this point, we conducted an additional controlled experiment using a ResNet-18 model trained on CIFAR-100, replacing the original pre-trained Vision Transformer (ViT) backbone. This modification eliminates the representational biases introduced by pre-training. Our empirical results demonstrate that, under this setting, high-confidence pseudo-labels exhibit clearer separation from the decision boundaries. We will include these findings and their analysis in the camera-ready version.
Other Comments Or Suggestions: The SoftMatch method achieves a balance between the quality and quantity of pseudo-labeled samples by weighting them according to their confidence, leading to significant performance improvements in pseudo-labeling tasks. Could this idea also be applied to the selection of two-phase labels to further enhance their effectiveness?
Response: Thank you for this insightful suggestion. The SoftMatch method effectively balances pseudo-label quality and quantity by weighting samples based on their confidence scores. Inspired by this motivation, we adopt a similar strategy in our approach.
Specifically, we first calculate a normalized 2-phasic metric to the [0,1] range by . We then calculated the mean and variance of and assigned loss weights to samples using the formula in SoftMatch
$ \lambda(\mathbf{p})= \\begin{cases} \phi(\mathbf{p})* \exp\left(-\frac{(\phi(\mathbf{p}) - \hat{\mu})^2}{2\hat{\sigma}^2}\right), & \text{if } \phi(\mathbf{p}) < \hat{\mu}, \\\ \phi(\mathbf{p}), & \text{otherwise}. \\end{cases} \ $As shown in Table A, experimental results show that integrating the SoftMatch weight into our 2-phasic metric led to a slight decrease in classification accuracy. We are conducting further analysis to investigate the underlying causes of this phenomenon.
Table A: Applying SoftMatch-inspired weighting to our 2-phasic metric
| L/C | 3 | 5 | 10 |
|---|---|---|---|
| Confidence | 66.21 | 71.38 | 73.73 |
| +2-phasic | 70.41 | 74.07 | 77.17 |
| +2-phasic & Softmatch | 67.69 | 73.78 | 76.80 |
This paper introduces a novel type of pseudo-labels that hold significant potential for enhancing pseudo-labeling strategies and complementing existing methods. The authors further propose a metric to efficiently identify these two-phase labels. Extensive experiments on eight datasets demonstrate that the 2-phasic metric significantly boosts the performance of existing pseudo-labeling methods.
给作者的问题
- The paper identifies limitations, including parameter sensitivity and computational overhead. Could you discuss any ongoing or planned future work to address these issues?
- Have you ever considered evaluating two-phasic metrics across various types of neural networks, such as RNNs and Transformers?
论据与证据
NA
方法与评估标准
The proposed methods and evaluation criteria are appropriate for the problem of enhancing pseudo-labeling in semi-supervised learning. The 2-phasic metric and loss function effectively capture the unique characteristics of two-phase labels. The criteria used in the paper are standard and relevant, ensuring applicable results.
理论论述
The paper supports its claims through experimental results and qualitative reasoning, rather than formal proofs. It demonstrates that incorporating two-phase labels improves model performance across multiple datasets and explains their effectiveness by their proximity to decision boundaries and ability to capture complex patterns.
实验设计与分析
The experimental designs and analyses in the paper are sound and valid. It compares the 2-phasic metric against various baselines across diverse datasets, demonstrating notable performance gains. It also includes ablation studies to validate the contributions of different components. Overall, the experiments effectively substantiate the proposed method’s efficacy.
补充材料
Yes. The paper provides appendices that describe many of the details of two-phase labels. We mainly focus on parts B (Validation of LMO Entropy), D (2-phasic based Pseudo-labeling Algorithm), and E (Details of Experiments).
与现有文献的关系
The paper’s core contributions are deeply connected to the broader fields of semi-supervised learning and training dynamics.
遗漏的重要参考文献
NA
其他优缺点
Strengths
- The paper introduces novel two-phase labels and a tailored metric and loss function to enhance semi-supervised learning.
- Extensive experiments show significant performance gains, especially with limited labeled data.
- The approach is practical, easily integrated into existing frameworks and effective in scenarios with limited labeled data.
Weaknesses
- Requires careful parameter adjustment, which can be time-consuming.
- Generalizability to other domains (e.g., text, time-series) is untested.
其他意见或建议
- Add equation numbers for all equations to enhance clarity and referencing.
- Ensure figure references are accurate and well-positioned.
- Enhance captions and explanations for figures and tables to better guide readers.
Thank you so much for your detailed and constructive comments! We have carefully studied them and revised the paper accordingly.
Q1: The paper identifies limitations, including parameter sensitivity and computational overhead. Could you discuss any ongoing or planned future work to address these issues?
Response: Thank you for raising these important points. Below is our response to these limitations.
- Hyperparameter determination
As shown in Figure 6, although the model’s performance exhibits fluctuations with certain hyperparameter variations, the overall performance remains relatively stable. This indicates that some hyperparameters have low sensitivity and that there may be correlations among them. Therefore, in future work, we plan to reduce the number of hyperparameters by employing hyperparameter fusion strategies, aiming to simplify the deployment of the 2-phasic metric while retaining its performance advantages.
- Computational Overhead
Calculating the 2-phasic metric requires recording training dynamics but does not introduce significant computational overhead. Specifically, it introduces an additional space complexity of O(N|T|) and time complexity of O(NC|T|), where N is the number of unlabeled samples, C is the class count, and |T| is the number of recorded training dynamics. Empirical validation (e.g., C=100 and |T|=50 for CIFAR-100) shows that this complexity is not too high in real-world datasets.
Moreover, the complexity limitations can be effectively mitigated through two strategies:
-
Parallelization: The recording of training dynamics and the computation of the 2-phasic metric are decoupled from the backpropagation process of the neural network. This allows for parallel execution, effectively hiding the additional computational latency within the training loop.
-
Efficient Sampling for Memory Bank: To further optimize memory usage, we propose an adaptive sampling strategy for memory bank construction. Rather than storing the training dynamics at all epochs, we selectively retain a representative subset by an adaptive sampling strategy based on the recorded training dynamics.
Q2: Have you ever considered evaluating 2-phasic metric across various types of neural networks, such as RNNs and Transformers?
Response: The principle of the 2-phasic metric is to capture the transition of neural network learning patterns from simple to complex. The universal law of neural networks learning from easy to hard has been well-established [1][2], endowing the 2-phasic metric with broad applicability across various neural network architectures. In our experiments, we used GCN and ViT as backbones for node classification and image classification tasks, respectively. The ViT represents a Transformer-based architecture. In the appendix, we additionally validated the effectiveness of the 2-phasic metric using GAT as the backbone for node classification. Moving forward, we plan to apply the 2-phasic metric to diverse data types and additional backbone architectures to further verify its generalizability.
[1] Arpit, D., Jastrzebski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville, A., Bengio, Y., et al. A closer look at memorization in deep networks. In International Conference on Machine Learning, pp. 233–242, 2017.
[2] Siddiqui, S. A., Rajkumar, N., Maharaj, T., Krueger, D., and Hooker, S. Metadata archaeology: Unearthing data subsets by leveraging training dynamics. In The Eleventh International Conference on Learning Representations,2023.
Other Comments Or Suggestions:
-
Ensure figure references are accurate and well-positioned.
-
Enhance captions and explanations for figures and tables to better guide readers
Thank you for your valuable suggestions. In response, we have made the following revisions:
- We carefully reviewed all figure and table references, including the main paper and the Appendix.
- We examined the captions of all figures and tables and revised some of them, such as the caption of Table 2.
Thanks for your responses. The authors have addressed my previous concerns.
The paper introduces a novel approach to pseudo-labeling in semi-supervised learning by identifying and leveraging "two-phase labels"—predicted labels that exhibit a distinct two-phase pattern during training: initially predicted as one category in early epochs and later switching to another. The authors argue that these labels, despite their low confidence and non-stationary nature, are highly informative for decision boundaries. To operationalize this insight, they propose a 2-phasic metric to quantify the two-phase pattern and a tailored loss function that learns correct correlations while mitigating false ones. Experiments across eight datasets (including image and graph benchmarks demonstrate that incorporating two-phase labels boosts the performance of existing pseudo-labeling methods.
The paper received 4 positive reviews, with reviewers acknowledging its novelty, theoretical grounding, and empirical effectiveness. Thus, I recommend acceptance.