We want to thank reviewer ZhKe for the positive feedback and comments. We will address individual comments below.

W1: missing algorithmic details and the lack clarify for important concepts

Thank you for highlighting this. To clarify, "confidenceProbs" are not the curated scores used in this paper. The "CuratedScores" are those raw rating scores which may be curated by our score curation mechanism. These confidence probes are only used to identify misrated samples, where raw scores were then replaced by the majority of KNN scores. We realized that one line of code was missing at the end of the score curation procedure, shown as . Besides, in Appendix C of the revised manuscript, we add one intuitive binary example to describe the derivation of the score transition matrix in detail.

W2: the practical calculation of

Thank you for pointing this out. There seems to be a misunderstanding. In Line 314, , which represents the average likelihood of identifying sample as misrated over multiple epochs, is used to calculate the confidence probability. We would like to clarify that the sample index is used purely for identification purposes and not for ordering. This means that the value of for individual samples remains unchanged regardless of the calculation sequence between any two examples, such as (5th example, 6th example) or (6th example, 5th example). In practice, we determine whether a sample is misrated over multiple cleaning epochs by assigning a value of 1 if the sample is identified as misrated and 0 otherwise. Specifically, the value of is calculated as the average of the misrated labels (values of 0 or 1) across multiple epochs, effectively reflecting the proportion of times the sample is identified as misrated, thereby making the detection process more robust against noise. For example, suppose the misrated labels of one example over five epochs are {0, 1, 0, 1, 1}. Then will be .

W3: Concerns on k-NN score clusterability and its impact on transition tatrix estimation

We would like to clarify that the score distribution in Figure 2 solely represents the statistical characteristics of rating scores and is not influenced by the embeddings of the samples. As a result, the score distribution alone cannot capture k-NN score clusterability, which is inherently linked to the samples’ embeddings. In fact, the k-NN clusterability ensures that k-NN information is meaningful and captures the relationship between embedding vectors and quality rating scores, such as potential score errors or score transition probabilities.

Target Example (Score: 1): User: 'You need to complete the following task: Calculate 15% of the following number: 100’, Assistant: '15% of 100 is 15.'
KNN Example 1 (Score: 3): User: ‘Calculate 15% of 500’. Assistant: ‘75’
KNN Example 2 (Score: 3): User: 'Calculate 50% of 300.”, Assistant: ' 50% of 300 is 150.'

For instance, consider an example selected from our data pool where a target sample is rated as 1, while its two nearest neighbors are both rated as 3. Intuitively, the quality of these three samples is similar. Therefore, the target sample would be expected to be rated as 3 instead.

Statistical probability information from k-NN clusters for each sample i.e., the prior probability constants (, , ) shown on the left-hand side of Eq. (1), is used to construct the consensus vectors. Violations of the k-NN clusterability assumption can affect the accuracy of k-NN information for individual samples, leading to overestimation or underestimation of the transition probabilities as well as the true score probability distribution. We acknowledge that the k-NN clusterability assumption may be violated in practice. However, the consensus vectors rely on the average probabilities across all 2-NN clusters, allowing statistical information from the remaining samples to mitigate corruption caused by a small number of violations. As a result, our method can tolerate a proportion of k-NN violations. Intuitively, prior work [1] has demonstrated that even in image classification tasks (e.g., CIFAR-10, Table 3), where 20% of data samples violate the k-NN clusterability assumption, the method still outperforms other baselines. Empirically, our experimental results support this claim. Furthermore, due to the absence of ground-truth scores, it is infeasible to conduct experiments to explicitly detect such violations.

[1] Clusterability as an Alternative to Anchor Points When Learning with Noisy Labels, ICML 2021.