PaperHub
5.5
/10
Poster4 位审稿人
最低2最高4标准差0.7
2
3
3
4
ICML 2025

PRIME: Deep Imbalanced Regression with Proxies

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24
TL;DR

This paper introduces PRIME, a novel proxy-based representation learning scheme for imbalanced regression.

摘要

关键词
Imbalanced RegressionRepresentation LearningProxy

评审与讨论

审稿意见
2

The paper introduces PRIME, a novel representation learning method for deep imbalanced regression tasks. PRIME leverages synthetic reference points called "proxies" to guide the learning of balanced and well-ordered feature representations, even for minority samples. Unlike previous methods that rely solely on sample relationships within individual batches, PRIME utilizes proxies as global anchors to shape the desired feature distribution. PRIME also enables the seamless application of class imbalance techniques from classification to regression setups, bridging the gap between the two tasks. Proposed experiments demonstrate the effectiveness of PRIME, achieving good performance on various real-world regression benchmarks.

给作者的问题

I find this paper interesting, and the proposed method is novel. However, the claim of "demonstrating state-of-the-art performance on four real-world regression benchmarks across diverse target domains" needs further justification.

论据与证据

not all. Authors assert that PRIME achieve the SOTA performance, but it seems that authors ignore the current best model which have better performance than PRIME.

方法与评估标准

yes

理论论述

yes

实验设计与分析

The paper mentions that the authors followed the "same experimental setup" and "previous state-of-the-art methods" for each dataset, indicating that they have adopted well-established experimental protocols.

However, several SOTA baselines are missed in the paper, and I listed partial of them below.

By reviewing the VIR paper, it is evident that VIR outperforms PRIME in all shot settings on AgeDB-DIR and surpasses all other methods on IMDB-WIKI-DIR. Why did the authors not report this?

Additionally, the paper "IM-Context: In-Context Learning for Imbalanced Regression Tasks" outperforms both VIR and ConR across all metrics, establishing it as the current state-of-the-art (SOTA) method. However, the authors did not mention this approach.

Does this imply that PRIME is unable to outperform VIR, PFN-localized, and GPT2-localized?

补充材料

yes, all.

与现有文献的关系

​The PRIME method, as presented in the paper, addresses a significant gap in the existing literature on imbalanced regression by introducing the concept of proxy learning into this domain for the first time. It leverages novel ideas surrounding sample-proxy relationships and integrates these with established imbalanced learning techniques to achieve state-of-the-art performance across various regression benchmarks.

遗漏的重要参考文献

By reviewing the VIR paper, it is evident that VIR outperforms PRIME in all shot settings on AgeDB-DIR and surpasses all other methods on IMDB-WIKI-DIR. Why did the authors not report this?

Additionally, the paper "IM-Context: In-Context Learning for Imbalanced Regression Tasks" outperforms both VIR and ConR across all metrics, establishing it as the current state-of-the-art (SOTA) method. However, the authors did not mention this approach.

Does this imply that PRIME is unable to outperform VIR, PFN-localized, and GPT2-localized?

其他优缺点

The main concern is why the authors assert that PRIME achieves state-of-the-art (SOTA) performance while not reporting the results of prior methods that outperform PRIME. I believe the authors need to address this issue.

其他意见或建议

None, but authors need to report the mentioned performance.

作者回复

We appreciate the reviewer’s feedback and the opportunity to clarify our claim regarding state-of-the-art (SOTA) performance. Below, we provide detailed comparisons with VIR and IM-Context, along with additional experiments to ensure a fair evaluation.

1. Comparison with VIR

Our claim of achieving SOTA performance is based on evaluations under a unified experimental protocol: using the same train/val/test splits, backbone (vanilla ResNet-50), and training settings as in prior works such as LDS, RankSim, and ConR. This setup allows for a fair and controlled comparison that isolates the contribution of the proposed PRIME framework.

In contrast, VIR uses a different data split and a more complex model architecture, incorporating a calibration network and reconstruction modules. These differences make direct comparisons of reported performance potentially misleading, as the results reflect not only the learning objective but also auxiliary components and dataset configurations.

In response to the reviewer’s suggestion, we conducted an additional experiment comparing PRIME to VIR under VIR’s official setting. Since the code and data splits for VIR are only publicly available for AgeDB-DIR, we focused our comparison on this dataset. We trained PRIME using the same train/val/test splits as VIR, while keeping all other settings unchanged. We then compared our results against the official pre-trained VIR model provided by the authors. For PRIME, results are averaged over five runs. As shown in Table 1, PRIME consistently outperforms VIR across all evaluation metrics under this setting.

We will revise the manuscript to include this experiment and clarify the scope of our SOTA claim accordingly.

Table 1. Comparison with VIR.

MethodMAEGM
AllManyMed.FewAllManyMed.Few
VIR (author-provided)7.126.697.739.594.564.265.116.29
PRIME7.066.447.619.284.394.084.936.00

2. Comparison with IM-Context

We thank the reviewer for bringing IM-Context to our attention.

First, we note that IM-Context is built upon a substantially different model configuration. It employs a pre-trained CLIP image encoder to extract visual features, followed by in-context learning using large-scale models such as GPT-2 and PFN. In contrast, PRIME uses a ResNet-50 backbone trained from scratch, without leveraging any pre-trained models. Therefore, direct comparisons may not be meaningful, as the two approaches differ significantly in both model capacity (i.e., CLIP, GPT-2, PFN vs. ResNet-50) and training paradigm (i.e., leveraging pre-trained models vs. training from scratch).

Second, we emphasize that PRIME is a proxy-based representation learning framework designed to be independent of model architecture. Indeed, our method is broadly applicable and can benefit from stronger backbones. We expect that using more powerful models would further improve performance.

To examine this, we conducted additional experiments using the pre-trained CLIP image encoder (ViT-B/32) as the backbone, which is also used in IM-Context. Specifically, we fine-tuned the CLIP backbone together with a two-layer MLP regression head and trained the model using the PRIME loss. To ensure robustness, we report the average performance of PRIME over five independent runs. We then compared our results on AgeDB-DIR and IMDB-WIKI-DIR with those reported by IM-Context.

As shown in Tables 2 and 3, PRIME substantially outperforms both PFN-localized and GPT2-localized across all evaluation metrics. These findings confirm that PRIME remains effective even on top of strong pre-trained models, demonstrating its flexibility across backbone choices. We believe these additional experiments further support our SOTA claim under a unified and fair evaluation protocol.

We will revise the manuscript to include these results and reflect the comparison with IM-Context.

Table 2. Results for AgeDB-DIR.

MethodMAEGM
AllManyMed.FewAllManyMed.Few
PFN-localized6.585.618.4910.494.293.586.308.19
GPT2-localized6.055.676.717.833.793.594.174.90
PRIME5.475.465.485.573.483.453.643.35

Table 3. Results for IMDB-WIKI-DIR.

MethodMAEGM
AllManyMed.FewAllManyMed.Few
PFN-localized8.968.7110.7916.335.265.176.009.42
GPT2-localized7.767.3511.1517.714.294.135.9611.00
PRIME6.425.989.9216.283.493.335.179.41
审稿人评论

Thank you for your rebuttal.

I believe there may have been a misunderstanding regarding the VIR paper. It uses the same datasets as other related works—for example, AgeDB-DIR, IMDB-DIR, NYUD2-DIR, and STS-B-DIR.

Therefore, author's statement that VIR uses a different data split is not accurate.

Additionally, the model architecture in their paper follows the same structure as DIR; the only difference is the incorporation of uncertainty modeling. From my perspective, this should not be a reason to exclude their results or to include only the AgeDB-DIR comparison. The results can likely be reproduced with minimal effort.

That said, given the rebuttal and the current results provided by the authors, I am willing to either raise my score or keep my score. I submit my reason to AC. In addition, I strongly encourage the authors to include full comparisons in the camera-ready version.

作者评论

Thank you for your comments.

We would like to clarify that the data splits used in VIR are different from those used in DIR, RankSim, ConR, and our work, despite all methods using the same underlying datasets.

This difference is clearly evident by comparing the agedb.csv files (which define the train/val/test splits) provided in the official repositories of VIR and other related works. For example, the first sample in the split provided by VIR, 715_RonaldReagan_53_m.jpg, is included in the training set, whereas the exact same sample is assigned to the validation set in the splits used by DIR, RankSim, ConR, and our work.

We invite the reviewer to directly compare the following official repositories to confirm this discrepancy:

Official repositories:

We emphasize that such discrepancies in data splits can significantly affect reported performance, and thus using consistent splits is essential for fair and meaningful comparison.

Regarding the model architecture, while VIR adopts the same backbone as prior DIR works, its use of uncertainty modeling introduces additional implementation complexity. That said, we agree that this difference alone should not preclude a full comparison.

For more complete comparisons, we conducted additional experiments with VIR using its official implementation under our setup, which is shared across DIR, RankSim, and ConR, for the AgeDB-DIR and IMDB-WIKI-DIR datasets. Tables 4 and 5 below summarize the results. For AgeDB-DIR, we report the performance of PRIME with C=40C=40 (as discussed in our rebuttal to Reviewer h5o4). Across both datasets, PRIME consistently outperforms VIR.

Building on both the previous comparison in Table 1 of our rebuttal (evaluating PRIME under VIR’s setup) and the additional experiments presented here (evaluating VIR under our setup), we believe the results consistently demonstrate the superior performance of PRIME over VIR. We will also include results on the remaining datasets in the camera-ready version to ensure a thorough and comprehensive comparison.

We hope that our clarifications and the additional results have addressed your concerns. We would greatly appreciate it if you considered updating your score to an accept.

Table 4. Comparison with VIR on AgeDB-DIR under our setup.

MethodMAEGM
AllManyMed.FewAllManyMed.Few
VIR7.316.698.2910.444.594.125.537.52
PRIME7.036.358.249.904.354.005.296.09

Table 5. Comparison with VIR on IMDB-WIKI-DIR under our setup.

MethodMAEGM
AllManyMed.FewAllManyMed.Few
VIR7.516.8612.8923.314.173.887.7516.90
PRIME7.366.7312.4823.013.983.737.1714.38
审稿意见
3

This paper presents PRIME, a method for handling regression tasks with imbalanced data distributions. PRIME introduces synthetic proxies as reference points that uniformly represent the continuous target space, aiming to mitigate representation collapse toward majority-target regions. The method uses two main loss components: a proxy loss, which positions these reference proxies in the feature space according to their relative positions in the target space, and an alignment loss, which encourages features of individual samples to move closer to appropriate proxies based on target similarities. PRIME treats each proxy like a class prototype, allowing it to adapt imbalanced classification methods. PRIME is empirically evaluated on four benchmark datasets.

给作者的问题

Please see weaknesses

[1] The theoretical analysis assumes that proxies are well-positioned (i.e., L proxy=0). How does PRIME perform in scenarios where this assumption does not hold? Is there empirical evidence to support the robustness of PRIME under such conditions?

论据与证据

The paper presents experimental evaluations using four benchmark datasets: AgeDB-DIR, IMDB-WIKI-DIR, NYUD2-DIR, and STS-B-DIR. However, several important baselines or combinations of existing methods appear to be missing or selectively reported. Specifically, previous methods are typically evaluated by combining multiple methods (e.g., LDS + FDS + RankSim), and other baseline are missing such as VIR (Variational Imbalanced Regression: Fair Uncertainty Quantification via Probabilistic Smoothing).

方法与评估标准

Yes

理论论述

The paper provides theoretical generalization bounds intended to justify the effectiveness of PRIME, in Theorem 4.1. However, the theoretical analysis assumes that proxies are well-positioned, which may not always hold in practice.

实验设计与分析

The paper follows standard benchmarks for evaluating imbalanced regression methods, specifically those introduced by Yang et al. (2021).

补充材料

Yes, all.

与现有文献的关系

NA

遗漏的重要参考文献

No

其他优缺点

Strengths:

  • The paper is clearly written, logically structured, and easy to follow.
  • It includes extensive experimental analyses, with a thorough exploration of the approach through ablation studies and evaluations across multiple benchmarks.
  • Introducing synthetic proxies to guide feature learning and achieve a balanced representation in regression tasks is innovative and conceptually clear.

Weaknesses:

  • Although the paper provides a sensitivity analysis, PRIME introduces several hyperparameters (e.g., λ p , λ a , τ f , τ t , α). The detailed impact of these parameters is not thoroughly explored, raising concerns about the robustness and practicality of tuning the method for different datasets.
  • The initialization of proxies plays a central role in PRIME’s success, yet the paper does not adequately investigate how different initialization strategies affect performance. Without this, the claimed advantages might be overly dependent on optimal initial placements of proxies.
  • The theoretical analysis relies heavily on the assumption that proxies are optimally positioned, an assumption that may not hold in practical scenarios. The robustness of PRIME under deviations from this ideal condition is not discussed.
  • Leveraging concepts from classification for regression problems, such as the use of class prototypes or proxies, is not entirely novel. The paper should clarify and distinguish more explicitly how PRIME significantly differs from prior approaches that adopt classification techniques in regression contexts.
  • The related work section is missing references-

其他意见或建议

NA

作者回复

We appreciate the reviewer’s comments. We have addressed all points and will revise the manuscript accordingly.

1. Impact of hyperparameters

We have already provided detailed analyses of the impact of each hyperparameter (λp\lambda_p, λa\lambda_a, τf\tau_f, τt\tau_t, and α\alpha) in Tables 16–20 in the appendix. The results show that PRIME consistently outperforms the w/o PRIME baseline regardless of the choice of hyperparameters, demonstrating strong robustness and practical stability. We will revise the manuscript to make this analysis more prominent.

2. Clarification on proxy initialization

We believe there may be a misunderstanding. PRIME does not rely on any optimal or specific initial placements of proxies. As stated on the right side of line 115 in the manuscript, all proxies are randomly initialized and jointly learned during training. Moreover, the effect of initialization randomness is already reflected in the repeated experiments with different random seeds, as reported in our main tables. The consistent performance across these runs confirms that PRIME is robust to proxy initialization. We will revise the manuscript to make this clearer and avoid potential confusion.

3. Theoretical analysis under non-optimal proxy positioning

We note that the derivation of the generalization error bound in Theorem 4.1 remains valid regardless of whether the proxies are optimal. When proxies are not optimally positioned, the resulting discrepancy can be incorporated as an additional term in the bound, rather than invalidating the analysis. To support this, we provide a sketch of how the generalization bound can be extended to the non-optimal case.

Let z~jp\tilde{\mathbf{z}}_j^p for j=1,,Cj=1,\ldots,C denote the optimal proxy features,

and p~:=p~θ(ξx)\tilde{p}:=\tilde{p}_{\theta}(\xi|\mathbf{x}) be the corresponding feature association distribution.

The empirical (i.e., non-optimal) proxies can be defined as zjp:=z~jp+ϵj\mathbf{z}_j^p:=\tilde{\mathbf{z}}_j^p+\epsilon_j,

where ϵj\epsilon_j represents the estimation error, and p:=pθ(ξx)p:=p_{\theta}(\xi|\mathbf{x}) denotes the corresponding feature association.

We revisit the balanced alignment risk term in Eq. (21) of the appendix, which was originally formulated based on optimally positioned proxies. To analyze the non-optimal case, we subsititue logp~\log \tilde{p} by applying the following identity: logp~=logp+[logp~logp].\log \tilde{p}=\log p + [\log \tilde{p} - \log p].

Then, the first term, logp\log p, follows the same derivation as in Theorem 4.1. The second term, logp~logp\log\tilde{p}-\log p, captures the discrepancy introduced by the deviation between the empirical and optimal proxies. Since pp and p~\tilde{p} is defined as softmax, we can bound this residual term using the inequality: aibimaxiaibi\frac{\sum a_i}{\sum b_i} \leq \max_i \frac{a_i}{b_i} for positive values aia_i, bib_i, which leads to: logp~logp2τfmaxjdf(z,z~jp+ϵj)df(z,z~jp).|\log\tilde{p} - \log p| \leq 2\tau_f\max_j |d_f(\mathbf{z},\tilde{\mathbf{z}}_j^p + \epsilon_j) - d_f(\mathbf{z},\tilde{\mathbf{z}}_j^p)|.

Finally, applying the inequality above to Eq. (21) yields an additional bounded term, which can be directly incorporated into the generalization error bound derived in Theorem 4.1. Importantly, as training progresses and the proxies become more accurate (i.e., ϵj\epsilon_j becomes smaller), the residual decreases accordingly, leading to a tighter bound. Our empirical results also show that PRIME performs robustly with random initialization of proxies. We will include this extended derivation in the revised manuscript with a full proof.

4. Distinction from classification-based methods

To the best of our knowledge, PRIME is the first to introduce proxies specifically designed for imbalanced regression. Moreover, as discussed in Section C.4 of the appendix, PRIME differs substantially from prior classification-based regression methods in its formulation and learning objective.

First, existing approaches typically quantize continuous targets into discrete bins and treat each bin as a class, which inevitably introduces quantization error. In contrast, PRIME assigns proxies based on soft associations (Eq. (6)) derived from pairwise target distances, effectively mitigating such errors.

Second, instead of predicting proxy indices (i.e., classes), PRIME learns to minimize the distance between features and their associated proxies in the embedding space. This objective aligns more naturally with the continuous nature of regression and promotes representations that reflect target similarities.

Furthermore, we also empirically compare PRIME with the most recent classification-based method, Hierarchical Classification Adjustment (HCA) [CVPR’24], and observe consistently superior performance, as shown in Tables 1–3 of the manuscript. We will revise the main text to better highlight these distinctions and clarify the novelty of our method.

5. Related work

As the comment does not specify which references are missing, we would appreciate any suggestions the reviewer might have.

审稿人评论

Thank you for your detailed and thoughtful rebuttal. That said, I still have a few remaining concerns regarding PRIME.

Regarding proxy initialization, I was referring to the number of supports C, as the results on the few-shot regions are sensitive to this parameter. As seen in Table 15, with C = 10 and C = 40, the MAE in the Few category (11.47 and 11.20 respectively) is higher than not using PRIME at all (10.71). Sorry for not making this clearer in my initial review.

While you use soft associations, considering proxies (as you refer to them as classes) is still a form of discretization. It may differ technically from binning, but semantically, it serves the same purpose.

As noted in the Claims and Evidence section of my review, several important baselines and combinations of existing methods appear to be missing from your evaluation. Methods like VIR (Wang & Wang, 2023, Variational Imbalanced Regression, NeurIPS) and IM-Context (Nejjar et al., 2024, In-Context Learning for Imbalanced Regression Tasks, TMLR) are relevant and should be discussed.

Regarding the experimental results, I noticed that your comparisons do not reflect the full landscape of prior techniques. Many recent works evaluate using combinations of methods (e.g., LDS + FDS + RankSim + ConR), which are notably absent from your main tables. Including such combinations is important to fairly assess the performance of PRIME. For example:

AgeDB-DIR (MAE, lower is better):

MethodAllManyMedianFew
LDS+FDS+RankSim+ConR6.816.327.459.21
VIR6.996.397.479.51
PRIME7.096.388.3910.13
PRIME + PRW7.066.677.279.91
PRIME + CB7.126.618.079.29
PRIME + LDAM7.246.857.849.29

IMDB-WIKI-DIR (MAE, lower is better):

MethodAllManyMedianFew
FDS + ConR7.296.9012.0121.72
VIR7.196.5611.8120.96
PRIME7.366.7312.4823.01
PRIME + PRW7.376.7412.0422.34
PRIME + CB7.486.9012.0522.71
PRIME + LDAM7.496.9112.2322.32

These comparisons suggest that, as currently presented, PRIME underperforms relative to combinations of existing methods in several important cases. Including these baselines—or combining PRIME with previous techniques like VIR or ConR—could help show its complementary value and improve the overall narrative. Without this, the current results may give a misleading impression of PRIME's relative performance.

If you show that combining PRIME with other methods leads to improvements, I'd be open to reconsidering my recommendation. As it stands, however, the evaluation does not provide a fully fair or comprehensive picture.

作者评论

Thank you for your comments. We have addressed all points and will revise the manuscript accordingly. We hope this resolves any remaining concerns and would appreciate your reconsideration of the overall recommendation.

1. Number of proxies

For simplicity, we fixed the number of proxies CC and did not extensively tune the hyperparameters. Still, PRIME consistently outperforms the baseline in overall performance (i.e., All) across different values of CC (Table 15 in the appendix), suggesting that it is relatively robust to the choice of CC.

Moreover, appropriate tuning could further improve its performance. As shown in Table 3 of our rebuttal to Reviewer h5o4, tuning for C=40C=40 led to improved results, particularly in the Few category (9.90).

2. Beyond binning

While PRIME introduces discrete proxies, its purpose fundamentally differs from binning. Traditional binning discretizes the label space to produce classification targets, whereas PRIME uses soft associations with multiple proxies to guide representation learning in regression. As other reviewers acknowledged the use of proxies in imbalanced regression as an interesting direction, we believe PRIME makes a meaningful contribution to representation learning for DIR.

3. Comparison with VIR

We note that VIR employs different train/validation/test splits. As such, direct comparisons of the reported performance may not be meaningful.

For a fair comparison, we evaluated PRIME under VIR’s setting. Since VIR’s code and data splits are available only for AgeDB-DIR, we limit our comparison to this dataset. As the official code does not reproduce the reported performance, we compare against the pre-trained VIR model provided by the authors. As shown in Table 1 below, PRIME consistently outperforms VIR across all evaluation metrics.

In addition, we further evaluated VIR under our setup, which is shared across DIR, RankSim, and ConR, on both AgeDB-DIR and IMDB-WIKI-DIR. PRIME again demonstrates consistently better performance, as shown in Tables 4 and 5 in our rebuttal to Reviewer SddV.

Table 1. Comparison with VIR on AgeDB-DIR.

MethodMAEGM
AllManyMed.FewAllManyMed.Few
VIR (author-provided)7.126.697.739.594.564.265.116.29
PRIME7.066.447.619.284.394.084.936.00

4. Comparison with IM-Context

IM-Context builds on large-scale pre-trained models. Specifically, it uses a pre-trained CLIP image encoder to extract features, followed by in-context learning with large models such as GPT-2 and PFN.

Importantly, PRIME is a proxy-based representation learning scheme designed to be independent of the model architecture. It is broadly applicable and can benefit from stronger backbones.

We conducted additional experiments using the same pre-trained CLIP encoder (ViT-B/32) employed by IM-Context. We compare our results on AgeDB-DIR and IMDB-WIKI-DIR with those reported by IM-Context. As shown in Tables 2 and 3 below, PRIME substantially outperforms both PFN-localized and GPT2-localized models across all evaluation metrics, demonstrating its effectiveness under the same backbone.

Note: Further details regarding VIR and IM-Context are provided in our rebuttal to Reviewer SddV.

5. Combining PRIME with other methods

As noted (lines 210–213), PRIME can be easily integrated into existing methods. In particular, PRIME focuses on aligning samples with proxies, making it complementary to recent approaches that leverage sample-wise feature relationships, such as FDS, RankSim, and ConR.

As suggested, we added PRIME to the best-performing combinations: LDS+FDS+RankSim+ConR (AgeDB-DIR) and FDS+ConR (IMDB-WIKI-DIR). As shown in Tables 2 and 3, combining PRIME with existing techniques consistently improves performance, highlighting its complementary role and broad applicability.

Table 2. Results for AgeDB-DIR.

MethodMAEGM
AllManyMed.FewAllManyMed.Few
ResNet-50
LDS+FDS+RankSim+ConR6.816.327.459.214.393.815.016.02
LDS+FDS+RankSim+ConR+PRIME6.766.297.379.114.243.804.905.98
CLIP
PFN-localized6.585.618.4910.494.293.586.308.19
GPT2-localized6.055.676.717.833.793.594.174.90
PRIME5.475.465.485.573.483.453.643.35

Table 3. Results for IMDB-WIKI-DIR.

MethodMAEGM
AllManyMed.FewAllManyMed.Few
ResNet-50
FDS+ConR7.296.9012.0121.724.023.836.7112.59
FDS+ConR+PRIME7.256.8511.4521.223.993.786.5612.41
CLIP
PFN-localized8.968.7110.7916.335.265.176.009.42
GPT2-localized7.767.3511.1517.714.294.135.9611.00
PRIME6.425.989.9216.283.493.335.179.41
审稿意见
3

The authors propose a novel method for imbalanced regression, leveraging learnable proxies as global reference points to achieve a balanced and well-structured feature distribution and aligning sample features with these proxies. Extensive experiments are conducted on multiple benchmarks, yielding strong and impressive results.

给作者的问题

N/A

论据与证据

Yes, I believe the claims are supported by clear evidence.

方法与评估标准

Yes, the method is evaluated on standard benchmarks alongside other representative baselines.

理论论述

I reviewed the theoretical claims and did not find any apparent issues. However, I am not fully certain about their rigorous correctness in the supplementary material, and a more thorough verification may be necessary.

实验设计与分析

Yes, I find the current experimental designs and ablation studies to be valid and sound.

补充材料

I reviewed the supplementary material, except for the mathematical derivations.

与现有文献的关系

The authors have done a good job in covering the related work.

遗漏的重要参考文献

N/A

其他优缺点

Strengths: The manuscript is in a clear and easy-to-follow presentation flow. The methodology and experimentation are solid.

Weakness: the overall idea is interesting and the demonstration is solid, but I have concerns about the proxy features. It would enhance the contribution if the authors can discuss more on the choice of proxy features. In the proposed method a learnable proxy feature bank is used, but what if the method use non-learnable proxy features? E.g. we can discretize the data points to bins like LDS in [1], and then within each bin we can have its own proxy features (e.g., the centroid of all features within the bin)? Do the authors have any insights? How would this type of proxy learning compare to learnable proxy learning? It would even better if the authors can do an ablation study.

I am open to raising my score based on the authors' response.

[1] Yang et al., Delving into Deep Imbalanced Regression, ICML 2021

其他意见或建议

I have one curious question about the Figure 2(b): why is there a twisted line pattern in the top right?

作者回复

We thank the reviewer for the constructive and insightful comments. We have addressed all points and will revise the manuscript accordingly.

1. Ablation on non-learnable proxy

As noted by the reviewer, PRIME is flexible with respect to how proxies are constructed—proxies can be either learnable or non-learnable. Following the suggestion, we conducted an additional ablation experiment on a non-learnable variant of PRIME, where proxy features are updated as the centroids of sample features assigned to each proxy. Specifically, to ensure the proxy loss and alignment loss can be properly backpropagated, we update the proxy features within each mini-batch based on the current sample-to-proxy assignments.

Table 1 compares the performance of the centroid-based proxy update with our PRIME using learnable proxies on the AgeDB-DIR dataset. Results are averaged over five runs. While the centroid-based method achieves slightly better performance than the learnable proxy in the Median category, it suffers from notable performance degradation in the other regions. In particular, we observe a significant performance drop in the Few category, indicating that the centroid-based proxies struggle under severe data sparsity.

We attribute this performance gap to the inherent limitations of the centroid-based method. Since proxies are updated as the centroids of the assigned sample features, their quality heavily depends on the number of assigned samples. When only a few samples are available—as is often the case in the Few category—the estimated centroids become unstable and unreliable. Moreover, because centroids are computed within each mini-batch, their estimates can fluctuate significantly depending on the batch composition. In contrast, learnable proxies are global parameters that are updated via backpropagation, offering greater stability and robustness, particularly in data-sparse regions.

This issue is further exacerbated in regression settings, where some proxy bins may contain no samples at all. In such cases, the centroid-based approach cannot update the corresponding proxies, leaving them inactive throughout training. In contrast, learnable proxies remain effective even without sample assignments, serving as global reference points that guide other samples and support representation learning.

We will add this ablation study and discussion to the revised manuscript.

Table 1. Results with non-learnable proxies.

MethodMAEGM
AllManyMedianFewAllManyMedianFew
Non-learnable (centroid)7.21±\pm0.096.57±\pm0.108.20±\pm0.1310.89±\pm0.334.67±\pm0.114.24±\pm0.125.42±\pm0.157.64±\pm0.22
PRIME7.09±\pm0.086.38±\pm0.118.39±\pm0.2610.13±\pm0.364.39±\pm0.083.91±\pm0.105.58±\pm0.226.57±\pm0.49

2. Clarification on Figure 2(b)

The twisted line pattern in Figure 2(b) appears due to suboptimal alignment between features and their corresponding proxies in the Few category. Although the proxies represent a balanced feature distribution, the alignment process in Eq. (7) still faces challenges under sample imbalance. Minority samples, which occur infrequently, often fail to align properly with their proxies, resulting in distorted feature–proxy alignment.

The use of class imbalance techniques (e.g., PRW, CB, and LDAM) provides better alignment focus on minority samples, mitigating this issue. To empirically validate their effect, we conducted an additional analysis on the AgeDB-DIR dataset, measuring the Spearman correlation between the proxy–feature similarity matrix (as visualized in Figure 2(b)) and the label similarity matrix. A higher correlation indicates better alignment and reduced distortion in the learned feature space.

Table 2 reports the Spearman correlation values when PRIME is combined with various class imbalance techniques. Results are averaged over five runs. Incorporating class imbalance techniques significantly improves the correlation, confirming their effectiveness in facilitating better alignment, particularly for samples in the Few category.

We will revise the manuscript to include this additional analysis and the corresponding figures alongside Figure 2(b).

Table 2. Alignment between proxies and features.

MethodSpearman ρ\rho (\uparrow)
PRIME0.722±\pm0.020
PRIME + PRW0.802±\pm0.021
PRIME + CB0.800±\pm0.023
PRIME + LDAM0.837±\pm0.015
审稿意见
4

For Deep Imbalanced Regression (DIR), the authors propose Proxy-based Representation learning for IMbalanced rEgression (PRIME).

They generate synthetic proxies in the feature space and align instances to the proxies. The proxies are distributed uniformly across the target values. While the corresponding target values are determined, the proxies are learned via model parameters. Inspired by tSNE, they define pi,jp_{i,j} and qi,jq_{i,j} to be similarities between yiy_i and yjy_j (target space) and between ziz_i and zjz_j (feature space). They seek to minimize the KL divergence of the two distributions (pi,jp_{i,j} and qi,jq_{i,j}). To reduce trivial solutions, they encourage distance between proxies and "feature space uniformity" (Wang & Isola 2020). Proxy loss (LproxyL_{proxy}) is the KL divergence and regularization.

For each instance, they calculate the association via distance and softmax to each proxy in the features space. Similarly, they calculate the association in the target space. Alignment loss (LalignL_{align}) is the cross entropy with the two associations (similar to a classification loss). The overall loss is the regression loss plus LproxyL_{proxy} and LalignL_{align}.

For evaluation they use 4 datasets and compare with 8 existing techniques. Empirical results indicate adding PRIME is somewhat beneficial. Ablation studies were performed to indicate the contribution of some of the components.

update after rebuttal

After reading and responding to the authors' rebuttal, I decided to raise my rating to Accept -- between Weak Accept and Accept to be more precise.

给作者的问题

  1. To be a self-contained paper, can you explain why the second term in Eq 4 can achieve "feature space uniformity" (Wang & Isola 2020)?

  2. How to determine the number of proxies? Table 15 in Appendix: any insight on why C=20 seems to perform better? I was wondering that more proxies would generally improve the learned feature space.

论据与证据

For evaluation they use 4 datasets and compare with 8 existing techniques. Empirical results indicate adding PRIME is somewhat beneficial. PRIME generally outperforms existing methods. PRIME alone outperorms in 3 out of 4 datasets in the Few category.

方法与评估标准

The proposed methods and evaluation criteria are reasonable.

理论论述

I am not familiar with some of the terms in Theorem 4.1 and hence did not check the proof in the Appendix.

实验设计与分析

Tables of results and visualization are helpful. In the tables, since the existing methods do not have the benefit of additional methods, compare PRIME alone with the existing methods, perhaps with a different highlight.

补充材料

I quickly reviewed Further Experiments and Analyses (Part C) of the supplementary materials.

与现有文献的关系

The proposed method is different from existing methods on representation learning in imbalanced regression. While the different components are borrowed from tSNE, (Wang & Isola 2020), and classification loss, the combination seems interesting.

遗漏的重要参考文献

I am not aware of essential references that are not discussed.

其他优缺点

While the different components are borrowed, the combination is interesting.

其他意见或建议

More explanation on why the second term in Eq 4 can achieve "feature space uniformity" (Wang & Isola 2020)

伦理审查问题

n/a

作者回复

We thank the reviewer for the constructive feedback. We have addressed all points and will revise the manuscript accordingly.

1. PRIME’s effectiveness in the Few category

We would like to point out that PRIME alone achieves state-of-the-art performance in the Few category on three out of four datasets, outperforming existing methods by 0.7%, 0.5%, and 3.6% on AgeDB-DIR, NYUD2-DIR, and STS-B-DIR, respectively. Furthermore, a key contribution of PRIME lies in its ability to seamlessly integrate class imbalance techniques into regression tasks. This facilitates balanced feature learning, substantially enhancing performance in the Few category.

To further validate PRIME’s effectiveness in the Few category, we conducted an additional analysis on the test set of AgeDB-DIR, measuring the Spearman correlation between the feature similarity matrix and the label similarity matrix. A higher correlation indicates that the learned features are more well-ordered and better reflect the continuity of the label space, which is crucial for learning effective representations in regression tasks.

As shown in Table 1, we compared the Spearman correlations for all samples and for samples in the Few category across PRIME, RankSim, and ConR. Results are averaged over five runs. While RankSim and ConR exhibit reasonable correlation values on the full dataset, their performance significantly degrades in the Few category. In contrast, PRIME maintains a high correlation within the Few category, indicating that it learns well-ordered feature representations, even for minority samples. This highlights the strength of our proxy-based formulation, which provides holistic guidance for feature positioning, allowing minority samples to be embedded in alignment with the overall label structure.

We will incorporate these additional results and analyses into the revised manuscript to better highlight PRIME’s effectiveness in the Few category.

Table 1. Spearman correlation between feature and label similarities.

MethodAllFew
RankSim0.804±\pm0.0080.587±\pm0.036
ConR0.790±\pm0.0240.614±\pm0.043
PRIME0.942±\pm0.0080.828±\pm0.020

2. Feature space uniformity

The second term in Eq. (4) encourages proxies to be repelled from one another, as it increases the angle between zip\mathbf{z}_i^p and zjp\mathbf{z}_j^p, thereby inducing a more dispersed (i.e., uniform) proxy distribution. Unlike the original uniformity loss (Wang & Isola, 2020), which pushes all pairs equally apart, our formulation increases the pairwise cosine distance proportionally to the label distance. This not only promotes uniformity in the proxy space but also preserves label ordinality.

Since sample features are aligned to proxies via the alignment loss in Eq. (7), the proxy uniformity is naturally transferred to the feature distribution. In this way, the second term in Eq. (4) promotes feature space uniformity by shaping the proxy distribution, which in turn guides the feature distribution through alignment.

To validate this, we trained a variant of PRIME without the second term in Eq. (4) (i.e., setting α=0\alpha=0) and measured feature space uniformity on the test set of AgeDB-DIR using the Luniform\mathcal{L}_{\text{uniform}} metric proposed by (Wang & Isola, 2020).

Table 2 presents the results averaged over five runs. Removing the second term leads to higher Luniform\mathcal{L}_{\text{uniform}} (i.e., lower uniformity), confirming that this term is critical for inducing feature space uniformity. We will include this result and discussion in the revised manuscript.

Table 2. Feature space uniformity.

MethodLunifrom\mathcal{L}_{\text{unifrom}} (\downarrow)
PRIME (α=0\alpha=0)-1.458±\pm0.022
PRIME-1.544±\pm0.016

3. Larger number of proxies

For simplicity, we set the number of proxies CC such that each proxy would roughly cover a 5-year age span for AgeDB-DIR and IMDB-WIKI-DIR, and kept it fixed while tuning the other hyperparameters. As shown in Table 15 in the appendix, even without tuning, increasing the number of proxies leads to improved performance in the many and median regions. This suggests that with proper hyperparameter tuning, a larger number of proxies could further improve performance.

To support this, we conducted an additional experiment with C=40C = 40, tuning the associated hyperparameters (λp=1\lambda_p = 1, λa=5\lambda_a = 5, τf=5\tau_f = 5, τt=1\tau_t = 1, α=0.005\alpha = 0.005), and obtained even better results, as reported in Table 3. Results are averaged over five runs. We will revise the manuscript to include this result.

Table 3. Results with more proxies on AgeDB-DIR.

MethodMAEGM
AllManyMedianFewAllManyMedianFew
PRIME (C=40C=40)7.03±\pm0.086.35±\pm0.128.24±\pm0.079.90±\pm0.254.35±\pm0.094.00±\pm0.075.29±\pm0.246.09±\pm0.13
最终决定

The paper has received scores 4,3,3,2. There were various concerns and detailed discussions focusing primarily on confirming the state-of-the-art performance of the method over previous baselines. For example, there have been particular concerns from Reviewers SddV and RZfa about missing comparisons to recent methods. The authors appear to have responded adequately to this by presenting additional experiments in the rebuttal, overall confirming improvements over other methods.

Unfortunately, there has been no further discussion on more foundational merits and interpretations of the method itself nor of the claimed theoretical support. From the discussions, it looks clear that there is agreement on improved practical performance of the algorithm over baselines. On the other hand, I can't help but note that in my view, the method itself lacks particular clarity and insights on its conceptual components - it combines two regularization loss terms which are also further combined with classification losses. It does appear to gives gains, but I am not sure how much real new ideas can be brought up from this.

The theory provided is rather disappointing on that front: There is some bound derived on balanced loss, but its unwieldy form of the involved terms does not make it particularly useful (in my opinion) for explaining the specific roles of the 'align' or 'proxy' loss terms and the specific choices to model these. Instead, it simply seems to capture that the loss is (in upper bound) proportional to a very rough term that measures imbalance (eq. 15). Unfortunately, none of the reviewers appears to have really read or commented on the theory and its value.

Given these limitations in conceptual clarity and theoretical contribution, coupled with the late addition of experimental results during the rebuttal period, and after consultation with senior area chairs, I recommend a borderline reject. I encourage the authors to incorporate the feedback to strengthen a future submission by considering moving the current theoretical analysis to the appendix or strengthening it, enhancing the discussion of the method's components and their interactions, and including the additional experiments, comparisons, and discussions from the rebuttal.