PaperHub
6.0
/10
Poster4 位审稿人
最低5最高8标准差1.2
5
8
5
6
3.5
置信度
正确性3.0
贡献度2.8
表达3.3
NeurIPS 2024

PSL: Rethinking and Improving Softmax Loss from Pairwise Perspective for Recommendation

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

This paper rethinks Softmax Loss (SL) from a pairwise perspective, introducing a novel family of robust DCG surrogate losses to address the limitations of SL, termed Pairwise Softmax Loss (PSL).

摘要

关键词
Recommender SystemsRanking Metrics OptimizationSurrogate LossDistributionally Robust Optimization

评审与讨论

审稿意见
5

This paper examines the limitations of softmax loss in recommender systems, focusing on its weak correlation with ranking metrics like DCG and sensitivity to false negatives. The authors propose pairwise softmax loss (PSL), which replaces the exponential function in softmax loss with activation functions such as ReLU, Tanh, and Atan to improve alignment with ranking metrics and enhance robustness. PSL optimizes the score gap between positive and negative item pairs, aiming for a tighter correlation with ranking metrics and improved noise resistance. Empirical evaluations on various datasets under different conditions (IID, OOD, and noisy) indicate that PSL generally outperforms SL and other baseline loss functions in recommendation accuracy and robustness.

优点

  • The paper is well-organized with clear explanations and presentation of theoretical and empirical results.
  • The authors provide strong theoretical analysis and comprehensive empirical evaluations on multiple datasets and testing conditions.
  • The proposed PSL method addresses key limitations of softmax loss, potentially leading to more accurate and robust recommender systems.

缺点

  • The proposed activation functions may interact differently with various underlying recommendation models. While the paper tests on a few models (MF, LightGCN, XSimGCL), the results might not hold for other models or architectures, limiting the generalizability of the method.
  • The method may introduce biases due to the specific choice of activation functions, potentially favoring certain types of data distributions or interaction patterns.
  • The reliance on temperature scaling for robustness could lead to parameter sensitivity, requiring careful tuning that might not be practical in real-world scenarios.

问题

  • The proposed activation functions (ReLU, Tanh, Atan) may have been tested on models like MF, LightGCN, and XSimGCL, but these models do not cover the entire spectrum of recommendation systems. Different models may have unique characteristics and architectural dependencies that interact differently with these activation functions. As a result, the performance improvements observed in the tested models may not generalize to other recommendation models or architectures.

    • Have the authors considered testing the activation functions on a broader range of recommendation models? How do the activation functions perform on models with different embedding sizes or layer structures?
    • The authors could extend their experiments to include a wider variety of recommendation models to validate the generalizability of the proposed activation functions. Additionally, conducting ablation studies to understand the interaction between the activation functions and various model components could provide deeper insights.
  • The method might introduce biases because the chosen activation functions could favor specific types of data distributions or interaction patterns. For instance, ReLU may perform better on sparse data, while Tanh might be more effective in scenarios with more uniformly distributed interactions. This bias can limit the applicability of the method to a narrower range of scenarios.

    • What criteria were used to select the specific activation functions (ReLU, Tanh, etc.)? Providing a rationale for the selection of each activation function based on empirical evidence would also strengthen the validity of their choices.
  • The method relies on temperature scaling to control the influence of different data points, which introduces parameter sensitivity. Finding the optimal temperature setting is crucial for achieving the desired robustness and performance, but this process can be labor-intensive and impractical in real-world scenarios. Moreover, the optimal temperature may vary across different datasets, requiring frequent re-tuning.

    • I wonder how sensitive are the results to variations in the temperature parameter? The authors should provide guidelines or automated methods for tuning the temperature parameter effectively?

局限性

  • The proposed activation functions have been tested on a limited number of recommendation models, which may not represent the full diversity of models used in practice. This limits the generalizability of the findings.

  • The specific choice of activation functions may introduce biases, favoring certain data distributions or interaction patterns, which could reduce the method's applicability across diverse datasets.

  • The reliance on temperature scaling introduces sensitivity to the temperature parameter, requiring careful and potentially impractical tuning for different datasets and scenarios.

作者回复

Response to Reviewer m2Qo (1/2)

Dear Reviewer m2Qo,

Much thanks for your detailed comments. In the following, we provide responses to the concerns you have raised:

[C1] While the paper tests on a few models, the results might not hold for other models or architectures, limiting the generalizability of the method.

[Response] Thank you for the constructive suggestions. Our experimental setup, which follows previous works (SL, BSL, etc.), tests on three typical backbones including basic MF, graph-based LightGCN and SOTA XsimGCL. However, we understand your concerns about generalizability and have subsequently tested our loss on a broader range of recommendation models, different embedding sizes, and different neural layers:

  • We conduct additional experiments on various model architectures, including neural-based NCF [a1], neighbor-based Simplex [a2], VAE-based Mult-VAE [a3], and causality-based PDA [a4].
  • We conducted sensitivity analysis on embedding sizes.
  • We test the performance with different layers in LightGCN.

The results are presented in the following tables in terms of NDCG@20 under IID setting. Our PSL consistently achieved the best performance across all tested scenarios. These results demonstrate the generalizability of PSL, aligning with our theoretical analyses -- the theoretical merits of PSL do not depend on any specific assumptions about model architecture, making it broadly applicable across diverse backbones.

Table A1. Recommendation performance (NDCG@20) with various backbones (NCF, SimpleX, Mult-VAE, PDA).

NCFGowallaMovieElectronicSimpleXGowallaMovieElectronicPDAGowallaMovieElectronicMult-VAEGowallaMovieElectronic
SL0.14690.07870.0475SL0.16200.09200.0493SL0.16300.09270.0529SL0.16260.09300.0528
AdvInfoNCE0.14430.07800.0467AdvInfoNCE0.16130.09190.0495AdvInfoNCE0.16260.09340.0526AdvInfoNCE0.16240.09320.0519
BSL0.14750.07870.0476BSL0.16200.09210.0493BSL0.16300.09280.0529BSL0.16260.09310.0528
PSL-tanh0.15150.08060.0482PSL-tanh0.16370.09320.0504PSL-tanh0.16470.09410.0533PSL-tanh0.16450.09440.0533
PSL-relu0.15200.08170.0492PSL-relu0.16410.09360.0506PSL-relu0.16470.09440.0535PSL-relu0.16480.09470.0537
PSL-atan0.15080.08140.0493PSL-atan0.16380.09320.0503PSL-atan0.16460.09410.0533PSL-atan0.16450.09450.0533

Table A2. Recommendation performance (NDCG@20) with varying embedding size dd in MF.

Gowalladd = 16dd = 32dd = 64dd = 128Electronicdd = 16dd = 32dd = 64dd = 128
SL0.13430.14960.16240.1689SL0.03780.04490.05290.0566
AdvInfoNCE0.13460.14920.16240.1696AdvInfoNCE0.03840.04540.05270.0566
BSL0.13460.14950.16270.1688BSL0.03860.04590.05300.0564
PSL-tanh0.13980.15240.16460.1708PSL-tanh0.04050.04640.05350.0570
PSL-relu0.14070.15340.16470.1706PSL-relu0.04110.04690.05410.0572
PSL-atan0.14010.15240.16460.1708PSL-atan0.04010.04640.05350.0569

Table A3. Recommendation performance (NDCG@20) with varying number of layers LL in LightGCN.

GowallaLL = 1LL = 2LL = 3ElectronicLL = 1LL = 2LL = 3
SL0.15470.16280.1551SL0.04820.05260.0486
AdvInfoNCE0.15420.16250.1548AdvInfoNCE0.04830.05280.0485
BSL0.15470.16280.1551BSL0.04820.05260.0486
PSL-tanh0.15670.16480.1572PSL-tanh0.05010.05320.0504
PSL-relu0.15660.16480.1570PSL-relu0.05040.05360.0503
PSL-atan0.15670.16480.1571PSL-atan0.05010.05320.0504

[References]

[a1] Neural collaborative filtering, WWW'17

[a2] SimpleX: A simple and strong baseline for collaborative filtering, CIKM'21

[a3] Variational Autoencoders for Collaborative Filtering, WWW'18

[a4] Causal intervention for leveraging popularity bias in recommendation, SIGIR'21

评论

Response to Reviewer m2Qo (2/2)

[C2] The method may introduce biases due to the specific choice of activation functions.

[Response] We appreciate your concern. We would like to clarify that PSL does not introduce biases. Although we offer three choices of activation functions, PSL consistently outperforms the compared methods with any choice, with few exceptions, as evidenced in Table 1, Table 2, Table A1, Figure 2, Figures C.1-C.4.

In practice, we recommend using PSL-relu, as it achieve superior performance in most cases. Moreover, the linear ReLU function is simpler than the others, making it potentially easier and more efficient to optimize.

[C3] The reliance on temperature scaling for robustness could lead to parameter sensitivity, requiring careful tuning that might not be practical in real-world scenarios.

[Response] Thank you for raising this concern. However, it is important to note that PSL does not introduce additional hyper-parameters compared to the classic SL, which also utilizes the temperature parameter. SL has been widely applied in practical recommender systems like Taobao and TikTok.

Besides, our empirical experiments on the temperature (presented in the following Table A4) indicate that PSL achieves strong performance requiring a narrow search range for the temperature of [0.025,0.05,0.1,0.15][0.025, 0.05, 0.1, 0.15] across four datasets. More importantly, PSL consistently outperforms basic SL at any temperature within this range. This suggests that PSL could serve as a better DCG surrogate loss than SL in practice, without necessitating additional tuning efforts.

In practice, when facing scenarios with significant resource constraints that preclude extensive hyperparameter tuning, we recommend limiting the search for the temperature to only two values {0.05,0.1}\{0.05, 0.1\}. At this setting, the performance of PSL may be slightly below the optimum but is still significantly better than BPR and SL. In more extreme scenarios, we may even opt to simply set τ=0.05\tau=0.05, as this value yields decent performance in the majority of cases. It is important to note that other advanced loss functions, such as BSL and InfoAdvNCE, are less effective in these scenarios because they require the tuning of more hyperparameters to maintain their performance.

Table A4. Recommendation performance (NDCG@20) with varying temperature τ\tau.

Gowallaτ\tau = 0.025τ\tau = 0.05τ\tau = 0.1τ\tau = 0.15
SL0.15430.16240.12480.0980
PSL-tanh0.15430.16460.13080.1014
PSL-relu0.15410.16470.13230.1032
PSL-atan0.15470.16460.13080.1013
Movieτ\tau = 0.025τ\tau = 0.05τ\tau = 0.1τ\tau = 0.15
SL0.07550.09290.08850.0755
PSL-tanh0.07500.09410.09100.0771
PSL-relu0.07550.09450.09200.0783
PSL-atan0.07500.09410.09100.0771
Electronicτ\tau = 0.025τ\tau = 0.05τ\tau = 0.1τ\tau = 0.15
SL0.02740.03600.05290.0516
PSL-tanh0.02920.04170.05350.0521
PSL-relu0.02920.04160.05410.0526
PSL-atan0.02930.04170.05350.0522
Bookτ\tau = 0.025τ\tau = 0.05τ\tau = 0.1τ\tau = 0.15
SL0.12100.11740.08070.0611
PSL-tanh0.12250.12070.08560.0641
PSL-relu0.12270.12140.08760.0658
PSL-atan0.12260.12070.08560.0642

We sincerely appreciate your feedback and hope these additional experiments and analyses have addressed your concerns. Please feel free to reach out if you have any further questions or suggestions.

评论

Thanks for the authors' responses, which have addressed my concerns.

评论

Thank you for raising your score and recommending acceptance!

审稿意见
8

From the unique characteristics of recommendation systems (RS), this paper highlights the incompatibility of the exponential function in the softmax loss when applied to RS. To address these issues, the authors propose the PSL loss, introducing two main modifications: the replacement of the activation function and the adjustment of the temperature factor's position. These changes result in multiple benefits, which are thoroughly discussed in the paper. Compared to several SOTA loss functions in RS, the proposed method demonstrates substantial improvements in user-item recommendation experiments conducted across three different scenarios.

优点

  1. The paper is well-written and accessible to readers, with the clearly-stated and intriguing motivation. The topic is fundamental to the field of recommendation systems. The proposed method effectively alleviates existing problems, despite the minimal changes.
  2. The proposed PSL loss could serve as a tighter surrogate of DCG compared to SL and could be considered as a DRO-empowered BPR loss. The bound relations with AdvInfoNCE and BPR underscore the sound merits of PSL. The logical coherence of several theorems and lemmas contributes to the comprehensiveness of this work.
  3. In the experiments, the detailed parameter settings for all compared methods are provided. The inclusion IID setting, OOD setting and noisy setting are thoughtfully designed to evaluate the applicability of PSL.
  4. Though the straightforward implementation, it’s encouraging that the proposed method demonstrates substantial improvements on widely-used datasets compared to several SOTA performances in RS.

缺点

  1. The proposed PSL loss is derived from the pair-wise form of softmax loss; however, the role of the original softmax loss in the subsequent discussion is not thoroughly addressed, leading to a minor gap in the exposition.
  2. While the paper provides some justification for the chosen activation functions, the theoretical rationale for selecting ReLU, Tanh, and Atan over others is not fully developed. The impact of the chosen activation functions on the stability of the training process is not sufficiently discussed.
  3. The paper mentions the use of grid search for hyperparameter optimization but does not provide a detailed analysis of the sensitivity of PSL to the temperature factor.
  4. There are a few typos in the paper: such as ‘faciliating’ in line73, ‘Scenerio‘ in line 256, ‘outperforms’ in line 314, ‘furture’ in line 325, , mismatch in line 157 158.

问题

  1. Recommendation systems often suffer from data sparsity issues. How does PSL perform under conditions of extreme data sparsity? Have the authors conducted experiments to evaluate the effectiveness of PSL in such scenarios, and what mitigation strategies, if any, are suggested?
  2. Generally, contrastive loss is sensitive to noise (false negative instances). Is it feasible to apply PSL to general contrastive learning?
  3. In the experiments, PSL-ReLU consistently achieves sound performance across all settings. Can authors provide any rudimentary insights or intuitive understandings for the selection of activation function for broader application?
  4. The introduction of new activation functions in PSL may alter the training dynamics of the backbone models. Apart from performance metrics, are there differences in the objective function values or convergence rates between PSL and SL?

局限性

See weaknesses.

作者回复

Response to Reviewer ufck (1/2)

Dear Reviewer ufck,

We greatly appreciate your acknowledgement of our contributions and your insightful comments. In what follows, we provide responses to the Weaknesses (W) and Questions (Q) you have raised:

[W1] The proposed PSL loss is derived from the pair-wise form of softmax loss; however, the role of the original softmax loss in the subsequent discussion is not thoroughly addressed, leading to a minor gap in the exposition.

[Response] Thank you for underlining this issue. We will offer more explanations in the next version.

Note that the original form (cf. Eq.(2.4)) is derived by Maximum Likelihood Estimation (MLE), where SL maximizes the probability of positive items:

LSL(u)=iPulog(exp(f(u,i)/τ)jIexp(f(u,j)/τ))\mathcal{L}_{\text{SL}}(u) = -\sum _ {i\in\mathcal{P}_u}\log\left(\dfrac{\exp(f(u, i)/\tau)}{\sum _ {j\in\mathcal{I}}\exp(f(u, j)/\tau)}\right)

Recent works (e.g., BSL) have rewritten SL into pointwise form for analyses, where positive and negative instances are treated separately:

LSL(u)=iPuf(u,i)/τ+iPulog(jIexp(f(u,j)/τ))\mathcal{L}_{\text{SL}}(u) = -\sum _ {i\in\mathcal{P}_u}f(u,i)/\tau+\sum _ {i\in\mathcal{P}_u}\log\left(\sum _ {j\in\mathcal{I}}\exp(f(u,j)/\tau)\right)

Although the original and pointwise forms are equivalent, improving SL from pointwise perspective may disrupt the pairwise nature -- i.e., the loss can not be expressed by a function of duij=f(u,j)f(u,i)d_{uij}=f(u,j)-f(u,i) . As discussed in Section 3, ranking metrics are pairwise -- it optimizes the relative order of instances rather than their absolute values. Once the pairwise nature is disrupted, the theoretical relation with ranking metrics is no longer maintained. That's why we derive PSL from pairwise form (cf. Eq.(3.1), Eq.(4.1)).

[W2] While the paper provides some justification for the chosen activation functions, the theoretical rationale for selecting ReLU, Tanh, and Atan over others is not fully developed. The impact of the chosen activation functions on the stability of the training process is not sufficiently discussed.

[Response] Thank you for the constructive comments. According to our Lemma 3, the selection of these three activations satisfies the condition δ(duij)σ(duij)exp(duij)\delta(d_{uij}) \le \sigma(d_{uij}) \le \exp(d_{uij}). Admittedly, while the selected activations may not be optimal, they are simple and are sufficiently better than the exp()\exp(\cdot) function adopted in SL, serving as a closer approximation to the ideal δ()\delta(\cdot) function. Our experiments further validate the effectiveness of the proposed PSL.

Activations can affect training stability, specifically impacting the gradient distribution (cf. Eq.(4.3)). In fact, PSL exhibits more moderate gradient distributions compared to SL (cf. Fig.1(b)), thereby enhancing training stability.

[W3] The paper mentions the use of grid search for hyperparameter optimization but does not provide a detailed analysis of the sensitivity of PSL to the temperature factor.

[Response] Thank you for the constructive suggestions. In Table A1, we have conducted sensitivity analysis of temperature τ\tau on four IID datasets and MF backbone (metric: NDCG@20). Results show that PSL and SL exhibit similar temperature sensitivity, and PSL consistently outperforms SL under the same τ\tau.

Table A1. Sensitivity analysis of temperature τ\tau (metric: NDCG@20).

Gowallaτ\tau = 0.025τ\tau = 0.05τ\tau = 0.1τ\tau = 0.15
SL0.15430.16240.12480.0980
PSL-tanh0.15430.16460.13080.1014
PSL-relu0.15410.16470.13230.1032
PSL-atan0.15470.16460.13080.1013
Movieτ\tau = 0.025τ\tau = 0.05τ\tau = 0.1τ\tau = 0.15
SL0.07550.09290.08850.0755
PSL-tanh0.07500.09410.09100.0771
PSL-relu0.07550.09450.09200.0783
PSL-atan0.07500.09410.09100.0771
Electronicτ\tau = 0.025τ\tau = 0.05τ\tau = 0.1τ\tau = 0.15
SL0.02740.03600.05290.0516
PSL-tanh0.02920.04170.05350.0521
PSL-relu0.02920.04160.05410.0526
PSL-atan0.02930.04170.05350.0522
Bookτ\tau = 0.025τ\tau = 0.05τ\tau = 0.1τ\tau = 0.15
SL0.12100.11740.08070.0611
PSL-tanh0.12250.12070.08560.0641
PSL-relu0.12270.12140.08760.0658
PSL-atan0.12260.12070.08560.0642

[W4] There are a few typos in the paper...

[Response] Thanks for pointing them out. We have made corrections.

评论

Response to Reviewer ufck (2/2)

[Q1] How does PSL perform under conditions of extreme data sparsity? Have the authors conducted experiments to evaluate the effectiveness of PSL in such scenarios, and what mitigation strategies, if any, are suggested?

[Response] Thank you for raising this concern. Note that PSL is theoretically superior to SL and it does not rely on any assumptions on dataset. Empirically, we have conducted extensive experiments on six datasets with varying sparsity (cf. Tab.B.1). The data density ranges from the highest in Amazon-Electronic (0.00208) to the lowest in Amazon-Book (0.00026), with a nearly tenfold difference. We observe that PSL consistently achieved SOTA performance with varying sparsity.

To mitigate the impact of extreme data sparsity, we believe leveraging some cold-start recommendation strategies could be useful like the work [a1, a2]. These strategies can be seamlessly integrated with PSL, requiring only the substitution of their loss functions (e.g., BPR, SL) with PSL.

[References]

[a1] Contrastive Learning for Cold-Start Recommendation, MM'21

[a2] A heterogeneous graph neural model for cold-start recommendation, SIGIR'21

[Q2] Generally, contrastive loss is sensitive to noise (false negative instances). Is it feasible to apply PSL to general contrastive learning?

[Response] Insightful suggestions! PSL has the potential to be adapted as a loss function for contrastive learning (CL). In the context of CL, for a sample zXz \sim X, where the probability distributions of positive and negative samples zi+,zjz_i^+, z_j^- are Pz+,PzP_z^+, P_z^-, respectively, PSL can be defined as:

LPSL(z)=EiPz+[logEjPz[σ(f(z,zj)f(z,zj+))1/τ]]\mathcal{L}_{\mathrm{PSL}}(z) = \mathbb{E} _ {i \sim P_z^+} \left[\log\mathbb{E} _ {j \sim P_z^-}\left[\sigma(f(z, z_j^-)-f(z, z_j^+))^{1/\tau}\right]\right]

When σ=exp\sigma = \exp , PSL degenerates into InfoNCE. We can expect PSL equipped with relu or tanh activation functions, could mitigate the noise sensitivity by avoiding gradient explosion (cf. Fig.1(b)). We plan to further explore this promising topic in future research.

[Q3] In the experiments, PSL-ReLU consistently achieves sound performance across all settings. Can authors provide any rudimentary insights or intuitive understandings for the selection of activation function for broader application?

[Response] In practical, we recommend using PSL-relu directly, as it achieve superior performance in most cases. Moreover, the ReLU function is simpler than the others, making it potentially easier and more efficient to optimize.

The reason for the effectiveness of PSL-relu could be explained as follows: as illustrated in Fig.1(a), on the negative half-axis, ReLU is closer to the ideal δ\delta compared to the others. Note that duij=f(u,j)f(u,i)d_{uij} = f(u, j) - f(u, i) and in many cases the negative score f(u,j)f(u, j) could be smaller than the positive score f(u,i)f(u, i), indicating the duijd_{uij} are more likely to be located at the negative half-axis. Therefore, PSL-relu acts as a de facto closer surrogate to DCG than others in practice, leading to better recommendation accuracy.

[Q4] The introduction of new activation functions in PSL may alter the training dynamics of the backbone models. Apart from performance metrics, are there differences in the objective function values or convergence rates between PSL and SL?

[Response] Thanks for your question. As stated in the Response to [W2], PSL can in fact enhance the training stability by moderating the gradient distribution. We have provided the loss curves of SL and PSL in Figure R1 (in Author Rebuttal), which show that the training process of PSL is stable and converges fast.

评论

Thanks for the responses. I will improve my score slightly after reconsidered this work since most of my concerns are addressed.

评论

Thank you for insightful comments, raising your score, and recommending acceptance!

审稿意见
5

The authors re-examine the connection between the Softmax Loss (SL) and the evaluation metric Discounted Cumulative Gain (DCG), highlighting the inadequate tightness of SL as a surrogate loss for DCG. They propose minimal yet effective modifications (PSL-tanh/relu/atan) based on the pair-wise formulation of SL, showing provable tightness and controllable weight distributions.

优点

  1. The authors provide a unified framework that helps to design a tighter and stronger DCG surrogate loss.
  2. Extensive experiments have been conducted in IID, OOD, and Noise settings. Specifically, the proposed PSL-relu loss shows superior performance in both OOD and Noise settings.

缺点

  1. In the IID setting, PSL demonstrates only marginal improvement even when compared to the standard Softmax Loss (SL).
  2. The authors argue that SL is highly sensitive to noise, while PSL with an appropriate activation function can mitigate this sensitivity. I would expect that as the noise ratio increases, an activation function with a moderate weight distribution should perform better. In other words, PSL-tanh/atan shall be a better choice to SL/PSL-relu if the noise ratio is high. However, this is not the case according to Figures C.1-4, in which PSL-relu consistently outperforms the other methods.

问题

Why does BSL only keep on par with SL in both IID and OOD settings? It appears that the connection to Distributional Robust Optimization (DRO) determines the performance in OOD settings, but it is not the case for BSL?

局限性

The authors have addressed the limitations.

作者回复

Response to Reviewer kuk6

Dear Reviewer kuk6,

We sincerely appreciate your recognition of our theoretical contributions and experimental setup. Below are our detailed responses to the Weaknesses (W) and Questions (Q) you have raised:

[W1] In the IID setting, PSL demonstrates only marginal improvement even when compared to the standard Softmax Loss (SL).

[Response] Thank you for raising this concern. Admittedly, the improvements of PSL over SL in IID setting are not particularly large, typically ranging from 1% to 3.5%. But we would like to highlight the following aspects:

  • The improvements of PSL are consistent across multiply datasets, backbones and metrics. This consistency demonstrates the effectiveness and generalizability of PSL, while some existing advanced losses like AdvInfoNCE and LLPAUC may performs worse than SL in some cases.
  • The improvements of PSL are significant -- we have conduct statistically tests with p<0.05p < 0.05 to validate these improvements are statistically significant.
  • PSL exhibits much stronger robustness than SL with the distribution shifts or noises. From Table 2 and Figure 2, we observe that PSL shows notable improvements over SL, sometimes reaching improvement levels of 5% and 10% in OOD and noisy settings, respectively. Given that distribution shifts and noises are common in practical RS, we anticipate that PSL could achieve impressive performance in practical.
  • The modification required to implement PSL over SL is minimal and practical. PSL only necessitates a simple replacement of the activation function without introducing additional parameters or hyperparameters, making it highly practical for industrial applications. In contrast, losses like AdvInfoNCE and BSL require more parameters and hyperparameters, necessitating higher training costs and extensive tuning efforts.

Overall, we believe that PSL can serve as a superior alternative to SL in practical settings, given its greater effectiveness and robustness, while not incurring extra implementation complexity or hyperparameter tuning efforts.

[W2] The authors argue that SL is highly sensitive to noise, while PSL with an appropriate activation function can mitigate this sensitivity. I would expect that as the noise ratio increases, an activation function with a moderate weight distribution should perform better. In other words, PSL-tanh/atan shall be a better choice to SL/PSL-relu if the noise ratio is high. However, this is not the case according to Figures C.1-4, in which PSL-relu consistently outperforms the other methods.

[Response] Thanks for your insightful question. We have indeed observed that PSL-ReLU achieves slightly better performance than PSL-Tanh/Atan in high-noise scenarios. The reasons may stem from the following two aspects:

  • The disparity in the weights of noisy data between PSL-ReLU and PSL-Tanh/Atan is not as pronounced as the disparity between PSL-ReLU and SL. The following Table A1 presents the weights (dL\nabla_d \mathcal{L}) of those extremely noisy instances where duij=1d_{uij}=1 across different losses. The absolute values between PSL-ReLU and PSL-Tanh/Atan are not so large as those between PSL-ReLU with SL.
  • PSL-ReLU employs the widely-adopted linear activation function, ReLU, which has been shown to be more easily optimized and exhibits greater numerical stability than the non-linear Tanh/Atan activation functions.

Both of these aspects contribute to the results. it appears that the second aspect outweighs the first, and PSL-ReLU exhibits slightly better performance even in highly noisy settings.

Table A1. Weights (dL\nabla_d \mathcal{L}) of extremely noisy instances with duij=1d_{uij}=1 across different losses.

LossSLPSL-ReLUPSL-TanhPSL-Atan
dL\nabla_d \mathcal{L}742802025

[Q1] Why does BSL only keep on par with SL in both IID and OOD settings? It appears that the connection to Distributional Robust Optimization (DRO) determines the performance in OOD settings, but it is not the case for BSL?

[Response] Thanks for raising this concern. BSL enhances SL by incorporating Distributionally Robust Optimization (DRO) on positive samples. However, this approach exhibits certain limitations:

  • BSL adopts a pointwise rather than a pairwise perspective, treating positive and negative instances separately. Given that ranking metrics depend on pairwise comparisons (as discussed in Section 3), this approach disrupts the theoretical alignment with ranking metrics, potentially rendering BSL an unsuitable surrogate for the DCG loss.
  • In typical RS, positive samples are typically sparse, with a user often having fewer than 15 positive items. BSL applies KL-divergence-based DRO on these positive instances. Consequently, the support of the perturbed distributions within DRO is confined to these limited positive items. This means the set of uncertainty distributions used in DRO lacks flexibility and may not adequately cover the test distribution, thus compromising the robustness.

Given these two drawbacks, BSL might outperform SL, but the improvement is not substantial. In fact, based on our experience on hyperparameter tuning, tuning the hyperparameters to align BSL closely with SL often yield optimal performance.

评论

Thanks for the authors' response and I will keep my rating.

评论

Thank you for your insightful comments and recommending acceptance!

审稿意见
6

This paper aims to investigate the effectiveness of softmax loss in the recommendation model. To overcome the limitation of SL, the authors propose a new pairwise softmax loss (PSL). Based on the analysis, the authors argue that replacing exponential function with other active functions can benefit the ranking performance. Experiments on both IID, OOD, and noise datasets prove the effectiveness of the proposed PSL.

优点

  1. The analysis of softmax loss on the DCG metric is impressive and reasonable. The pair-wise softmax combines BPR loss and activation operation, which is easy to incorporate into the current model.
  2. The technique is sound based on well-motivated according to the analysis.
  3. The experiment design is reasonable and follows the related work, which verifies the effectiveness.

缺点

  1. Overall, the softmax loss is utilized in training and the test stage, which means we can utilize the model's output as the probability. The probability can be used in downstream tasks, such as ECPM calculation in online advertisements. So, replace the operation related to ranking metrics and practical use.
  2. The distinction between AUC, Recall, and other metrics should be paid more attention. The motivation behind SPL is prioritized to focus on the DCG metrics.
  3. The experiment design is reasonable, but the analysis needs to be further enhanced.

问题

  1. How the SPL affects other metrics, for example, AUC is important in pointwise click scenarios.

  2. In experiments, only results of NDCG@20 are given. Results on more cutoffs should be given in the recommendation.

局限性

Yes

作者回复

Response to Reviewer VFmg

Dear Reviewer VFmg,

We sincerely appreciate your recognition of our work. Your detailed comments are highly valued, and the questions you raised are both interesting and practical. Below are our detailed responses to the Weaknesses (W) and Questions (Q):

[W1] Overall, the softmax loss is utilized in training and the test stage, which means we can utilize the model's output as the probability. The probability can be used in downstream tasks, such as ECPM calculation in online advertisements. So, replace the operation related to ranking metrics (may harm the) practical use.

[Response] Thank you for the insightful comments. It is not straightforward to derive the probability from PSL, but we can construct it by leveraging classic Plackett-Luce ranking model [a1]. The Plackett-Luce ranking model establishes relationships between permutation probabilities with the prediction scores, and SL was firstly introduced in the recommendation community from this perspective [a2].

To elaborate, in the Plackett-Luce ranking model, the probability that an item occupies the top position (i.e., the probability that it is the item the user likes most) can be expressed as:

Ps(i)=ϕ(si)jIϕ(sj)P_s(i)=\dfrac{\phi\left(s_i\right)}{\sum_{j \in \mathcal I} \phi\left(s_j\right)}

where sis_i denotes the Plackett-Luce score of an item ii. Function ϕ()\phi(\cdot) can be any increasing and strictly positive function. In our PSL, we express si=f(u,i)brefs_i=f(u,i)-b_{ref} and ϕ()=σ()1/τ\phi(\cdot)=\sigma(\cdot)^{1/\tau}. Here a reference term brefb_{ref} is introduced, which can be set as the prediction value f(u,j)f(u,j) of a specific item jj (e.g., the top-KK th item, which serves as a threshold score for an item being considered positive). The introduction of brefb_{ref} is important as PSL optimizes the relative magnitudes of f(u,i)f(u,i) from a pairwise perspective, rather than their absolute values. Introducing a reference term can enhance the stability of the results.

Such Plackett-Luce model possesses some interesting properties:

  • When we set ϕ()=exp()\phi(\cdot)=\exp(\cdot), the probability Ps(i)P_s(i) degenerates to the Softmax form.
  • The objective of PSL can be viewed as optimizing the log-likelihood of the Plackett-Luce model with specific reference terms. That is, for each positive user-item pair (u,i)(u,i), we have: LPSL(u,i)=log(jIσ(duij)1/τ)=log(σ(f(u,i)bref)1/τjIσ(f(u,j)bref)1/τ)=logPs(i)\mathcal{L} _ {\text{PSL}}(u,i) = \log\left(\sum _ {j \in \mathcal{I}} \sigma(d_{uij})^{1/\tau} \right) = -\log\left( \dfrac{\sigma{(f(u,i)-b_{ref}})^{1/\tau}} {\sum _ {j \in \mathcal{I}}{\sigma(f(u,j)-b_{ref})^{1/\tau}}} \right) = -\log P_s(i) , where we set bref=f(u,i)b_{ref}=f(u,i).

[Reference]

[a1] The analysis of permutations, Journal of the Royal Statistical Society Series C: Applied Statistics, 1975.

[a2] Learning to rank: from pairwise approach to listwise approach, ICML'07

[W2] The distinction between AUC, Recall, and other metrics should be paid more attention. The motivation behind PSL is prioritized to focus on the DCG metrics.

[Q1] How the PSL affects other metrics, for example, AUC is important in pointwise click scenarios.

[Response] Thank you for the constructive comments. This work prioritizes DCG metric due to its widespread application in evaluating model performance within industrial RS. Furthermore, optimizing DCG remains a challenging problem. We appreciate the reviewers' suggestions to emphasize other metrics. Consequently, we have conducted further experiments to assess the performance of our PSL model using other metrics, including Precision, MRR, and AUC, detailed in the subsequent Table A1, while Recall has been reported in the Table 1 in our manuscript. From these tables, we observe that PSL consistently outperforms other baselines across these varied metrics.

Table A1. Precision@20, AUC, and MRR@20 on IID datasets and MF backbone.

GowallaPrecision@20AUCMRR@20ElectronicPrecision@20AUCMRR@20
SL0.06370.96660.1870SL0.01560.80040.0702
AdvInfoNCE0.06380.96670.1872AdvInfoNCE0.01560.80040.0685
BSL0.06380.96350.1881BSL0.01560.80030.0686
PSL-tanh0.06460.96830.1886PSL-tanh0.01580.80740.0704
PSL-relu0.06470.96740.1885PSL-relu0.01590.80590.0718
PSL-atan0.06460.96830.1886PSL-atan0.01580.80710.0704

[W3] The experiment design is reasonable, but the analysis needs to be further enhanced.

[Response] Thank you for highlighting this aspect. We will augment our experiment section with more comprehensive analyses, including more detailed comparisons of baseline performances, more analyses across three distinct scenarios, and more analyses of the impact of noise on different losses, etc.

[Q2] In experiments, only results of NDCG@20 are given. Results on more cutoffs should be given in the recommendation.

[Response] Thank you for your constructive suggestions. Table A2 present results of our PSL in comparison with other baselines across various NDCG@KK metrics. These results consistently demonstrate that our PSL model outperforms the baselines, thereby validating its effectiveness and robustness.

Table A2. NDCG@KK on IID datasets and MF backbone, K[5,10,20,50]K \in [5, 10, 20, 50].

GowallaNDCG@5NDCG@10NDCG@20NDCG@50ElectronicNDCG@5NDCG@10NDCG@20NDCG@50
SL0.13230.14170.16240.1994SL0.03520.04290.05290.0696
AdvInfoNCE0.13250.14200.16270.1997AdvInfoNCE0.03430.04220.05270.0699
BSL0.13290.14230.16300.2001BSL0.03450.04250.05300.0694
PSL-tanh0.13430.14390.16460.2013PSL-tanh0.03570.04340.05350.0702
PSL-relu0.13450.14400.16470.2012PSL-relu0.03620.04390.05410.0705
PSL-atan0.13430.14390.16460.2013PSL-atan0.03570.04340.05350.0702
评论

Thanks for the authors' response and my concerns are resolved, and I raise my scores.

评论

Thank you for raising your score and recommending acceptance!

作者回复

Overall Rebuttal

Dear Reviewers VFmg, kuk6, ufck, and m2Qo,

We thank all reviewers for taking the time to review our paper and for providing valuable and insightful feedback. We are delighted to see that our work has been recognized for its contributions and inspiration to the recommendation loss community, as mentioned by Reviewers VFmg, kuk6, ufck, and m2Qo. We also note that some reviewers raised in-depth questions, and we have endeavored to provide comprehensive responses.

We would like to express our gratitude to the reviewers for affirming the reasonable motivation, solid theoretical foundation, and the simplicity and effectiveness of our PSL loss, as highlighted by Reviewers VFmg, kuk6, ufck, and m2Qo. We are pleased to see that the reviewers found the experiment design and results to be comprehensive and convincing, as mentioned by Reviewers VFmg, kuk6, and ufck. We also appreciate the insightful comments and suggestions for our experiments, including more recommendation metrics (Reviewer VFmg), temperature sensitivity analysis (Reviewers ufck and m2Qo), more backbones and model structures (Reviewer m2Qo). We have tried our best to conduct these additional experiments, and we believe that the results are consistent with our claims.

In our rebuttal, we carefully considered the comments and suggestions provided by the reviewers, and we have addressed them point by point. We believe and hope that our responses adequately address the reviewers' concerns.

Once again, we sincerely thank the reviewers for their valuable feedback, which help us improve the quality of our work.

最终决定

This paper reveals the limitations of the commonly used Softmax Loss (SL) for ranking tasks in recommendation systems and introduces an improved method, Pairwise Softmax Loss (PSL), to address these shortcomings. The proposed method is both well-motivated and supported by theoretical guarantees. Reviewers found the paper to be novel, with a simple and practical approach, and the analysis and results were effective. Most of the reviewers' concerns were satisfactorily addressed in the rebuttal. However, the authors are encouraged to incorporate the reviewers' suggestions in the final version, such as including discussions on additional metrics like AUC and Recall.