/10

Poster4 位审稿人

最低3最高3标准差0.0

ICML 2025

Loss Functions and Operators Generated by f-Divergences

Vincent Roulet,Tianlin Liu,Nino Vieillard,Michael Eli Sander,Mathieu Blondel

OpenReview PDF

提交: 2025-01-23更新: 2025-07-24

TL;DR

We propose to generate loss function from f-divergences and do experiments on language modeling tasks.

摘要

The logistic loss (a.k.a. cross-entropy loss) is one of the most popular loss functions used for multiclass classification. It is also the loss function of choice for next-token prediction in language modeling. It is associated with the Kullback-Leibler (KL) divergence and the softargmax operator. In this work, we propose to construct new convex loss functions based on $f$-divergences. Our loss functions generalize the logistic loss in two directions: i) by replacing the KL divergence with $f$-divergences and ii) by allowing non-uniform reference measures. We instantiate our framework for numerous $f$-divergences, recovering existing losses and creating new ones. By analogy with the logistic loss, the loss function generated by an $f$-divergence is associated with an operator, that we dub $f$-softargmax. We derive a novel parallelizable bisection algorithm for computing the $f$-softargmax associated with any $f$-divergence. On the empirical side, one of the goals of this paper is to determine the effectiveness of loss functions beyond the classical cross-entropy in a language model setting, including on pre-training, post-training (SFT) and distillation. We show that the loss function generated by the $\alpha$-divergence (which is equivalent to Tsallis $\alpha$-negentropy in the case of unit reference measures) with $\alpha=1.5$ performs well across several tasks.

关键词

loss functionsf-divergencesentropiesFenchel conjugates

评审与讨论

审稿意见

评分: 32025-03-13

This paper proposes a generalization of entropy-based loss functions (such as logistic loss and softmax) by incorporating f-divergences. Specifically, the generalization is formulated using Fenchel-Young duality, where the standard Shannon entropy regularization is replaced with f-entropies. The authors demonstrate that several existing loss functions, including sparsemax and entmax, emerge as special cases of this framework under different choices of f-divergences. Furthermore, the paper provides detailed practical considerations on the computation and differentiation of the proposed loss functions, ensuring their feasibility for large-scale learning tasks. The empirical evaluation on ImageNet and language modeling datasets validates the effectiveness of this approach, with the α-divergence (α = 1.5) achieving the best performance.

给作者的问题

论据与证据

The abstract claims two primary contributions:

Generalizing Shannon entropy to f-entropy produces loss functions that are advantageous.

This is well supported by theoretical analysis (Fenchel-Young framework) and empirical results (image classification and language modeling)

The generalization allows for non-uniform reference measures, which could be useful.

This is not well addressed. The authors themselves note in line 343 (lhs) that using a non-uniform reference measure did not lead to performance improvements. Although I recognize the possibility to use non-uniform reference to incorporate prior knowledge, this paper did not include informtation how this incorporation could be beneficial.

Given this, I recommend that the authors reconsidr making Claim 2 in the abstract or clarify under what conditions non-uniform refernece measures might be useful.

方法与评估标准

The proposed methods and evaluation criteria are generally appropriate for the problem.

Theoretical insight is strong: The framework unifies prior approaches like sparsemax and entmax under a broader family of entropy-based loss functions.
Practical implementation is well considered: The paper addresses efficient computation of the loss and gradient, making the method feasible for large-scale applications.
Empirical evaluation is sufficient: The experiments on image classification (ImageNet) and language modeling show consistent performance improvements, supporting the claims.

However, there is a key issue that remains unaddressed:

What differentiates members of the f-divergence family?

It is unclear why certain f-divergences (e.g., α-divergence with α = 1.5) improve performance, while others (e.g., Jensen-Shannon divergence) degrade it.
The paper lacks a theoretical explanation for why some choices lead to better optimization or generalization.
A deeper analysis of the effect of different f-divergences on training dynamics or model representations would strengthen the claims.

理论论述

I reviewed the proofs and did not find any major issues.

实验设计与分析

The experiments are well designed and support the primary claim that incorporating f-divergence could improve the training process and bring improvements in performance.

补充材料

I read the appendix. There is no supplementary material.

与现有文献的关系

The topic of this paper has the potential to impact the broader community of machine learning.

It introduces a new family of loss functions under a unified framework based on f-divergences, offering a fresh perspective on entropy-basesd regularization. Nevertheless, the authors should provide insights into selection of specific f-divergences.
The variational problem under f-divergences is also relevant to automatic implicit differentiation techniques, which are increasingly used in optimization and deep learning. This paper provides a concrete example.

遗漏的重要参考文献

I suggest including addition refernces in variational inference techniqeus using f-divergences. f-divergence has been extensively studied in variational inference. For instance:

Nowozin et al., f-gan: Training generative neural samplers using variational divergence minimization. NeurIPS 2016.
Li & Turner. Rényi divergence variational inference. NeurIPS 2016.

其他优缺点

Strengths:

The paper is well-written and easy to follow.
The appendix provides substantial details that assist understanding of the paper.

Weaknesses:

Limited comparison beyond f-divergence losses. The proposed loss function is only compared within the f-divergence family, including sparsemax and entmax. How does it perform compared to temperature-scaled softmax or label-smoothed softmax? Including these comparisons could provide insights into why certain f-divergences are more beneficial than others and whether the proposed approach offers advantages beyond the existing methods.

其他意见或建议

Typos:

Line 161 and throughout the paper: Jeffreys (instead of Jeffrey) divergence is the correct way of reference.
Line 617: Propositon --> Proposition
Line 636: coresponding --> corresponding

作者回复

2025-04-01

We thank the reviewer for their constructive feedback.

clarify under what conditions non-uniform reference measures might be useful.

Thank you for your suggestion. Non-uniform reference measures may be relevant in classification tasks with known imbalanced classes. We directly tried the relevance of the approach on pre-training in the Nanodo codebase and did not find any particular gains in that setting.

The paper lacks a theoretical explanation for why some choices lead to better optimization or generalization.

We agree that such insights—knowing when and why a loss outperforms others on specific tasks—would be beneficial. However this is very challenging. Indeed, even in the case of the standard cross-entropy loss, its induced loss landscape in model parameters and the ease or difficulty of optimizing over it are generally hard to analyze for practically relevant tasks. We believe that, just as the rates of convergence of optimizers have not always dictated deep learning practices, theoretical properties of f-divergence losses may not be reflected in deep learning experiments. For this reason, we prefer to present a methodological approach that showcases the actual performance of different losses, upon which the community can build. Through these experiments, we observed that the alpha divergence with $\alpha=1.5$ appears to provide a good trade-off between the standard KL divergence and a sparsemax loss.

including additional references in variational inference techniques using f-divergences.

Thank you for your suggestions. We will include them.

How does it perform compared to temperature-scaled softmax or label-smoothed softmax? Including these comparisons could provide insights

Regarding temperature, we believe temperature scaling is not useful at training time, as $\beta$ can be absorbed into the logits $\theta$ . However, it could indeed be useful at inference time. We will add a remark to make this clarification.

Regarding label-smoothing, we believe it would be interesting to try, though this technique can be applied to any loss and is therefore relatively orthogonal to our work.

In addition to $f$ -divergence generated losses, we tried the (multiclass) hinge loss. However, despite trying various learning rates and warm-up, we were unable to make the hinge loss work well on ImageNet. We suspect that the hinge loss, as a non-smooth loss, requires a completely different hyper-parameter tuning. Our observation here aligns with findings from the paper Zhu et al., (2023) mentioned by Reviewer EvEL, which demonstrates that applying multiclass hinge loss to Tiny ImageNet results in accuracy only slightly better than random (See their Table 7, where “CS” denotes the Crammer-Singer multiclass hinge loss, which we also used).

[1] Zhu et al, 2023, ICML, Label Distributionally Robust Losses for Multi-class Classification: Consistency, Robustness and Adaptivity.

Typos

Thanks for spotting them. They are now corrected.

审稿意见

评分: 32025-03-13

The authors propose a framework including operators (f-softmax, f-softargmax, f-softplus and f-sigmoid) and loss functions that generated by f-divergences for multi-class classification. Mathematical derivations and efficient computation algorithm are provided. The practical performance are demonstrated on ImageNet classification and language model settings.

update after rebuttal

Thank the authors for their rebuttal and I'd like to keep my evaluation score as weak accept.

给作者的问题

Please see previous comments.

论据与证据

The claims are clear and convincing to me.

方法与评估标准

The methods and evaluation criteria make sense to me.

理论论述

I didn't check details for the proof but the theoretical claims in the main content make sense to me.

实验设计与分析

The experimental designs and analyses generally make sense to me.

However, it would be better to include performance variance or standard deviation for multiple replicates with different random seeds or data splits.

补充材料

I didn't check the supplementary material.

与现有文献的关系

The proposed method could be used for multi-class classification task.

遗漏的重要参考文献

There is another previous work that shares partially similar idea that generate loss function with convex regularization, which can recover logistic and SVM losses with KL divergence.

Zhu, Dixian, Yiming Ying, and Tianbao Yang. "Label distributionally robust losses for multi-class classification: Consistency, robustness and adaptivity." International Conference on Machine Learning. PMLR, 2023.

其他优缺点

Strengths:

The paper writing and presentation is good.
The motivation is clear and the flow of mathematical derivations are neat.
The authors ensure the efficiency of the proposed method, which only takes negligible overhead in practice (Appendix A.3).

Weaknesses:

There is no theoretical insight why certain variant ( $\alpha$ -divergence with $\alpha=1.5$ ) performs better than others.
The experiments don't include performance variance for multiple different replicates (as mentioned before).
It could make the proposed method more convincing by compare other classification losses in the literature, such as SVM losses and other CE loss variants. The current compared baselines is a little limited.

其他意见或建议

Please see previous comments.

作者回复

2025-04-01

Thank you for their constructive feedback. We hope we have addressed your comments.

it would be better to include performance variance

Thank you for this suggestion. We repeated our experiments with multiple random seeds and reported the standard deviation across independent runs. Specifically, for each $f$ -divergence generated loss, we used 5 random seeds, and applied the loss on ImageNet classification, finetuning, and distillation experiments. The results are below. Overall, the standard deviations of all losses are small, which we believe is because the training datasets are large enough to diminish the effect of different realization of random initial parameters.

ImageNet:

divergence	accuracy mean	accuracy std	accuracy min	accuracy max
cs	0.7604	0.0015	0.7587	0.7621
js	0.7246	0.0014	0.723	0.7266
squared_hellinger	0.7281	0.0008	0.7272	0.729
alpha divergence ( $\alpha=1.5$ )	0.7758	0.0013	0.7743	0.7776
kl	0.7684	0.0007	0.7676	0.7692

Supervised finetuning experiment:

divergence	ROUGE-2 mean	ROUGE-2 std	ROUGE-2 min	ROUGE-2 max
cs	11.15	0.1	11.02	11.31
js	7.95	0.08	7.87	8.07
rcs	9.55	0.1	9.44	9.7
alpha divergence ( $\alpha=1.5$ )	14.27	0.04	14.2	14.32
kl	9.77	0.02	9.75	9.8

Distillation experiment:

divergence	ROUGE-2 mean	ROUGE-2 std	ROUGE-2 min	ROUGE-2 max
cs	14.17	0.09	14.01	14.26
js	16.51	0.06	16.43	16.6
rcs	16.3	0.05	16.25	16.38
alpha divergence ( $\alpha=1.5$ )	17.43	0.13	17.19	17.6
kl	16.64	0.05	16.57	16.71

There is another previous work that shares a similar idea by Zhu, Dixian, Yiming Ying, and Tianbao Yang. "Label distributionally robust losses for multi-class classification: Consistency, robustness and adaptivity." International Conference on Machine Learning. PMLR, 2023.

We will add this citation, thank you.

There is no theoretical insight why certain variants (alpha-divergence with alpha = 1.5) perform better than others.

compare to other classification losses in the literature, such as SVM losses and other CE loss variants.

In addition to our $f$ -divergence generated losses, we tried the (multiclass) hinge loss. However, despite trying various learning rates and warm-up, we were unable to make the hinge loss work well on ImageNet. We suspect that the hinge loss, as a non-smooth loss, requires a completely different hyper-parameter tuning. Our observation here aligns with findings from the paper of Zhu et al., (2023) you mention, which demonstrates that applying multiclass hinge loss to Tiny ImageNet results in accuracy only slightly better than random (See their Table 7, where “CS” denotes the Crammer-Singer multiclass hinge loss, which we also used).

[1] Zhu et al, 2023, ICML, Label Distributionally Robust Losses for Multi-class Classification: Consistency, Robustness and Adaptivity.

审稿意见

评分: 32025-03-13

The paper proposes using Fenchel-Young losses derived from f-divergences to perform image classification and language model pretraining, fine-tuning, and distillation. The authors present an efficient bisection method for solving for the f-softmax function involved in optimizing the proposed loss. The paper analysis f-Fenchel-Young losses in the context of image classification and language model pretraining, fine-tuning, and distillation, along with ablations for softmax aggregation of f-trained logits at generation time.

给作者的问题

What is the novelty compared to the original Fenchel-Young losses? Blondel et. al 2019 "Learning with Fenchel-Young Losses"

a similar bisection scheme appears in their Algorithm 1
Tsallis 1.5, sparsemax, etc. are discussed. JSD and Hellinger appear to fall under the original framework, even if not stated explicitly or tested empirically
generalization to reference $q$ is a distinction from Blondel et. al 2019 but appears to be known (perhaps up to $0 \in \text{dom}(f^\prime)$ )
I invite the authors to spell these out for those of us reviewers and ACs less versed in this line of work.
I will strongly argue for acceptance if this concern is met. I do value the empirical study, especially in language modeling settings.

My remaining questions arise mostly from a genuine interest in the work rather than a critical evaluation informing the review score.

Do the authors have any comment on why modifying the FY loss improves performance in general? (It's ok not to, I often dread this question when deriving generalized losses!)

In particular, for SFT/Distillation, one of my hypothesis in reading the paper was that allowing for sparsity might be useful for next-token prediction where greedy/top-k/nucleus procedures perform well.
However, the soft-max decoding from f-softmax logits appears to undermine this hypothesis
Do we expect any interesting interaction between sparse-f finetuning/distillation and decoding procedures beyond standard (temperature=1) decoding?

Is the f-soft-argmax invariant to addition by a constant? ( $\text{softargmax}_f(\theta + c, q) = \text{softargmax}_f(\theta, q)$ )

$\text{softmax}_f(\theta + c, q) = \text{softmax}_f(\theta, q) + c$ is stated in Terjek 2021, and I've proven it myself at some point. This corresponds to shifting $\tau^* \rightarrow \tau^*+c$ and appears to not modify the softargmax.
This would be a useful property to specify in general.
I came to this by thinking whether one might also consider f-softmax aggregation from standard logits (the reverse setting of Lines 353-376R). Although "open-logits" is an uncommon access model, one could recover softmax logits up to a constant from next-token probabilities and use this for f-softmax decoding (if the above holds).

Temperature scaling during training ( $\text{softmax}_{\beta f}$ ) or inference (scaling $\beta$ with given $\theta$ ) might also be considered (?). Some initial experiment or commentary could serve to highlight this as a direction for exploration for practictioners.

论据与证据

The paper clearly steps through the derivation of the proposed losses using the analogy with the well-known softmax and soft-argmax functions. I have minor questions regarding specific claims in Other Comments below.

方法与评估标准

The methods and evaluation criteria appear sound.

It appears that one challenge of comparing training with different $f$ is that the training losses are different and thus validation or test losses can not be easily compared. Thus, the authors use accuracy for classification or next-token prediction, and downstream task (summarization) scores for finetuning and distillation.

理论论述

I am familiar with Proposition 1 from existing work, and the convergence of the algorithm appears correct.

实验设计与分析

Experiments appear to be soundly designed. Language model distillation and fine-tuning are particularly active areas of research.

补充材料

I appreciate the care in providing numerically stable implementations of the operations in the Appendix. While I did not investigate, I envision this will be very useful to myself and other researchers!

"Differentiating through $f$ -softmax" should be detailed in the Appendix for completeness.

与现有文献的关系

The representation of the f-softmax in Proposition 1 moves far beyond Wang et. al 2024 Thm 1. Even Eq 13 needs to be rearranged using $(f_*^\prime)^{-1} = f^\prime$ to recover their result. Since this discussion may not be necessary, the citation could also be dropped.

遗漏的重要参考文献

One might also wonder how to directly optimize the Fenchel-Young loss in Eq 11. It seems that CvxPy Layers [1] might be used, although this optimization over the vocabulary (rather than bisection over a scalar) seems less efficient and convenient.

[1] Agrawal et. al 2019, "Differentiable Convex Optimization Layers"

其他优缺点

The paper is well-written and comprehensive.

It would be useful to explicitly emphasize novelty (see below), for the review process at the very least.

In any case, the applications to language models are creative and interesting.

其他意见或建议

Notation for the Loss Function

I initially thought it might be more insightful to write the loss in Eq. 5 as $\min \limits_{\boldsymbol{\theta}} -\langle \boldsymbol{\theta}, \boldsymbol{y} \rangle + \text{softmax}_{\Omega}(\boldsymbol{\theta})$ (i.e. a conjugate optimization), but I also see why the authors need to include this in Eq. 11.

The main concern is that $\Omega(\boldsymbol{y})$ is a constant in Eq. 5 and the reader has to parse that this term does not contribute to the optimization (amidst use of $\boldsymbol{p}$ above and $\boldsymbol{q}$ later). Perhaps the authors could add a comment here.
However, the payoff is that we can optimize Eq. 11 with respect to q.

Minor questions:

m1) Why is the relation between (5) and (6) and upper bound? Is it tight e.g. for $f$ of Legendre type?

m2) Do the authors have any comment on the role of $q=1$ vs. $q=1/K$ ? The latter seems more principled. Panel 3 in Fig 10 ( $q = 1/K$ ) resembles Panel 5 of Fig 9 ( $q=1$ )

作者回复

2025-04-01

We thank the reviewer for their constructive feedback and their interest in this work.

m1) Why is the relation between (5) and (6) an upper bound? Is it tight e.g. for f of Legendre type?

There was a typo in these equations, the lower or equal sign should be replaced by a greater than or equal sign. This bound is detailed in Proposition 3 of (Blondel et al., 2020). Indeed it is tight if $\Omega$ is of Legendre type, that is, if $f$ is of Legendre type on $(0, +\infty)$ . This is the case for the KL, reverse KL, Jensen to cite a few examples but not the case for the chi-squared or $\alpha$ divergences, for example.

m2) Do the authors have any comment on the role of q=1 vs. q=1/K? The latter seems more principled. Panel 3 in Fig 10 (q=1/K) resembles Panel 5 of Fig 9 (q=1)

This is an excellent question. $f$ -entropies are recovered with $q=1$ , not $q=1/k$ . Using $q=1$ or $q=1/k$ can lead to slightly different regularization function $\Omega$ , and losses, due to the fact that $f$ can be non-homogeneous (that is, $f(p/q)q \neq f(p)$ ). Mathematically, the losses differ since

\mathrm{softmax}_f(\theta; \mathbf{1}/k) = \alpha \sup\_{p \in k \triangle^k} \langle p, \theta \rangle - D\_f(p, \mathbf{1}) \neq \alpha \ \mathrm{softmax}\_f(\theta; \mathbf{1})

where $k \triangle^k = \{ k p, p \in \triangle^k\}$ is a scaled simplex. Numerically, we did not observe changes in training curves when using one or the other.

1) What is the novelty compared to the original Fenchel-Young losses? Blondel et. al 2019 "Learning with Fenchel-Young Losses"

Our paper sets out to study Fenchel-Young losses when the regularization $\Omega$ is set to a $f$ -divergence, which to our knowledge hadn’t been studied before. In doing so, we draw an interesting parallel between entropies already used in Blondel et al (Shannon, Gini, Tsallis) and $f$ -divergences (KL, chi-square, alpha divergences).

Proposition 1 in our paper can be thought of as a generalization of Proposition 9 in Blondel et al. When $q$ is non-uniform, their proposition does not apply, while our proposition does. When $q$ is uniform, our proposition tackles the case where $f$ is not differentiable on 0. In addition, we prove that there is a unique solution to the root problem on the considered interval, which Blondel et al did not prove. Our proof does not go through KKT conditions and rather relies on conjugate calculus. We also provide detailed computations in the appendix that take into account some potential numerical instabilities of naive implementations.

On the empirical side, we demonstrate the proposed losses on tasks of different data modalities, including both vision (ImageNet) and text generation tasks, which cover different training strategies (from scratch, finetuning, and distillation). In addition, we obtained a novel empirical insight: using the classical soft-argmax works well even if we trained with our f-divergence based losses. This suggests that the choice of the loss used at training time, not the choice of the $f$ -softargmax used at inference time, impacts accuracy the most. We hope that our results give a good glimpse of the losses’ potential as well as their limitations.

2) Do the authors have any comment on why modifying the FY loss improves performance in general? (It's ok not to, I often dread this question when deriving generalized losses!)

Unfortunately, we do not have a good theoretical answer to this question, see also the answer we provided to reviewers 9x5b, EvEL and aCJJ. Our goal here was first and foremost to provide experimental results on recent tasks (pretraining, fine-tuning, distillation of LLMs) with a methodological approach to build these losses. We hope that such experimental results may help the community understand the relevance of different losses. We thank the reviewer for the numerous avenues they already proposed.

3) Is the f-soft-argmax invariant to addition by a constant?

Yes, this is the case. This comes from the fact that $\langle p, \theta + c \rangle - \Omega(p) = \langle p, \theta \rangle - \Omega(p) + c$ when $p$ belongs to the simplex.

4) Temperature scaling during training or inference might also be considered (?).

We believe temperature scaling is not useful at training time, as $\beta$ can be absorbed into the logits $\theta$ . However, it could indeed be useful at inference time. We will add a remark to make this clarification.

审稿人评论

2025-04-07

Thanks to the authors for the detailed reply. I maintain my score due to the borderline question of novelty compared to Fenchel-Young Losses. While I believe choosing the special case of $f$ -divergences and considering the case of non-uniform $q$ (discarded in emprical study) are relatively minor contributions, I appreciate the language model experiments and technical care to encompass a large class of divergences.

审稿意见

评分: 32025-03-17

The paper investigates a general framework for generating loss functions using f-divergences, extending the well-known logistic loss (cross-entropy). It introduces a new set of operators, namely f-softmax and f-softargmax, and develops a novel bisection algorithm for computing them. The experimental results focus on evaluating the effectiveness of these loss functions in image classification (ImageNet) and language modeling, including pretraining, supervised fine-tuning (SFT), and distillation.

给作者的问题

What is the underlying (even fundamental) reason why a particular case of f-divergence provides some gains (not significant ones though)? Does it have to do with the properties it follows? (in terms of (a)symmetry, or satisfying the partition inequality, etc.).
How interesting are such loss functions in state-of-the-art systems in language processing? Can you report some gains or interesting results on networks based on attention using different (more complex) loss functions?
Do the so-called new entropies satisfy the axiomatic formulation of entropy? What are their properties and their interest?

论据与证据

The paper is very clear in its content and claims, all claims are clearly articulated. However, not all claims are justified and some are a bit overstated. Generalizing or extending loss functions by replacing KL divergence with the class of f-divergence functions is a well-studied topic, with numerous contributions (and techniques) and it is fairly a standard procedure. There is no novelty in that aspect here. Note that the generalization of the logistic loss by replacing the KL divergence with more general f-divergences has been done at least in the bayesian setting (see work by Andrea Tonello and others). Works like Nguyen et al. (2009), Martins & Astudillo (2016), and Go et al. (2023) already explored similar directions. This paper builds on Fenchel–Young losses rather than fundamentally innovating on them, and the use of Fenchel–Young losses (Blondel et al., 2020) as a general framework was already well-established. The new "entropies" cannot be claimed entropies simply because they are generated in the same way as KL or Tsallis divergence. Do they satisfy the axioms of information measures?

方法与评估标准

The potential benefits or interest of the proposed loss functions are demonstrated on image classification (ImageNet), language model post-training, and distillation. The experimental study is rather limited, failing to test these ideas in other SOTA architectures, so how general the results and the claims could be?

理论论述

Yes, the theoretical claims have been checked. The study rigorously defines a new class of operators that generalize softmax, this is correctly done, although there is no technical difficulty. The convexity and differentiability analysis are also correct, based on the framework of Fenchel–Young losses.

实验设计与分析

The experiments are valid and correct but they are limited, not representative and the gains are not very pronounced or significant (nor are sufficiently supported by technical explanations and justifications why certain things work and others not). Stating the accuracy without the variance makes it hard to see whether the proposed loss function outperform others in ImageNet. Then, while α = 1.5 performs well, fine-tuning across different datasets and architectures might require additional adjustments (the parameter sensitivity needs to be further studied).

补充材料

There is no Supplementary Material per se, but the paper has a long Appendix part, containing well-known results on f-divergences as well as proof of the main theoretical results.

与现有文献的关系

The use of f-divergences in loss functions is already well studied and extensions beyond KL for cross-entropy have been done before. Many ideas here are reformulations or extensions rather than fundamentally new insights.

The theoretical part, in terms of defining f-divergences, the new entropies, etc. is well established in the information theoretic community and there are many, more general results, alternatives (Sharma-Mittal), etc.

遗漏的重要参考文献

Although the literature on this topic is very large, there are no essential references missing, apart from examples where f-divergences have been used in loss functions (specifically in cross entropy), such as "f-Divergence Based Classification: Beyond the Use of Cross-Entropy" Novello, Tonello, and https://arxiv.org/pdf/2501.18537v1. A clearer comparison highlighting the differences between this work and prior approaches should be provided.

其他优缺点

Strengths:

The authors explore the potential gains by extending KL divergence in cross-entropic loss functions.
There is an attempt to provide some analytical results (not always new or of technical depth though).
The algorithm seems to be relevant in practice as its implementation cost seems to be low. The main computational contribution is the bisection method for efficiently solving the f-softargmax transformation.

Weaknesses:

Limited novelty, both in terms of f-divergences results (per se) or extending loss functions (including cross entropy) using f-divergences. There is a rich literature that has explored the benefit from replacing KL divergence by f-div.
The gains are not important and there is no theoretical/analytical or even experimental justification on why a particular value of a-divergence (as defined by Tsallis entropy) provides gains (but not some other widely used/well known cases of this rich family of divergences.

Overall, a well-structured, incremental improvement on f-divergence-based losses, but not an important (or groundbreaking) contribution in theory (and practice).

其他意见或建议

Nothing major to comment.

作者回复

2025-04-01

We thank the reviewer for their feedback.

The experimental study is rather limited

We disagree that our experimental study is limited. We evaluate the proposed losses on tasks of different data modalities, including both vision (ImageNet) and text generation tasks, which cover different training strategies (from scratch, finetuning, and distillation). All other reviewers appreciated the experimental efforts ("experimental designs [...] make sense to me", "experiments are well designed", "Experiments appear to be soundly designed").

accuracy without variance

Thank you for this suggestion. Please see the results in our response to reviewer EvEL, where we report the mean, standard deviation (std), minimum (min), and maximum (max) across 5 independent runs for each divergence in the ImageNet, SFT, and distillation experiments. To summarize, we found that the standard deviation is low.

underlying reason why a particular case of f-divergence provides some gains

We agree that such insights—knowing when and why a loss outperforms others on specific tasks—would be beneficial. However, this is very challenging. Indeed, even in the case of the standard cross-entropy loss, its induced loss landscape in model parameters and the ease or difficulty of optimizing over it are generally hard to analyze for practically relevant tasks. We believe that, just as the convergence rates of optimizers have not always dictated deep learning practices, theoretical properties of f-divergence losses may not be reflected in deep learning experiments. For this reason, we prefer to present a methodological approach that showcases the actual performance of different losses, upon which the community can build. Through these experiments, we observed that the alpha divergence with $\alpha=1.5$ appears to provide a good trade-off between the standard KL divergence and a sparsemax loss.

This paper builds on Fenchel–Young losses rather than fundamentally innovating

At the beginning of Section 3, we clearly state “In this paper, we propose to study Fenchel–Young losses and associated operators when the regularizer is defined as $\Omega_f(p; q) = D_f(p, q)$ ". We believe this hasn’t been done before.

The use of f-divergences in loss functions is already well studied

As we cover in our ample related work section, there are indeed existing works using $f$ -divergences to derive loss functions, such as Nguyen et al. (2009), as you mention. However, these works do not use the same mathematical formulation: they are not based on Fenchel-Young losses. We are the first to study $f$ -divergences as the regularizer in Fenchel-Young losses.

"f-Divergence Based Classification: Beyond the Use of Cross-Entropy" Novello, Tonello

We will add this citation, thank you.

How interesting are such loss functions in state-of-the-art systems in language processing?

Our pretraining experiments use a model with 1.2 billion parameters. While state-of-the-art models often use a larger number of parameters, we argue that this is a size where models already start to be very useful. In fact, many organizations have a 1 billion-parameter model in their offering (e.g., Gemma, Mistral, etc). In addition, our training pipeline relies on state-of-the-art components and training recipes. Our pretraining experiments are implemented with nanodo [1], we use modern decoder-only transformers with rotary embeddings and QK layer normalization, and train them on the large-scale C4 corpus, following the empirical scaling laws found in prior work [2]. This approach allows us to explore the impact of our loss functions within a framework that reflects current practices. To summarize, we argue that our experiments are conducted in a real-world setting and are therefore informative.

[1] Peter Liu, et al. NanoDO: A minimal Transformer decoder-only language model implementation in JAX. http://github.com/google-deepmind/nanodo, 2024.

[2] Mitchell Wortsman, et al. "Small-scale proxies for large-scale transformer training instabilities." ICLR, 2024.

Can you report some gains on networks based on attention using different (more complex) loss functions?

We are not sure if we understand the question correctly. While it would be possible in principle, our work does not explore using f-softargmax as attention layers (i.e. intermediate layers). Our work focuses on loss functions (i.e. output layers).

Do the so-called new entropies satisfy the axiomatic formulation of entropy? What are their properties and their interest?

An $f$ -divergence gives rise to a well-defined negative $f$ -entropy (Cichocki & Amari, 2010) if $f$ is strictly convex (Blondel et al, 2020, Section 4.1). These entropies have different sensitivity to changes in the probability of an event, which we visualize in Figure 1.

审稿人评论

2025-04-09

Thank you for your response, your constructive approach, and the clarifications provided.

The authors consider a specific case of Fenchel-Young losses using f-divergences, and indeed, this particular combination has not been previously studied within this framework, although related work leveraging the same framework does exist (e.g., https://arxiv.org/pdf/1901.02324, https://openreview.net/pdf?id=7Dep87TMJs). It is worth noting that f-divergences themselves have been previously used as loss functions in other contexts. Thus, the paper may be of interest to a community focused on Fenchel-Young losses and their generalizations.

A key limitation, however, lies in the inability to interpret or explain why only a specific form of α-divergence (in addition to KL) appears to provide performance gains and we appreciate the acknowledgment from the authors. This remains an elusive and unexplored aspect of the work and stands as its primary weakness. That said, since the paper’s main focus appears to be computational and dealing with loss functions that interpolate between softmax and sparsemax, this limitation could be somewhat mitigated.

One possible explanation for the observed behavior may lie in the interesting and relatively unique information-geometric properties of Amari’s α-divergence. This divergence, while part of the f-divergence family, sits at the intersection of f- and Bregman divergences and aligns well with the underlying geometry of the statistical manifold. This presents an intriguing direction for future research, and the present work can be viewed as a first, empirical indication of the potential of such divergences in this context. A small clarification would be helpful in this regard: when referring to “α-divergence,” the authors should specify that they are using Amari’s formulation (and not even the original α-divergence, in Amari’s paper or as introduced by Chernoff previously, which has a different scaling), in order to avoid confusion with other α-type divergences, such as Rényi’s, which does not belong to the f-divergence class.

Regarding the comment on the proposed entropies and Figure 1, it is important to acknowledge that there is already a rich body of literature on the topic (e.g., [Ben-Bassat, 1978], (https://link.springer.com/chapter/10.1007/978-3-031-68208-7_5, early work by Daróczy in the 1970s, or more recent efforts to define Jensen–Shannon-type entropies and others (https://arxiv.org/abs/2106.14874). While defining entropies based on divergences (as with KL or Rényi) is a common approach, such definitions do not always yield valid uncertainty measures under standard axiomatic frameworks. It would be beneficial for the authors to either acknowledge this limitation or provide a more rigorous justification for their formulation. Specifically, it would be worth discussing whether the proposed entropies satisfy any of the well-established axioms of uncertainty measures — such as those proposed by Faddeev, Khinchin, or more recent formulations (e.g., https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.111.230401). This is ultimately a minor issue, as the main focus of the paper lies elsewhere. However, it would strengthen the presentation to either tone down these claims or indicate that a formal axiomatic analysis is left for future work.

Given the clarifications and after a careful reconsideration of the paper’s focus and contribution, the recommendation will be reconsidered.

最终决定Accept (poster)

2025-05-01

This paper studies Fenchel-Young losses when the regularizer is defined as a $f$ -divergence. The framework recovers previously studied losses (cross-entropy, sparsemax, $\alpha$ -entmax) and leads to new ones, potentially using non-uniform reference measures.

Reviewers highlighted the clarity of the paper and the solid experimental design as main strengths of this work. They pointed out as weaknesses the limited novelty -- many ideas here are reformulations or extensions rather than fundamentally new insights, and the best performing loss setting corresponds to a previously studied loss. Yet, the framework (combination of Fenchel-Young losses and $f$ -divergences) appears never to have been studied before in its full generality, and the proposed Algorithm 1 is also new, which makes me lean towards acceptance.