PaperHub
6.6
/10
Poster4 位审稿人
最低3最高4标准差0.5
3
4
3
4
ICML 2025

Theoretical Performance Guarantees for Partial Domain Adaptation via Partial Optimal Transport

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24

摘要

关键词
Partial Domain AdaptationOptimal TransportGeneralization Bounds

评审与讨论

审稿意见
3

The paper studies the problem of Partial Domain Adaptation (PDA), where the target label space is a subset of the source label space. The authors propose a theoretically grounded approach based on Partial Optimal Transport (POT) to tackle PDA, deriving generalization bounds that justify the use of the Partial Wasserstein Distance (PWD) as a domain alignment term. These bounds also provide explicit weight formulations for the empirical source loss, distinguishing their method from prior heuristic weighting strategies. The authors introduce WARMPOT, an algorithm that leverages the derived bounds to optimize domain adaptation performance. They validate WARMPOT through extensive numerical experiments, demonstrating competitive results compared to state-of-the-art (SOTA) methods. The paper claims that their weighting strategy improves upon existing approaches and provides better theoretical justification for PDA method.

给作者的问题

Given that solving the partial optimal transport problem is computationally expensive, how does WARMPOT compare in training time to previous approaches?

论据与证据

  • WARMPOT is theoretically justified – the paper derives generalization bounds that incorporate PWD as a domain alignment term and explicitly define source loss weights. Evidence: Theoretical analysis and derivations in Section 3​ .

  • WARMPOT outperforms existing PDA methods – The proposed algorithm achieves better accuracy compared to methods like MPOT, PWAN, and ARPM. Evidence: Empirical results in Section 5.4, showing improved performance on the Office-Home dataset​ .

  • The weighting strategy is more effective than existing ones – WARMPOT's weights successfully reduce the influence of outlier classes, improving adaptation. Evidence: Table 1 shows that the weighting strategy leads to better classification accuracy and mitigates negative transfer​

  • Most claims are well-supported by theoretical arguments and experimental validation. However, additional sensitivity analyses on weight selection would further strengthen the claims.

方法与评估标准

The authors adopt the Office-Home dataset as the main benchmark, comparing WARMPOT against existing PDA algorithms. Evaluation focuses on classification accuracy across different domain shifts. The Partial Wasserstein Distance is used to measure alignment between source and target distributions. The chosen dataset and metrics are well-aligned with the problem, and comparisons with recent PDA approaches (e.g., MPOT, ARPM) provide a meaningful assessment​.

理论论述

The paper presents theoretical results regarding generalization bounds for PDA, particularly:

  • Feature-based bound (Theorem 3.2) – Establishes a bound on the target loss using PWD between empirical feature distributions.

  • Joint distribution-based bound (Theorem 3.3) – Extends the previous bound to account for target labels.

实验设计与分析

The experimental design includes comparisons on a standard PDA dataset, tuning of key hyperparameters α,β\alpha,\beta, and ablation studies on weighting strategies. A notable strength is the comparison against both heuristic and theoretically motivated weighting methods, providing insight into the benefits of WARMPOT's approach​. However, additional statistical significance tests or error bars would improve confidence in the reported results.

补充材料

The supplementary material provides useful clarifications, especially regarding implementation details and theoretical derivations.

与现有文献的关系

The paper builds upon prior work in domain adaptation, optimal transport, and partial optimal transport (POT). The paper propose a theoretically grounded approach based on Partial Optimal Transport (POT) to tackle PDA, deriving generalization bounds that justify the use of the Partial Wasserstein Distance (PWD) as a domain alignment term.

遗漏的重要参考文献

No

其他优缺点

Some weaknesses:

  • Limited exploration of additional datasets beyond Office-Home.

  • Lack of error bars to support experimental results.

其他意见或建议

No.

作者回复

We thank the reviewer for their comprehensive evaluation and helpful suggestions.

Questions for authors: Given that solving the partial optimal transport problem is computationally expensive, how does WARMPOT compare in training time to previous approaches?

Indeed, solving the optimal transport problem is generally computationally expensive. One common way to circumvent this problem is via mini-batch approaches. We follow this approach. Specifically, in the numerical experiments reported in the paper, we use the mini-batch partial optimal transport framework proposed in Improving mini-batch optimal transport via partial transportation by Nguyen et al. (2022). The most well-performing algorithms in the literature, including PWAN and ARPM, also require one to solve optimal transport problems as part of their algorithms.

We would also highlight that different weighting strategies come with different computational costs. In WARMPOT, the weights, which need to be computed per mini-batch, are obtained directly from the solution of the partial optimal transport problem without any additional overhead. On the contrary, a weight update in both the BA3US and the ARPM weighting strategy involves the entire dataset. Additionally, we see from the tables inserted below (see final comment for further details) that there is a trade-off between performance and computational cost for these weighting strategies. Such a trade-off is not present in WARMPOT.

Claims and Evidence: Additional sensitivity analyses on weight selection would further strengthen the claims.

We have added sensitivity analyses for the αmax\alpha_{max} and β\beta parameters on ImageNet \rightarrow Caltech (see response to QCqK).

Weakness 1: Limited exploration of additional datasets beyond Office-Home.

We have added results for ImageNet \rightarrow Caltech (see response to 9cQ1 for details).

Weakness 2: Lack of error bars to support experimental results.

In order to address this point, we have re-run our experiments using additional random seeds for OfficeHome (6 seeds) and computed average and standard deviation. We have also conducted a weighting-scheme comparison on ImageNet \rightarrow Caltech using 3 random seeds. In the tables shown below, we considered for BA3^3US and ARPM the weight update interval (in epochs) indicated in the parentheses. We see from the table that the performance of BA3^3US depends on the update interval (the smaller the interval, the better the performance) Gu et al., 2024 recommend using 500 and 2000 as update interval for ARPM on OfficeHome and ImageNet \rightarrow Caltech, respectively. Adopting the same update interval for BA3^3US, to maintain a fair comparison, we see that, as claimed in the paper, WARMPOT results in better performance than MPOT and ARPM(500), and yields performance comparable to BA3^3US(500) (overlapping confidence intervals) on both datasets.

Also, the confidence intervals of the best performing algorithms in Table 2 reported in the paper (ARPM and ARPM+our weights) overlap.

Weighting schemeBA3^3US(100) weightsBA3^3US(500) weightsBA3^3US(750) weightsWARMPOT (ours)MPOT weightsARPM(500) weights
Avg. test acc. on OfficeHome78.1 (0.4)77.6 (0.4)77.4 (0.4)77.6 (0.7)76.0 (0.4)72.9 (0.3)
Weighting schemeBA3^3US(750) weightsBA3^3US(1500) weightsBA3^3US(2000) weightsBA3^3US(5000) weightsWARMPOT (ours)MPOT weightsARPM(2000) weights
Test acc. on ImageNet \rightarrow Caltech86.1 (0.6)85.0 (0.2)84.7 (0.7)83.1 (1.4)84.8 (0.1)78.6 (1.2)79.2 (1.4)
审稿人评论

I thank the authors for providing the responses. Thus, I would keep my initial score due to my lack of expertise.

审稿意见
4

This paper deals with Partial Domain Adaptation (PDA), a setting where source and target domain distributions differ, and where the target domain label space is a subspace of the source domain label space. The authors propose to tackle this important problem through Optimal Transport (OT), an established field of mathematics that has previously contributed to domain adaptation in general. The authors use this framework not only to propose new algorithms, but also to provide theoretical generalization bounds.


Post-rebuttal. The authors did a good job on their rebuttal, especially, they provided new experiments on a large scale adaptation task. Overall, this is a good paper with strong theoretical contributions and convincing experiments. Hence, my final score is 4. Accept

给作者的问题

N/A See previous section

论据与证据

Here is a list of highlevel claims made in the introduiction,

  1. "we provide theoretically motivated algorithms for PDA"
  2. "we derive generalization bounds on the target population loss and devise training strategies that minimize them"
  3. "our bounds give rise to weights that, when combined with the ARPM algorithm of Gu et al. (2024) lead to SOTA results for the Office-Home data set."

Claims 1 and 2 are well supported by the theoretical parts of the paper. While claim 3. is also true it is fairly limited in scope. See my comments on the next section about.

方法与评估标准

In this point, the paper falls short on the acceptance criteria. While the authors provide a comprehensive comparison with the state-of-the-art, they do so in a single benchmark, i.e., the Office-Home benchmark. For instance, other papers such as (Gu et al., 2024) and (Cao et al., 2018) have considered the following (on top of Office-Home),

  • Office 31
  • Imagenet -> Caltech
  • VisDA2017 (Real -> Synthetic and Synthetic -> Real)

Which provide more thorough comparisons. In my view the authors should complete their experiments with at least other benchmark (preferably Imagenet -> Caltech of VisDA2017, which are more complex and large scale than Office 31).

理论论述

The authors provide a series of 2 theorems, alongside 1 lemma and 2 corolaries. Overall, they provide generalization bounds in terms of the partial Wasserstein distance between the distributin of extracted features. These generalization bounds are novel, and in line with previous research on domain adaptation theory.

I checked the appendix provided by the authors, and, as far as my knowledge goes, the proofs of theorems 3.2 and 3.3 are correct. I did not check the proof of lemma 3.4.

实验设计与分析

The experimental design an analysis is in lign with current domain adaptation practice, and are, in this regard, correct. The authors could have explored more benchmarks to validate their method, as I highlighted in Methods and Evaluation Criteria

补充材料

I reviewed most of the appendices, which are good. The proofs are clear and easy to follow.

与现有文献的关系

The current paper goes in a similar direction to previous papers (Fatras et al., 2021; Khai et al., 2022) that propose Optimal Transport techniques for partial domain adaptation. An important feature of this paper is that the authors provide generalization bounds in terms of the partial Wasserstein distance.

(Fatras et al., 2021) Fatras, Kilian, et al. "Unbalanced minibatch optimal transport; applications to domain adaptation." International Conference on Machine Learning. PMLR, 2021.

(Khai et al., 2022) Nguyen, Khai, et al. "Improving mini-batch optimal transport via partial transportation." International conference on machine learning. PMLR, 2022.

遗漏的重要参考文献

The authors do a good job summarizing partial Optimal Transport. However, this is not the only way of tackling partial DA. Especially, the authors do not discuss, nor compare the use of umbalanced Optimal Transport (Fatras et al., 2021) for partial domain adaptation

(Fatras et al., 2021) Fatras, Kilian, et al. "Unbalanced minibatch optimal transport; applications to domain adaptation." International Conference on Machine Learning. PMLR, 2021.

其他优缺点

Here I give a summary of strenghts and weaknesses. Please use this list when writting your rebuttal. If the authors provide a rebuttal that answers the following weaknesses, I will raise my score accordingly.

Strenghts

  1. Sound theoretical analysis with novel results for Partial DA
  2. New algorithm with promising results on Office-Home benchmark

Weaknesses

  1. The most important, the empirical evaluation of this paper is very limited. The authors should complete their empirical validation with other benchmarks in Partial DA

其他意见或建议

Comment 1. The term LfL_{f} looks like λ\lambda in (Redko et al., 2017) and other theoretical DA works. Could the authors provide additional discussion on the potential similarities?

Comment 2. The cost in (10) looks a lot with the cost of the joint cost proposed in (Courty et al., 2017; reference in the main paper). I think the authors could add some discussion about the similarities as well. This discussion could also make links between the feature importance factor ξγ\xi \gamma weighting the features in the ground-cost

Comment 3. Given the practical applications of their work, I think the authors could comment on the restrictiveness of their hypothesis. For instance, they assume

  1. The encoder gg is γ\gamma-Lipschitz
  2. The label loss function is a distance, and ξ\xi-Lipschtiz

In common DA practice, neither of these 2 hypothesis are met. For instance, since WGANs, we know that enforcing γ\gamma-Lipschitzness on neural nets is tricky. Furthermore, neural nets are often trained with the cross-entropy loss, which is not a metric on Y\mathcal{Y}.

(Redko et al., 2017) Redko, Ievgen, Amaury Habrard, and Marc Sebban. "Theoretical analysis of domain adaptation with optimal transport." Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2017, Skopje, Macedonia, September 18–22, 2017, Proceedings, Part II 10. Springer International Publishing, 2017.

作者回复

We thank the reviewer for their careful reading and helpful comments.

Methods and Weaknesses: While the authors provide a comprehensive comparison with the state-of-the-art, they do so in a single benchmark, i.e., the Office-Home benchmark. In my view the authors should complete their experiments with at least other benchmark (preferably Imagenet \rightarrow Caltech or VisDA2017, which are more complex and large scale than Office 31).

We have now benchmarked our algorithm on ImageNet \rightarrow Caltech. The test accuracy of WARMPOT for this data set is 84.8%, while ARPM+our weights achieves 85.1%. For reference, the test accuracy of ARPM is 84.1%. Hence, this additional experiment provides further indication that WARMPOT, and specifically its weights, provides a robust approach for PDA tasks.

AlgorithmImageNet \rightarrow Caltech
ResNet-5069.7
DAN71.3
DANN70.8
IWAN78.1
PADA75.0
ETN83.2
DRCN75.3
BA3US84.0
ISRA+BA3US85.3
SLM82.3
SAN++83.3
AR85.4 (0.2)
ARPM84.1 (1.4)
PWAN86.0 (0.5)
WARMPOT (ours)84.8 (0.1)
ARPM+our-weights85.1 (0.9)

Essential References: The authors do a good job summarizing partial Optimal Transport. However, this is not the only way of tackling partial DA. Especially, the authors do not discuss, nor compare the use of unbalanced Optimal Transport (Fatras et al., 2021) for partial domain adaptation

In Table 4 of Nguyen et al. (2022), a comparison is made between their MPOT approach and a mini-batch unbalanced OT (UOT) approach. The results indicate that the POT-based approach works better. We have revised the paper to extend the discussion of UOT approaches and include numerical comparisons.

Comment 1: The term LfL_f looks like λ\lambda in (Redko et al., 2017) and other theoretical DA works. Could the authors provide additional discussion on the potential similarities?

LfL_f indeed plays a similar role as, e.g., λ\lambda in Redko et al. (2017) and admits the same interpretation, in the sense that it is related to the difficulty of the domain adaptation problem. However, while λ\lambda is the smallest achievable sum of population losses, LfL_f is the smallest achievable loss maximized over the empirical source and target set. Meanwhile, L~f\tilde L_f relates to the minimal achievable target loss, and Ξ\Xi measures the performance gap between considering source and target tasks jointly or separately.

Comment 2: The cost in (10) looks a lot with the cost of the joint cost proposed in (Courty et al., 2017; reference in the main paper). I think the authors could add some discussion about the similarities as well. This discussion could also make links between the feature importance factor ζγ\zeta\gamma weighting the features in the ground-cost.

The cost in (10) is indeed the same as the one proposed in Courty et al., 2017. We will make this point clearer in the revised version of the manuscript. From a theoretical perspective, the factor ζγ\zeta\gamma (which appears both in our cost and the one of Courty et al., 2017) corresponds to Lipschitz parameters that are generally not available (see below). Hence, from a practical point of view, we treat it as a hyperparameter in our algorithm.

Comment 3: Given the practical applications of their work, I think the authors could comment on the restrictiveness of their hypothesis. For instance, they assume 1) the encoder gg is γ\gamma-Lipschitz, 2) the label loss function is a distance, and ζ\zeta-Lipschitz. In common DA practice, neither of these 2 hypothesis are met. For instance, since WGANs, we know that enforcing γ\gamma-Lipschitzness on neural nets is tricky. Furthermore, neural nets are often trained with the cross-entropy loss, which is not a metric on Y\mathcal Y.

Indeed, the Lipschitz assumptions required for our theoretical results are often not satisfied for loss functions used to train neural nets in practice. Such Lipschitz assumptions are commonly invoked in the literature to obtain generalization bounds that depend on Wasserstein distances. Mathematically, the Lipschitz assumption allows one to relate the loss function to the cost appearing in the Wasserstein metric (see Lemma A.4). Note that, in our numerical experiments, we consider loss functions that do not necessarily conform to the theoretical assumptions. The fact that we still observe good performance indicates that these assumptions are not critical for the general insights obtained from our theoretical results to hold.

审稿人评论

Thank you for your rebuttal.

I consider that my questions have been correctly addressed, and the new experiments are convincing. I will raise my score accordingly, i.e., I will raise it from 3. Weak Accept to 4. Accept

Other than that, I have one small remark. The authors should be careful in their claims, especially when saying

The fact that we still observe good performance indicates that these assumptions are not critical for the general insights obtained from our theoretical results to hold.

While I do agree that the regularity assumptions for the loss function may not be necessary for a good empirical performance, good performance does not count as evidence that the theoretical results hold, especially as there may be other reasons to why the method work in practice.

作者评论

We thank the reviewer for their thoughtful feedback and for considering our responses convincing. We also appreciate the updated evaluation. Note, though, that the score has not been updated in the system yet.

We agree with the reviewer’s remark regarding the following statement in our response:

The fact that we still observe good performance indicates that these assumptions are not critical for the general insights obtained from our theoretical results to hold.

We will make no claims of this sort in the revised version of our paper.

审稿意见
3

This submission studies the generalization bound and empirical model for the partial domain adaptation problem, where the label spaces across domains are different. The key idea of this paper is to use the weight deduced from the partial transportation mass, which is claimed to be able to filter the outliers (that are redundant for the target domain). Theoretical bound is provided to show the target error can be upper-bounded by weighting source risk, partial Wasserstein discrepancy, estimation bias on target distribution, and (intractable) worse error. The proposed method is compared with other SOTA PDA and OT methods on a PDA dataset.

给作者的问题

I would like to authors to address my concerns on the claims and missing references, which are justified in detail in Claims And Evidence part and Essential References Not Discussed part. Besides, here are additional questions:

Q1. It seems that the joint distribution-based bound in Thm. 3.3 induces larger error than Thm. 3.2. Though Eq. (5) has a factor of 2 for PW\mathbb{PW}, note that the discrepancy in Eq. (9) is induced by the joint cost function (which naturally combines the two costs, i.e., feature cost and label cose). Thus, it is generally the same for the partial Wasserstein term while Eq. (9) induces more complex term that are intractable, i.e., term in Eq. (11). Some justifications are highly appreciated.

Q2. The main difference between this submission and the existing GLS correction method [r4] is that this work adopts POT as a discrepancy metric. However, as discussed in the Claims And Evidence part, the PW\mathbb{PW} is not definitely guaranteed to exclude the outliers. Thus, how to understand the essential advantages of the proposed method?

Q3. The improvement over ARPM is only 0.2% on Office-Home dataset, while no other empirical results are provided. This result seems to imply that the essential function of the proposed method is generally admitted by the existing ARPM. Justifications or additional experiments are highly appreciated.

论据与证据

There are several concerns regarding the claims:

C1. The merits seem to be overclaimed, e.g., the generalization bound for PDA, partial alignment. Note that several works have also derived bounds for PDA with OT or even more general framework, where the idea of partial alignment and weighted risk estimation is also presented. More details are discussed in the Essential References Not Discussed part.

C2. In line 171, the is a claim that “when α=1\alpha=1, we have that qj=1/ntq_j = 1/n_t”. This claim seems to be problematic: if β\beta is not small enough (i.e., 1/(βns)1/ (\beta n_s) is not big enough) or ntn_t is small (i.e., 1/nt1/n_t is large), then this claim indeed cannot be satisfied due to the inequality constraint of POT.

C3. In line 174, assume that qj=1/ntq_j = 1/n_t (even it seems to cannot be guaranteed), then why this equality can ensure that the outliers are ignored? Considering the same case in C2 while the mass of source outliers is large (i.e., the mass sum of shared samples in Ps/βP_s / \beta is still smaller than the total mass requirement α\alpha), there always seems to be transport masses from outliers.

方法与评估标准

The methodology and evaluation criteria are generally appropriate.

理论论述

The theoretical results and proofs look correct.

实验设计与分析

The comparison experiment is only conducted on a single dataset, while a consistent improvement over different datasets is necessary for demonstrating the empirical performance of the proposed method.

补充材料

The technical parts w.r.t. the proofs are roughly checked.

与现有文献的关系

The key idea is related to the recent progresses on sample-level weight estimation for (label) shift, the generalization bound for PDA (with optimal transport), where the main difference is that this work considering the partial optimal transport framework.

遗漏的重要参考文献

The missing references (which are closely related to this submission) can be summarized from the following aspects:

  1. Generalization bounds. In fact, from the views of distribution shift (particularly the label shift), both PDA, Open-Set DA (OSDA), and universal DA (UniDA) can be considered as the label shift (LS) or generalized label shift (GLS), where these extreme shift scenarios imply that the supports of label distributions are different. Therefore, there are actually works that provide the same innovation, i.e., 1) GLS bound: upper-bound with weighted source risk, shift on (conditional) representation distribution, and shift on label distribution [r1,r2] 2) GLS bound under optimal transport as discrepancy measures [r3].

  2. The idea of employing the marginals of transport plan (of POT) as tool for automatically address the extreme shift have been studied in Unified OT [r4], which solving a more general setting (i.e., PDA, OSDA and UniDA) and considering the hard threshold for weight (where this work considers soft weight pip_i).

References

[r1] Tachet des Combes, Remi, et al. "Domain adaptation with conditional distribution matching and generalized label shift." Advances in Neural Information Processing Systems 33 (2020): 19276-19289.

[r2] Luo, You-Wei, and Chuan-Xian Ren. "When Invariant Representation Learning Meets Label Shift: Insufficiency and Theoretical Insights." IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).

[r3] Kirchmeyer, Matthieu, et al. "Mapping conditional distributions for domain adaptation under generalized target shift." International Conference on Learning Representations. 2022.

[r4] Chang, Wanxing, et al. "Unified optimal transport framework for universal domain adaptation." Advances in Neural Information Processing Systems 35 (2022): 29512-29524.

其他优缺点

Pros:

  1. The organization is clear and easy to follow.

  2. Theoretical analysis is provided for the proposed OT framework.

Cons:

  1. The interpretations of the main technique are unrigorous and insufficient.

  2. The related works are not properly discussed, which makes it hard to assess the essential merits.

  3. The empirical improvement is limited and the experiments.

其他意见或建议

  1. The definition of pi,qjp_i, q_j could be confusing due to the incomplete definition of Π\Pi^*. It is necessary to justify that what problem does Π\Pi^* correspond to (i.e., (PWα(,))(\mathbb{PW}_\alpha (\cdot,\cdot)))? Though they are defined in the proof part of the appendix, it should be clarified in the main body.
作者回复

We thank the reviewer for their careful reading of the paper, relevant references, and constructive comments.

C1 and Essential References:

We thank the reviewer for pointing out these relevant references. While all of them are relevant, consider similar settings and techniques, and deserve to be reviewed in the introduction, there are a few key differences between our work and the mentioned papers. First, to the extent that the mentioned papers present bounds in terms of a weighted source loss, they all rely on classwise weights defined in terms of unknown data distributions. These are then estimated using a method based on Detecting and correcting for label shift with black box predictors by Lipton et al. (2018). However, these estimates are only guaranteed to be accurate if GLS holds exactly, i.e., if the feature representation Z=g(X)Z=g(X) of the input XX is such that P(ZY=y)=Q(ZY=y)P(Z|Y=y)=Q(Z|Y=y) for source PP and target QQ. Our results require no such assumptions, and yield explicitly computable weights from only the observed data. Additional differences are detailed below.

  1. The results in [r1] that contain the weighted source loss are bounds on distribution discrepancies, and not target loss. Furthermore, the weights therein are class-level weights that depend on the unknown underlying distributions, and the domain discrepancy is measured in terms of Jensen-Shannon divergence. In [r2], the same kind of weights are used, and domain discrepancy is measured in terms of metrics like the total variation. Finally, the risk bound in [r3] does not include source loss weights, and depends on the 1-Wasserstein distance rather than its partial counterpart considered in our paper.
  2. While [r4] considers a more general setting, aiming for, e.g., private class discovery, the proposed algorithm is not accompanied by any theoretical analysis. Furthermore, as pointed out by the reviewer, they use unbalanced rather than partial optimal transport, and consider binary rather than soft weights.

In summary: the listed references are very relevant, and we have updated the discussion of related work to include them. However, our theoretical results provide test risk bounds in terms of empirically computable weights and apply without any assumptions on the specific form of distribution shift. We have updated our stated contributions to clarify these points and avoid overclaiming the merits of our work relative to prior art.

C2, C3, and Q2:

First, note that we assume β(0,1]\beta\in(0,1]. Hence, the mass of the scaled source distribution is always at least 11. When α=1\alpha=1 and all entries of QX~Q_{\tilde X} equal 1/nt1/n_t, the condition below Eq. (4)

1nsTΠ1nt=1\mathbf{1}^T_{n_s}\Pi\mathbf{1}_{n_t}=1

requires the ntn_t column sums of Π\Pi to add up to 11, whereas the condition

ΠT1nsQX~\Pi^T \mathbf 1_{n_s} \leq Q_{\tilde X}

limits each row sum to be 1/nt1/n_t at most. Combined, they imply that qj=1/ntq_j=1/n_t.

Regarding guaranteeing that outliers are ignored: The parameter β\beta needs to be chosen to be small enough to enable the algorithm to ignore outliers. In particular, if α=1\alpha=1, β\beta can at most equal the outlier proportion. However, this still does not guarantee that outliers are ignored. In the same way, while the approach of [r4] is designed to avoid outliers, there is also no guarantee that this works. Empirically, as shown in Fig. 1, WARMPOT does significantly downweight outlier samples in practice.

The essential advantages of WARMPOT for PDA compared to the approach of [r4] are: (a): WARMPOT is endowed with theoretical guarantees, (b): the transport plan for POT is more interpretable than the one for unbalanced OT, and (c): our soft weights enable different samples of the same class to have different influence on the final hypothesis.

Q1:

As the reviewer notes, Thm. 3.2 includes only a feature cost whereas Thm. 3.3 additionally incorporates a label cost. Intuitively, one would expect the feature-only approach to suffice when the distribution shift is restricted to covariate shift and a setting where the supports of the input distributions mostly overlap. However, in cases of label shift, incorporating the label cost may be helpful. The factor 2 in Thm. 3.2 essentially compensates for the absence of the label loss, and stems from the use of Lemma A.4 (where we need to use triangle inequality twice). A similar point applies to the uncomputable terms that capture the difficulty of the specific task (LfL_f and L~f\tilde L_f). Whether or not one of the bounds leads to better performance is a largely empirical question, and our experiments indicate that, for the tasks under consideration, it is generally beneficial to include a label cost.

Q3:

We have added results for ImageNet \rightarrow Caltech (see response to 9cQ1 for details).

Other:

We now use p~i\tilde p_i and q~j\tilde q_j for the weights in Thm. 3.3 to avoid confusion, and clarify the difference compared to pip_i and qjq_j.

审稿人评论

I thank the authors for providing detailed responses, where most of the concerns are addressed. Thus, I would raise the score accordingly.

审稿意见
4

The paper presents (PAC) bounds and on the (expected) empirical loss using partial Wasserstein distance in either the marginal (features only) or joint (features and labels). The first two terms of the loss are the loss weighted provided by the marginal of the partial transport plan and the partial Wasserstein distance itself. Minimizing these leads to an algorithm that is performant on a benchmark dataset and when the weights are used in a more complicated heuristic achieves state of the art.

update after rebuttal

Further experiments suggested by reviewer confirm that the approach achieves performance. Limitation due to bias caused by batch sampling, which has been noted in the continuous conditional flow matching case, could be noted.

给作者的问题

Is a metric actually required for \ell? For instance, cross-entropy/log loss/KL divergence doesn't satisfy the requirements for a metric.

Line 820 "Through a parameter search," on what dataset and what performance metric? Especially in a dataset that is claimed to be unsupervised, this could be 'information leakage' from the test set performance. Generally, the percent of outliers is unknown.

Although empirical measures are used, the results would appear to be for the whole sample not the mini-batch. The implications of the reduced sample size from the mini-batch isn't clear in the bounds. Is there an effect?

论据与证据

The theory is developed rigorously and the the experimental results (one dataset but many source/target pairs) shows consistent performance.

方法与评估标准

For the ground distance in the label space the loss should be a metric, but this is not specified.

理论论述

I did not check them rigorously but they seem logical.

实验设计与分析

The second regards how hyperparameter search can be performed on a dataset that its unsupervised...

补充材料

I reviewed the appendix.

与现有文献的关系

Domain transfer and partial domain alignment is a very practical problem that is well established. The contributions are meaningful and will have impact if reproduced.

遗漏的重要参考文献

Not that I noticed.

其他优缺点

The paper is clear and well written. It should have significance because it is an important problem due to its widespread nature.

其他意见或建议

作者回复

We thank the reviewer for their helpful and constructive comments.

Q1. For the ground distance in the label space the loss should be a metric, but this is not specified. Is a metric actually required for \ell? For instance, cross-entropy/log loss/KL divergence doesn't satisfy the requirements for a metric.

The theoretical derivations require the loss function to be symmetric in the model prediction and true label, and that the triangle inequality holds. Note, though, that in our numerical experiments we consider loss functions that do not necessarily conform to the theoretical assumptions. The fact that we still observe good performance indicates that these assumptions are not critical for the general insights obtained from our theoretical results to hold.

Q2. Line 820 "Through a parameter search," on what dataset and what performance metric? Especially in a dataset that is claimed to be unsupervised, this could be 'information leakage' from the test set performance. Generally, the percent of outliers is unknown.

In order to assess the impact of this hyperparameter selection, we have conducted a sensitivity analysis for the αmax\alpha_{max} and β\beta parameters on ImageNet \rightarrow Caltech dataset. We set the following values for hyperparameters: η1=0.92\eta_1 = 0.92, η2=5.47\eta_2 = 5.47, ε=5.59\varepsilon = 5.59. In the experiment on αmax\alpha_{max}, we set β=0.72\beta = 0.72 and in the experiment on β\beta, we set αmax=0.08\alpha_{max} = 0.08 (see Appendix E in our paper for the notation used).

As seen in Figure 1, the impact of varying β\beta is not large (about 2%) over the entire range of 0<β10<\beta\le 1. Similarly, in Figure 2, when 0<αmax0.10<\alpha_{max}\le 0.1, the performance difference is not large (about 3%). However, for αmax>0.1\alpha_{max}>0.1 there is a significant drop which is due to the large number of outliers in source sample. Using a larger value of αmax\alpha_{max} results in a positive weight over outlier instances thus degrading the performance. The results indicate that the specific choice of these parameters has a minor impact over a range of reasonable values.

Furthermore, note that for the ARPM+our-weights algorithm, we use the same parameters as the original ARPM algorithm.

Q3. Although empirical measures are used, the results would appear to be for the whole sample not the mini-batch. The implications of the reduced sample size from the mini-batch isn't clear in the bounds. Is there an effect?

As the reviewer notes, we solve the mini-batch partial optimal transport problem during training rather than the full-data transport problem due to computational considerations. To study the effect that this has on the resulting weights, we compared:

(i): the average weights applied to each sample during training that arise from the mini-batch partial optimal transport problem.

(ii): the weights arising from the full-data transport problem at the end of training;

The results indicate that the two approaches lead to very similar weight distributions, and in particular, the weight proportion assigned to shared samples is 86.37% with mini-batch weights and 90.54% with full-sample weights. This aligns with discussions in Improving mini-batch optimal transport via partial transportation by Nguyen et al. (2022) regarding the possibility of using mini-batches to approximate full-data transport problems.

最终决定

The paper addresses the problem of Partial Domain Adaptation and proposes to solve the problem using a partial optimal transport framework. The main contributions of the paper are a new algorithm for matching source/target distributions and theoretical generalization bounds providing support to the approach.

All reviewers found that the paper deserves acceptance as it proposes novel and solid contributions.