Wording Image for Domain-Invariant Representation in Domain Generalization

Jiawei Ma,Yulei Niu,Shiyuan Huang,Guangxing Han,Shih-Fu Chang

OpenReview PDF

提交: 2023-09-23更新: 2024-03-26

TL;DR

Learning domain-invariant embedding on joint vision-language embedding space for domain generalization without prior.

摘要

关键词

Vision-Language AlignmentDomain GeneralizationDomain-Invariant Representation DisentanglementLong-Tail Learning

评审与讨论

审稿意见

评分: 5置信度: 32023-10-24

This paper studies domain generalization, where the key challenge is to learn domain-invariant visual representations for each category. The authors argue that language embeddings of a particular category are naturally domain-invariant. Moreover, the difference between the pseudo-language embedding (prompted with the input image) and the original language embedding (prompted with the class description) represents the domain-specific counterpart. To this end, the authors propose WIDIn to learn domain-invariant visual representations using language embeddings. Empirical evaluation under various domain generalization benchmarks demonstrate the effectiveness of the proposed method.

优点

This paper is well-written and easy to follow.
The motivation is intuitive and seems reasonable.
The figures vividly illustrate the pipeline of the proposed method, especially the left part of Figure 1 and Figure 2b.
The improvements over baselines are relatively significant.
The proposed method is also capable of long-tailed image classification.

缺点

Lack of evidence to support the core insight. As illustrated in Figure 2b, $t_x - t_c$ is parallel to $x - x_e$ , but in Figure 3, no such phenomenon can be observed. It is encouraged to draw a parallelogram composed of $x, x_e, t_x, t_c$ for each category both on the source domain and the target domain.
Missing ablations. There are three objectives in total, including $L_{ia}$ , $L_{ca}$ , and $L_{feat}$ . It is better to study the effectiveness of each component step by step.
The underlying motivation of $L_{ca}$ . Specifically, $L_{ca}$ aims to minimize the distance between the domain-specific text embedding and the domain-invariant text embedding. However, as the optimization goes on, the difference between these two embeddings becomes small, which is used to measure the domain-specific parts. However, domain-specific parts always exist and never go small. Therefore, minimizing $L_{ca}$ becomes strange to me.

问题

I do not have further questions, please refer to the weaknesses section.

审稿意见

评分: 5置信度: 52023-11-01

Authors propose to project images into language space by representing each image as a word token, which is attached with hand-crafted prompt and fed into language encoder, where the difference between the extracted embedding and the language embedding of its class description is used to estimate the domain-specific counterpart, which facilitates the domain-invariant representation learning. Experiments demonstrate the effectiveness of this approach.

优点

Reported results outperform baselines: experimental studies on two domain generalization benchmark datasets and two long-tail benchmark datasets demonstrate the effectiveness of this approach.
The presentation is clear and easy to follow.

缺点

The novelty is limited. Representing an image as a word token is very common in many recent multimodal models (e.g., BLIP, LLava, CM3, RA-CM3, CM3Leon, etc.). The overall idea and pipeline is still similar to LADS. The difference is that, this paper learns from domain-invariant representations while LADS learns from domain-specific representations, and the way a domain-specific or domain-invariant is obtained is similar.

technical details:

From my understanding, $t_x$ is the "unified" representation of domain-invariant and domain-specific features. In this case, I wonder why you use a text encoder to encode "an image of [V]" to represent $t_x$ ? Given that you claim "an image" represents an invariant domain, it would involve some biases (towards invariant domain) this way. Why not directly encode the [V] token alone?
I am concerned about using "image" to represent the domian-invariant space. Have you tried using other words to replace "image"? For example, using "painting" instead: projecting everything into the painting-domain. I doubt "an image of {...}" works due to the authors' claim that, "image" corresponds to an invariant space. I guess other words could have the similar results.
Code is not provided.

问题

please see weaknesses

审稿意见

评分: 6置信度: 32023-11-01

The paper introduces a method called WIDIn, which connects visual and language information to create domain-invariant features. This is achieved by representing images as word tokens and using the difference between the image's and its class description's language embeddings to promote domain-invariant representation learning. Experimental results show the effectiveness of WIDIn on benchmark datasets.

优点

The proposed method is novel. With the added F_p, F_D, and F_C, it add learnable parameters to LADS and thus lead to better performance.
The experimental results show the power of proposed method.
Ablation study with different prompts, contrastive learning w/ w/ot labels, and training/freezing language model is interesting.
The extension experiments on long tail case also better support the power of the proposed method.

缺点

Is the image encoder trainable? If so, it might be unfair to compare with Linear Clf. and MLP Clf.
It would be great if the author could compare to other more general methods in domain generalization, where only the training domain is available and the domain descriptions of testing domains remain unknown.
It would be great if the proposed method could be evaluated on bigger benchmarks (DomainNet and Office-Home) with more domains as the proposed methods do not require any access to the target domain.
The presentation of this paper is a little bit awkward. For example, the training details, such as loss for each step, are not presented in the main paper.

问题

See weakness