CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions
We introduce CulTwin, a synthetic cultural dataset of visually similar concept pairs with contextualized captions, and CultureCLIP, a CLIP-based model fine-tuned to better distinguish visually similar yet culturally distinct concepts.
摘要
评审与讨论
The paper introduces CultureCLIP, an enhanced version of the CLIP vision-language model (VLM), designed to improve fine-grained cultural concept differentiation while maintaining generalization capabilities.
接收理由
Reasonable Motivation and Novelty: The motivation behind the paper is well-articulated and addresses a critical gap in vision-language models (VLMs). By empowering CLIP to distinguish culturally distinct yet visually similar concepts through fine-grained visual cues, the work presents an interesting and valuable contribution to multimodal learning. The introduction of CultureCLIP, which enhances cultural awareness, aligns with the growing need for culturally sensitive AI systems.
High-Quality CulTwin Dataset: The CulTwin dataset is a significant strength, constructed through a meticulous three-stage data curation pipeline. The use of synthetic concept-caption-image triplets, enriched with cultural context via vision-language models and text-to-image diffusion models, ensures high quality and diversity. This dataset not only supports the proposed method but also has the potential to advance multimodal alignment research in cultural contexts.
Comprehensive Experimentation: The experimental setup is robust, featuring a variety of culture-specific and culture-agnostic benchmarks. The ablation studies systematically validate the effectiveness of the proposed contrastive learning framework.
拒绝理由
Need for Concrete Examples in Cultural Taxonomies: While the cultural taxonomy is well-defined across eight categories, the paper would benefit from providing specific examples for each category to enhance clarity.
Lack of Visualization Results: The paper claims improved fine-grained cultural understanding, but it lacks visualizations to substantiate these claims. For example, showing matching scores between an image (e.g., Yuelao with a red thread) and captions describing similar cultural concepts would provide concrete evidence of CultureCLIP’s ability to distinguish subtle differences. Such visualizations would make the results more interpretable and compelling.
Limited Innovation Beyond CLIP Framework: Although CultureCLIP introduces a novel dataset and training objective, the core methodology heavily relies on the existing CLIP architecture and contrastive learning paradigm. The current approach, while effective, risks being perceived as an incremental extension of CLIP.
Thank you for your careful and insightful comments. We sincerely appreciate your recognition that (i) our work addresses a critical gap in vision-language models by enhancing cultural awareness through fine-grained visual distinctions, and (ii) the CulTwin dataset is a valuable and well-constructed resource that can support future research in culturally grounded multimodal learning. We also thank you for acknowledging the strength of our experimental design and ablation studies. Below, we address your concerns in detail.
Weakness 1: Need for Concrete Examples in Cultural Taxonomies and Visualization Results
Thank you for these helpful suggestions. We agree that providing concrete examples for each cultural taxonomy category and including visualizations can significantly enhance clarity and interpretability. Due to space limitations, we currently include three representative examples in Figure 3. A more complete set of examples for all eight categories (Cuisine, Clothing, Animals & Plants, Art, Architecture, Daily Life, Symbols, and Festivals) can be found from this anonymous link: https://postimg.cc/K3yRbpd1.
Additionally, we plan to include representative image–caption pairs along with their matching scores to better illustrate CultureCLIP’s ability to capture subtle cultural distinctions. Some preliminary visualizations can be found from this anonymous link: https://postimg.cc/Y4S4Znmy. These enhancements will be added to the appendix in the revised version.
Weakness 2: Limited Innovation Beyond CLIP Framework
We sincerely thank the reviewer for acknowledging the novelty of our dataset and training objective. We would like to point out that while our method builds upon CLIP without architectural modifications, its core contribution lies in addressing a key limitation of CLIP—its difficulty in capturing fine-grained, culturally grounded distinctions. To this end, we propose a novel dataset construction approach and a tailored training strategy that enhances cultural understanding without compromising CLIP’s original image–text alignment. Importantly, prior works such as MedCLIP (Wang et al., 2022), BioCLIP (Stevens et al., 2023), and LongCLIP (Zhang et al., 2024) have demonstrated that meaningful contributions can be achieved without modifying CLIP’s architecture. Similarly, our work introduces a novel data curation and training strategy that, while motivated by cultural understanding, offers generalizable insights for adapting CLIP to a wide range of domain-specific tasks, thus making a valuable contribution to the community. As CLIP is a fundamental model for multimodal learning, we believe our work will inspire the development of more culturally aware and nuanced vision-language models of even greater parameter scale and sophistication.
The author's response adequately addressed the reviewer's concerns. Therefore, I maintain my original score of 6.
Dear Reviewer n9zq,
We sincerely appreciate the time and effort you have invested in offering us your valuable feedback.
We look forward to your comments on our responses and would be delighted to address any further questions or concerns you might have.
Thank you once more for your insightful comments and the time you have dedicated to our submission.
Best,
The Authors
This paper introduces a synthetic dataset containing culture-specific concept-caption-image triplets and presents contrastive approaches inspired by NegCLIP and TripletCLIP to train a cultureCLIP using the proposed dataset. The dataset is designed to amplify similar yet distinct cultural concept pairs that current VLMs struggle with. They show that using hard images and caption/concepts negatives and LoRA, they can improve the baseline CLIP on culturally specific tasks without sacrificing general task performances.
接收理由
I find the focus of cultural awareness in VLMs still a relatively new topic and deserves more attention in the field. I also appreciate the design of the dataset construction pipeline and the corresponding training strategy. These insights can be helpful for people who are interested in adapting VLMs for specific domains.
拒绝理由
Reliability of the Proposed Dataset I think certain descriptions in the paper are too high-level and missing specific details:
- Where do the images come from (real images from Wikipedia or synthetically generated)? How do you guarantee that the pair constructed is culturally appropriate? Please provide more details of MLLM-as-a-Judge and why Qwen2.5-VL is proficient in judging culturally related questions. For human eval, how many examples are annotated? Are the annotators qualified to annotate concepts across many countries?
Unclear Model results
- What is the CultureCLIP set up in Table 2? Some ablation results have even better performance compared to what’s reported in Table 2. (E.g. Table 3 max performance on GlobalIRG-G is 69.76 compared to 69.47 in table 2)
- One of the conclusions in ablation is filtered dataset enhances model performance. Could the authors please clarify what the ablation experiments were done for this section?
- It seems that the culturally specific data tested are biased towards questions involving concepts. Could the author explain why datasets like CultureVerse or CVQA are not included as a benchmark? Or VLMs like CultureVLM and other closed-source VLMs are not tested?
给作者的问题
- Why are NegCLIP and TripleCLIP not evaluated on GlobalIRG and CROPE datasets?
- Why CLIP++ trained on more cultural data but the performance is significantly worse?
- The metrics in Table 2 are not consistent between cultural-specific and agnostic tasks. It’s better to use percentages for both for consistency.
- On line 144, “Finally, the back side is generated by the LLM, which matches visually similar concepts with the front side within the same cultural category, following the same format.” Could you clarify how you measure visual similar matches?
Responses to the questions:
Thanks for your valuable advice and we will respond questions one by one:
Q1: Why NegCLIP and TripleCLIP are not evaluated on GlobalIRG and CROPE?
Thank you for raising this point. NegCLIP and TripleCLIP are pre-training methods trained on datasets (CC3M and CC12M) that are significantly smaller than CLIP’s original pre-training corpus. As a result, their performance on general benchmarks (e.g., MSCOCO and Flickr30K), as shown in Table 2, is substantially lower than CLIP and our CLIP-based fine-tuned models. Furthermore, since these models were not exposed to culturally relevant data during training, their performance on cultural datasets is considerably weaker and thus not meaningful for comparison. Strictly speaking, these models should not be used as direct baselines in the main evaluation. However, we retained them in the tables for completeness. More meaningful comparisons are provided by NegCLIP++ and TripleCLIP++, which apply the same pre-training strategies but are using more pre-training data and further fine-tuned on our cultural dataset. We will revise the tables and clarify this point in the updated manuscript.
Q2: Why CLIP++ performs worse despite being trained on more cultural data?
Thank you for this important observation. As discussed in Section 5.1 (lines 258–263), CLIP++ is trained solely on cultural image–caption pairs (containing textual and visual cues) without incorporating hard negatives. In contrast, NegCLIP++ and TripleCLIP++ use our constructed hard negative pairs, which are essential for helping CLIP learn fine-grained visual distinctions. This result underscores that simply fine-tuning on domain-specific data without hard negatives can negatively affect the original alignment and lead to performance degradation, even when using LoRA to preserve parametric knowledge. We will clarify this point further in the revised version.
Q3: Inconsistent metric formats in Table 2
Thank you for pointing this out. We will revise the manuscript to report the results of the cultural-agnostic tasks using percentage units as well, to ensure consistency across all metrics.
Q4: How visual similarity is determined during back-side concept generation (l.144)?
Thank you for this insightful question. As shown in Appendix D, Figure 6, we prompt the LLM to generate visually similar but culturally distinct negative samples based on a given concept, its context, and key visual features. This simple yet effective strategy guides the LLM to focus on visual attributes and generate concept pairs with similar visual features but different cultural meanings. To ensure the quality of these generated pairs, we compute consistency scores using Qwen2.5-VL during the image quality check stage—specifically evaluating whether the positive and negative images preserve the key visual features of their respective concepts. This helps us retain image pairs that are visually comparable but reflect fine-grained cultural differences. While we acknowledge that this process relies on the capabilities of the underlying LLM and MLLM, our human evaluation following the image quality check shows that the retained samples are reasonably high-quality. Moreover, experimental results indicate that these contrastive pairs significantly improve the model’s ability to capture subtle cultural distinctions. We appreciate the reviewer’s insight and agree that further improving the quality of hard negatives is a promising direction. We plan to explore more robust and interpretable generation strategies in future work.
Weakness 2: Unclear model results
Thank you for your helpful comments.
- We sincerely thank the reviewer for pointing out the inconsistency between the ablation results in Table 3 and those in Table 2. After careful investigation, we found that although the experimental settings for Table 3 were intended to match those of Table 2 (i.e., all models trained on the same unfiltered 100k dataset using LoRA with rank 4), an issue occurred during the evaluation of the cultural-related dataset in Table 3: some images were not properly loaded, as the experiments were conducted by different individuals. This led to inconsistent and confusing results. We have now re-evaluated all ablation settings under the correct configuration, and present the updated results on cultural benchmarks for Table 3 below.
| Configuration | λ (cap/con) | GlobalRG-G | GlobalRG-R | CROPE |
|---|---|---|---|---|
| Single Branch (No Negative) | ||||
| Caption-only w/o neg | 1.0 / -- | 66.95 | 77.43 | 79.19 |
| Concept-only w/o neg | -- / 1.0 | 64.24 | 77.70 | 79.19 |
| Single Branch (With Negative) | ||||
| Caption-only w/ neg | 1.0 / -- | 66.27 | 77.53 | 79.25 |
| Concept-only w/ neg | -- / 1.0 | 65.83 | 77.29 | 79.19 |
| Mixed Branches | ||||
| Cap (w/o neg) + Concept (w/o neg) | 0.5 / 0.5 | 67.29 | 77.27 | 79.25 |
| Cap (w/ neg) + Concept (w/o neg) | 0.5 / 0.5 | 67.12 | 78.70 | 79.25 |
| Cap (w/o neg) + Concept (w/ neg) | 0.5 / 0.5 | 68.81 | 76.87 | 79.19 |
| Full (Both with Negative) | ||||
| Both w/ neg (Ours) | 0.7 / 0.3 | 65.93 | 78.80 | 79.37 |
| Both w/ neg (Ours) | 0.5 / 0.5 | 67.12 | 78.25 | 78.60 |
| Both w/ neg (Ours) | 0.3 / 0.7 | 69.47 | 78.60 | 78.84 |
- The ablation results supporting our claim that quality filtering enhances performance are shown in Table 4. Specifically, the “+QF” (Quality Filtered) configurations refer to training on the 73.8k filtered samples that passed our image quality check, while those without “+QF” use the full unfiltered 100k dataset. For example, Config 5 (LoRA r=4 + QF) outperforms Config 3 (LoRA r=4), and Config 6 (LoRA r=8 + QF) outperforms Config 4 (LoRA r=8), demonstrating that filtering significantly improves data quality and model performance—even with fewer samples. Moreover, Config 2 (QF with full fine-tuning) underperforms compared to Configs 5 and 6, suggesting that LoRA not only offers more efficient training but also better preserves the original image–text alignment than full-parameter tuning. We will make these configurations and observations clearer in the revised version. We will make these settings and findings more explicit in the revised manuscript to improve clarity.
Weakness 3: Missing baselines and benchmarks
Thank you for raising this interesting question.
- Regarding why we do not compare with VLMs like CultureVLM, our method is based on CLIP and focuses on contrastive multimodal learning to enhance cultural awareness. It is primarily designed for feature representation in classification and retrieval tasks. In contrast, CultureVLM adopts a different architecture that combines the vision encoder of CLIP with an LLM, enabling more complex generation. As such, a direct comparison would be less meaningful due to the substantial differences in architecture and model scale. We instead view CultureCLIP as a culturally-aware vision backbone, which can be integrated with LLMs to build stronger VLMs in the future. Additionally, CultureVLM is a concurrent work that has not yet been open-sourced, so we did not include it as a baseline at this stage. We are glad to incorporate it into future comparisons once it becomes publicly available.
- For benchmark selection, we adapted GlobalIRG and CROPE into a statement-ranking task to evaluate cultural sensitivity in CLIP-based models. This transformation ensures a fair comparison and aligns with CLIP's capabilities, making our choice of benchmarks purposeful and not biased. Regarding CVQA and CultureVerse (not released yet), these datasets involve high-level multi-hop reasoning, which typically requires the generation capabilities for accurate reasoning for the answer. As such, they are not suitable for testing fine-grained visual feature discrimination. For example, asking CLIP to determine the specific cultural use of an object in a given country is unreasonable, as CLIP is primarily designed for visual recognition rather than reasoning.
Thank you for your careful and insightful comments. We sincerely appreciate your recognition that (i) our work is meaningful and can help draw the community’s attention to cultural awareness in vision-language models, and (ii) our dataset construction approach and corresponding training strategy may offer useful insights for adapting VLMs to specific domains. Below, we address your concerns in detail.
Weakness 1: Missing details in data construction process
Thank you for pointing out the need for more specific details regarding our dataset construction process, the use of MLLM-as-a-Judge, and the human evaluation setup.
- Image source: real vs. synthetic: As illustrated in Figure 2, we first collect cultural concepts across different countries and categories using both bottom-up (from real-world data) and top-down (LLM-driven) approaches, and then extract or generate the corresponding context and key visual features. Based on this information, we generate images using a text-to-image model (Stable Diffusion 3.5 Large Turbo). Thus, all training images are synthetic, while real images are only used during the bottom-up phase to help extract context and visual cues. We chose not to use real web images (e.g., from Wikipedia) for the following reasons: (1) High-quality images strongly tied to cultural concepts are scarce and require costly manual annotation. (2) Captions from the web are often long and may not align well with the image content. (3) It is difficult to find fine-grained, visually similar hard negatives from natural sources. In contrast, generating images from culturally relevant captions allows for scalable and controllable data creation.
- Ensuring cultural appropriateness of matched pairs: During the pair matching stage, we use Qwen2.5-VL to generate visually similar but culturally distinct negative samples for each concept, given its context and key visual features. While we initially considered using stronger models like GPT-4o, we ultimately chose Qwen2.5-VL due to its lower cost and better scalability. Despite being smaller, it achieves reasonable quality, with a passing rate of around 75%, making the trade-off reasonable.
- Details on MLLM-as-a-Judge: We use Qwen2.5-VL to evaluate three dimensions of image quality: Authenticity, Consistency, and Cultural Fidelity. We understand the reviewer’s concern about Qwen2.5-VL’s cultural reasoning ability. To mitigate this, as shown in Appendix D (Figure 10), we provide explicit context and key visual features, and ask the model to assess the appropriateness of elements within that cultural setting—rather than answer open-ended cultural questions. We also employ few-shot prompting to guide more reliable scoring behavior. While gaps remain in abstract categories (e.g., Art, Architecture), we find that in more common categories, MLLM scores follow human evaluation trends.
- Human evaluation setup and annotator qualification: The evaluation was conducted on 240 filtered images (120 positive–negative pairs) across eight cultural categories. Each image was accompanied by a reference image, detailed cultural context (including the concept, its definition, and the generated caption), as well as a Wikipedia link. To ensure the reliability of our human evaluation, we invited three PhD-level volunteers with familiarity in cultural concepts. Annotators were encouraged to consult the provided references if they were unfamiliar with a particular concept. They were instructed to assess each generated image along three dimensions: authenticity (whether the image violates physical commonsense), consistency (whether it accurately depicts the intended concept), and cultural fidelity (whether any inappropriate or culturally irrelevant visual elements are present). While authenticity does not require specific cultural knowledge, both consistency and cultural fidelity can be reasonably evaluated using the provided references, even without deep cultural expertise.
Dear Reviewer 25A8,
We are truly grateful for your time and the thoughtful feedback you have provided.
We look forward to your comments on our responses and would be delighted to address any further questions or concerns you might have.
Thank you again for your valuable insights and for taking the time to review our work.
Sincerely,
The Authors
The paper addresses the issue of lack of cultural representation in CLIP, i.e. the model's inability to distinguish similar but culturally distinct concepts. The authors create a synthetic dataset of pairs of visually similar concepts. Each concept is accompanied by an image and caption which feature the concept. CLIP is finetuned on this data using LoRA, using the contrastive loss from NegCLIP (which incorporates hard negatives), applied not just to image-caption pairs but also to image-concept pairs. Results on various culture-specific benchmarks indicate that the training has effectively instilled cultural knowledge into CLIP, and ablations show that this is in larger part thanks to the concept-image pairing.
The work is clearly written, although some details are missing (see below). I can't say that it is particularly original, as there is nothing about the objective, the creation of the data or the training of the model that is particularly innovative. That being said, a CLIP model with improved cultural knowledge is needed, and this work improves considerably on some relevant metrics, so perhaps members of the community will find the model (if publicly released), a useful tool for their work.
接收理由
The paper is mostly well written.
The data creations procedure sounds reasonable.
It addresses a timely issue and shows very good results on appropriate benchmarks.
拒绝理由
The evaluation of how well the culture-specific training preserved the general capabilities of the CLIP base model is limited to retrieval on COCO and Flickr30k. At the very least, some image classification benchmarks should be included, such as ImageNet and CIFAR, as they are indicative of the generalization of the model to a broad range of objects. On a related note, I think it's important to introduce the cultural benchmarks in at least two-three sentences, so the reader can make better sense of the results and discussion.
Key details about the dataset are missing, pertaining to its size, caption length, computational cost of its creation (e.g. in terms of time).
The authors do not state whether they will make the dataset and model public.
Error analysis or outlook for future work is missing. Some examples of CLIP v CultureCLIP outputs on the culture benchmarks would be nice to get a sense for how off-target CLIP was and where CultureCLIP might still be lacking in performance.
给作者的问题
The authors state that "A natural approach is to fine-tune CLIP on a large-scale dataset containing image-text pairs that explicitly describe culturally significant details of concepts.". It is not immediately obvious to me why that's the natural way to go. The mention of the significant details, if anything, might give the model less incentive to store knowledge about the distinctive features of entity X, since those details are provided in the textual context and as such do not need to be memorized. Training with hard negatives is the natural way to go, as I see it, without providing any textual cues to the distinctive features, in order to force the model to actually learn them from the data. Indeed at the end we see that concepts are more useful than captions, which I intepret to mean that when the model fully has to rely on the concept with no additional signal from the textual context to be used as a shortcut, it learns the visual features of the entity best. So I'm not sure why the authors chose to introduce the approach in this way.
I don't understand why the "Efficient Contrastive Pre-training" is called that.
"A VLM then evaluates their cultural relevance, associating each concept with a specific country and cultural category." - what are the possible categories? How is the list compiled?
"Concepts that do not meet the criteria are discarded." -> what are the criteria?
It is not clear what the authors mean by front side and back side on l 144-146
What is the final size of the dataset? What is the average length of the captions?
What text-to-image model is used to generate the images?
Is Qwen2.5-vl used for all steps involved in the generation of the data? What gives the authors confidence that this model is powerful enough to reliably carry out all tasks? Was any human validation performed?
Why is it a better idea to generate images of the concepts rather than source natural images from the web/existing image repositories? Is there any concern around the synthetic nature of the data? This should be discussed in the paper.
"This structure jointly anchors abstract cultural concepts to specific visual-textual cues" - since there is no loss that concerns concepts and captions, I think the cues are only visual, right?
I assume concepts are encoded with the same text encoder as used for captions. Are they presented as solo words or contextualized in a template of some sort? Is there concern for using the same module to encode two distinct types of inputs, effectively doing multi-tasking?
Would it make sense to extend the pairs to cross-cultural concepts which may look alike but are in fact highly distinct? Such as a clay vessel which in one culture might be use to store food and in another to collect waste? Such examples were presented in the CVQA paper by Romero et al. 2024.
The data creation section is somewhat tidious and could be shortened, with parts of it moved to the appendix.
Q5: What do “front side” and “back side” refer to (l.144–146)?
We use the metaphor of a Twin Card to describe two culturally related concepts that are visually similar, each represented by its own concept–caption–image triplet. As illustrated in the top part of Figure 2, the front side refers to the original concept obtained through either the top-down or bottom-up approach, while the back side is a counterpart concept generated by an LLM. Together, they form a pair of conceptually linked examples designed to highlight cultural distinctions. In the revised version, we will (i) modify the original text to make this metaphor clearer, and (ii) add additional annotations to Figure 2 to help readers better understand the front–back relationship of the Twin Card.
Q6 & Q7: Dataset size, caption length, and image generation model
Please see details in Weakness 2.
Q8: Why Qwen2.5-VL, and how is reliability ensured?
VLM plays a crucial role in our approach. We initially used GPT-4o but ultimately switched to Qwen2.5-VL, which, despite being smaller, provides sufficiently reliable results for our purposes. While more powerful LLMs could potentially improve data quality, from a practical standpoint, Qwen2.5-VL strikes a good balance between performance and cost. To validate the data quality, we included a human validation step in the final image quality check. As shown in Table 1, the passing rates for each cultural category hover around 75%, indicating reasonable quality. This passing rate is satisfactory, and with more time or resources, the dataset could be further expanded since intermediate generation steps require no human supervision.
Q9: Why use synthetic rather than real images?
As discussed in Q1, collecting high-quality, culturally relevant real-world data is challenging due to limited availability, annotation costs, and weak alignment between images and captions. In contrast, generating images from culturally grounded captions using text-to-image models allows for scalable and controllable data creation. One concern with synthetic data is that it may lack diversity and lead to overfitting. To address this, as described in Section 3.2, we diversify the generated captions by varying artistic style, setting, and scene composition—enriching the visual representation of each concept. Our experimental results indicate that this variation helps mitigate overfitting to some extent.
Q10: Are cues only visual?
We appreciate this insightful question. Although we do not explicitly enforce a concept-caption loss, this does not imply that the model ignores textual cues. First, we omit the concept-caption loss because captions are considered context-enriched extensions of concepts, rendering such a loss relatively trivial compared to cross-modal losses. Second, our training objective anchors images—with their visual cues—to concepts while maintaining the original alignment between images and captions (noting that concepts and captions share the same text encoder). Through this process, the three components—concept, caption, and image—gradually become tightly aligned, enabling concepts to capture both visual and textual cues. We will clarify this point in the revised manuscript to avoid potential misunderstandings.
Q11: Any concern with using the same text encoder for concepts and captions?
Concepts are represented as single words (see Appendix D, Figure 7), while captions are context-rich descriptions generated by the LLM based on the concept and visual features. We intentionally use a shared text encoder to preserve the original alignment between images and captions while aligning concepts and images. Training a separate encoder for concepts alone is less meaningful and tends to degrade generalization performance. Thus, despite some multi-tasking concerns, we consider this a necessary trade-off.
Q12: Suggestion on extending pairs to cross-cultural concepts
Thank you for the suggestion. While we organize concepts by country and category, our twin matching only requires that paired concepts belong to the same cultural category and are visually similar, allowing for cross-cultural examples (e.g., kimono and kebaya in Figure 3). We appreciate the insightful example of the clay vessel. However, when visually similar objects serve different cultural functions, this kind of semantic disambiguation is likely beyond the capabilities of CLIP-like models alone. In such cases, we believe VLMs (CLIP-ViT + LLM) are better suited to perform high-level, multi-hop reasoning that incorporates background knowledge. Accordingly, our work focuses on enhancing CLIP’s ability to associate fine-grained visual features with cultural concepts thereby laying the foundation for more effective downstream reasoning by LLMs.
Q13: Can the data creation section be shorten?
Thanks for the advice. We will move part of detailed data creation process to the appendix.
Thank you for your comment and for recognizing the addressed points.
We sincerely appreciate your suggestion regarding openness. We plan to make both the model and the dataset publicly available via the Hugging Face platform. Additionally, we intend to share the code for data curation and fine-tuning on GitHub. We believe that this will help facilitate reproducibility and encourage further research based on our work. Hope this addresses your concerns.
Thank you for the response. I find that weaknesses 1, 2 and 4 are adequately addressed. I expect the authors to incorporate all new details and clarifications into the next revision of the paper.
Can the authors please comment on whether the model and dataset will be made public? With this kind of work, that is an extremely important aspect of the contribution.
Dear Reviewer 9Kid,
Thanks again for your recognition of our work!
We have prepared a simplified version of our code and placed it at this link. Hope this will address your concerns. We truly appreciate the valuable suggestions you have provided throughout the review process and will carefully incorporate your feedback as we revise the paper.
Sincerely,
The Authors.
Weakness 2: Missing details about the dataset and lack of clarity on whether the dataset and model will be released
Thank you for the helpful suggestion. We agree that providing more details about the dataset will improve clarity and transparency. Our filtered dataset currently contains 73,823 high-quality samples, selected from an initial pool of 99,996 concept–caption–image triplets with negatives—a scale significantly larger than existing benchmarks such as CROPE (~1k samples). Each concept is a single word, and the captions have an average length of 14.55 words. All images are synthetically generated from these captions using Stable-Diffusion-3.5-Large-Turbo, with an efficient generation throughput of approximately 3,000 images per hour on a single H20 GPU. We will incorporate these details into the main text in the revised version and release the dataset, model, and code once they are ready, with the aim of supporting future research and fostering continued progress in the community.
Weakness 3: Error analysis or outlook for future work is missing
Thank you for this valuable suggestion. The revised version will include a dedicated section analyzing failures, limitations, and future directions. For error case example, we observe that both CLIP and CultureCLIP struggle with test cases where the visual distinction is highly abstract. One such case involves distinguishing between gongbi (meticulous style) and xieyi (freehand style) Chinese paintings. While both models misclassified a gongbi example, CultureCLIP showed lower confidence (72% vs CLIP's 78%), suggesting modest calibration improvement, though abstract distinctions remain challenging. Future work may explore the following directions:
- Enhancing CLIP’s ability to recognize abstract visual cues, such as artistic style or symbolic meaning.
- While our current approach utilizes Qwen2.5-VL as an MLLM-as-a-judge to assess cultural relevance, future work could explore more robust or interpretable alternatives to improve the reliability of cultural image generation.
- Investigating the distributional gap between synthetic and real-world images, and exploring training strategies such as mixing real and synthetic data to improve generalization and visual grounding.
Responses to the questions:
Thanks for your valuable advice and we will respond questions one by one:
Q1: Suggestion that hard negatives without textual cues might better encourage visual learning
We agree that fine-grained hard negatives are more natural and crucial for helping CLIP learn subtle visual distinctions. Our original phrasing aimed to emphasize that high-quality image–caption pairs are typically required to train CLIP effectively. However, in the cultural domain, there are practical challenges: (1) such data is scarce and costly to annotate, (2) real-world captions are often lengthy and misaligned with visual content, and (3) constructing fine-grained hard negatives is difficult. To address this, we generate images from curated concepts using captions primarily as prompts to guide image generation and preserve image–caption alignment during fine-tuning. While we agree that removing textual cues may better force the model to learn visual distinctions directly, we consider the inclusion of captions a reasonable trade-off to maintain alignment and generalization ability.
Q2: Why is it called “Efficient” Contrastive Pre-training?
We agree that the term “efficient” may be misleading, as efficiency is not a primary focus of our paper. To avoid confusion, we will remove the word “Efficient” from the subsection title in the revised version and focus the discussion on contrastive pre-training methods relevant to our work.
Q3–Q4: What are the cultural categories and criteria for discarding concepts?
We manually defined eight major cultural categories: Cuisine, Clothing, Animals & Plants, Art, Architecture, Daily Life, Symbols, and Festivals (as shown in Appendix A). These are consistently applied in both the bottom-up and top-down pipelines. While the categorization involves a degree of subjectivity and may not be exhaustive, they capture a broad and representative set of cultural elements. In the bottom-up stage, we first use the Qwen2.5-VL model to assess the cultural relevance of each concept, and then perform classification on those deemed culturally related. A concept is discarded if the associated image or text is not strongly related to any of the eight predefined cultural categories. To encourage strict filtering and avoid overly loose assignments, we designed the prompt which favors rejection over uncertain inclusion and helps us prioritize data quality over quantity.
Thank you for your careful and insightful comments. We sincerely appreciate your recognition that (i) our model, with enhanced cultural knowledge, addresses an important need and can contribute meaningfully to the community, and (ii) our proposed method is reasonable and demonstrates strong performance on appropriate benchmarks. Below, we address your concerns in detail.
Weakness 1: Limited benchmarks on general capabilities and missing introduction of cultural benchmarks
We sincerely thank the reviewer for the insightful suggestion. To provide a more comprehensive assessment of our model’s generalization ability, we have extended our evaluation to include several widely-used image classification benchmarks that were previously omitted due to space constraints. Specifically, we report results on FER2013, ImageNet-1k, ImageNet-A, ImageNet-O, ImageNet-R, VOC2007, CIFAR-10, and CIFAR-100, covering facial expression recognition, large-scale classification, out-of-distribution robustness, multi-label tasks, and both coarse- and fine-grained categories. Notably, the model’s performance on these general benchmarks remains stable and in some cases improves slightly, indicating that our method preserves — and can even enhance — generalization despite additional training. This supports the robustness and versatility of our approach. We report Top-1 Accuracy (Acc1), Top-5 Accuracy (Acc5), and Mean Per-Class Recall (MPCR) for each dataset. Complete results of performance on general benchmarks (%) are summarized below:
| Model | FER2013 | ImageNet-1k | ImageNet-A | ImageNet-O | ImageNet-R | VOC2007 | CIFAR-10 | CIFAR-100 |
|---|---|---|---|---|---|---|---|---|
| OpenAI CLIP | ||||||||
| Acc1 | 41.22 | 63.37 | 31.51 | 47.55 | 69.31 | 76.45 | 89.77 | 64.24 |
| Acc5 | 94.78 | 88.82 | 64.23 | 78.30 | 88.81 | 95.93 | 99.61 | 88.78 |
| MPCR | 36.10 | 63.36 | 32.63 | 48.85 | 67.92 | 80.59 | 89.83 | 64.21 |
| Caption (w/o neg) | ||||||||
| Acc1 | 41.57 | 63.37 | 31.61 | 47.80 | 69.28 | 76.52 | 89.84 | 64.48 |
| Acc5 | 94.68 | 88.82 | 64.31 | 78.10 | 88.80 | 95.98 | 99.62 | 88.91 |
| MPCR | 36.40 | 63.36 | 32.75 | 49.09 | 67.92 | 80.60 | 89.86 | 64.44 |
| Caption (w/o neg) + Concept (w/o neg) | ||||||||
| Acc1 | 41.20 | 63.35 | 31.59 | 47.65 | 69.33 | 76.42 | 89.86 | 64.40 |
| Acc5 | 94.82 | 88.83 | 64.20 | 78.25 | 88.00 | 95.94 | 99.63 | 88.84 |
| MPCR | 36.00 | 63.37 | 32.72 | 49.01 | 67.94 | 80.64 | 89.87 | 64.40 |
| Caption (w/ neg) | ||||||||
| Acc1 | 41.59 | 63.39 | 31.61 | 47.65 | 69.26 | 76.52 | 89.80 | 64.45 |
| Acc5 | 94.69 | 88.82 | 64.33 | 78.10 | 88.79 | 95.96 | 99.63 | 88.87 |
| MPCR | 36.42 | 63.39 | 32.72 | 49.04 | 67.87 | 80.58 | 89.83 | 64.41 |
| CultureCLIP | ||||||||
| Acc1 | 41.26 | 63.37 | 31.60 | 47.65 | 69.31 | 76.48 | 89.83 | 64.47 |
| Acc5 | 94.80 | 88.80 | 64.29 | 78.10 | 88.85 | 95.95 | 99.63 | 88.85 |
| MPCR | 36.21 | 63.37 | 32.77 | 49.00 | 67.96 | 80.60 | 89.85 | 64.41 |
Regarding the cultural benchmarks, detailed descriptions are currently provided in Appendix C. We appreciate the reviewer’s point and agree that including a brief summary in the main text would improve clarity. In the revised version, we will incorporate the following: To evaluate cultural understanding, we adapt three benchmarks—GlobalRG-Grounding, GlobalRG-Retrieval, and CROPE—into statement-ranking tasks suitable for CLIP-based models. Each task requires selecting the most semantically accurate description of a given image from several culturally grounded statements, thereby testing the model’s ability to capture fine-grained, culture-specific visual cues.
We sincerely appreciate the time and effort the reviewers have dedicated to evaluating our paper. Their insightful feedback has greatly contributed to improving the manuscript and clarifying key aspects of our work. We are grateful that the reviewers recognized the following strengths:
- The focus on cultural awareness in VLMs is timely and important, addressing an underexplored challenge in multimodal learning (25A8, n9zq).
- The proposed method is reasonable and well-designed, including a clear data curation approach (9Kid, n9zq) and effective training strategy (25A8).
- The CulTwin dataset is high-quality and diverse (n9zq), and the CultureCLIP model is considered a useful tool for the community (9Kid).
- The paper is clearly written, with strong experimental results and ablations validating the approach (9Kid, n9zq).
During the rebuttal period, we were delighted to receive the reviewers’ positive feedback and glad that our responses effectively addressed each concern raised. We will incorporate their invaluable suggestions into the revision process. Thanks again for all the effort and support.
The Authors
This paper introduces CultureClip, a culturally aware CLIP that is fine-tuned on a synthetic cultural dataset with triplets that represent different cultural contexts. The reviewers highlighted several strengths of the paper, including the timeliness of the need for culturally aware models, the dataset construction methods and the strong experiments. The reviewers also had several issues that they highlighted, which were all addressed by the authors either by additional results or by promises to modify the paper. For example, the authors provided multiple satisfying clarifications, generalizations of their results to several other image classification benchmarks, visualizations, and committed to adding sections analyzing the errors of their model. The authors are asked to add all the promised changes to their final paper.