6.4

/10

Poster4 位审稿人

最低3最高5标准差0.7

3.5

置信度

创新性2.8

质量3.0

清晰度2.5

重要性2.5

NeurIPS 2025

Object-centric binding in Contrastive Language-Image Pretraining

Rim Assouel,Pietro Astolfi,Florian Bordes,Michal Drozdzal,Adriana Romero-Soriano

OpenReview PDF

提交: 2025-05-08更新: 2025-10-29

TL;DR

object-centric inductive biases to a address the binding problem in CLIP-like pretraining

摘要

关键词

object-centric representationbinding problemobject-centric language alignment

评审与讨论

审稿意见

评分: 4置信度: 42025-06-08

This paper presents OC-CLIP, a CLIP-based vision-language model enhanced with object-centric inductive biases for compositional understanding. The proposed method integrates a binding module (inverted cross-attention) with a structured similarity score guided by scene graphs. Experimental results on multiple benchmarks (SugarCrepe, GQA-spatial, COCO-spatial) show significant gains over baseline models, highlighting the benefits of slot-based image representation and relational modeling.

优缺点分析

Strengths:

A clearly articulated motivation around the binding problem in CLIP.
Introduction of object-centric inductive biases via slot-attention-like mechanism and structured similarity score.
Strong results on targeted compositional benchmarks.

Weaknesses:

The text-only Vera baseline from the SugarCrepe paper is not reported or discussed, despite the fact that the ARO benchmark is explicitly designed to be partially solvable by language priors. This omission significantly weakens the claims regarding model improvements on ARO.
The paper emphasizes that it avoids hard-negative captions during training. However, hard-negative scene graphs derived from text are used instead. These serve an equivalent function, so the distinction seems more rhetorical than substantive.
OC-CLIP is fine-tuned on COCO, GQA, and VG, which are also the sources of the evaluation benchmarks. Meanwhile, some baselines are evaluated under stricter zero-shot settings. Without proper controls, it is unclear whether the observed gains stem from architectural improvements or data distribution alignment.

问题

At this stage, due to the above concerns—particularly the lack of fair and comprehensive comparisons, the potentially misleading claims about hard-negative usage, and the unclear generalization performance—I am unable to recommend acceptance. That said, I find the direction promising and technically sound, and would be open to increasing my score if the authors can address these issues in rebuttal.

局限性

Yes

最终评判理由

The rebuttal fully address my concerns, I lean towards accept as promised.

格式问题

作者回复

2025-07-30

We thank the reviewer for their insightful feedback and take the opportunity to address the listed concerns below :

Q1 - Missing Text-only Vera baseline

We acknowledge the importance of including the Vera baseline in our analysis and will incorporate it into our updated manuscript for both Table 1 and Table 2. Our results demonstrate that our model's enhancements are robust and extend beyond the capabilities of text-only baselines, which strengthens our contribution.

	ARO-R	ARO-A	SC - swap att	SC - swap obj
Vera	61.7	82.6	49.4	49.2
OpenCLIP-ft	50.1	60.0	63.1	72.4
NegCLIP	80.2	71.0	75.4	75.2
OC-CLIP	84.9	84.0	88.9	83.5

We would like to emphasize that the Vera text-only was initially used by the Sugarcrepe paper as a motivation to develop their benchmark and to challenge models beyond mere plausibility and grammar correctness In fact, the Vera baseline has a random chance performance in all the Sugarcrepe splits.

We will update our results tables with the Vera scores for the sake of completeness of baselines coverage. We thank the reviewer for this suggestion!

Q2 - Hard-negative usage clarification

We appreciate the reviewer's concern regarding our statement about not using hard negatives during training. Upon reflection, we believe this statement may have been misleading as our proposed method does utilize a form (although not finegrained) of negatives derived from the scene graph structure.

Therefore, we'd like to contrast the negatives used in OC-CLIP with the hard-negatives leveraged by other approaches. Our method employs "negative" scene graphs, as defined in Eq. 5, which are derived from the scene graph at a coarse level (manipulating relationship directions only, needed to learn a non symmetric relation score). However, other hard-negative methods operate at a finer-grained level, generating new negatives by altering specific attributes or parts of speech tags in the ground-truth sentence (including both nodes and edges content as well as relationship directions). Our approach does not employ such fine-grained modifications.

We in fact refer to other methods as employing “finegrained hard negatives” as specified in lines 6, 204, 214 and 373.

We believe this distinction in granularity is essential to consider when comparing our approach to other hard-negative-based methods, especially in our controlled experiments (Section 4.1) focused on attribute-binding where fine-grained hard negatives consist of swapping the attributes of two objects in the scene. In those controlled experiments, no hard negatives are used for OC-CLIP (the controlled scene graphs do not have any edges).

In order to avoid any confusion, we will update our manuscript with a dedicated paragraph clearly stating the difference between the graph-based relationship negatives (as defined in Eq 5) used in OC-CLIP and the fine-grained negatives used by other approaches. We will clarify the claim accordingly.

Q3 - In-d vs architectural improvements

We would like to emphasize that the OpenCLIP-ft baseline is finetuned on the same data mixture as ours ( COCO, VG, GQA). Therefore, we believe this comparison isolates the effectiveness of the proposed modules vs the use of data from a similar distribution to ARO/SugarCrepe. When compared to this baseline, OC-CLIP shows improvements of +15.1% in the hard swap-attribute split and +23.2% on ARO-A..

Moreover the controlled experiments in section 4.1 are specifically designed to further isolate our model contribution effectiveness by comparing both OC-CLIP and CLIP trained on a 3D simulated domain with fixed vocabulary size.

The results of section 4.3 further confirms the source of improvements coming from our architectural biases. In this setting we train both CLIP and OC-CLIP from scratch on cc3m and cc12m and show that not only has better sample-efficiency but outperforms CLIP by a large margin in zero-shot Imagenet classification and compositional understanding (eg, SugarCrepe).

We will update our manuscript in the baselines description paragraph (l273) accordingly to avoid any confusion and clearly state that we finetune the Openclip-FT baseline on the same data as OC-CLIP.

Regarding the other baselines, we would like to emphasize that all of them except DAC are finetuned using data in the COCO/VG domain. The exception DAC recaptions 3m images coming from CC3M dataset along with dense and finegrained hard negatives.

We hope this clarifies our approach and addresses the reviewer's concerns. We are happy to answer any additional questions the reviewer might have.

2025-08-01

Thank you for your new results and discussions. My concerns have been fully resolved. I tend to raise my score to 4 (Boarderline accept) before discussion with other reviewers.

2025-08-02

We thank the reviewer for supporting our work! We are happy to answer any additional questions the reviewer might have during the discussion period.

审稿意见

评分: 4置信度: 32025-06-10

This paper proposes Object-Centric CLIP (OC-CLIP), a novel method that enhances compositional scene understanding in CLIP-like models by incorporating object-centric inductive biases. Instead of relying on hard-negative samples, OC-CLIP uses a binding module to align a slot-based image representation with a scene graph extracted from text, capturing object attributes and spatial relationships more effectively. The model achieves significant improvements on compositional benchmarks (e.g., +16.5% on SugarCrepe, +89% on COCO-spatial) and shows strong sample efficiency and scalability. However, it depends on a parser for scene graph extraction and requires further scaling to match larger CLIP variants.

优缺点分析

Strengths:

Introducing the binding module to enhance the compositional understanding capabilities of CLIP-like models is intuitively making sense and technically sound.
The proposed algorithm is easy to integrate into the existing CLIP-like models' training pipeline.

Weaknesses:

The proposed algorithm appears unable to capture directional spatial relations. Take the relationship “A red apple to the left of a blue car” (Line 142) as an example. I don't think the proposed structured similarity score is able to learn the spatial relationship beween the apple and the car. Because in Eq.2, the relationship embedding is learnt from the fs([r, Ss]) + fo([r, So]), where the two terms are simply added. This operation makes "A to the left B" and "B to the left A" exactly the same, because A + B = B + A.

问题

Please address the question listed under the Weaknesses section. I will consider raising my final score if these concerns are adequately addressed.

局限性

yes

最终评判理由

I have carefully reviewed the authors' response and acknowledge that it addresses my main concerns. Therefore, I have decided to raise my score from 3 to 4.

格式问题

No Paper Formatting Concerns

作者回复

2025-07-30

We thank the reviewer for their feedback and appreciate the opportunity to clarify the limitations mentioned. We understand the importance of accurately capturing directional spatial relationships in our proposed algorithm. We would like to clarify how our structured similarity score is in fact designed to handle such directional nuances.

Explanation of Directional Handling

In our model, the functions ( $f_s$ ) and ( $f_o$ ) are designed on purpose with different learnt parameters ( $\\theta_s$ , and $\\theta_o$ ) to capture the directionality of relationships and break the symmetry of the addition. This means that the embeddings for the subject and object are processed differently, allowing the model to distinguish between "A left of B" and "B left of A."

Simplified Example

To illustrate this, consider a simplified example where the slots S are 2d embeddings containing x,y coords of a visual object.

In that case $f_s$ and $f_o$ can easily model a simple ‘left of’ relationships by extracting the x coord with different signs (+ for $\\theta_s$ and - for $\\theta_s$ ) such that:

$f_s(A, left) = f_{\\theta_s}(A, left) \= x_A$

$f_s(B, left) = f_{\\theta_s}(B, left) \= x_B$

$f_o(A, left) = f_{\\theta_o}(A, left) = -x_A$

$f_o(b, left) = f_{\\theta_o}(B, left) = -x_B$

Score ( ‘A left of B’ ) $= f_s(A, left) +f_o(B, left) = x_A - x_B$

Score(‘B left of A’) $= f_s(B, left) \+f_o(A, left) = x_B - x_A =$ - Score ( ‘A left of B’ )

Hence capturing the directionality of the spatial relationship ‘left of’.

This approach is easily generalizable to higher dimensions. In a multi-dimensional space, the learnt functions ( $f_s$ ) and ( $f_o$ ) can be extended to operate on vector embeddings, allowing the model to capture complex spatial relationships by considering all relevant dimensions. The distinct parameterization ensures that directional relationships are accurately modeled regardless of dimensionality.

Learning Directionality with local loss ( $\\mathcal{L}_{\\text{rel}}$ )

To further enhance the model's ability to learn directional relationships, we employ a loss function ( $\\mathcal{L}_{\\text{rel}}$ ) (as described in Eq. 3.2 and Line 197). This involves swapping the subject and object indices to create a local negative example, effectively enabling our module to learn a non-symmetric relation score. We have ablated this local loss in Appendix 3.2 to demonstrate its impact.

Empirical Validation

Our results on the GQA-spatial and VG-spatial from the What’s up benchmark (Table 2), which are designed to exclusively test for spatial relationship understanding (eg. ‘A chair to the left of the table’), further supports the model's capability. We achieve accuracy improvements from random chance to over 90% when compared to CLIP, indicating the model's effectiveness in learning non-symmetric relation scores.

Conclusion

By employing distinct parameter sets for $f_{\\theta_s}$ and $f_{\\theta_o}$ , and leveraging ( $\\mathcal{L}_{\\text{rel}}$ ), our model is capable of learning and distinguishing directional spatial relationships. This design choice ensures that the model does not treat "A left of B" and "B left of A" as equivalent, thus addressing the concern raised. We will emphasize this asymmetrical parametrization in the main method description to avoid any confusion.

We hope this explanation clarifies how our score module can handle directional spatial relationships. We are open to answer any additional questions the reviewer might have.

评论- Response to Rebuttal

2025-08-04

I have carefully reviewed your response and acknowledge that it addresses my main concerns. Therefore, I have decided to raise my score from 3 to 4.

2025-08-04

We thank the reviewer for their answer and for supporting our work! We are happy to answer any additional question the reviewer might have during the remaining of the discussion period.

审稿意见

评分: 5置信度: 42025-06-30

Existing contrastive models like CLIP excel at recognizing individual objects but have limitations in understanding spatial relationships in multi-object scenes and correctly binding object attributes, a problem known as the "binding problem". This paper introduces Object-Centric CLIP (OC-CLIP), a new pretraining method aimed at improving the compositional understanding of Vision-Language Models (VLMs) for complex scenes. OC-CLIP boosts performance in real-world benchmarks for attribute binding and spatial relationship understanding, and achieves notable improvements in zero-shot classification and compositional understanding tasks.

优缺点分析

Strengths：

The binding module utilizes an inverted cross-attention mechanism and introduces default query tokens to process visual information not explicitly mentioned in the text, a design that is unique.
The proposed structured similarity score combines object and relationship scoring. It is also supplemented by a relational loss to enforce the learning of non-symmetric relationships, providing a novel and more refined approach to image-text alignment.
Research has demonstrated that enhancing compositional understanding through model architectural design (inductive bias) can be more effective than solely relying on large-scale hard-negative augmentation. This holds significant practical implications for resource-constrained or rapid-deployment scenarios.

Weaknesses:

The OC-CLIP method relies on a parser to extract object-centric attributes and spatial relationships from text descriptions. Does it quantify the impact on OC-CLIP's downstream performance if the parser produces low-quality, incomplete, or erroneous scene graphs.
Although the paper demonstrates OC-CLIP's scaling potential at the 15M scale, the model still needs to be further scaled to at least the 400M scale to be fully comparable to all CLIP variants.
OC-CLIP introduces a notable computational overhead due to the binding module requiring additional cross-attention operations. Despite the paper implementing mitigation strategies, such as using a smaller text encoder and operating in a reduced embedding space for the binding module, there is still a significant 2.2x overhead for the base architecture.
The core of the binding module is to generate "slots". However, the paper does not extensively discuss how the number of slots or their granularity affects performance. A fixed number of slots might not adapt well to scenes with a dynamic number of objects, and the choice of slot granularity could also influence the model's understanding capabilities. The paper does not provide insights into how to select or dynamically adjust the number of slots.

问题

Despite the implemented mitigation strategies, such as using a smaller text encoder and operating the binding module in a reduced embedding space to lower computational overhead, the base architecture still exhibits a significant 2.2x overhead. Aside from the strategies mentioned in the paper, are there any other potential optimization methods that could further reduce this computational cost?
The paper discusses error modes of different parsing methods in Appendix A.4 and qualitatively analyzes the advantages of LLM-based parsers in relation understanding. To what extent would OC-CLIP's downstream performance be impacted if the parser generates low-quality scene graphs, e.g., incomplete or containing erroneous objects/relations? Could this impact be mitigated through the model's inherent robustness or specific training strategies?
The paper does not extensively discuss how the number of slots or their granularity affects performance. In designing the binding module, how is an appropriate number of slots chosen to adapt to a dynamic number of objects in a scene? Has there been exploration into mechanisms for dynamically allocating the number of slots, or the impact of combining different granularities on the model's understanding capabilities? What types of visual information do the "default query tokens" introduced in the binding module learn during training, and do they offer insights into the model's "background understanding" or "non-core object recognition" abilities?
Considering how OC-CLIP enhances CLIP's compositional understanding through object-centric inductive biases and a structured similarity score, when applying OC-CLIP's core mechanisms to large Vision-Language Models, could this effectively improve their visual comprehension and significantly mitigate visual hallucination, specifically attribute hallucination and relation hallucination? More precisely, how might OC-CLIP's slot-structured representations and scene graph constraints help VLMs more accurately bind object attributes and understand spatial relationships, thereby reducing the generation of factually incorrect image descriptions?

局限性

yes

最终评判理由

After careful consideration of their responses and alignment with fellow reviewers, I will maintain my original rating.

格式问题

none

作者回复

2025-07-30

We thank the reviewer for their insightful feedback and supporting our work! We take the opportunity to answer the listed questions below :

Q1 - Additional strategies to further optimize the computational cost

Since the bottleneck of our module for the binding module is the number of KQ comparison in the binding module (where each query node from the scene needs the be compared to each key candidate from the visual backbone), any method that reduces the number of candidate keys could drastically improve the computational efficiency during training. In that spirit combining our binding module with token merging methods such as $1$ could constitute a good avenue for further computational cost improvement.

We would also like to emphasize the fact that this 2.2x overhead is present at training time only (because each element of the batch needs to perform the binding KQ attention with all the other elements of the batch to compute the CLIP loss), At inference time the overhead is minor.

Moreover we would like to emphasize that while our model introduces a computational overhead, our training from scratch results in section 4.3 shows better sample efficiency on downstream zero-shot performance when compared to CLIP. For that reason we believe the computational overhead introduced by the binding module to be a reasonable trade off for better general performance and compositional understanding.

$1$ Token Merging: Your ViT but Faster

Q2 - Discussion on Robustness of the Parser

The quality of the parser definitely constitutes a bottleneck in terms of performance. We also comment on the quantitative impact of the parser quality in Figure 6 where the parser ability to extract good relationship structure greatly impacts performance on the relational tasks (12 to 15% performance drop when using a Spacy-based parser vs an llm-based parser).

We anticipate that the following mitigations strategies could greatly improve the robustness the parsing :

Finetuning an llm-based parser on noisy captions to extract good scene graphs would be straightforward (although costly),
Including more noisy and hard parsing examples given in the context prompt of the llm-based parser,
Allowing the llm-based parser to filter out bad data points by prompting it to output an empty graph (thus ignored) when unsure about the scene graph parsing .

We believe those directions constitute great avenues for future extensive research on the parsing method itself and are orthogonal to our approach.

Q3a - Number of slots analysis

We would like to clarify that our approach in fact handles a dynamic number of slots as opposed to fixed (slot attention like) methods. The extraction of the slots in our binding module are conditioned on the input scene graph, such that the number of visual slots will be equal to the number of query nodes in the input graph (corresponding to the number of parsed textual objects) . We believe this design is key to the inherently ill-defined notion of an “object” in a visual scene that can be understood at different levels of granularities (as emphasized by the reviewer). In our case the visual patches are grouped conditioned on the objects (=query nodes) mentioned in the input graph. The number of resulting slots will be equal to the number of nodes in the scene graph.

To give a more concrete example, let us consider an image containing several colored chairs. For that same image different scene graphs will lead to different slots decomposition depending on the granularity of caption:

(1) “a group of chairs” will be parsed as a single node graph “a group of chairs” that will group together the visual information coming from all the chairs into one slot
(2)“a red chair and a blue chair” will be decomposed into a graph with 2 separate query nodes ( ‘red chair’ and ‘blue chair)’ and each one of the nodes will group the visual patches that correspond to a specific chair into a separate slot, resulting in 2 slots.

We believe this conditioning mechanism is key to the dynamic handling of plausible scene decomposition and hope we have clarified this point.

Q3b - Default Query Token role

Since the perceptual grouping is done using inverted cross attention ( eg. by applying the softmax along the query dimension), this inductive bias pushes all the visual key candidates to be softly matched along at least one query (as explained by $2$ ). However, natural captions are far from being an extensive description of the visual content and can arbitrarily focus on a single part of the image only. Since we only want the relevant visual information ( as captured by the input scene graph) to be extracted from the candidate patches, we need to re-route the “un-necessary” visual information towards other tokens, that we coin default query tokens. Those tokens participate in the competitive grouping attention but are not taken into account when computing the score. Their role is thus to absorb any remaining visual information that is not mentioned in the input scene graph.

$2$ Object-centric learning with Slot Attention

Q4 - Extension to MLLMs general framework

While a direct extension to the OC-CLIP to a more general MLLM framework is not straightforward, a good avenue for future work could be to study the impact on the representation learnt by the vision backbone of OC-CLIP when trained from scratch (on larger 400m or 2b scale, like its CLIP counterpart) on a downstream Llava-style MLLM (using OC-CLIP backbone stitched to an LLM by an MLP connector). Indeed, without any specific patch tokens constraints, CLIP-like training does not force any locality on the visual patches representation, which might in turn lead to poor feature binding at the visual representation level of the MLLM. Techniques like masked image modeling, or local iBot losses (as used in Siglip2 and Dinov2) aim to improve the locality and density of the visual patches.Therefore, it could be an interesting future work to study how OC-CLIP local graph matching constraints might similarly impact the locality of the visual patches that can be used by a downstream MLLM. As this question requires full scale training from scratch, we leave that for future work.

We hope we have addressed the reviewer concerns and are happy to answer any additional questions the reviewer might have.

2025-08-05

We would like to thank again the reviewer for their time and for supporting our work! We hope our rebuttal answer addressed the reviewer's main concern. We are happy to answer any additional questions the reviewer might have during the remaining of the discussion period.

2025-08-06

Dear Reviewer ArR9,

could you please let us know your feedback based on the rebuttal provided for the paper?

Best! Your friendly AC

审稿意见

评分: 3置信度: 32025-07-07

The authors introduce Object-Centric CLIP (OC-CLIP), which enhances CLIP with object-centric inductive biases to improve compositional understanding which cannot be solved by CLIP. The proposed model not only enhances the performance of CLIP-based models in multi-object compositional understanding, but also paves the way towards more accurate and sample-efficient image-text matching of complex scenes.

优缺点分析

Strengths: 1.It tries to deal with the complex compositional scenes involving multiple objects and their spatial relationship which cannot be solved by CLIP. 2.It can work with noisy data with zero-shot.

Weakness: 1.These modules do not seem to be a good plug in to current vLLM (not end-to-end), and they may decrease the training and inference speed. 2.The performance does not seem to be good in some tasks on Table 1. 3.The parsing procedure at the beginning may limit the performance.

问题

1.How to combine the proposed method with current vLLM without decreasing the training and inference speed as it is not end-to-end. 2.How to explain the performances of the proposed method is not so good on some tasks. 3.Figure 2(b) and c do not seem to be straightforward and easy to understand. It misleads me to think of correlation. 4.How will the parser impact the performance?

局限性

1.How to combine the proposed method with current VLLM is not discussed, especially the performance of training and inference. 2.How to further improve the performance of the method as in some tasks it is not so good. 3.This work has not discussed the impact of parser on the whole workflow, which may lead to a disaster.

格式问题

No formatting concern.

作者回复

2025-07-30

We thank the reviewer for their feedback and appreciate the opportunity to clarify the limitations mentioned. Below are our responses to the specific points raised :

Q1-a :Compatibility with vLLM:

vLLM is a framework that optimizes inference and training of llm-based models. We do in fact use vLLM to optimize the scene graph parsing efficiency (Appendix A.4). However vLLM is generally not considered useful for training and inference of embedding models like CLIP. This is because the primary optimization vLLM implements (e.g. chunked prefill, automatic prefix caching, etc..) focuses on generation efficiency.

We would like however to clairify that our model is an embedding model and is in fact end-to-end. It can therefore benefits from any additional opitimizations applicable to CLIP-like models.

We also would like to also emphasize that the scene graph parsing step is not part of the model but a data preprocessing step done offline as specified in l136. The model itself is end-to-end and can benefit from further wrapping and optimizations the same way as CLIP. We will update the main Figure to clarify that aspect.

Q1-b: Computational Efficiency:

Our module introduces a trade-off in computational efficiency, which we discuss in detail in the "Computational Analysis" section of Appendix A.3 (l752). We quantitatively measure the computational overhead of our modules and demonstrate that for larger-scale ViT models (e.g., ViT-L), the binding module does not become a bottleneck, as it remains fixed when scaling the ViT backbone. To mitigate overheads when training our model from scratch, we propose:

Using a smaller text encoder (fewer layers and reduced width), since the OC-CLIP text encoder only needs to encode information about single objects and relationships.
Operating the binding module on a reduced embedding space (256 vs. 512 for the original CLIP).

Q2 -Drop in Performance on some Tasks:

The binding module's relatively low training vocabulary size (eg. COCO domain), due to being trained from scratch, is the primary reason it does not outperform (and is not expected to) all baselines on "replace" and "add" tasks of Table 1. We anticipate this performance gap would close if trained on the original OpenCLIP pretraining set (Laion400m), as indicated by results in Section 4.3 at the 15m scale.

However we want to highlight that on tasks that require finegrained object and attribute binding (spatial, and swap splits) our model outperform all the other baselines, even when they are trained on order of magnitude more data (eg. DAC finetuned on CC3M augmented with hard negatives).

Specifically, the "add obj" (91.1% vs. 93.4% for CLIP counterpart) and "replace obj" (94.6% vs. 95.4% for its CLIP counterpart) splits in SugarCrepe show slightly lower performance for our model. While both models achieve high accuracy, this difference stems from two factors:

Vocabulary Generalization: Our binding module and multimodal alignment scores are trained from scratch, limiting their generalization to the vocabulary they were exposed to during training.
Out-of-Distribution Negatives: The "replace" and "add" negatives in SugarCrepe are generated using an LLM (GPT-4). This process does not guarantee that the objects/attributes introduced are within the distribution of our training data (COCO/VG). These LLM-generated negatives may include objects/attributes outside the COCO domain but within OpenCLIP's broader pretraining domain, leading to a slight, albeit minimal, drop in our model's accuracy. Training on the original OpenCLIP pretraining set (Laion400m) is expected to mitigate this issue.

Q3 - Lack of Parser Impact Discussion:

We would like to clarify that we already extensively discuss the impact of scene graph parsing in Section A.4 of the Appendix (l777) as referred in line 136 and line 382 of the main text, where we:

Compare different parsing methods (automatic, fine-tuned lightweight parser, and LLM-based parser) and describe their qualitative and quantitative performance impacts, along with the error types each can introduce.
Comment on the limitations of LLM-based parsing, especially when ambiguous parsing requires visual grounding, providing interesting directions for future work.
Discuss and reference the scene graph parsing cost in the preprocessing step.

L3 -Clarity of section 4.1 figures

Figure 2(b) and 2(c) aim at comparing CLIP and OC-CLIP sample efficiency in a controlled setting where we vary the training data size (y axis) and the amount of hard negatives (x-axis). It shows that without any hard negatives and only 30% of the possible animal pairs combinations OC-CLIP outperforms CLIP even when trained on 70% of the colored animals pairs combination and 70% of all the possible swap hard negatives. We will update the figure titles of part 4.1 to emphasize that the numbers correspond to performance accuracy and not correlation. We apologize for the confusion!

We hope this clarifies our approach and addresses the reviewer's concerns. We are happy to answer any additional questions the reviewer might have.

2025-08-04

As the discussion period is almost over we would like to know if the concerns of the reviewer were addressed by our rebuttal answer ? Is there any point that we can clarify during that time? We are happy to answer any additional question the reviewer might have.

评论- Re: Rebuttal by Authors

2025-08-06

Dear Reviewer Hp4d,

could you please let us know your feedback based on the rebuttal provided for the paper?

Best! Your friendly AC

2025-08-09

Dear Reviewer Hp4d,

As the discussion period is coming to an end we would like to know whether our rebuttal answer adequately addressed your initial concerns ?

Specifically, if we understood your concerns correctly, we believed we addressed the following :

Clarification about applicability of vLLM to llm-based models. We use vLLM for the parsing preprocessing step and our model is end-to-end and can thus benefit from any other optimization applicable to CLIP-like embedding models.
Emphasis about the parser discussion actually being present in our manuscript (both qualitative and quantitative analyses) and we pointed out the references in the main text to clarify that point.
Clarification about the content/intent of the Controlled experiments Figures 2 (b) and (c)

We hope that the reviewer is satisfied with our answers and are happy to provide any further clarifications before the end of the discussion period.

Sincerely,

The authors

最终决定Accept (poster)

2025-09-17

The paper proposes an approach to integrate inductive biases into the pretraining of CLIP-like models to improve their compositional understanding. This is done via a binding module that introduces scene graph information into the training of the model. The evaluation focuses on multi-object compositional understanding and shows good results.

The paper was considered by four reviewers. The final ratings were: BR (reviewer did not answer rebuttal) - A - BA - BA Reviewers mainly highlighted the paper's approach towards a compositional understanding of CLIP style models as well as the demonstrated scaling behaviour. Most questions and weaknesses could be resolved during the rebuttal.

The AC follows the majority of the reviewer voting and recommends accepting the paper. The AC would encourage the authors to integrate the findings of the rebuttal in the CR version of the paper.