Thank you very much for dedicating your time and expertise to review our submission. The depth of your analysis and the constructive criticism you have provided are immensely beneficial. I sincerely hope my reply can answer your questions.

W4: The core concept is "negotiation," which implies a dynamic, give-and-take process. But your negotiator is trained once and then fixed. How is this different from a multi-input fusion network? The term feels like a bit of a stretch.

Q1: The same as point 4 in the Weaknesses. How is this different from a fusion network?

Architecturally, the Negotiator does resemble multi-input fusion networks in structure. However, their functional roles within the collaborative framework differ fundamentally, as we elaborate below:

Input/Output: The negotiator takes local representations encoded by each modality's encoders from the same perspective of environmental information as input, and outputs a common representation. In contrast, the input of the multi-input fusion network is the features obtained by agents from encoding the environmental information from different perspectives, and outputs features fused with multi-perspective information.
Estimator: The estimator in the Negotiator evaluates each modality's contribution to the shared representation, whereas the foreground estimator in multi-input fusion networks assesses the probability of foreground objects appearing at corresponding locations in BEV.
Role of Module: The Negotiator's role is to negotiate the common representations while implicitly preserving their information through model parameters, whereas multi-input fusion networks are designed to integrate multi-view information.

In NegoCollab, the concept of "negotiation" is embodied in the mutual compromise among agents within the initial alliance when negotiating the common representation from their respective local representations. For dynamically joined new agents participating in collaboration, they only need to align with the common representation negotiated by the initial alliance.

W3: The paper introduces a highly complex loss function with distribution, structural, and even a "pragmatic" alignment component, kind of over-engineered.

Q2: The loss function is complex. The "pragmatic alignment" part, which needs a whole separate network, seems like a lot of extra work. Have you tried removing it? I'm curious how much of your performance gain is due to this versus the core idea.

The multi-dimensional alignment loss plays a critical role in facilitating better alignment between unimodal local representations and the multimodal common representation. In Section 4.3, we perform comprehensive ablation studies on the individual components of the multi-dimensional alignment loss. The comparative results after completing the first training phase are summarized in the table below:

uni-dis	uni-stru	uni-pragma	AP@0.5	Inc.	AP@0.7	Inc.
✓			0.609	-	0.496	-
✓	✓		0.655	+7%	0.532	+6.8%
✓		✓	0.671	+9.2%	0.538	+7.8%
✓	✓	✓	0.711	+14.3%	0.566	+12.4%

Table 1. Ablation study of the multi-dimensional alignment loss. The collaborating agents are m1 and m2.

Taking the performance under AP@0.5 as an example: when solely adding the structural consistency loss, performance improves by 7%; when only incorporating the pragmatic consistency loss, performance gains approximately 10%; when both consistency losses are included, performance increases by about 14%. These results clearly demonstrate the significance of multi-dimensional alignment loss for enhancing heterogeneous collaboration performance. Furthermore, the separate network required for auxiliary pragmatic alignment is a 2D occupancy detection head, implemented as a single-layer 2D convolutional network with approximately 1.3K parameters, introducing virtually no additional training overhead.

The design of the multi-dimensional alignment loss stems from this analysis: The essence of aligning local representations to the common representation lies in transforming single-modal features into multi-modal features, which cannot be adequately achieved solely through conventional distribution consistency loss that merely constrains feature distributions. This is because the common representation is negotiated from multiple different types of agents and contains multi-modal information, while the local representation is encoded by the local encoder from observed data and contains only single-modal information. Taking the common representation negotiation between m1 and m2 as an example: m1 and m2 employ LiDAR and camera sensors respectively, with their local representations containing only single-modal information from the corresponding sensors. The common representation negotiated from m1 and m2 combines camera-LiDAR fused modalities. However, inherent differences exist between camera and LiDAR modalities in terms of data format, spatial structure, and pragmatic information. These differences are consequently reflected in the disparities between single-modal (camera/LiDAR) features and fused-modal features. Therefore, aligning such cross-modal features requires multi-dimensional consistency constraints encompassing distribution, structural relationships, and pragmatic information between features.

In the revised version, we will refine the descriptions of the multi-dimensional alignment loss and its ablation studies (Lines 189-195 and 293-301) to better clarify their functions.

W1: This paper tackles a "closed-world" problem where the set of agent types is predefined.

W2: The core design of NegoCollab is inherently non-extensible.

Q3: The paper's biggest blind spot is scalability. What's the plan when a new agent type, one that wasn't part of the initial "negotiation," wants to join the collaboration? Do you have to retrain everything from scratch? If so, the "low training cost" claim from the abstract seems misleading, as the real cost is in re-integration. If you can clarify this, I would be pleased to improve the rating.

For dynamically joining new agents, collaborative participation can be achieved by training a pair of plug-and-play sender-receiver to align with the negotiated common representation. The training process only requires parameter updates for the new agent's sender and receiver modules while eliminating the need to retrain the negotiator network, thus providing an efficient solution for heterogeneous collaboration in "open-world" scenarios with minimal training cost.

We explained the training process when a new agent joins in Section 4.1. Specifically, in the first stage, the common representation is generated from the negotiator and the encoder of the agents in initial alliance. The cyclic consistency loss and multi-dimensional alignment loss are calculated in the same way as described in Section 3.2.2 and Section 3.2.3. In the second stage, the collaborative task loss is calculated as the collaborative detection loss between the new agent and the agents in the initial alliance. We will add a diagram of the training process for a new agent in the appendix of the revised version.

Notably, since new agents do not participate in the negotiation process, aligning them to the common representation inevitably incurs greater information loss. This indeed represents a limitation of NegoCollab in "open-world" collaboration scenarios. To mitigate this limitation, we investigate NegoCollab's collaborative performance when the common representation is negotiated from different initial alliances. Our analysis concludes that selecting initial alliance agents with optimally performing perception encoders and including more diverse agent types in the initial alliance can enhance the generalization capability of the common representation, thereby improving collaborative performance in "open-world" settings. The relevant results are presented in Appendix B.1, with selected findings shown below:

			AP@0.5					AP@0.7
Initial Alliance	m1m2	m3m4	m1m3	m2m4	All	m1m2	m3m4	m1m3	m2m4	All
Protocol	0.792	0.785	0.772	0.499	0.676	0.615	0.564	0.710	0.289	0.457
m1m3	0.869	0.832	0.951	0.484	0.830	0.761	0.720	0.904	0.280	0.718
m1m2	0.872	0.770	0.911	0.512	0.745	0.759	0.578	0.805	0.319	0.555
m3m4	0.727	0.840	0.914	0.506	0.737	0.550	0.726	0.840	0.289	0.562

Table 2: Performance comparison when negotiating common representations from different initial alliances.