Thank you for your valuable feedback and for highlighting the areas that require clarification in our paper. We appreciate your insights, which allow us to improve our manuscript. Below, We summarize each of your questions and provide detailed responses.

Q1: Clarification on how graph connectivity is learned and visual similarity is utilized.

A: Thank you for your question. The node features in our approach only contain textual characteristics. The visual information, is provided by the segmentation model during training. Specifically, the graph neural network outputs embeddings and adjacency matrix that are utilized in the segmentation network's inference process. For the image training samples, the segmentation network encodes visual embedding features. The GNN makes predictions by classifying and mapping these visual features to labels. The loss function is computed using formulas (4-6), and the loss value is backpropagated through the GNN model to update the parameters.

Q2: Concerns about using an older architecture and its competitive performance.

A: We appreciate your observations regarding the architecture used in our studies. To ensure a fair comparison, we selected this architecture based on the configuration outlined in Mseg [23]. Training on seven datasets is indeed time-consuming (approximately one week on four 80G A100 GPUs), which limits our capability to evaluate more competitive architectures at this stage. However, our results (Table 11) from five datasets show potential for improvement, as we've enhanced Cityscapes' mIoU to 82.2%. This indicates that there is room for further optimization in our method. We will explore more competitive base architectures, such as transformer-based models, in our future research.

Q3: Performance evaluation relative to the MSeg approach on indoor/outdoor datasets.

A: Thank you for raising this question. It's worth noting that MSeg combines categories and overlooks some difficult classes, which simplifies the learning task. Many of these difficult classes correspond to IoU values lower than the overall mean IoU (e.g., for ADE, mIoU for omitted/merged classes is 37.5 vs. all categories at 42.0; for COCO, it's 37.1 vs. 46.7). MSeg's merged categories also take into account label alignment across different datasets, which eases the learning process and thus may not represent a fair comparison to our method. We have provided examples of merged categories and their IoUs to help illustrate this point:

MSeg Label	ADE Label	IoU
table	coffee table	55.9
	table	50.6
terrain	land	0.4
	earth	31.1
	field	22.1
	grass	64.7
	sand	36.4
	dirt track	6.4
mountain_hill	mountain	56.2
	hill	8.9
car	car	84.3
	van	35.9
stairs	stairs	28.5
	stairway	28.9
railing_banister	railing	26.4
	bannister	10.0
unlabeled	computer	66.3
	signboard	36.3
	monitor	45.5
	crt screen	5.0
	screen	63.9
	canopy	8.8
	plant	44.0
	bar	23.8
	step	11.7

MSeg Label	COCO Label	IoU
building	house	30.0
	roof	52.7
	building-other-merged	46.6
floor	floor-wood	36.0
	floor-other-merged	47.8
table	table-merged	38.3
	dining table	75.8
terrain	grass-merged	37.6
	sand	14.5
	dirt-merged	22.5
wall	wall-brick	24.9
	wall-stone	54.9
	wall-tile	30.3
	wall-wood	16.0
	wall-other-merged	43.1
vegetation	flower	10.9
	tree-merged	49.5

Q4: Potential issues with hallucinated label descriptions through ChatGPT.

A: Thank you for your invaluable input regarding the potential inaccuracies in generated label descriptions. We agree that using official label descriptions from datasets like Cityscapes greatly enhances the reliability of our methodology. Given that some datasets may lack corresponding label descriptions or have inconsistent language styles, this can pose challenges for the model. Therefore, in the future, we plan to use the label descriptions provided by official datasets as prompts, employing GPT models to generate more accurate and stylistically consistent label descriptions to help the model learn better. This adjustment may help improve the model's learning process by reducing the risks associated with hallucinated descriptions.

Limitation: Discussion on societal impacts and scalability challenges.

A: Thank you for the addition to the limitations section. Indeed, compared to unsupervised or weakly supervised methods, we still require complete annotated data. However, our approach leverages the available annotation information in existing datasets to reduce the reliance on annotated data. Errors in the fully automated construction of a unified label space do present some safety risks for autonomous driving tasks. Therefore, we also recommend introducing a manual review mechanism to address these issues. Current automatic label generation methods may introduce significant safety risks, so we recommend incorporating a manual review mechanism for generated labels to ensure accuracy and mitigate these concerns.

Regarding scalability, we concur that it's not necessary to load all datasets on a single node. Instead, as long as the computations for weight gradient updates include representations from all datasets, we can distribute this across multiple nodes, allowing more efficient processing. This approach can help facilitate scaling while avoiding bottlenecks in performance. We will include additional descriptions of these limitations in the final version of the paper, particularly focusing on safety considerations for autonomous driving scenarios.

We hope that these responses provide the clarity you were seeking and address your concerns adequately. We sincerely appreciate your constructive feedback, which is invaluable in improving our manuscript.