I appreciate the authors' response, but I still have some remaining concerns:

The authors claim that a 5-layer MLP, trained per scene, can compress the embeddings of a large VLM to a six-dimensional representation without any loss of information. This assertion appears counterintuitive. Open-world queries typically have high-dimensional embeddings, yet the proposed 5-layer MLP is trained on a very limited subset of data—the features of objects within a single scene. While it is true that the text and image embeddings are well-aligned before compression, the encoding process performed by the 5-layer MLP could disrupt these correlations. For instance, prior to compression, the text embeddings for 'bird' and 'chair' might be significantly distant from each other, reflecting their semantic disparity. However, after compression, these embeddings might become undesirably close, undermining their original relationships. The authors need to provide evidence that such distortions do not occur; otherwise, the method may fail to effectively handle open-vocabulary queries.
The authors state that text queries are provided at test time. However, line 209 mentions that the method requires a set of text queries 𝑇, which is even used during training to supervise the autoencoder, as indicated in Eq. 1. Furthermore, line 304, 305 reveal that 𝑇 is utilized in scene optimization to compute the cross-entropy loss. This clearly demonstrates that 𝑇 is employed during training. Could the authors clarify this apparent contradiction between the claim that text queries are provided at test time and their evident use during training?
The authors claim that the proposed method is capable of performing multi-scale segmentation. However, there are no comparisons or experiments provided to validate this claim. Does the method require retraining to perform multi-scale segmentation? Additionally, the proposed design does not appear to incorporate hierarchical semantics. Could the authors elaborate on the exact technical process through which multi-scale segmentation is achieved?

In summary, I find that the authors' response does not adequately address my concerns. Unless the authors provide additional information or clarification to resolve these issues, I have decided to lower my score.