Thank you for the active discussion, for remaining problems:

Feature encoding: I now understand your concern. CLIP provides per-pixel feature and pixels in the bounding-box share the same feature of the object. As for background, all pixels outside the bounding-box share the same feature to represent the unified concept of "region". On the contrary, LSeg tends to provide per-pixel feature and each pixel feature varies. However, encoding the region information of an image is more likely an image classification task rather than a segmentation task, which aims to supervise all pixels in the background with a unified region label. It's hard to convergent scattered features from different pixels to a unified concept.

Feature contrastive learning: For “How do you make sure your features on the 3D surface point are exactly the feature you render?”, we further add a more detailed discussion to train the neural representation in Section 4.2. In the way discussed in 4.2 about neural encoding and 4.1 about target feature processing, given a posed RGB-D image, the target feature of each pixel is processed as mentioned in 4.1 denoted as (features on the 3D surface point). At the same time, the related pixel in depth image is back-projected into 3D space according to depth and pose value, denoted as , and processed by the process mentioned in 4.2 to form (rendered feature). A contrastive loss is conducted between and to train the neural representation. Training details are declared in Section 4.4. Figure 2 clearly shows the feature contrastive learning pipeline.

Threshold: As mentioned, "we filter the points with similarity over threshold (0.6 in our practice), and simply draw a bounding box to cover these points for visualization". Generally, there are more than 30~50 points on a single object, so there's a tolerance on the threshold choice. As for the exact value 0.6, it is an empirical choice based on our experiments on Matterport3D dataset among tens of scenes we've tested. In our practice values among 0.4~0.6 does not bring too much disturbance on results.

Image query localization revise in paper: The metric and sample strategy declaration has been added in Section 5.2.

Ablation: Our main contribution is proposing the allocentric layout-based scene encoding with the neural representation approach and constructing a topological map based on this. As can be seen in Fig. 7 the object encoding branch keeps nearly the same with previous method, so we do not ablate the object related metrics like semantic segmentation which would be nearly the same with previous works. As for graph, we propose a topometric map construction pipeline based on the learned neural representation based on the queried object and region features. Consequently, the topo-map construction result relies on the learned object and region embedding metrics. Evaluating and improving the topo-map are our undergoing future work on graph-based path planning and locomotion which would not be included in this paper.