In Table 5, it is unclear whether the proposed CMD is effective under a voxel size of 4cm. It would be better to provide the results that remove the CMD under a voxel size of 4cm and compare with the results at the 3-th row.
Is the proposed method sensitive to the hyperparameters ”? More discussions are required.
In Tables 1-4, it is unclear why the authors ignore the results of Open3DIS with both 2D and 3D supervision. More discussions are required.
In Eq(4), why not use cosine similarity to measure the distance between the mask features and the associate features? Note that using cosine similarity is common practice to measure the distance between features, and is used by OpenSeg and CLIP. Besides, on line 341, should be .
As the three types of associate features lie in different feature spaces, it is difficult for a model to learn the mask features that aggregate all the advantages of associate features by Eq(4). The results in Table 6 also show that using all three types of association may not be the best choice. Why not use separate mask features for each associate feature and concatenate them together?
On line 457, the authors argue that “small voxel size can save the memory requirements”. In fact, using larger voxel size can reduce the number of voxels and is more memory efficient.
Some of the references need to be updated. For example, Open3DIS is currently published in CVPR 2024.