We sincerely thank you for your valuable feedback and constructive comments. We have carefully considered each point raised and provide our responses below.

1. Computational Complexity

We acknowledge that our method may be more time-consuming due to the presence of two loops in the selection strategy. However, it is important to highlight that this increased computational complexity is offset by our method's unique capability to predict unknown viewpoints, a feature not available in alternative methods. Additionally, our approach demonstrates superior performance in scene segmentation compared to other methods.

2. Benefit of the Viewpoint Selection Strategy

We appreciate your observation regarding the role of the viewpoint selection strategy in predicting images from unknown viewpoints.

The viewpoint selection strategy is integral to enhancing the quality of viewpoint-independent object-centric representations. It ensures that the most informative viewpoints are selected for observation, thereby reducing redundancy and improving prediction accuracy. Accurately predicting images from unknown viewpoints is both a necessary condition for the effectiveness of our viewpoint selection strategy.
can be obtained in image prediction tasks by inputting the target viewpoint timestep into the viewpoint encoder. The model uses to guide the prediction of the image from target viewpoint.

3. Necessity of the Prediction Set

We apologize if Figure 2 is not sufficiently clear. In our framework, the images from the Unknown Set are not directly accessible to the model. Only the viewpoint timesteps of the Unknown Set are available. The Prediction Set consists of images predicted from these viewpoint timesteps. Generating the Prediction Set is a necessary step to estimate the information increment of the unknown viewpoints, which plays a crucial role in our viewpoint selection strategy.