Q1: Please elaborate on this claim and raise what editing types or some works who heavily rely on editing types. As of my knowledge and experience, Instruct-Nerf2Nerf or Clip-based methods are capable of any text input and user-friendly.

R1: In the introduction, we listed three potential flaws in existing methods. We want to clarify that each method has one or more of these flaws, rather than all having all three flaws. To summarize the types of flaws in each approach:

Rely on known editing types in advance: DFF, Learning-to-Stylize, ARF.
Require retraining an editing model for each specific 3D scene: CLIP-NeRF, Instruct-NeRF2NeRF, NeRF-Art, DFF, Learning-to-Stylize, ARF.
Less user-friendly: DFF, Learning-to-Stylize, ARF.

Our main contribution in this paper is proposing a new editing framework that overcomes all these three flaws.

Q2: Edited single frame is well-structured and jumpy between multiviews may not be simply noisy perturbations.

R2: The differences between the outputs of diffusion models may not be noise perturbations, but they may have their own inherent patterns of these differences. The reason we refer to them as 'noise' perturbations in the article is simply to borrow the idea of 'denoising' for convenience: that is, denoising methods are used to remove noise, while our method is used to remove differences between the outputs of diffusion models. Regardless of what this difference specifically is, the purpose of training our model is to eliminate it, and our experimental results support the validity of this thought.

Q3: The writing is complex and needs much more clarity. For example, what is the purpose of input caption?

R3: The role of input and output caption is explained in Eq.1 of the article. Our method uses a 2D editing model called Null-text inversion. This method requires the provision of both input and target captions when editing images. The input caption represents the original image description, and the target caption denotes the desired image description.

Q4: What is the generalizable mean in the context?

R4: The 'generalizable' refers to the ability of our model to edit new scenes without the need for retraining. In contrast, methods such as Clip-NeRF and Instruct-NeRF2NeRF require retraining the model when editing new scenes.

Q5: There is not enough information for the inference stage. It only mentioned content filter in the paragraph. In Fig. 2 inference time, how can NeRF generate closed eye image if it has not seen examples in the training time? From the figure, after the volume rendering it shows an open-eye image, but the next step in G's output, it abruptly show closed eye image. How to use the filtered image at the inference time to avoid re-training NeRF is also not clear to me.

R5: Detailed explanations and examples of the inference and filtering process can be found on pages 16 and 17.

Regarding the explanation of the transition from open eyes to closed eyes in Fig. 2:

During the training phase (Fig.2a), we do not use closed-eye data as ground truth (GT) to train the model. The data pairs used for training are {the slightly perturbed image and the clean original image }. The goal of training our model is to remove the differences between these two images. Since only slight perturbations were applied during editing, the eye is not actually closed. Furthermore, this is just one type of edit on one scene. In our training stage, we used 1198 scenes, each constructing 405 training data pairs. Therefore, there are a total of 485,190 training data pairs for our model's training. After completing the training, our model has the ability to eliminate the differences between the output results of the 2D editing model.

During the inference phase (Fig. 2b), we apply normal amplitude edits to images from 3D scenes. At this point, we first filter out some images with poor editing results, then use the model obtained in the previous step to remove the remaining perturbations. Thus, the inference phase can produce the result of closing the eyes.

Q6: It is not convincing to say changing to snow-covered roads is weather changes.

R6: Our method also simulates the effect of raindrops. Please see Supplementary Material Fig. 20, where the target caption reads "a red flower with green leaves in the heavy rain". We also referenced other articles such as "ClimateNeRF: Extreme Weather Synthesis in Neural Radiance Fields", presented at ICCV this year. Similarly, that article only covers snow weather effects.

We will clarify this point in our final article version.

Q7: In Fig.1, the pineapple and strawberry examples seem just to change appearance only. It does not create new geometry.

R7: Geometric changes may not be obvious in this example, we suggest referring to Supplementary Material Fig. 20 in the article, where we demonstrate edits such as turning flowers into apples and eggs.