5.6

/10

Poster5 位审稿人

最低4最高8标准差1.5

4.0

置信度

正确性2.8

贡献度2.2

表达3.0

NeurIPS 2024

GaussianCut: Interactive segmentation via graph cut for 3D Gaussian Splatting

Umangi Jain,Ashkan Mirzaei,Igor Gilitschenski

OpenReview PDF

提交: 2024-05-15更新: 2025-01-14

TL;DR

Object selection and segmentation using graph cut for scene represented using 3D gaussian splatting

摘要

关键词

3D VisionSegmentationGraph cutGaussian Splatting

评审与讨论

审稿意见

评分: 8置信度: 32024-06-26

The paper proposes a method for interactive multiview segmentation of scenes represented as a set of 3D gaussians (3D gaussian splatting). The method takes as an input multiple views of a scene, constructs 3D Gaussian splatting representation of the scene, takes interaction from the user (scribbles, points, or text prompts), uses a video segmentation method (prior work) to get object masks from user input in each view, and then constructs a graph with nodes corresponding to the gaussians. A binary energy function on this graph is designed which encourages similar gaussians (both in space and color) to be connected with edges of higher weight. Also for each Gaussian there is a likelihood term representing how likely it is to be foreground/background. Finally this energy function is optimized with a graph cut and gives a separation of the cloud of gaussians into two sets: foreground/background. After separation in 2 sets, foreground/background can be visualized separately using the corresponding set of Gaussians. Experimental results outperform prior work.

优点

The proposed method is well motivated. The energy function makes sense.

缺点

Computational cost is not real time.

Mapping user input to gaussians could be better described. For someone not familiar with 3D Gaussian splatting prior work, it is hard to understand how it works. An illustration could help.

The cluster similarity in unary terms (Eq. 3) seems to have a big impart on the performance, but is kind of ad-hock not so well motivated. The problem might be a result of 'shortcutting' bias in the energy formulation (cuts separating a small component are cheaper). What you are doing here is kind of reminiscent of saying that pixels close to the foreground scribbles should be more likely in the foreground (if I think of an analogy to 2D segmentation).

问题

Without the mask refinement method (just from user scribbles), by how much the performance goes down?

局限性

Limitations are discussed.

作者回复

2024-08-07

We thank the reviewer for their thoughtful feedback and kind words about our work.

Computational cost is not real time.

The reviewer is right in noting that the graph cut algorithm does not run in real-time. However, once segmented into foreground and background Gausisans, it can be rendered in real-time. We kindly refer the reviewer to Table 2 (global response) to see a detailed analysis of pre-processing, training, and segmentation time for our method. Since our method does not perform changes to the 3DGS optimization, the overall time is much less than the baselines. Exploring real-time segmentation is an exciting future direction for this project.

Mapping user input to gaussians could be better described.

In 3DGS, we can keep track of the Gaussians splatting to a particular pixel in 2D while rasterizing from that viewpoint. So for a particular Gaussian, we can compute the ratio of pixels splatted in the foreground (as per the 2D mask) and the total number of pixels splatted by that Gaussian. This gives us the coarse estimate. Thank you for highlighting this. We will add an equation and an illustration to make this more clear to the reader.

Intuition for the cluster similarity term

The intuition behind adding a cluster similarity term is to improve the accuracy of foreground identification. If a Gaussian is in a region with other Gaussians that are likely foreground, it might also be foreground. We start with a coarse estimate of each Gaussian's likelihood of being foreground or background, derived from coarse splatting. However, this initial estimate can have errors due to inaccuracies in the 2D segmentation masks. Nodes with high confidence (weights $w_g$ close to 1) of belonging in the foreground can serve as prototypes for other similar nodes. Directly finding the closest high-confidence node for each Gaussian is computationally expensive. Therefore, we cluster these high-confidence nodes to reduce the computational load. We also present an ablation study on the number of clusters in Table 8. After experimenting with various heuristics, this clustering approach proved to be quite effective.

Without the mask refinement method (just from user scribbles), by how much the performance go down?

We kindly refer the reviewer to Figure 12 which shows this effect qualitatively. If we do not use any 2D mask (just rely on the user scribbles), we only get the area around the scribble from coarse splatting. Applying graphcut has much more significant benefits in this case. We show the quantitative results on five scenes from the LLFF dataset (scribbles following NVOS bnchmark). The numbers in the table are IoU, classification accuracy.

Scene	Scribbles	Scribbles (with graphcut)	GaussianCut
Fern	8.17, 74.50	47.97, 84.52	83.06, 94.60
Flower	7.48, 79.16	85.30, 97.61	95.37, 98.91
Fortress	15.12, 84.15	95.67, 99.56	97.95, 99.61
Trex	6.74, 88.53	50.44, 91.96	83.43, 97.83
Orchids	6.17, 84.84	85.25, 97.55	95.80, 99.31

The performance drop is natural as the default implementation uses masks from multiple views but graph cut on 3D Gaussians can still retrieve major parts of the objects even with just the scribbles. We can include more qualitative results in the appendix to show this.

审稿意见

评分: 4置信度: 52024-07-12

The paper proposes GaussianCut for interactive multi-view scene segmentation using 3D Gaussian Splatting. It first accepts user input (clicks, scribbles or text, similar to SAM) on single images, and them aim to segment the corresponding 3D Gaussians. The method constructs a graph based on scene Gaussian, and then using graph-cut algorithm to minimize the energy function. Segmentation tracking method are employed to provide the initial segmentation masks. The experiments are performed on LLFF, Shiny, SPIn-NeRF and 3D-OVS.

优点

The paper has a good presentation and clear writing, where figures are clean for the readers to understand the method pipeline.
The paper has a good motivation on extracting objects from explicit 3D Gaussians. Graph cut is well known for image segmentation on pixels, and the extension to 3D Gaussians is interesting.
The quantitative results in Table 1, 2, 3 and 4 show the reasonable segmentation performance brought by GaussianCut.

缺点

Although extracting objects from explicit 3D Gaussians has a good motivation, this has been studied in previous works like LangSplat [a] and Gaussian Grouping [b], but these two very relevant methods are missing quantitative comparisons in the main paper. It is not clear what are the advantages of the proposed Gaussian Cut comparing to [a] and [b]. In Table 10 (should move to the main paper), [a][b] are compared but the performance of Gaussian Grouping is still on par or even better than the proposed GaussianCut. To understand the paper novelty, the paper should clarify these main differences / advantages in the main paper, and also include the detailed running speed comparison.
Using video tracking masks to obtain coarse segmentation is not new, which has been explored in [b, c], but not compared. Since [a][b] both lift SAM's masks to 3D, these 2 methods can also perform click/scribble based 3D segmentation
The limited performance improvement of the introduced spatial, color and cluster similarities in Table 5, which show the improvement of the proposed n-links and t-links are minor. Also, the extension of Graph cut from 2d pixels to 3D Gaussians seems very straight-forward.
More comparisons on benchmarks like Replica (setting proposed by Panoptic Lifting [d]) or Lerf-Mask [b] are desired.

[a] Langsplat: 3d language gaussian splatting. CVPR, 2024.

[b] Gaussian Grouping: Segment and Edit Anything in 3D Scenes. ECCV, 2024.

[c] CoSSegGaussians: Compact and Swift Scene Segmenting 3D Gaussians with Dual Feature Fusion. arXiv, 2024.

[d] Panoptic Lifting for 3D Scene Understanding with Neural Fields. CVPR, 2023.

问题

How will GaussianCut perform when video segmentation/track fail due to large motion? which may lead the coarse segmentation contain large portions of errors.

How many Gaussians are considered during the graph construction? The most concerning part for me is the tech novelty of the paper.

局限性

The limitation is discussed in the paper.

作者回复

2024-08-07

We thank the reviewer for accessing our work and providing valuable feedback. We provide clarifications to the concerns and questions they raise.

Justify the tech novelty. Graph cut from 2d pixels to 3D Gaussians seems very straightforward.

Extending graph cut from 2D pixels to 3D Gaussians involves several non-trivial design choices and findings. We kindly refer the reviewer to the global response.

Comparison with LangSplat and Gaussian Grouping and detailed running speed comparison

We kindly refer the reviewer to Table 1 (global response) for comparison with LangSplat and Gaussian Grouping, and a detailed speed comparison in Table 2 (global response). For the 3D-OVS dataset, we provided a comparison with these two baselines (Table 10). We would like to clarify that the last line of Table 10 is the average over the Lawn scene (not the overall average). Our overall average is much higher than other baselines as shown below. A reason behind this is that Gaussian Grouping and LangSpalt optimize features for all the Gaussians and it therefore limits interactivity with specific objects. Interactivity here refers to the choices made when selecting the 2D supervision mask. Once a supervision mask is chosen, the underlying feature field is optimized based on that. On the other hand, for GaussianCut, the 2D masks are chosen based on the user input.

Overall average on the 3D-OVS dataset. The reported metric is IoU.

3D-OVS	Gaussian Grouping	LangSplat	CGC	Ours
Average IoU	82.92	67.78	87.50	94.38

Since the code for CoSSegGaussians is not public yet, we can’t provide a comparison against that.

Dependence on video segmentation model

We use video segmentation models for better performance, but our method can work with just one mask as well (Table 4 and Figure 1 in the rebuttal pdf). While our model performance does suffer when using a single mask, it can still perform reasonable segmentation. Langsplat, Gaussian Grouping, CoSSegGaussians, however, require a segmentation mask for all training views.

Limited improvement from proposed n-links and t-links

Results in Table 5 show the average performance on 7 scenes. While the effect of each individual component might not seem a lot on average, the effect can be significant depending on the scene. We show two scenes from the SPIn-NeRF dataset where removing the n-links can have a significant effect. GaussianCut also retrieves fine details (as shown in Figure 4 and 10) which are otherwise missed when just using the semantic maps. The numbers in the table are IoU / classification accuracy.

Scene	Coarse	w/o n spatial	w/o n color	w/o cluster	GaussianCut
Lego	88.03/99.63	88.43/99.66	88.52/99.66	89.18/99.69	89.18/99.70
Truck	93.32/97.76	95.47/98.49	95.60/98.54	95.67/98.57	95.70/98.60

Comparison on more datasets like Replica and Lerf-mask

We thank the reviewer for suggesting additional datasets. We show quantitative results on four datasets: LLFF from NVOS, SPIn-NeRF, 3D-OVS, and Shiny, and have also compared performance against SA3D, ISRF, SAGA, SAGD, LangSplat, Gaussian Grouping. In addition, we take scenes from mip-NeRF and LERF (Figure 8) to show qualitative results. If needed, we can provide quantitative results on more benchmarks for the revised paper.

How will GaussianCut perform when video segmentation/track fails due to large motion? which may lead the coarse segmentation containing large portions of errors.

We kindly refer the reviewer to Figure 5-7. We have also added extreme case (no video segmentation model) qualitative result in Figure 1 (rebuttal pdf). The quantitative results (IoU / classification accuracy) are shown below:

Scene	Single Image (Coarse)	Single Image (graphcut)	GaussianCut
Truck	55.63 / 83.37	71.83 / 90.23	95.7 / 98.6
Lego	72.92 / 98.88	79.98 /99.26	89.2 / 99.7

How many Gaussians are considered during the graph construction?

We consider all the Gaussians to construct the graph. The number is same as base 3DGS model, ranging from ~850k to 4M for all of our scenes (time taken between 40 sec to 4 mins).

2024-08-13

I read and appreciate the authors' response to my review. After thoroughly considering the feedback from the other reviewers, I am inclined to uphold my original score as "borderline reject" due to the paper's unclear tech contributions in the submitted paper writing and limited performance improvement. To understand the paper novelty, in the next revision, the paper should highlight / clarify the main differences / advantages to existing works in the main paper, and also include the detailed running speed comparison in the paper as well.

2024-08-13

We thank the reviewer for their suggestion. As per the reviewers' suggestion, we did a detailed run-time comparison against the feature learning methods during rebuttal. We initially did not compare the runtime against feature-based methods as we operate directly on a pre-trained 3DGS model. Since none of the published prior work do a training-free segmentation of 3DGS model, we did not have a direct baseline to compare against. To make our contribution clear, we have highlighted that our approach is training-free in the abstract (line 15), in Figure 1, in the method section (lines 125-126) and the discussion section (lines 319-320). We have differentiated our work from feature-based methods (lines 84-85 and lines 99-100) in the related work section. While we work on improving our work, we were curious to know what other running speeds the reviewer is interested in seeing beyond the ones provided in this discussion period?

Regarding the limited performance improvement, we would again like to highlight that for a training-free approach to perform on par or even better than training-based baselines is not only novel but also surprising. As for improvement, we show a considerable improvement on 3D-OVS dataset (+6.88 absolute IoU gain) and a smaller improvement on NVOS (+1.6 absolute IoU) and SPIn-NeRF (+0.5 absolute IoU). The reason for this is that the performance on the two latter benchmarks is already pretty high even for the baselines and thus, the room for improvement is smaller as this benchmark is becoming saturated.

审稿意见

评分: 6置信度: 42024-07-13

This paper proposes GaussianCut for interactive 3D segmentation. The GaussianCut takes a trained 3DGS model and the user prompt as inputs. The SAM model first transfers the user prompt into an initial mask. Then the 3DGS model renders multiple view images and an existing video-tracking model is used for segmenting 2D masks across multiple views. With the masks on multiple views, the splatted Gaussians are identified with two likelihood parameters. Then the graph-cut method is applied for the Gaussian points where each Gaussian is a node, and the edge models the foreground and background relations. Results on multiple benchmarks show the effectiveness of the proposed method.

优点

This paper is well-written and easy to follow.
Using Graph Cut to segment the 3D Gaussians makes sense and sounds interesting.
The result on multiple benchmarks shows the proposed method achieves new SOTA performance.

缺点

The proposed method is kind of straightforward. The framework of GaussianCut is a combination of all existing models. The SAM model is used to obtain masks from user prompts. The SAM-Track model is used to generate Multiview masks. 3DGS model can explicitly model relations between 3D Gaussians and 2D pixels. Graph Cut model is used to separate the Gaussian points. This paper should explain the key contributions of GaussianCut itself.
Details of associating 3D Gaussians with 2D masks should be given. As alpha blending will assign different weights to different Gaussians for a single pixel, For the Gaussians splatted into one pixel, are they assigned different weights just for this pixel? Or a hard (binary) assignment is used?
The time cost of the proposed method is shown in Tables 6 and 7. However, the comparisons with other SOTA methods in terms of time cost should be included to illustrate the speed advantage/disadvantage of the proposed method.

问题

The novelty of the proposed method should be justified. The proposed method is more like a post-processing. Some important method details and comparisons are missing.

局限性

The authors have adequately addressed the limitations.

作者回复

2024-08-07

We thank the reviewer for their constructive review. We would like to respond to the questions and concerns that they have posed.

Key contributions clarification

We kindly refer the reviewer to the global response which provides a detailed explanation of our key contributions.

Details of associating 3D Gaussians with 2D masks

Gaussians are assigned coarse estimates based on their weights during alpha-rendering at rasterization, i.e., for the Gaussians splatted onto one pixel, it is assigned a different weight just for this pixel. This weight is based on its contribution at alpha blending. However, we had experimented with binary weights as well and we did not observe any major performance difference. The binary weights also give a similar performance in our experiments. We will include these details about the mapping of masks to the Gaussians in the revised paper. Thank you for the feedback.

Time cost analysis

For a detailed speed analysis, we kindly refer the reviewer to Table 2 (global response).

2024-08-13

Thanks for the detailed response. The reviewer still retains the following concerns:

Novelty. This paper extends the graph cut-based segmentation method into 3D Gaussians, which haven't been explored before. However, 3D Gaussians are essentially a bunch of 3D points. Using Graph cut for 3D segmentation including point cloud segmentation has been explored [1,2]. The current technological contribution is limited.
It's still unclear how the soft/hard weights are combined in the proposed method. Also, what are the experimental results of these two different choices?

[1] Zhang, Zihui, et al. "Growsp: Unsupervised semantic segmentation of 3d point clouds." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[2] Guo, Haoyu, et al. "Sam-guided graph cut for 3d instance segmentation." arXiv preprint arXiv:2312.08372 (2023).

2024-08-13

We thank the reviewer for the follow-up.

Coarse splatting

We elaborate on the coarse splatting in further detail below:

Consider an optimized 3DGS model for a scene, $\mathcal{G}$ . For $n$ viewpoints, we obtain masks $\mathcal{M}$ := { $M^j$ } $_{j=1}^n$

from a video segmentation model corresponding to $n$ viewpoints $\mathcal{I}$ := { $I^j$ } $_{j=1}^n$ . $M^j$ indicates the set of foreground pixels in the viewpoint $I^j$ .

For each Gaussian $g \in \mathcal{G}$ , we maintain a weight, $w_g$ , that indicates the likelihood of this Gaussian belonging to the foreground. To obtain the likelihood term $w_{g}^j$ pertaining to mask $j$ for Gaussian $g$ , we unproject the posed image $I^j$ back to the Gaussians using inverse rendering and utilizing the mask information,

w_g^j = \frac{\sum_{**p** \in M^j } \sigma_g(**p**)T_g^j(**p**) }{\sum_{**p** \in {I}^j} \sigma_g(**p**)T_g^j(**p**)}

where $\sigma_g(**p**)$ and $T_g^j(**p**)$ denote the opacity and transmittance from pixel $**p**$ for Gaussian $g$ . If $g$ does not contribute to $**p**$ , the transmittance is taken to be $0$ . Combining over all the masks,

w_g = \frac{\sum_{j}\sum_{**p** \in M^j } \sigma_g(**p**)T_g^j(**p**)}{\sum_j\sum_{**p** \in {I}^j} \sigma_g(**p**)T_g^j(**p**) }

As mentioned in the paper, we use the same formulation as proposed by GaussianEditor [a] and kindly refer the reviewer to their paper for further details.

For binary weights (hard assignment), we simply keep a count of the number of foreground and background pixels the Gaussian $g$ splats to in $I^j$ .

w_g = \frac{\sum_{j}\sum_{**p** \in M^j } \mathbb{I}(T_g^j > 0) }{\sum_j\sum_{**p** \in {I}^j} \mathbb{I}(T_g^j > 0)}

This $w_g$ is used directly in Equation 3 in the paper. We show the IoU on several scenes comparing soft assignments and hard assignments. Since soft assignment has marginally better performance, it is our default implementation.

Scene	Soft assignment	Hard assignment
Fern (NVOS)	83.06	82.56
Fortress (NVOS)	97.94	98.12
Leaves (NVOS)	95.95	95.60
Lego (SPIn-NeRF)	89.18	88.95
Pinecone (SPIn-NeRF)	91.89	91.99

Novelty concerns

Regarding the novelty, the graph construction proposed in our method is significantly different from the point cloud literature. Our proposed method is much simpler and both the approaches mentioned by the reviewer involve training networks to obtain the segments.

GrowSP: Unsupervised Semantic Segmentation of 3D Point Clouds: This method does not employ graph cut for segmentation. It learns per-point features, extracts superpoints, progressively grows those superpoints, and performs clustering for segmentation.
SAM-guided Graph Cut for 3D Instance Segmentation: While this method does use graph cut on point clouds, their formulation differs significantly from our approach. They apply graph cut on superpoints which are obtained from another pre-segmentation model. Therefore, when an object to be selected is part of a superpoint, this technique is unable to segment it out. Our approach, on the other hand, provides finer user control as we do not rely on any point cloud pre-segmentation module. Their primary method involves training a graph neural network (GNN) using high-quality pseudo labels whereas our approach is training-free. Their non-GNN baseline also requires masks from multiple views to assign edge weights. Our proposed approach can work even with a single mask. Moreover, since this method stores SAM features, adapting it for 3DGS would require a much higher memory footprint than our approach (please see discussion section).

More generally, while some advancements in point cloud segmentation can be tranferrable for 3DGS, our coarse splatting (section 3.3) and terminal links (section 3.4) are heavily tailored for 3DGS.

[a] GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting, CVPR 2024

2024-08-14

Thanks for the detailed response. The reviewer appreciates it. Please include those details in the revised version. I will increase my rating accordingly.

2024-08-14

We thank the reviewer for the positive feedback and for expressing the intention to increase the rating. We appreciate the constructive review and will ensure that the discussed changes are included in the revised version of the paper.

审稿意见

评分: 6置信度: 32024-07-13

This paper presents GaussianCut, a new method for foreground/background segmentation of 3D Gaussian scenes. GaussianCut relies on 2D image/video segmentation masks, which are generated for a subset of the training images. GaussianCut propagates these masks to the 3D Gaussians by projecting each Gaussian onto each mask and averaging the corresponding mask values. It then assembles the Gaussians into a graph where each Gaussian is connected to its nearest neighbors, defines several energy terms (for nodes and edges), and uses these energy terms alongside a graph cut to partition the Gaussians. Unlike several baselines, GaussianCut can be applied to pre-trained 3D Gaussian scenes without retraining/fine-tuning.

优点

In Figure 4, the qualitative results of the proposed method look much sharper than those of the baselines. Also, the quality metrics for the proposed method improve over the baselines slightly.
Figures 5 and 7 make it clear that the proposed method is fairly robust to bad 2D input segmentations.
The paper is clear and easy to follow.
The proposed method is simple.

缺点

One major weakness is that none of the tables include the runtimes of the proposed method and the baselines. This is important because segmentation is arguably much more useful if it runs at interactive speeds. The paper also does not list the runtime of the simpler coarse splatting variant/ablation. I think it is crucial for the paper to list both the training-time overhead and segmentation runtime of each method.
Another major weakness is that the paper includes/omits baselines in the tables very inconsistently. For example, SAGD and ISRF are featured in Figure 4, but not in Table 3. On the other hand, MVSeg and SAGA are featured in the table, but not the figure. I would be much more confident in the results if the baselines were always included (when appropriate).

Overall, I would be happy to raise the paper's score if the above two major weaknesses were addressed. My score for contribution is held back by the fact that some of the baselines (e.g., SAGA) are significantly faster to run despite having similar quality.

Another smaller weakness is that the paper does not include visualizations of the energy terms, high-confidence clusters, etc. These would be very helpful for building the reader's intuition for the method.

Minor nitpicks:

Line 55/56: The following sentence should perhaps include citations: "Recent works have also explored segmentation with Gaussians as the choice for scene representation."
Line 101: I was confused about what is meant by "decomposing the boundary Gaussians."
Line 144: "Guassians"
Line 169: The graph is defined as $(|\mathcal{G}|, \mathcal{E})$ . This should be $(\mathcal{G}, \mathcal{E})$ instead.
Line 170: The definition of the neighborhood is listed as $\mathcal{N} \subseteq |\mathcal{G}| \times |\mathcal{G}|$ , but the neighborhood is defined in terms of nodes ( $\mathcal{N} \subseteq \mathcal{G}$ ) and not edges (which would be $\mathcal{N} \subseteq \mathcal{G} \times \mathcal{G}$ anyways) in the next sentence.
Line 264: Units (dB) should be listed for the PSNR differences.
Line 296-297: The time cost for segmentation technically grows linearly, but the constant factor is big enough that this doesn't matter. I would update the sentence here to be more precise.
The single mask ablation (line 298) should probably be included in Table 5.

问题

Line 171-172: The authors state that "Gaussians that map to the same object would be closer spatially." This seems reasonable, but isn't always the case. For example, dull specular highlights are often represented via transparent surfaces and "clouds" inside objects. Does the proposed method handle these cases well?
The proposed method has a number of hyperparameters. How sensitive is it to these hyperparameters?
How good is the coarse splatting baseline with well-chosen hyperparameters? I think visualizing a sweep of the threshold in the appendix would convince the reader that the proposed approach works better no matter what threshold is chosen.
How are the frames ordered before they are passed to the video segmentation model? How sensitive is the method to this ordering?
To what extent does the proposed method's performance rely on SAM-Track's quality? Do any baselines rely on different video segmentation models, and if so, how do the metrics change when those are updated to use SAM-Track?

局限性

Yes.

作者回复

2024-08-07

We thank the reviewer for their thoughtful and constructive feedback. We have provided clarifications throughout and added additional experiments to address the concerns they raised. We appreciate the reviewers’ attention to detail and will incorporate all the suggestions made by them.

Runtimes of proposed method and baselines

We have added all the runtimes suggested by the reviewer and we kindly refer them to Table 2 (global response).

Inconsistent baselines

We ran all the baselines on the SPIn-NeRF dataset (10 scenes). For ISRF, we get OOM issues at 1008 resolution for 360-degree inward scenes so we resize it to 1/4th resolution.

Method	IoU	Acc
MVSeg	90.9	98.9
SA3D	92.4	98.9
SAGD	89.7	98.1
SAGA	88.0	98.5
ISRF	71.5	95.5
Coarse Splatting	91.9	98.9
GaussianCut	92.9	99.2

For the qualitative results in Figure 4, since MVSeg (part of SPIn-NeRF) is designed for inpainting task, extracting the segmentation module was challenging. For SAGA, we run out of memory for the garden scene in RTX 4090 GPU. It's because the scene has considerably more images (185 images, ~4.4 million Gaussians) and SAGA stores a 32-dimensional feature for each Gaussian. Therefore, we were not able to include SAGA in Figure 4. We will also include all the baselines on all the datasets in the final version.

Visualizations of the energy terms

We will be happy to provide visualizations that can improve the clarity of our method. However, it is not clear to us how energy function can be visualized. The weights are assigned on edges, not nodes, which makes visualization tougher. Also, since the energy function is not optimized over time (it is minimized using min-cut algorithms), we can not plot the overall energy either. We would like to request the reviewer to provide suggestions on how meaningful visualizations can be included.

I was confused about what is meant by "decomposing the boundary Gaussians."

SAGD[a] also does not require any segmentation-aware training given an optimized 3DGS model. They decompose Gaussians (split a single Gaussian into two) that may lie at the boundary of an object

The single mask ablation (line 298) should probably be included in Table 5.

We provide the ablation with a single mask for the NVOS benchmark below. We will also include this in Table 5 in the revised paper.

Method	IoU
Single mask	86.6
Coarse	91.2
GaussianCut	92.6

"Gaussians that map to the same object would be closer spatially." assumption justification

Our proposed method is scene agnostic, i.e., it works regardless of the distribution of the underlying Gaussians. With the recent advances in point-based and Gaussian-based techniques, these artifacts might be mitigated and since our model works on a pre-trained 3DGS, it can adapt to such advances. This assumption worked well for the all scenes we tested.

Sensitivity to hyper-parameters

We share a default setting in the paper which performs reasonably on all our datasets. The sensitivity of each parameter can be very scene-dependent. For instance, in a scene where parts of an object have different colors, a very high weight on the color similarity can affect adversely. We show the effect of $\lambda$ (controls the pull of neighboring vs terminal edges) and $\sigma$ (decay constant of the similarity function) on two scenes. The reported metric is IoU.

Scene	$\lambda=0.5$	$\lambda=1$	$\lambda=2$	$\lambda=4$
Fortress	97.67	97.99	97.95	97.80
Lego	89.15	89.18	89.18	88.49

Scene	$\sigma=0.5$	$\sigma=1$	$\sigma=2$	$\sigma=4$
Fortress	96.12	97.95	97.56	96.04
Lego	89.20	89.18	89.18	89.19

Coarse splatting baseline with well-chosen hyperparameters

For the four 360-degree inward scenes in the SPIn-NeRF benchmark, we show a sweep of the threshold (default is 0.3). GaussianCut outperforms all the thresholds considered for coarse splatting. It is worth noting that adjusting the threshold of coarse splatting also improves the quality of graph cut (as we directly use these for terminal links weights).

Threshold	IoU	Acc
Coarse@0.1	88.47 $\pm$ 4.85	98.96 $\pm$ 0.53
Coarse@0.3	89.67 $\pm$ 3.18	98.94 $\pm$ 0.72
Coarse@0.5	87.76 $\pm$ 3.06	98.45 $\pm$ 1.50
Coarse@0.7	83.30 $\pm$ 6.04	97.58 $\pm$ 2.84
Coarse@0.9	72.13 $\pm$ 11.26	96.08 $\pm$ 4.69
GaussianCut w/ Coarse@0.3	90.55 $\pm$ 3.76	99.18 $\pm$ 0.41

How are the frames ordered before they are passed to the video segmentation model? How sensitive is the method to this ordering?

We run the camera on a fixed trajectory to obtain rendering from different viewpoints (spiral trajectory for front-facing and round for 360-degree inward scene). We limit the number of frames to 30 for front-facing and 40 for 360-degree scenes. However, all the training images can also be used for coarse splatting (Table 6). Although it might not be preferred for scenes that have a large number of training images. SAM-Track is quite good for unordered multi-frame images as well. The results in Table 6 were obtained after directly giving all training images to SAM-Track.

To what extent does the proposed method's performance rely on SAM-Track's quality? Do any baselines rely on different video segmentation models, and if so, how do the metrics change when those are updated to use SAM-Track?

All our experiments are based on SAM-Track and we did not experiment with another model. Our method can work on any video segmentation model and its quality can affect the final performance. Although we start with SAM-Track, our mask quality is improved significantly (Figure 11).

[a] SAGD: Boundary-Enhanced Segment Anything in 3D Gaussian via Gaussian Decomposition, arXiV, 2024

2024-08-14

Thank you to the authors for providing detailed runtimes, more baseline results, additional hyperparameter sweeps, and comparisons against the coarse splatting baselines. These additional details adequately address my concerns about the evaluation, and so I have raised my overall score from 5 to 6. I think the paper's direct impact may be limited by its less-than-interactive runtime and the fact that Gaussian segmentation is a very specific niche, but runtime could be improved in follow-up research on training-free Gaussian segmentation, and so I think the paper is worth publishing.

I would encourage the other reviewers to consider raising their scores above borderline reject. While the proposed method has the disadvantage of not providing interactive segmentation speed, it takes a fundamentally different approach to segmentation compared to the baselines (training-based vs. training-free), and this is valuable in and of itself. The proposed method has the potential to serve as a stepping stone towards more practical segmentation approaches, and with the additional results the authors have provided, I think the reader will have a good sense of the method's strengths and weaknesses. In other words, although the proposed method is not perfect, I think the paper merits more than a borderline reject.

2024-08-14

Thanks for your kind feedback and for recognizing the value of our work!

审稿意见

评分: 4置信度: 52024-07-14

The paper presents a method for segmentation in 3D Gaussian Splatting. Given a prompt in 2D, the proposed approach is capable of segmenting objects of interest from the 3D Gaussians. Specifically, the method first performs 2D segmentation in all training views, which are then propagated into 3D using a technique similar to visual culling. This achieves a coarse 3D segmentation in the initial stage. Subsequently, the method employs graph cuts to refine the 3D segmentation. The paper validates the method on multiple datasets using 2D segmentation metrics and demonstrates the potential of the proposed approach.

优点

The paper proposes refining 3D segmentation via graph cuts using carefully-designed energy functions. This step is practical and effective.

It’s surprising to see that the method, even without 2D mask supervision, can still outperform baselines like LERF.

缺点

The paper has several weaknesses as below:

Writing:

The method is not well-motivated in the introduction section. Specifically, in lines 24-39, the problem defined in the paper is hard to understand. The authors might want to rewrite this section to make the motivation clearer.
Part of the implementation is not clear to me. Specifically, I’m uncertain if the method is training-free for segmentation when provided with a pretrained GS model. If so, it would be beneficial to make this clear, as being training-free is a significant strength compared to other methods.

Method:

I’m not entirely convinced by the training-free approach in the proposed method. In my view, learning a per-Gaussian feature for segmentation offers more flexibility. Therefore, I suggest the authors justify their method by demonstrating what the proposed method can achieve that learnable features cannot.

Experiments:

The important comparison between baselines that use 3DGS such as GaussianEditor and LangSplat. The paper only reports the baselines that use NeRF, which is generally worse than 3DGS in the segmentation task. In other words, the baselines are insufficient to justify the proposed method.
Both qualitative and quantitative results of 3D segmentation are missing in the result section. From my viewpoint, it’s essential to this paper since the proposed method aims to perform 3D segmentation. While 2D segmentation partially shows the performance of the proposed method, it’s not the correct metric to validate the proposed method. The paper should not avoid this evaluation because 3D GT segmentation is missing in the datasets used in the paper. In this case, I would suggest the author trying some other datasets like ScanNet where 3D GT is available.

问题

Please address the concerns that I described above. During rebuttal, I would expect author justify better the proposed methods according to my suggestions I made above. Before, I currently vote for reject.

局限性

Yes, they did it well.

作者回复

2024-08-07

We appreciate the reviewer for their feedback on our paper. We thank the reviewer’s suggestion for helping to improve the clarity of our work

Clearer motivation for the introduction

Our work is motivated by leveraging the explicit nature of 3DGS representation. With user inputs and the underlying geometry of the scene captured by the Gaussians, objects can be segmented from a scene, without requiring any additional optimization of the 3DGS model. We provide a clearer motivation below and we will include the changes in the revised version of the paper.

3D Gaussian Splatting represents a scene using a set of Gaussians, thereby offering an explicit representation of the scene. Prior works in 3DGS scene segmentation involve augmenting the set of Gaussian with a per-Gaussian feature, that is optimized during the Gaussian fitting, supervised by 2D features. These features provide semantic information and can be used for segmentation. Since 3DGS stores the parameters of each Gaussian explicitly, the size of the feature embedding exacerbates the already high memory footprint of the method. Therefore, such methods have relied on learning low-dimensional features per Gaussian. While this enables a 3D consistent segmentation, optimizing a per-Gaussian feature significantly increases the fitting time of the Gaussians to the scene. Our proposed methodology is a post hoc technique that operates directly on an optimized 3DGS field without requiring any segmentation-aware fitting. We directly tap into the representation and map each Gaussian to its corresponding object(s). We do this by formulating the Gaussians in a scene as an undirected graph and partitioning the set of Gaussians allowing for the extraction of subsets that represent specific objects.

Is the method training-free for segmentation when provided with a pretrained GS model?

Yes, our approach is training-free when provided with a pre-trained GS model. We have highlighted this in the abstract and we kindly refer the reviewer to Figure 1 which shows that GaussianCut operates directly on a pre-trained GS model and user inputs. We will also modify our writing to explain this more clearly. To summarize our implementation, we take an already optimized GS model with user inputs on any one image. The user input is processed into dense masks and the masks are also propagated to multiple images using a video segmentation model. We do a "coarse splatting", which assigns each Gaussian a likelihood ratio, which is the fraction of pixels splatted in the foreground (as per the 2D mask) and the total number of pixels splatted by that Gaussian. This step does not require any further GS optimization. It is computed by simply rasterizing each view that has a corresponding 2D segmentation mask. Finally, we formulate a graph from the Gaussians that uses this additional likelihood term and the inherent properties of the Gaussians already learned in the initial fitting.

As mentioned by the reviewer, this is indeed significant as it provides segmentation without requiring any changes to the fitting process. As noted in Table 2 (global response), it also saves substantial time compared to optimizing features.

Comparison with other 3DGS and per-Gaussian feature baselines

We thank the reviewer for suggesting additional baselines. We would like to clarify that we do provide 3DGS-based baselines. We provide SAGA[a] and SAGD[b] results on the LLFF and SPIn-NeRF datasets (Tables 1, 3, 9), both of which are based on 3DGS. We also compare against Gaussian Grouping, LangSplat, and CGC (all of which are based on 3DGS) in Table 10 (appendix). We had not compared against them on all datasets because our method is focused more on interactive segmentation and these methods propose learning a feature field for all the scene elements (a more detailed distinction between these baselines and our method is provided in the global response). Based on the feedback, we have reported Gaussian Grouping and LangSplat comparison on the NVOS benchmark in Table 1 (global response).

3D evaluations missing

We completely agree with the reviewer that 3D GT is indeed a better metric to evaluate 3DGS segmentation. As noted by the reviewer, most datasets and approaches do not provide such ground-truth and benchmarks for this evaluation. SPIn-NeRF, Shiny, and 3D-OVS, while not 3D consistent, provide masks for multiple views to show the efficacy of methods to some extent. Obtaining the ScanNet dataset requires approval from the authors, which could take up to a week. Since we did not have that much time and the 3DGS baselines (SAGA, LangSplat, Gaussian Grouping) have a larger runtime, especially for scenes with a higher number of images, we could not include the quantitative results for the rebuttal. We will include results from Replica or ScanNet in the final paper.

[a] Segment Any 3D Gaussians. arXiV, 2023
[b] SAGD: Boundary-Enhanced Segment Anything in 3D Gaussian via Gaussian Decomposition, arXiV, 2024

作者回复

2024-08-07

We thank the reviewers for their helpful and valuable comments and appreciate that they found our paper well-written, easy to follow (DeYw, 24kC, BR3k), and well-motivated (MadC, DeYw). We are delighted to see that the reviewers recognize the performance improvement of our training-free approach on 3D segmentation benchmarks (Qza2, BR3k, 24kC, MadC), and found the extension of graph cuts to 3D Gaussian Splatting interesting (BR3k, DeYw).

We want to use this general response to clearly highlight our key contributions following the suggestions of reviewers BR3k and DeYw, and differentiate our work from feature-based learning approaches (Qza2, DeYw). We ran additional experiments to compare against feature-based baselines (LangSplat[a] and Gaussian Grouping[b]) following the suggestion of reviewers Qza2 and DeYw and show that our model outperforms these baselines. Additionally, we also show a detailed time analysis in Table 2 (global response) following suggestions from reviewers BR3k and 24kC. We found the feedback constructive and will happily incorporate changes suggested by the reviewers.

Key contribution (DeYw, BR3k)

Our key contribution is in proposing a method for object selection that is training-free given a pretrained 3DGS model.

Training-free: Our method is a post hoc technique and unlike prior work on optimization of per-feature Gaussian [a,b], our technique saves significant optimization time (Table 2) and memory (as we only store one additional parameter per Gaussian). This is not only novel in the sense of a new capability but it is also a surprising result. Given the larger computational budget of training-based approaches, one would expect them to achieve higher performance.
Extension of graph cuts to 3D Gaussians: Our work also contributes to Graph cut-based segmentation research. Although thoroughly explored for image segmentation, the extension to 3D Gaussians is non-trivial and has not been considered in previous work. The energy function proposed in our work contains several non-trivial design choices, including, modeling distances to neighbors (which is typically not considered in images), and n-links and t-links design.
Leverage underlying geometry information: 2D semantic maps (like SAM), while very robust, can be inefficient at segmenting finer details (Figure 11). Gaussians optimized for a scene capture the geometry of the scene, and our method utilizes these (using the color and position similarity) to retrieve fine details. Feature field-based methods can also miss finer details (e.g., plant decorations in Figure 4).

Comparison against feature-based methods (Qza2, DeYw)

Per-Gaussian feature optimization baselines[a-c] alter the fitting process of 3DGS by adding an additional attribute for each Gaussian. Below, we compare our approach with feature-based methods

Use case: Feature-based methods learn features for everything in the scene. While useful, it can limit the flexibility of interactivity with a single object. Our method is more flexible in choosing specific object(s) using positive/negative clicks, or scribbles as we generate the 2D masks after the user interaction.
Optimization time: The fitting time of feature-based methods (Table 2 global response) as well as the memory footprint of storing additional features increases significantly, which might not be desirable in all applications.
Reliance on video segmentation model: Since we use the 2D masks only while rasterization, we do not require masks from all the viewpoints and can also work with just a single mask (Figure 12, Table 4). However, feature-based models require masks for all training views.
Complimentary rather than competing: Rather than seeing feature-based methods as a replacement for our method, we see them as complementary. Our energy function can be modified to also include a feature similarity term in equation 2. We see this as an interesting extension of our work.

Table 1: NVOS results using Gaussian Grouping (GG) and LangSplat (LS). Our approach gives an overall better performance. LangSplat also fails to give good segmentation results for two scenes (Figure 2 of rebuttal pdf).

Scene	IoU (GG)	IoU (LS)	IoU (ours)
Horns	93.61	95.99	97.03
Fern	87.06	83.97	83.07
Orchids	86.80	96.25	95.81
Flower	95.27	95.24	95.37
Leaves	93.00	29.26	95.96
Fortress	97.06	97.70	97.95
Trex	81.66	19.46	83.44
Average	90.64	73.98	92.66

Run time analysis (BR3k, 24kC, MadC)

We compare the run time of our method with [a-d]. We take an average of the run time over 7 scenes from the NVOS benchmark. We divided the run time analysis into 3 stages: pre-processing (this step involves obtaining features for images), fitting time (which involves optimizing the 3DGS/NeRF model), and segmentation time (time between obtaining user prompts and producing segmentation output). Since we operate directly on the pre-trained 3DGS model, our fitting time is significantly lower than other 3DGS-based approaches.

Table 2: Segmentation time (in seconds) on NVOS benchmark

Method	Preprocessing time	Fitting time	Segmentation time	Performance (IoU)
SA3D (NeRF-based)	0	309.14 $\pm$ 18.85	33.89 $\pm$ 13.04	90.3
Gaussian Grouping	13.72 $\pm$ 4.63	2096.07 $\pm$ 251.96	0.55 $\pm$ 0.09	90.6
LangSplat	2000.34 $\pm$ 1222.19	1346.92 $\pm$ 247.00	0.82 $\pm$ 0.02	74.0
SAGA	71.17 $\pm$ 22.74	1448.50 $\pm$ 205.07	0.35 $\pm$ 0.05	90.9
Coarse Splatting	6.11 $\pm$ 0.38	510.97 $\pm$ 106.42	19.48 $\pm$ 4.31	91.2
GaussianCut	6.11 $\pm$ 0.38	510.97 $\pm$ 106.42	88.77 $\pm$ 33.68	92.5

[a]Langsplat: 3d language gaussian splatting. CVPR, 2024.
[b]Gaussian Grouping: Segment and Edit Anything in 3D Scenes. ECCV, 2024.
[c]Segment Any 3D Gaussians. arXiV, 2023
[d]Segment Anything in 3D with NeRFs, NeurIPS, 2023

最终决定Accept (poster)

2024-09-25

This paper proposed a new interactive segmentation method for 3D Gaussian scenes. The reviews of this paper are slightly mixed. Two reviewers gave “Borderline reject” while three reviewers were very positive on the paper (2 “Weak Accept” and 1 “Strong Accept”). After carefully reading the authors’ rebuttal and all reviewers’ comments. The AC agrees with the majority of reviewers that this paper has made valuable contributions on this challenging task and showed satisfactory segmentation results on 3D Gaussian scenes. Furthermore, the idea of introducing Graph Cut to segment 3D Gaussians without retraining/fine-tuning is novel and interesting. To prepare the camera-ready version of this paper, the authors are required to make the necessary revisions suggested by the reviewers. This decision was discussed with and approved by the SAC.