Learning Segmentation from Point Trajectories
摘要
评审与讨论
The authors propose a loss function that seeks to group the trajectories into low-rank matrices where the motion of object points can be approximately explained as a linear combination of other point tracks. Experiments on the synthetic MOVi-F variant of the Kubric dataset and the real datasets DAVIS 2016, SegTrackv2 and FBMS show that the proposed method outperforms single-sequence methods, single-stage end-to-end methods and multi-stage methods.
优点
- The authors address key issues in the field and the contribution is original even if somewhat incremental.
- The proposed method is detailed and reproducible.
- Experiments are relatively well conducted on synthetic and real datasets showing the superiority of the proposed method.
缺点
About the presentation, please clearly state a name/acronym to the proposed method and replace "ours" by it in the comparison tables.
问题
No questions
局限性
Yes
We are happy the reviewer has found our work to be original, detailed, reproducible, addressing key issues, and with well-conducted experiments. We also thank the reviewer for suggestions to improve the clarity of our work.
About the presentation, please clearly state a name/acronym to the proposed method and replace "ours" by it in the comparison tables.
We will update the text and tables using the tag "LRTL (Ours)" to denote our Low-Rank Trajectory Loss.
This paper proposes a novel loss function that allows training image object segmentation models based on object motion in videos. Motivated by recent work on self-supervised learning of segmentation using optical flow, the authors propose to use longer point trajectories as additional self-supervision signal. Related to subspace clustering, the proposed loss function encourages to predict segments whose trajectories can be well explained by few basis trajectories. The predicted segments are merged into a binary segmentation mask and evaluated on standard, real-world segmentation benchmarks. Previous methods based only on optical flow are consistently outperformed, demonstrating the effectiveness of the proposed method.
优点
- Unsupervised learning of segmentation is an important problem. Several recent methods approached this task using optical flow as a self-supervision signal, extending this line of research to trajectories is a well-motivated idea.
- The mathematical motivation of the loss is very well explained. Without having a deep mathematical background, I could follow the derivation of the loss function without issues.
- Standard benchmarks and modelling components are used for evaluation, which makes it easy to compare the proposed method to previous approaches.
缺点
-
It is not described clearly enough what kind of segmentation task is targeted. From the introduction and method section it seems to me that multi-object segmentation is adressed, only at the very end of the method section it is mentioned that the predicted segments are merged into a binary segmentation in some cases.
- To my understanding the task is multi-object segmentation for MOVi and binary segmentation for all other datasets. This should be clearly stated in the experiment section.
- It should be stated in the introduction and method section more clearly that the main task is binary segmentation.
-
The proposed method is not compared to models that do not use optical flow for self-supervision. It would be interesting to see how the proposed method compares to other self-supervised segmentation approaches. For example
- CutLER (Wang et al. 2023) and VideoCutLER (Wang et al. 2023)
- DINOSAUR (Seitzer et al. 2023) and VideoSAUR (Zadaianchuk et al. 2023)
The masks predicted by these models could be merged to obtain a binary segmentation in the same way as for the proposed method.
问题
-
How do the predicted segments look like before merging? Visualization would help to better understand the capabilities and limitations of the proposed method.
-
The principle of common fate is not cited in the paper, a reference to the literature on Gestalt psychology would be appropriate (e.g., Wertheimer 1912).
-
How well does the proposed method perform on MOVi when estimating trajectories using RAFT and CoTracker? This would allow for better judging how much the proposed method could be improved in the future by using more accurate trajectory estimation methods.
局限性
The authors address limitations of their work in a dedicated paragraph. Their discussion is brief but adequate in my view.
We thank the reviewer for thoughtful comments and suggestions and are happy that they recognized our work as well motivated, well explained, and easy to compare.
To my understanding the task is multi-object segmentation for MOVi and binary segmentation for all other datasets. This should be clearly stated in the experiment section. [...] the main task is binary segmentation.
Correct! While this is already noted on L261, we will make this clear throughout in the final version.
The reason why several experiments focus on binary segmentation is to allow comparison to relevant prior art which used video benchmarks with binary masks, as these are far more common. The "over-segmentation and merging" approach is also common in the literature, so we also adopt it to be more comparable to prior works.
However, as shown in the MOVi experiment, our loss is indeed more general and not limited to binary segmentation.
The proposed method is not compared to models that do not use optical flow for self-supervision. It would be interesting to see how the proposed method compares to other self-supervised segmentation approaches. For example CutLER (Wang et al. 2023) and VideoCutLER (Wang et al. 2023) DINOSAUR (Seitzer et al. 2023) and VideoSAUR (Zadaianchuk et al. 2023) The masks predicted by these models could be merged to obtain a binary segmentation in the same way as for the proposed method.
In the table below, we provide a comparison of VideoCutLER and VideoSAUR on DAVIS using the same merging strategy for combining multiple predictions to a binary segmentation as in our method:
| Method | DAVIS |
|---|---|
| Ours | 82.2 |
| VideoCutLER | 67.2 |
| VideoSAUR | 17.5 |
Our method shows a significant advantage. We observe that VideoCutLER has trouble segmenting instances from crowds in the background. VideoSAUR has imprecise object boundaries which severely impacts performance when measurred using Jaccard score.
We also note that VideoCutLER and VideoSAUR are trained on much larger video datasets as they, ultimately, attempt to derive a learning signal from DINO features computed across adjacent frames. The goal of our paper is to study whether point trajectories can be used to train a segmentation network, which is a complementary rather than an alternative approach: these forms of supervision could be combined.
How do the predicted segments look like before merging? Visualization would help to better understand the capabilities and limitations of the proposed method.
We include the visualization of segments before merging them in the rebuttal PDF, Figures R1 and R2. Mostly, different entities or parts of entities (legs, heads, other limbs) are segmented out. This suggests that the model learns that both individual instances and articulated parts of the bodies are likely to have coherent motion.
The principle of common fate is not cited in the paper, a reference to the literature on Gestalt psychology would be appropriate (e.g., Wertheimer 1912).
Thank you, we will include a suitable references.
How well does the proposed method perform on MOVi when estimating trajectories using RAFT and CoTracker? This would allow for better judging how much the proposed method could be improved in the future by using more accurate trajectory estimation methods.
We repeat the experiments in Tab. 1 using CoTracker-estimated trajectories for our method and show the results as "Ours (CoTracker)" below. Note that our experiments on MOVi-F used only the trajectory loss (no optical flow) to match the same input modalities employed by other methods.
| Method | ARI | FG-ARI |
|---|---|---|
| K-Means | 15.26 | 42.53 |
| Ours (GT traj.) | 46.07 | 65.76 |
| Ours (CoTracker) | 38.20 | 57.85 |
The estimated (imperfect) tracks impact performance but it still exceeds that of simple K-Means and prior methods. This shows that improving the tracks can further improve results achieved by the loss.
I thank the authors for their detailed response. My main concerns were addressed well by the additional comparisons and clarifcations. With the proposed changes, I am happy to recommend the paper for acceptance.
This paper introduces a method for training a segmentation network using long-term point trajectories as a supervisory signal to enhance optical flow. It proposes a novel loss function aimed at grouping these trajectories into low-rank matrices, allowing the motion of object points to be approximately represented as a linear combination of other point tracks. The proposed approach surpasses previous methods in motion-based segmentation, demonstrating the value of long-term motion and the effectiveness of the new formulation.
优点
- The introduction describes the problem in more detail when introducing the issue.
- The structure of the article is good.
- The experimental results of the method proposed in this paper show a significant improvement.
缺点
- The main contribution of this paper is the proposal of two losses, but the loss seems to be effective in the experiments of other segmentation methods.
- The contribution of the paper in Subspace Clustering is not described clearly.
- The resolution of Fig 3 is relatively low.
- There is a lack of comparison in terms of inference speed.
问题
- Does the method proposed in this paper work on other segmentation networks as well, and would additional experiments on other segmentation networks help demonstrate the generality of the proposed loss function?
- How is the training time and inference speed of the method proposed in the paper? It would be better to include some quantitative comparison experiments.
- What is the specific contribution of this paper to Subspace Clustering?
- The Flow Estimator and Point Tracker are both frozen during training in this work. Is it possible to also update them during training to leverage the information in the dataset?
局限性
The authors have acknowledged some limitations of their work. I suggest the authors could further describe the limitations in the generalizability of their approach.
We are happy the reviewer has found our work detailed, well-structured and showing significant improvement. We thank the reviewer for their thoughtful questions and suggestions. We reply to each comment below.
The contribution of the paper in Subspace Clustering is not described clearly. [...] What is the specific contribution of this paper to Subspace Clustering?
We do not consider our loss function a contribution to the area of Subspace Clustering. The main question we explore is how to supervise segmentation networks with the motion information contained in point trajectories (L37).
The reason why we discuss Subspace Clustering is because our loss is inspired by it, and because Subspace Clustering has historically been applied to clustering motion trajectories. However, as we show in our experiments, general Subspace Clustering algorithms such SSC or LRR might not be suited to training neural networks.
The resolution of Fig 3 is relatively low.
Thank you; we shall update the size of the picture for the final version.
Does the method proposed in this paper work on other segmentation networks as well, and would additional experiments on other segmentation networks help demonstrate the generality of the proposed loss function?
Yes, the proposed loss function is network-architecture agnostic as it only requires mask prediction. Thus any network which predicts masks or has mask-like representation could be used. In our experiments, we used (a) a scaled down version of a UNet network when considering single-sequence optimization (L270), and (b) the same network architecture as used in prior work (MaskFormer + DINO) in video segmentation (L273).
In the table below, we experiment with changing the segmenter architecture in the DAVIS benchmark. This shows that we can swap different network architectures with relative ease.
| Network | DAVIS |
|---|---|
| MaskFormer + DINO | 82.2 |
| MaskFormer + Swin-Tiny | 81.2 |
| UNet | 80.6 |
In these results, we did not further optimize batch size, learning rate and other parameters that typically depend on the type/architecture of neural network trained (except training UNet for 60k iterations rather than 5k as it is randomly initialized), so these numbers likely underestimate the actual performance that could be obtained with such networks after hyperparameter tuning.
There is a lack of comparison in terms of inference speed. [...] How is the training time and inference speed of the method proposed in the paper? It would be better to include some quantitative comparison experiments.
In the table below, we provide the inference time comparison using different networks as average FPS during DAVIS evaluation. Note that since our contribution is a loss function, it is network architecture agnostic. Using it does not affect inference time; only the choice of network architecture does. We matched the architecture with prior work for the best comparisons. Below, we show inference speed for different network choices.
| Network | Inference Speed |
|---|---|
| MF + DINO | 3.3 FPS |
| UNet | 6.4 FPS |
As expected, with smaller networks, the inference time is reduced.
A good question is how much of a computation burden our trajectory loss imposes. We measure time spent in the loss function during the forward pass as 0.3s or about 14% of the total batch time, which is small.
As we reported in Appendix E.3, training a network for our experiments takes about 3 hours. With our proposed losses, one can train a chosen segmentation network quickly on the benchmark data.
The Flow Estimator and Point Tracker are both frozen during training in this work. Is it possible to also update them during training to leverage the information in the dataset?
This is an interesting idea! As the task under consideration is completely unsupervised, we would not recommend continuing to train flow and point-tracking networks while training the segmenter. That is because there is a trivial degenerate solution: if no motion is predicted (e.g., by ignoring the input) by flow and/or point tracker then the segmentation network can predict arbitrary segmentation as well with minimal loss, thus the training would likely collapse.
However, one could alternate training segmenters and motion predictors on data where point annotations are available. In this work, we concentrate on establishing how trajectory data can be used as an "objectness" signal.
I suggest the authors could further describe the limitations in the generalizability of their approach.
While we showed good results in quite general real-world videos, as mentioned in L324, extending our loss to multi-object segmentation in videos where objects undergo non-rigid articulated motion to the same effectiveness is not trivial. This is, ultimately, because the precise number of objects is unknown, so recovering the appropriate number of instances from that might have been seperated into parts is not straightforward. This could be overcome by introducing an appearance-based loss across groups of predicted masks, effectively encouraging the network to predict objects each of up to parts by predicting total components.
Thanks for the author's rebuttal, I will keep my score.
This paper proposes a model to process long-term motion and short-term motion simultaneously to achieve motion-based segmentation. Specifically, motivated by subspace clustering, this work proposes a loss function that enables training a neural network to learn motion grouping from both optical flows and point trajectories. It outperforms the previous method in the unsupervised video segmentation task. The qualitative comparison also shows obvious improvement, giving a clearer boundary.
优点
- The motivation and method explanation seems to be clear. The paper writing is easy to follow.
- Using a simple sample to introduce the low-rank intuition is convincing and reasonable. Based on this core idea, other smoothing losses and regular loss from optical flow make learning more effective.
- Experiments show the strength of the proposed strategy. A comprehensive ablation study has been performed to illustrate the impact of each factor.
缺点
- As mentioned in the limitation, the paper's principle assumes that the object is rigid. However, the task that this paper works on not only includes rigid objects -- it's a general video segmentation task. Then it seems that the low-rank theory can not extend to a general setting. And why not consider local rigid like ARAP loss? (SpatialTracker)
- Do not give some corner cases or failure cases, especially for non-rigid objects. I hope to see some corner cases like multiple objects, where they behave similarly in the short term but different in the long term. Then it can better demonstrate the motivation of the paper.
问题
Why does solely using long-term loss get worse performance than solely using optical flow loss (7 percent drop in Table 4)? Though the paper gives a short explanation that it is due to the sparse set of points and noise, long-term motion also has its advantage like it's more stable than short-term information.
局限性
As mentioned in the weakness, the principle the paper proposed is reasonable, but seems like it does not fully support the motivation and the ultimate goal of the task. More analysis and experiments are needed to show the right practice when applying the proposed to real-world videos.
We are glad that the reviewer finds the motivation and method clear, easy to follow, convincing and reasonable, with strong results and comprehensive ablation. We thank the reviewer for constructive comments and helpful suggestions.
the paper's principle assumes that the object is rigid. However, the task that this paper works on not only includes rigid objects -- it's a general video segmentation task. Then it seems that the low-rank theory can not extend to a general setting.
We do not assume objects are rigid. As we state on L202, relaxing enables supporting non-rigid objects too.
And why not consider local rigid like ARAP loss? (SpatialTracker).
Using an ARAP loss would require additional depth input and training an additional encoder network to embed trajectories. In SpatialTracker, this is done using RGBD data for the trajectory prediction task. Additionally, similar to other adjacency matrix methods, the scaling of pairwise trajectory loss is quadratic (L111): SpatialTracker uses 256 trajectories during training using 8xA100, while we can train using 70k trajectories on a single GPU.
Do not give some corner cases or failure cases, especially for non-rigid objects. I hope to see some corner cases like multiple objects, where they behave similarly in the short term but different in the long term. Then it can better demonstrate the motivation of the paper.
We included additional visualisations in the Rebutall PDF showing predicted components as well as the motion of the frame.
Note that our trained network does not observe motion during inference. Because of this, it is robust to lack of motion for all horses in the 2nd sequence (Fig. R1), or similar flow patterns of people in the 3rd sequence (Fig. R1), or even noisy optical flow in the 2nd sequence (Fig. R2). It has learned to separate instances or parts. We show predicted components to highlight that the network learns to predict instances and sometimes limbs as separate masks as they might not move together.
We also include some failure cases in Fig. R2, such as missing very small objects and learning to identify and segment shadows as a component.
Why does solely using long-term loss get worse performance than solely using optical flow loss (7 percent drop in Table 4)? long-term motion also has its advantage like it's more stable than short-term information.
Optical flow carries less information than long-term tracking, but it has some advantages. First and foremost, it can be computed at full resolution (even CoTracker, which is efficient, cannot cope with more than 70k tracks, which are fewer than the number of image pixels). This is extremely beneficial when the task requires pixel-level accuracy, such as segmentation. Second, instantaneous motion is easier to predict and model than long-term motion. Hence, when optical flow is sufficient to segment the objects, it generally allows it to be done robustly and easily. Long-term tracking is thus complementary to optical flow.
As mentioned in the weakness, the principle the paper proposed is reasonable, but seems like it does not fully support the motivation and the ultimate goal of the task. More analysis and experiments are needed to show the right practice when applying the proposed to real-world videos.
The main goal of our paper was to establish whether long-term trajectories are useful source of "objectness" for training segmentation networks and to propose a principled way to accomplish this. We have established a positive answer showing improvement over prior art in a series of benchmarks. Additionally, we analysed and showed why alternative implementations of this idea do not work as well.
The paper tackles video object segmentation by incorporating into the loss function not only instantaneous optical flow information but also long term pixel tracking information. Raft was used for optical flow and CoTracker was used for long term pixel tracking in the experiments. The experiments show a marginal improvement in performance when combining the two information sources in the loss function.
优点
The paper flows quite well, it addresses that video object segmentation is the problem space, the focus is on loss function, Figure 2, the layout appears clear as well. There are a handful of datasets and comparing methods used in the experiments.
缺点
Table 2 where the experimental results are presented lists a collection of methods categorized into different groupings. Perhaps these groupings and methods could be better discussed in the lit review, it appears that the categories in the lit review do not correlate nicely and I do not know the difference of these methods unless I look at the references and read the papers myself. The improvement is incremental. IT is expected that there would be some improvement however what cases do we actually get the improvement in, a bit of more depth in the analysis would make this a better paper. I assume that the camera is static?, correct? if not, perhaps making this clearer would help. I have no idea how long the long term point trajectories were, perhaps analyzing this would help. Also depending on the trajectories, were there occlusions or other interesting factors that contribute to the loss function would be interesting.
问题
- I found the related works were like a laundry list. You divided the categories into unsupervised video object segmentation, motion segmentation, trajectory based motion segmentation and subspace clustering. That is find however your focus is only video object segmentation, why is that and how can you address the other problem areas?
- I would imagine if we had 3d scene flow, by perhaps combining monocular depth and optical flow would result in good results without long term tracking?
- why not incorporate appearance information as well?
- Appearance information for segmentation in the examples would suffice, it would be interesting to focus on cases where appearance info is not sufficient for segmentation and we require motion information.
局限性
I am not sure that the examples actually illustrate that motion segmentation is necessary for these cases. I would focus on cases where appearance information is not enough. Can this system deal with a moving camera or does the camera have to be static? How well does the system work under occlusion? Different motions of the object of interest will results in different performance, perhaps diving into this analysis would be informative. Both sources of info, optical flow and long term pixel tracking info are based on 2D info, the projection of 3D info. This has limitations. The paper should have explored different object movements. It does state that non rigid objects when dealing with multiple objects is an issue however an in depth exploration for a single non rigid object would be informative.
We thank the Reviewer for constructive comments and suggestions. We reply to concerns below.
Table 2 where the experimental results are presented lists a collection of methods categorized into different groupings. Perhaps these groupings and methods could be better discussed in the lit review.
While our literature review is structured based on related sub-fields, the methods in Tab. 2 are arranged based on the approach to the task: using single-video optimization (L95), training a network in multiple stages (L89) or single-stage training. For clarity, we will summarise the groupings in the respective section.
The improvement is incremental.
We disagree with this assertion. Firstly, we propose a principled loss to derive an objectness signal from a novel modality. Secondly, our proposal shows effective improvement across several benchmarks where improvements have been historically difficult. For example, after OCLR [66] has first reached a Jaccard score of 80.9 on the DAVIS benchmark, follow-up works [54,37,59] could only approach or match it for several years, showing the difficulty of the task.
I assume that the camera is static?, correct? if not, perhaps making this clearer would help. How well does the system work under occlusion? Different motions of the object of interest will results in different performance, perhaps diving into this analysis would be informative.
The camera does not need to be static (and in almost all scenes it is not); in fact, our loss explicitly models a time-varying camera (see L185). Similarly, we make no assumptions about occlusions or object motion.
In general, the method is robust: the included result videos contain complex camera motion, "zooming" (e.g., dance videos), strong occlusions (e.g., bicycle or dog scenes), self-occlusions and non-rigid motion. Our method works well in quantitative studies and qualitative examples.
I have no idea how long the long term point trajectories were, perhaps analyzing this would help.
The default trajectory length is 41 frames (L267). We studied the effect of trajectory length in Tab. 7 (Appendix). Longer tracks suffer from increasing tracking errors over longer periods. On the other hand, very short tracks might not be sufficiently informative. Our results in Table 7 support this.
Also depending on the trajectories, were there occlusions or other interesting factors that contribute to the loss function would be interesting.
As can be seen in the example tracking visualization video from the supplement, occlusions (in blue) are very common. Point trackers estimate the position of tracked points even when they are occluded, and we make use of this prediction in our loss. We also experimented with different trackers (L566-568, Tab. 8). We found that TAPIR and BootsTAP, which are about as good as CoTracker for visible points, are less reliable for occluded points, which hinders performance.
That is find however your focus is only video object segmentation, why is that and how can you address the other problem areas?
The difference between video object segmentation and motion segmentation is that the latter only segments objects in the frames where they are moving, whereas the former segments objects as long as they move at some point. As we note in L80-82, the difference between these two tasks is relatively small, and in fact, often, the same benchmark data is used to assess both. In our case, the network is trained to segment the object given a single image as input at a time, and so it should be regarded as solving video object segmentation as it cannot perceive motion directly. However, it is easy to reuse it for motion segmentation by checking if the segmented region actually moves in a given frame, e.g., by measuring optical flow.
I would imagine if we had 3d scene flow, by perhaps combining monocular depth and optical flow would result in good results without long term tracking?
This is an interesting idea, but we are not aware of works that have done so yet, at least for video object segmentation. One challenge is that it is difficult to obtain robust 3D predictions (see, e.g., the extensive literature on video depth estimation). Should this information be available, our formulation can be extended trivially to 3D trajectories simply by replacing 2D points with 3D ones. In principle, by avoiding camera projection, these 3D trajectories should be statistically "simpler", so a lower value of the parameter , e.g., 3 or 4, should suffice.
Appearance information for segmentation in the examples would suffice, it would be interesting to focus on cases where appearance info is not sufficient for segmentation and we require motion information.
Our method does not ignore appearance. First, motion is used as a learning signal to train a segmentation network that takes as input an image, and is thus appearance-based. Second, our approach also uses appearance indirectly, as optical flow and point trajectories are predicted from RGB data.
In any case, the main goal of our paper is to show how point trajectories can be used to derive an "objectness" signal to train a segmentation network. We propose a new loss to do so. We note that our new loss could be combined with other losses that capture complementary sources of supervision, but doing so is orthogonal to our main goal.
For completeness, and as suggested by other reviewers, we do compare with SoTA video object segmentation methods that are appearance-based, i.e., VideoCutLER and VideoSAUR, and obtain better results than them. This also suggests that appearance-based approaches are not sufficient for this data yet (as the gap is considerable).
Both sources of info, optical flow and long term pixel tracking info are based on 2D info, the projection of 3D info.
Perspective projection indeed complicates the analysis. As we state L198, we increase , to generalize to wide range of scenarios in the real world data.
We thank the Reviewers for their thoughtful comments and suggestions. We are happy they found our presentation clear, well-flowing, well-motivated, convincing and reasonable, our results strong and our experiments comprehensive. We reply to each comment individually.
To aid replies, we also provide a Rebutall PDF that contains two figures (R.1, R.2), which we reference. The figures show more examples, including failure cases, from FBMS sequences alongside motion and predicted components. We shall include these figures as well as additional results and discussion presented in the reply in the final paper.
All of the reviewers are in favor of accepting this paper, to varying degrees. They appreciate the novelty of the loss function incorporate long-term temporal information to complement short-term optical flow, and the thorough experiments. There was a robust response to the reviews from the authors that alleviated concerns and contributed to acceptance of the paper.