A Motion-aware Spatio-temporal Graph for Video Salient Object Ranking
摘要
评审与讨论
This work proposes a trajectory-aware spatial-temporal graph for video salient object ranking. The proposed model includes a spatial correlation graph and a temporal correlation graph. Unlike previous VSOR methods, this work suggests to modeling instance-level temporal relations. They conduct experiments to demonstrate the advantage of their method, and also show the effectiveness of their method in video retargeting task.
优点
-
The idea of this work is clear. Unlike previous VSOR methods, which focus on frame-level temporal relations, this work proposes to explore the instance-level temporal relations.
-
They apply their VSOR model to the video retargeting task and achieved good results.
缺点
-
The authors claim that they explicitly model the motion trajectories of each instance (L8-9,L42,L69). However, the proposed method doesn't support this claim as the proposed method neither track each instance nor re-identify the same instance. Instead, the proposed method aggregates local contextual features from the same position across frames to approximates the instance trajectory. This claim might be not accurate.
-
Lack comparison with latest SOR methods. VSOR is highly related to image-based SOR. In table 1, the compared image-based methods are [2,5] (2021), however, some latest methos are available, such as, "Bi-directional Object-context Prioritization Learning for Saliency Ranking" (2022-CVPR), "Partitioned Saliency Ranking with Dense Pyramid Transformers" (2023-ACM MM) and "SeqRank: Sequential Ranking of Salient Objects" (2024-AAAI).
-
There is a real lack of experimental results to substantiate its advantages. In Table 1, only three methods are presented: two image-based salient object recognition (SOR) methods from 2021 and one video salient object recognition (VSOR) method from 2022. To strengthen the evaluation, it would be prudent to compare the proposed method with more approaches from related fields. For instance, consider video salient object detection methods like “Shifting More Attention to Video Salient Object Detection” (2019 CVPR) and “Dynamic Context-Sensitive Filtering Network for Video Salient Object Detection” (2021 ICCV). Additionally, exploring video shadow detection methods such as “SCOTCH and SODA: A Transformer Video Shadow Detection Framework” (2023 CVPR) could provide valuable insights.
-
Why the SA-SOR scores of DAVSOD is much worse than those in RVSOD? Need explanation. In addition, it would be better to include a naive solution as comparison, since the SA-SOR score is very low. For example, a naive solution is to rank the instances according to their sizes to see how well the model outperforms such naive solution.
问题
-
How does the model benefit from "Trajectory-wise Contrast" if the motion is too dynamic? For example, in Figure 2, the persons are moving away from their original position and go out of the expanded bounding box at t+1 frame. Then, cannot track the future trajectory by comparing and , as the person goes outside of the bounding box. If the model fails to track the object, how does the model benefit from "Trajectory-wise Contrast"?
-
It would be beneficial if the authors included studies to investigate how the bounding box area impacts VSOR performance. Whether a larger bounding box is more friendly for faster motion?
-
What does the "||" mean in Eq. 1, Eq. 2, Eq. 3 and Eq. 4? Does it mean concatenation? It would be better to specify this just after Eq. 1.
-
Why the numerical results in table 1 are significantly different from those reported in [1]?
-
The Eq. 2 and Eq. 3 are confusing and inconsistent with Figure 2 (). means instance at frame , and mean instance at current frame. If so, why in Eq. 2 doesn't include a superscript? The in Figure 2 () also misses some superscripts. Does aggregate information from in Eq. 2?
-
How long does it take to train the model?
局限性
The authors have discussed the limitations, and I agree that there is no potential negative societal impact of their work.
###1 Weakness
1. Inaccurate claim of "explicitly model the motion trajectories of each instance."
Response
Yes, thanks for pointing out this problem. Actually, according the the human visual mechanism, our motivation is to measure the magnitude of instance motion across adjacent frames to infer instance-level temporal saliency. Therefore, tracking or re-identifying each instance is helpful but not necessary for our task as instances moving fast and fail to tracking will consequently obtain a large inter-frame contrast score and consequently large saliency priority. Also, without tracking labels, it is difficult to model the motion trajectories of each instance accurately. Replacing 'model the motion trajectories' with 'approximate the motion trajectories' seems a more accurate claim.
2. Lack comparison with latest image SOR methods.
3. Lack comparison with latest image SOR methods.
Response to 2 and 3
- Since our primary focus is on modeling temporal saliency cues and effectively combining them with spatial saliency information, image-based SOR methods are not the most pertinent way to highlight these key contributions. Therefore, we choose the most related video SOR works to showcase these advancements.
- While video SOD typically highlights the most salient region without explicit instance modeling, video SOR detects more objects to rank their saliency, often relying on a detector for object localization. This results in significant differences in the inference stage.
- However, we do include a comparison in Table R1, which demonstrates that our method achieves significant performance improvements over traditional image-based SOR and video SOD methods. This clearly showcases the effectiveness of our temporal saliency modeling and the proposed spatial-temporal fusion strategy.
4. Why the SA-SOR scores of DAVSOD is much worse than those in RVSOD?
Response
The reason stems from the varying difficulty between the two datasets. As shown in Fig. R4, DAVSOD has low resolution and low-quality appearance, making it challenging to detect all objects successfully. The missing objects consequently result in low SA-SOR scores.
###2 Question
1. How does the model benefit from "Trajectory-wise Contrast" if the motion is too dynamic?
Response
As shown in Fig. R1 in the one-page PDF, an instance (e.g., the person in the red box) moving rapidly and outside the local context will consequently obtain a large inter-frame contrast score between the features of this instance and its local context in adjacent frames (i.e., and ). This will result in the instance receiving a higher saliency score, as expected.
2. Whether a larger bounding box is more friendly for faster motion?
Response
- As mentioned previously, an instance moving rapidly and outside the local context will consequently obtain a low inter-frame similarity between and , and thus receive a high saliency score as expected.
- In our view, a bounding box that is too large will cause the instance-level contrast to become local-global contrast, increasing the risk of introducing another instance into the comparison. This can generate confusing instance temporal motion cues.
- The comparison shown in Table R2 verifies our argument. A larger bounding box fails to achieve better performance.
3. Some issues in our writing.
Response
The symbol () in Eq. 1-4 represents feature concatenation.
4. Why the numerical results in table 1 are significantly different from those reported in [1]?
Response
Regarding Lin [1], it mentioned that annotators were invited to re-annotate the data, resulting in differences between our dataset and Lin's dataset. As they refused to release their labels, we retrained their model on our version of the dataset for a fair comparison. The results from other papers are also based on our reproduction of their open-source code, ensuring that the experimental results are authentic and reliable.
5. How long does it take to train the model?
Response
We set the batchsize to 1, and training for approximately 20 hours to achieve convergence on a RTX 4090 GPU.
Weakness
- I agree that a fast motion leads to a large contrast (between the adjacent frames) and may result in a higher saliency score. If so, we can identify the fast-moving objects without the need to track the instances. However, it sounds like that the idea is more similar to "motion-aware" instead of "trajectory-aware" as mentioned in the title and throughout the paper.
2,3: Authors have included two additional image-based SOR methods for comparison in the attached pdf, which help support the advantages of their methods. There are type mistakes in Table R1 (Fang (2021) and Liu (2021) instead of 2022?).
It would be better to include some visual comparison. The visual results are very limited. Overall, I think that the current evaluation part is much stronger than the previous version, but it may still fall short of the required standard.
- The authors claim that "DAVSOD has low resolution and low-quality appearance, making it challenging". Based on my knowledge, I cannot agree and accept this explanation. First, there are no statistics showing that the resolution and quality of DAVSOD are worse than those in RVSOD. And I have gone through some examples in DAVSOD, and I think the resolution and quality should not be the major problem resulting in such a low SA-SOR score. Second, the caption in Figure R4 suggests that "Our detector is affected by a large number of non-salient instances". Such explanation may be too casual. Third, since all methods fail to do well on DAVSOD dataset, it would be beneficial that authors could include a deeper analysis to the DAVSOD dataset (including images and annotations, as both of them affect the training results).
Question
Authors' responses have addressed my questions posted in this section.
Summary
However, I insist my initial rating score, and my justifications are as above weakness 1-4, and I agree that the reviewer KoDU that this work seems like a simple extension of IRSR method (Liu et al. Instance-level relative saliency ranking with graph reasoning).
We feel sincerely grateful to receive your timely and insightful feedback, which has been very helpful in improving the quality of our paper. Following your suggestions, we have conducted additional in-depth experiments and analyses, and we hope the information provided in the following responses will further address your concerns.
Response to Weakness 1
Determining the title has been a challenge. "Motion-aware" seemed too ambiguous, as global temporal contrast can also be considered "motion-aware". We wanted to emphasize instance-level motion modeling, so we used "trajectory-aware" approximately, but this is somewhat overclaimed as we cannot accurately track each instance. "Instance-wise motion-aware" may be a better choice to convey our focus.
Response to Weakness 2 and 3
- Sorry for type mistakes. The publication year is 2021, not 2022.
- Due to limited space in the PDF, we are unable to provide extensive visual comparisons. However, we agree that more visual examples would greatly benefit verifying the advantages of our method. We carefully compare the visual results of different methods across various scenes and draw the following conclusions:
- Image SOR vs. Video SOR: Image SOR methods tend to highlight spatial saliency cues like large, close objects or distinct appearances, while ignoring the temporal saliency from dynamic cues like fast motion or large postural changes. Such visual differences can be seen in Fig. 1, 3 and 4 in the manuscript.
- VSOD vs. VSOR: VSOD methods sometimes cannot obtain complete salient instances, as they lack instance-level priors (e.g., VSOD incompletely segments the partially visible pedestrian in the select_0247 video). In contrast, our VSOR model can fully segment salient instances by leveraging object detection. Additionally, VSOR focuses on the contrast among salient instances, while VSOD may be distorted by background noises. While VSOD can only identify salient objects from backgrounds, VSOR can further rank their saliency.
We will add more diverse visual comparisons in the future.
Response to Weakness 4
We perform an in-depth study of the DAVSOD test set. The results in the table below classify scenes by challenge, proportion, and examples, and report ranking (SA-SOR) and detection (mAP) performance using our model trained on the full DAVSOD training set. Analysis of these results reveals two key reasons for the low performance:
| Category | SA-SOR | mAP | Proportion | Example |
|---|---|---|---|---|
| (a) hard to detect | -0.07 | 0.50 | 5/22 | select_0557, select_0208, select_0572 |
| (b) low quality of labeling | 0.16 | 0.44 | 7/22 | select_0607, select_0345, select_0577 |
| (c) others | 0.45 | 0.70 | 10/22 |
a) Severe occlusion among multiple objects or objects with very small sizes, making it difficult to successfully detect all salient objects (e.g., the instructor and the person skydiving with blocked each other are perceived as a single entity in the select_0557 video).
b) Severe variance in salient objects between adjacent frames: The DAVSOD salient object annotations are based on subjective eye fixations from multiple testers, which exhibit increased variance as the number of objects increases. This results in inconsistent and unreliable SOD and ranking labels, as fixations shift significantly across frames in scenes with diverse objects. Compared to RVSOD, DAVSOD contains much more such scenes with multiple objects, where the saliency of individual objects can flicker between salient and non-salient in adjacent frames (e.g., the bull and person in the select_0607 bullfighting video).
We also test Liu’s method on the scenes ‘(c) others’ with relative reliable labels and our method achieves significant ranking improvement over Liu’s (0.45 vs. 0.39) (Liu et al., Instance-level relative saliency ranking with graph reasoning).
In summary, the varying number of objects and instability of salient objects make it challenging to train robust detection, SOD, and ranking models, leading to excessive false positives or missed detections and poor ranking performance.
In the future, we plan to re-label the DAVSOD dataset by introducing human annotations to respect the temporal dynamics while having better temporal consistency.
Difference to Liu’s IRSR
-
IRSR focuses on spatial saliency cues. We focus on two new key problems for video SOR: 1) Modeling diverse temporal saliency cues, especially instance-level motion; 2) How to optimize spatial and temporal cues jointly. The two problems are core in VSOR. The method is simple yet effective and achieves large improvement over IRSR.
-
Additionally, we propose a simple yet effective VSOR-based video retargeting method, largely improveing the retargeting performance.
Thanks for the detailed response! Most of my concerns have been addressed. I decide to raise my rating. I hope that:
-
The authors can include the additional experimental results (discussed in the rebuttal) into Table 1 and include more visual results (it would be better to indicate the video name and dataset name for each visual sample, if possible) in the revised version.
-
Explain why the results on DAVSOD is much worse than those of RVSOD concisely in the revision.
-
Release a statistic table on both datasets would be helpful. The statistics may include the number of objects per video/image in each split (train and test set), the performance (SA-SOR, MAE, [mAP/IoU]) for each split, the number of images/video in each split. Lastly, since the authors have re-splited the DAVSOD, thus it would be great that author can release the video name for each split.
This paper proposes a graph-based video salient object ranking method. It introduces a spatial-temporal graph to integrate trajectory-wise spatial and temporal saliency cues. Based on VSOR, this paper proposes a video retargeting method to adjust the videos to different aspect ratios adaptively. Extensive experiments also demonstrate the effectiveness of the video salient object ranking method and the video retargeting method.
优点
- A graph-based model is proposed for video salient object ranking.
- This paper synchronizes the spatio-temporal saliency cues in a single graph for joint optimization to facilitate video saliency ranking.
- Based on VSOR, this paper proposes a simple but efficient retargeting method.
- Experiments and ablation studies validate the effectiveness of the method and its components.
缺点
- In Figure 4, the ablation visual results in the first and second examples are very similar. I cannot see the effectiveness of different ablation models according to these visual results.
- The main purpose of retargeting is to find the correct window for each frame. The seam carving [12] can find better windows for the salient instances. Even if some distortions exist in these instances, they can be easily solved by a post-processing method. For example, we can first get the window position (x1,y1,x2,y2) by calculating the left top and right bottom points, then crop the original image according to the window position, which can eliminate the distortions in these salient instances. By the way, the cropped regions in Figure 5 are inconsistent with Figure 6.
- Some symbols are confusing, e.g., the symbol (||) in Equation (4).
问题
The writing of this paper needs to be further improved. Some concepts are not clearly illustrated. For example, the paper mentioned trajectory features often (e.g., title, abstract, method) but did not explain where and how it was generated.
局限性
Yes, the authors clearly illustrate the limitations of the paper.
###1 Weakness
1. Similar ablation visual results in the first and second examples in Figure 4. I cannot see the effectiveness of different ablation models according to these visual results.
Response
- Yes. The methods 'w/o TRM', 'w/ GTRM', and 'w/ ITRM' will indeed share similar result, highlighting the instances whose static appearance cues are very distinctive (e.g., the person with a large size and close to the camera in the first example), as they all lack instance-wise motion modeling. 2) However, when it is difficult to differentiate the saliency by static appearance alone (e.g., the instances in the second example), these methods will show different performance. As shown in the last two columns,, 'w/o TRM' and 'w/ GTRM' fail to identify the person as salient, due to their reliance on spatial cues or rough temporal cues by comparing two frames globally. 'w/ITRM' achieves improvement on additionally identifying the person by eliminating the background influence and considering inter-frame cross-instance contrast. However, they are still unable to effectively model instance-wise motion. As a result, they give a wrong saliency ranking.
- In contrast, by incorporating our trajectory-aware temporal correlation modeling into the graph, our method ('Ours' in Fig. 4) can effectively identify instances with noticeable motion cues, leading to the successful highlighting of motion instances and accurate saliency ranking.
2. The main purpose of retargeting is to find the correct window for each frame. The seam carving [12] can find better windows for the salient instances. Even if some distortions exist in these instances, they can be easily solved by a post-processing method. For example, we can first get the window position (x1,y1,x2,y2) by calculating the left top and right bottom points, then crop the original image according to the window position, which can eliminate the distortions in these salient instances. By the way, the cropped regions in Figure 5 are inconsistent with Figure 6.
Response
I may not fully understand your method, but I'm concerned the traditional seam-carving approach is semantic-agnostic, without instance-level awareness or semantic prioritization. It simply calculates image gradients and crops low-gradient regions, which could inadvertently remove uniform areas within instances. As a result, I'm worried it may be difficult to reliably retain all key semantics when determining the cropping window solely by optimizing the top-left and bottom-right coordinates.
3. Some symbols are confusing, e.g., the symbol () in Equation (4).
Response
The symbol () in Equation (4) represents feature concatenation.
###2 Question
1. Unclear concepts. Where and how were the trajectory features generated.
Response
Thanks for your suggestion and we will detail some key concepts to reduce any confusion. For the modeling of trajectory features, we project the absolute position of the current frame's instance onto the adjacent frames to capture changes in motion over time. To account for potential camera movement and drastic changes in the scene, we doubled the size of the instance's bounding box when performing this projection. This enhancement is intended to improve the model's robustness and ability to reliably track objects, even in the face of significant contextual variations. An instance moving fast will consequently obtain a low inter-frame similarity score between features and and consequently get a large saliency score as expected.
This paper introduces a graph model for the video salient objects ranking task. Distinguishing itself from prior research, this study incorporates instance trajectory modeling to amplify temporal saliency cues. Additionally, the authors present a cohesive optimization approach that seamlessly integrates spatial and temporal saliency cues in an adaptive manner. Also, a VSOR-based video retargeting method is introduced, demonstrating notable advantages over existing techniques. Experimental results underscore the exceptional performance of the proposed spatial-temporal graph for VSOR, as well as the effectiveness of the accompanying video retargeting model.
优点
- In general, the presentation of the methodology is clear, concise, and easy to follow, making it accessible to readers.
- The literature review is thorough and well-analyzed, providing a solid foundation for introducing instance trajectory cues as a reasonable motivation.
- The proposed method effectively addresses the limitations of previous temporal saliency modeling approaches, demonstrating its well-grounded nature. The trajectory-wise temporal graph introduced is highly effective, and the joint optimization strategy for integrating spatial and temporal graphs seems to be both reasonable and adaptive.
- The authors comprehensively consider various spatial and temporal saliency cues, employing an integration strategy that fully explores VSOR, resulting in significant experimental improvements.
- The proposed video retargeting method based on VSOR is intriguing. The concept of utilizing salient object ranking to determine the cropping center is both intuitive and innovative, offering a simple yet effective solution.
缺点
- I would appreciate seeing some failure cases and their corresponding analyses for the proposed method. Including these would provide a more comprehensive understanding of the properties and limitations of the approach.
- While the proposed video retargeting method outperforms previous crop-based solutions in localizing the saliency center and preserving key semantics, it's important to acknowledge that crop-based approaches inherently struggle when salient objects are dispersed across different areas, potentially leading to the loss of foreground regions.
- To clarify the dataset collection and annotation process in Section 4.1, presenting a flowchart would be a helpful addition.
问题
- I am interested in understanding the testing time performance of both the proposed graph model and the video retargeting model. This information would provide valuable insights into the practicality and efficiency of the approach.
- I am curious about how the proposed ITRM and TTRM contribute to the salient object detection and ranking processes. A comparison involving representative samples would be insightful, showcasing the specific benefits of these modules in inferring salient objects.
- Regarding the determination of local context size in line 141, the authors mention doubling the size. However, more details on the criteria or heuristic used to determine this size would be beneficial for a complete understanding.
- In Figure 3 and Figure 4, it would be helpful to specify which dataset the samples were chosen from. Additionally, in Figure 2, a brief explanation of the different colors or line types used to represent various elements or metrics would improve clarity.
局限性
Yes.
###1 Weakness
1. Some failure cases.
Response
As shown in Fig. R2, the final saliency inference is highly influenced by the object detector performance. Poor detector output leads to inaccurate instance features and wrong saliency ranking.
In future, we plan to address this by two ways:
- Adapting the detector to incorporate saliency priors, enabling joint optimizing detection and saliency estimation.
- Exploring a unified model that learns detection and saliency estimation simultaneously, allowing the components to benefit from each other's signals.
2. Limitation of crop-based approaches when salient objects are dispersed across different areas, potentially leading to the loss of foreground regions.
Response
Yes, the cropping-based method does have the common limitation of being constrained by the original image aspect ratio. However, from a practical business implementation perspective, the cropping approach is the safest option as it faithfully preserves the content of the original image without introducing any distortions.
This is an important consideration, as users may become uncomfortable with visual distortions introduced by more aggressive techniques like seam carving. Maintaining the integrity of the original image can help avoid potential user criticism or dissatisfaction, which is a crucial factor for real-world deployment.
3. The flowchart for dataset collection and annotation.
Response
Thank you for your suggestion. The flowchart can be found in Fig. R3 in the one-page PDF, and we will add it in the final version if the paper is published. We select videos with varying object numbers and then determine the saliency rank by the given instance masks and comparing the given fixation numbers for each instance.
###2 Question
1. Testing time.
Response
For video salient object ranking, our model can process 20 frames per second on an RTX 4090 GPU. In practical applications, increasing the batch size can further improve the inference speed. The inference time for the retargeting part is 820 frames per second.
2. How do the proposed ITRM and TTRM contribute to the salient object detection and ranking processes.
Response
The ITRM explores global temporal cues by comparing all instances in adjacent frames, while the TTRM further captures instance-wise motion cues by comparing each instance and its local context in adjacent frames to approximate the instance trajectory.
As shown in Figure 4 in the manuscript, comparing the results of the fifth and sixth rows, we can see that the ITRM highlights an instance distinguishing from all instances in adjacent frames (e.g., the person with a large size in the first example), but tends to overlook instances without a distinctive appearance but exhibiting rich motion patterns (e.g., the blurry person with a small size in the first example).
In contrast, by capturing instance-wise temporal motion cues, our proposed TTRM method better highlights instances with significant temporal motion, even if they are smaller in size and undistinguished in appearance. This suggests that our TTRM is more effective at capturing the nuanced saliency cues associated with object dynamics and movement.
3. Determination criteria of local context size.
Response
We set it empirically without experimental analyses. To further validate this design choice, we conducted ablation experiments to test the impact of different scaling factors, examining 1x, 2x, 3x, and 4x scaling.
The experimental results presented in Table R2 demonstrate that doubling the bounding box area, effectively a 2x scaling, achieved the best overall performance. This suggests that this level of spatial context expansion was the most beneficial for accurately capturing the relevant saliency cues, without introducing too much extraneous information.
4. Specify which dataset the samples were chosen from in Figure 3 and Figure 4. A brief explanation on different colors or line types in Figure 2.
Response
The scenes in Figure 3 are taken from the RVSOD dataset, while the scenes in Figure 4 are from both the RVSOD and DAVSOD datasets. In Fig. 2, the different colors represent different instances.
Thank you for your suggestions, and we will make improvements based on your feedback.
This paper proposes a video salient object ranking approach based on spatio-temporal graph, leveraging instance trajectories and spatio-temporal saliency cues to improve SOR accuracy. Experiments demonstrate the superiority of the proposed model.
优点
Originality: The proposed approach has some original aspects.
Quality: The overall quality of the paper is average.
Clarity: Certain sections of the paper lack clarity.
Significance: The proposed approach is significant to some extent.
缺点
● It seems that the proposed approach is an temporal extension version of [3], I cannot see much new insights from this paper.
● In the introduction section, the motivation of introducing GNN for video salient object ranking is not well described, why using GNN in both spatial and temporal dimension of videos? The authors should briefly describe why and how you construct GNNs for SOR in this section.
● Experiments are insufficient; some typical SOR approaches are not compared.
● The approach for approximating instance trajectories may not be sufficiently accurate or generalizable in real-world scenarios, as it merely compares the features of an instance at the same spatial position across different frames.
● Some missing related works:
○ [1] Qiao et al. HyperSOR: Context-aware graph hypernetwork for salient object ranking
○ [2] Guan et al. SeqRank: Sequential Ranking of Salient Objects
● Reference:
○ [3] Liu et al. Instance-level relative saliency ranking with graph reasoning
问题
How do you obtain the saliency rankings and instance masks for the DAVSOD database?
Why set the batch size to 1 during training?
局限性
From my point of view, the "Instance interaction" only considers temporal correlations of instances bettwen two adjacent frames, which may be insufficient for modelling long-range temporal dependencies.
###1 Weakness
1. It seems that the proposed approach is an temporal extension version of [1].
Response
Unlike [1] that focuses on spatial saliency cues for static images, we focus on two new key problems for video SOR: 1) Modeling diverse temporal saliency cues, especially instance-level motion variations; 2) A unified graph to optimize spatial and temporal cues jointly. Additionally, we propose a simple yet effective VSOR-based video retargeting method, which significantly improves the performance of retargeting in preserving the key semantic information from the original video content.
2. The motivation of introducing GNN for video SOR, why using GNN in both spatial and temporal dimension of videos? How to construct GNNs for SOR.
Response
- As noted in prior work [1], the intricate interaction and competition relationships among different object instances are the most critical cues for inferring their relative saliency.
- Similarly, optimizing the delicate balance between the cooperative and competitive relationships of spatial and temporal saliency cues is the key to accurately determining the overall saliency priority in videos.
- However, these complex relational dynamics cannot be effectively captured by traditional CNNs. In contrast, GNNs offer a powerful framework to explicitly model diverse relationships and optimize their combination for joint saliency inference by learning the edge connections among nodes.
- The features for different spatial scales (i.e., each instance, its local context, its global context) and its local contexts in adjacent frames are treated as nodes. Edges are then constructed among them to build multi-scale spatial relations and instance-wise temporal correlations. By optimizing the various edges, the GNN can adaptively combine diverse spatial-temporal relationships and saliency cues for joint inference.
3. Insufficient comparison to some typical SOR approaches.
Response
- Our primary focus is on modeling temporal saliency cues and effectively combining them with spatial saliency information. Therefore, we believe directly comparing our approach to image-based SOR methods is not the most pertinent way to highlight these key contributions.
- However, we do include such a comparison in Table R1, which demonstrates that our method achieves significant performance improvements over traditional SOR methods, even they adopt a more powerful backbone (e.g., SwinTransformer in SeqRank) than ours (ResNet50). This clearly showcases the effectiveness of our temporal saliency modeling and the proposed spatial-temporal fusion strategy.
4. The approach for approximating instance trajectories may not be sufficiently accurate or generalizable in real-world scenarios, as it merely compares the features of an instance at the same spatial position across different frames.
Response
- Our core motivation is to measure the magnitude of motion for individual object instances across adjacent frames, in order to infer instance-level temporal saliency in accordance with the human attention mechanism. Therefore, while tracking or re-identifying each object instance can be helpful, it is not our original motivation and necessarily a strict requirement for our task.
- In fact, instances that move quickly and fail to be accurately tracked will consequently exhibit large inter-frame feature contrast, and thus obtain a high saliency priority (see Fig. R1 in the PDF). Conversely, even if we successfully track a rigid instance (e.g., a fast-moving football), directly comparing the appearance features between two completed instances will result in a high similarity score and low saliency. Hence, the key to determining temporal saliency lies not in the tracking itself, but in the quantification of instance-level motion dynamics.
- Additionally, without access to ground truth tracking annotations, it becomes quite challenging to model the motion trajectories of each individual instance accurately. Instead, our approach focuses on directly quantifying the magnitude of motion at the instance level, which can effectively capture the temporal saliency cues even in the absence of precise tracking information.
5. Some missing related works.
Response
Thank you for providing these related works. Both are image-based SOR models, and therefore focus on exploring spatial saliency cues. Specifically, Qiao et al. investigate the influence of scene context on saliency ranking by constructing a new dataset with varying contexts and building a hypergraph to model diverse spatial relations. Guan et al. formulate the image-based SOR as a sequential and continuous process.
Unlike these approaches that focus on spatial cues, our core motivation lies in modeling the temporal saliency cues and the adaptive combination of spatial and temporal ones. We will involve more related works in the manuscript to enhance the literature review.
###2 Question
1. How do you obtain the saliency rankings and instance masks for the DAVSOD database?
Response
The DAVSOD dataset provides us with eye fixation distributions and instance-level masks. We determine the saliency level of different instances based on the total number of fixation points assigned to each, i.e., an instance with more total fixation points will be assigned a higher saliency priority.
2. Why set the batch size to 1 during training?
Response
This setting inherits the configuration from [1] without any additional tuning, for the sake of fair comparison.
###3 Reference
[1] Instance-level relative saliency ranking with graph reasoning.
After reading the author's rebuttal, I feel that all my concerns have been addressed. I realize that the authors have done sufficient innovative extensions compared to the work of [1]. Besides, the motivations for using GNN and object motions have been well elaborated. Since the authors have addressed my concerns, I revise my rating to accept.
The figures and tables can be seen in PDF.
This paper proposes a graph-based video salient object ranking method. It introduces a spatial-temporal graph to integrate trajectory-wise spatial and temporal saliency cues. It also presents a cohesive optimization approach that seamlessly integrates spatial and temporal saliency cues in an adaptive manner. The presented experiments demonstrate the advantage of the proposed method, and also show the effectiveness in video retargeting task. Synchronizing the spatio-temporal saliency cues in a single graph for joint optimization is appreciated to facilitate video saliency ranking. Main concerns raised by the reviewers were clarification of the novelty, insufficient experiments, failure case discussion, and writing. The authors’ rebuttal resolved the concerns to convince the reviewers. During post-rebuttal discussion, in-depth analysis of low performance on DAVSOD was suggested. Using “motion-aware” instead of “trajectory-aware” was also suggested because the proposed method does not accurately track the instance. The author addressed these concerns adequately. The reviewers unanimously support this paper in the end. The paper should be accepted, accordingly. All the discussion with the reviewers should be properly incorporated in the final version.