4.7

/10

Poster3 位审稿人

最低4最高6标准差0.9

4.7

置信度

正确性2.7

贡献度2.0

表达3.3

NeurIPS 2024

OnlineTAS: An Online Baseline for Temporal Action Segmentation

Qing Zhong,Guodong Ding,Angela Yao

OpenReview PDF

提交: 2024-04-29更新: 2024-11-06

摘要

关键词

Temporal Action SegmentationVideo Understanding

评审与讨论

审稿意见

评分: 4置信度: 52024-07-08

This paper proposes an online baseline for temporal action segmentation. The method is built upon causal TCN and integrates a GRU, attention-based feature aggregation, and a memory bank. A heuristic online post-processing method is also proposed.

优点

I like the statement claiming the contribution as a baseline for action segmentation. It is a modest and accurate claim. The paper provides a reasonable online baseline for this task. Online action segmentation is also a meaningful task.
The proposed post-processing method is intuitive and effective.
The experiments and the ablation studies show good results.

缺点

The major concern is about whether the method can achieve real-time inference speed. This is crucial for the online setting. Currently, no inference speed is reported, for example, the FPS including the feature extraction. Although the authors discussed this in the limitation section, this is a major weakness, given that the method is proposed as a baseline for the online setting.
Experiments on new benchmarks such as Assembly101 would be interesting.
I would recommend the authors remove the statement "we are the first to establish a fully supervised online framework for TAS". There are some concurrent works that became available before the NeurIPS submission deadline or slightly after that, e.g., O-TALC and Progress-Aware Online Action Segmentation.

问题

局限性

The limitation has been discussed regarding the real-time inference. This is good.

作者回复

2024-08-07

We thank the reviewer for finding our contribution claims to the meaningful online TAS task modest and accurate. We address the reviewer's comments below.

Weaknesses

W1. Inference speed

A: Runtime evaluation, conducted on our Intel Xeon Gold 6442Y (2.6 GHz) and a single Nvidia A40 GPU, is reported in time (ms) per frame / FPS in the table below. Note that the inference speed is identical for online and semi-online modes due to their identical input length.

Optical Flow Algorithm	Optical Flow Calculation	I3D - RGB	I3D - Optical Flow	Ours (I3D input)	Ours (raw RGB input)
	Time (ms) / FPS	Time (ms) / FPS	Time (ms) / FPS	Time (ms) / FPS	Time (ms) / FPS
TV-L1	166.5 / 6	9.3 / 107	11.2 / 89	4.2 / 238	191.2 / 5
NVOFA $^*$	1.4 / 714	9.3 / 107	11.2 / 89	4.2 / 238	26.1 / 38

[*]: NVOFA SDK. https://developer.nvidia.com/opticalflow-sdk

These are our findings:

Our architecture achieves an inference speed of 238 FPS with precomputed I3D features.
Feature extractions for both RGB and optical flow inputs are fast, reaching at least 3x the real-time rate of 25 FPS.
The TV-L1 algorithm used by I3D, not optimized for online requirements, supports 6 FPS. However, Nvidia Optical Flow SDK (NVOFA) with hardware acceleration can boost this speed to 714 FPS.

In conclusion, to use I3D features, the optical flow calculation is the main bottleneck. Real-time application is feasible with more advanced acceleration algorithms.

W2. Results on Assembly101

A: Due to the limited time, we could only experiment with a subset of a single view (C10095). Results are shown in the table below. Our approach can also effectively achieve reasonable performance compared to the offline setup. The pure online for the Assembly101 dataset is very challenging in terms of segmental metrics (row 2), while our proposed post-processing significantly boosts segmental performance.

Method	Acc	Edit	F1 10	F1 25	F1 50
Offline	37.1	30.7	31.6	27.8	20.6
Online	36.8	10.9	8.7	6.4	4.0
Online + p.p.	33.5	28.2	29.1	25.4	17.6

W3. Contribution claim

A: Thanks for pointing out these concurrent works. After a close examination, O-TALC was released on Arxiv in mid-April indicating acceptance as a short and unindexed paper in TAHIR; while Progress-Aware Online Action Segmentation only became available at CVPR one month after our submission. We thank the reviewer and will revise our statement in our updated manuscript.

评论- Response

2024-08-11

Thanks for the rebuttal.

W1: The method cannot achieve real-time speed using TV-L1. The method can achieve real-time speed using NVOFA, but it is unclear how using NVOFA affects the performance.

I am satisfied with W2 and W3 at this stage.

Overall, I tend to keep my score. But I am fine with acceptance if all others vote for accept.

2024-08-12

We thank the reviewer's acknowledgment of our response to the weaknesses and their prompt feedback.

To clarify, “online” refers to processing data incrementally as it becomes available. It does not necessarily imply zero latency (i.e., real-time processing). Many existing video understanding methods which claim to be (and are accepted) as online do not work in real-time. For instance, (semi-)online video segmentation methods such as GenVIS [1] operate at 6 FPS and DEVA [2] at 6.6 FPS; while online action detection methods like Testra [3] at 12 FPS, MAT [4] runs at 8.1 FPS, E2e-load [5] at 8.7 FPS, and MATR [6] at 6.0 FPS. Among these, [3,4,6] are not “real-time” because of their optical flow extraction.

Additionally, we have included the speed of the CUDA version of the TV-L1 for optical flow calculation, using the same hardware configuration as we mentioned, in the updated table below. With the CUDA version, our apporach can achieve a real-time inference speed up to 33 FPS. We believe that utilizing the cuda version of the same TV-L1 algorithm will have a minimal impact on the overall performance.

Optical Flow Algorithm	Optical Flow Calculation	I3D - RGB	I3D - Optical Flow	Ours (I3D input)	Ours (raw RGB input)
	Time (ms) / FPS	Time (ms) / FPS	Time (ms) / FPS	Time (ms) / FPS	Time (ms) / FPS
TV-L1 (cpu)	166.5 / 6	9.3 / 107	11.2 / 89	4.2 / 238	191.2 / 5
TV-L1 (cuda)	4.8 / 208	9.3 / 107	11.2 / 89	4.2 / 238	29.5 / 33

We hope this clarification addresses the reviewer's concern regarding the inference speed of our approach.

[1] A Generalized Framework for Video Instance Segmentation. CVPR 2023

[2] Tracking Anything with Decoupled Video Segmentation. ICCV 2023

[3] Real-time Online Video Detection with Temporal Smoothing Transformers. ECCV2022

[4] Memory-and-anticipation transformer for online action understanding. ICCV 2023

[5] E2e-load: End-to-end long-form online action detection. ICCV 2023

[6] Online Temporal Action Localization with Memory-Augmented Transformer. ECCV 2024

审稿意见

评分: 4置信度: 52024-07-11

This paper presents the first online framework for temporal action segmentation. At the core of the framework is an adaptive memory designed to accommodate dynamic changes in context over time, alongside a feature augmentation module that enhances the frames with the memory. A post-processing approach is proposed to mitigate the severe over-segmentation in the online setting. The method achieves state-of-the-art performance on three common segmentation benchmarks.

优点

The paper is well written and easy to be followed.
Exploring online paradigms in temporal action segmentation is meaningful.

缺点

Novelty is a big issue. The proposed adaptive memory bank mechanism has been explored in the video object segmentation tasks[1]. Also, semi-online inference scheme has been proposed in video instance segmentation tasks[3]. [1] Video Object Segmentation with Dynamic Memory Networks and Adaptive Object Alignment, iccv [2]Video Object Segmentation with Adaptive Feature Bank and Uncertain-Region Refinement,neurips [3] A Generalized Framework for Video Instance Segmentation, cvpr
The context-aware feature augmentation (CFA) module mainly consists of self-attention and cross-attention, adaptive memory bank was widely used in video detection and segmentation tasks, so the proposed module is not novel enough.
Is the Trans. Decoder necessary in the CFA module? For example, directly using memory as K and V of CA or performing SA together with ck.
In semi-online inference, why use non-overlapping clips? Would using overlapping clips for sampling and voting classification yield better results?
The performance improvement of Edit and F1 brought by post-processing is so amazing, why is there such a big improvement but Acc has decreased.
Compared to the latest offline methods, there is still a significant gap in the performance of online methods. The online method has advantages in terms of training cost and inference speed.

问题

As an online paradigm, why use offline models as baselines instead of building online models from scratch? The solution of replacing standard convolutions with causal convolutions is not very elegant.

局限性

see weakness and questions.

作者回复

2024-08-07

We thank the reviewer for acknowledging the task importance and our effective presentation of the work.

Weaknesses

W1&2. Novelty on memory bank, semi-online inference, self- and cross-attention

Thanks for your comments. Indeed, attention and memory banks are common architectural components and semi-online inference is also an established concept. Note that we don’t claim the memory bank, attention mechanisms and semi-online inference as our technical contributions.

The distinction and novelty lie in how the objective and design are tailored to the specific task rather than the architectural components or abstract concepts stand-alone. Our focus is centered on the unique integration of the memory bank, attention mechanism, and semi-online inference for the online TAS task.

Here are detailed distinctions:

Memory bank

In [1,2], the memory design supports instance matching where each memory token represents object instances and is selectively updated via feature similarities [2]. While our memory design accumulates contextual information for feature enhancement, storing contexts at different granularities: clip-wise coarse long-term memory and frame-wise fine-grained short-term memory. Our memory capacity adapts over time to retain as much contextual information as possible.

[1,2] update Q directly based on K and V, whereas ours retrieves relevant memory for the query clip with Trans. Decoder and then enhance features with SA and CA.

In addition, our method shows over a 30% increase in Acc (Table 9) compared to existing work like LSTR [44], using similar attention and memory techniques.

Semi-online Inference

Thank you for pointing out [3], semi-online inference is not a claimed contribution but is presented to offer a more comprehensive evaluation due to the clip-based nature of the model's inputs and outputs.

In summary, as stated in L43-49, our technical contributions are:

Addressing the novel online TAS problem;
Developing CFA leveraging adaptive memory;
A fast and effective post-processing for over-segmentation in the online setup.

W3. Trans. Decoder necessity

A: We include a Trans. Decoder is to gather context information specific to the current clip, which we then use to enhance the clip features for better online segmentation.

Trans Dec	Mem	Acc	Edit	F1@10	F1@25	F1@50
w/	w/	82.4	32.8	43.0	41.1	34.7
w/o	w/	80.6	29.1	39.5	36.2	29.3
w/o	w/o	77.3	28.3	35.9	33.6	23.1

The ablation results above (on 50Salads) indicate that performance decreases when the component is removed and memory is directly fed as K and V for CA (row 2). Further removal of the memory (row 3) leads to an even greater performance drop, highlighting the importance of the context information in the online task.

W4. Overlap semi-online performance

A: The table below shows that with a stride of $\delta=64$ and a window size of $\omega=128$ , overlaps lead to better results than the frame-wise online mode but lag behind the non-overlapping mode.

Method	Voting	Acc	Edit	F1 10	F1 25	F1 50
Semi-online	no	82.4	32.8	43.0	41.1	34.7
½ overlap	yes	81.1	29.5	40.4	39.3	30.1
online	no	79.1	29.0	38.5	35.5	28.3

Overlapping results are affected by multiple predictions with various contexts, while the non-overlapping case benefits from a shared context, leading to better accuracy and less over-segmentation.

W5. Acc decreases after post-processing

A: Segmental metrics and frame-wise accuracy are not well-aligned in TAS. A high accuracy doesn’t necessarily ensure less over-segmentation and vice versa. Consider the extreme case where every other frame is interspersed with the wrong action, the segmental measures are very poor, but the accuracy is still 50%. This trade-off is also observed in [R1].

Our post-processing prioritizes reducing over-segmentation over frame-wise accuracy. Fig. RA in the rebuttal pdf (sample example as Fig. 3) shows post-processing: 1) removes fragments (black boxes). 2) may reduce Acc, particularly at action boundaries (red boxes).

[R1] Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation, ECCV 2022.

W6. Performance gap between online and offline

A: We respectfully disagree that a performance gap between online and offline models is a weakness.

Such a gap is well-expected and is directly attributed to the online nature, in which future temporal context is not available for making predictions. In a similar fashion, [R2] found that under the same offline setup, performance drops when the input window size is reduced and cannot account for sufficient context.

[R2]. How much temporal long-term context is needed for action segmentation? ICCV 2023.

Questions

Q1. Why use offline models?

A: We thank the reviewer for the question.

Successfully solving TAS requires merging visual features over long temporal ranges of context, while preserving local semantics. Existing offline models are designed specifically for this task and excel at it, making them a natural starting point rather than building from scratch.

Additional advantages include: 1) using the same backbone to facilitate performance evaluations and comparisons to the offline model and 2) analyzing specific performance discrepancies to identify meaningful areas for improvement.

2024-08-14

Thanks for your response. I have read the rebuttal carefully.

评论- Thanks for the comment

2024-08-14

Dear Reviewer xPL1,

Thank you for taking the time to carefully read our rebuttal. If there are any aspects of our response that require further clarification or if you have any additional questions, we are more than happy to engage in further discussion.

审稿意见

评分: 6置信度: 42024-07-11

This paper introduces OnlineTAS, the first fully-supervised online framework for temporal action segmentation (TAS). The main contributions include:

A context-aware feature augmentation (CFA) module that incorporates an adaptive memory to enhance frame representations with temporal context.
An adaptive memory bank that accumulates short-term and long-term context information.
A post-processing technique for online boundary adjustment to mitigate over-segmentation.
State-of-the-art performance on three TAS benchmarks in an online setting.

优点

The CFA module and adaptive memory bank provide an innovative approach to incorporating temporal context in an online setting.
The proposed post-processing technique effectively mitigates the over-segmentation problem common in online TAS.
The authors demonstrate the effectiveness of their approach on three standard TAS benchmarks, showing consistent improvements over baselines.

缺点

The paper lacks a thorough discussion of the computational complexity and resource requirements of OnlineTAS. For an online method intended for real-time applications, this is a critical aspect. The authors mention that their approach uses a "single-stage TCN" for efficiency, but don't provide concrete details on inference time, memory usage, or how these scale with video length. A comparison of computational requirements with offline methods and other online video understanding tasks would be beneficial. This analysis should include considerations for both the online and semi-online inference modes.
The paper presents quantitative results but lacks a qualitative error analysis. A detailed examination of failure cases could provide valuable insights into the limitations of the approach and guide future improvements. For instance, analyzing scenarios where OnlineTAS performs poorly compared to offline methods or where over-segmentation persists despite post-processing could be informative.
While the paper introduces an adaptive memory bank, there's limited discussion on how this approach scales to very long videos or continuous video streams. It's unclear how the method would perform in scenarios where the video length greatly exceeds the memory capacity, which is a likely scenario in many real-world applications.
For an online method, real-time performance is crucial. However, the paper doesn't provide metrics such as frames per second processing speed or end-to-end latency on standard hardware. This information is essential for assessing the practical applicability of the method in real-time scenarios.

问题

The following questions are mostly related to weaknesses directly or indirectly:

Have you explored the impact of different feature extractors and different backbones on the performance of OnlineTAS?
Can you elaborate on how OnlineTAS might be adapted for weakly-supervised or unsupervised settings?
How sensitive is the method to the choice of hyperparameters, particularly the clip size and memory length?
Your post-processing technique seems effective in mitigating over-segmentation. Have you explored how this technique might be integrated into the model itself during training, rather than as a post-processing step?
The paper introduces both online and semi-online inference modes. How do these two modes compare in terms of latency and accuracy trade-offs? In what scenarios would one be preferable over the other?

局限性

The authors acknowledge some limitations, such as the computational intensity of the offline segmentation model used as a backbone and the reliance on pre-extracted features. These factors could hinder real-time application of the framework. However, the discussion of limitations could be expanded to include potential challenges in adapting the method to more diverse datasets or real-world streaming scenarios.

作者回复

2024-08-07

We thank the reviewer for recognizing the novelty of our CFA module for the online TAS problem and the effectiveness of the post-processing for mitigating the over-segmentation issue.

Weakness

W1&4. Runtime analysis and computational requirements

A: We thank the reviewer for the comment.

Runtime

Kindly refer to our response to W1 for Reviewer aGP1.

Computation requirement

	GPU Mem Requirement
MSTCN	551M (10k frames)
ASFormer	716M (10k frames)
Ours	1.2G (fixed)

We compare GPU memory requirements for inference between offline and online models. Offline models like MSTCN and ASFormer use 551M and 716M per every 10,000 frames, respectively. Note that the memory requirement will scale linearly with the actual video length as these models operate on the full sequence. In contrast, our model has a fixed memory requirement of 1.2G for each inference, regardless of the actual video length. This facilitates its application in dealing with online streaming videos that can be infinitely long.

W2. Error analysis.

A: Thanks for the suggestion. We show some qualitative examples in Fig. 3, and we will add more to the revision. For failure cases, we have the following two observations:

Action start often delays due to the need for more frame information to predict new actions, especially when facing semantic ambiguities at action boundaries.
Persistent over-segmentation happens when the network makes incorrect but confident predictions, which could be improved with a stronger backbone or better temporal context modeling.

W3. How memory scales with video lengths

A: We thank the reviewer for the question. When the memory capacity is exceeded, our method discards the earliest memory in a FIFO manner.

Our ablation study (Table 5) includes scenarios where the video length exceeds the memory limit. For memory sizes of 16, 32, and 64, the earliest memory are discarded as the average length of 50Salads is ~5.8K frames. With the size reducing, the performance gradually drops and reaches the lowest of 79.8 in Acc compared to the peak of 82.4. Note that with the memory size set to 16, our approach only retains long-term information from up to 192 frames, 30x less than the average video length.

Questions

Q1. Different feature extractor and backbone

A: Thanks for the suggestion. We experiment on three common TAS datasets replacing the MS-TCN backbone with ASFormer and report the results in Table RA in the rebuttal pdf. Our approach is still effective in boosting online segmentation performance.

Unfortunately, we weren’t able to conduct the experiments with other optional feature extraction backbones due to the time and resource constraints for the rebuttal. We will add this to the camera-ready.

Q2. Online setup under other forms of supervision

A: Thanks for the interesting question. Weakly- and unsupervised approaches typically involve iterative model training and label estimation. Our post-processing technique, as a label refinement process, may be used in such setups to refine network prediction and generate temporally consistent pseudo-labels. The two lines of work are orthogonal and serve as interesting thoughts for future work of addressing the online problem.

Q3. Hyperparameter sensitivity

A: These studies are already reported in Table 5. Our approach is reasonably tolerant to the clip size and memory length change. For example, the performance only drops 1% when the size is reduced to 1/4 (32) of the original default (128).

Q4. Integrating post-processing for training

A: That is a good question. The post-processing is to refine the frame labels predicted by the model; we can see its application in weakly and unsupervised setups to refine pseudo labels as we explained in response to W2. However, as our current approach is fully supervised, there’s no need for label correction during training as the ground-truth labels are provided.

Q5. Latency accuracy trade-off analysis between online and semi-online modes

A: The inference speed for one pass is identical for both modes as their input sizes are the same. However, their latency is different. Online inference operates on a frame basis and the latency is dependent only on the inference speed while semi-online mode requires additional latency to gather frames up to the clip lengths (128 frames at a standard 25 FPS corresponds to 5.12 seconds).

In terms of performance, the online approach is less competitive than the semi-online with an average gap of around 2-5% (see Table 1). This is likely because of the better preservation of temporal continuity in the semi-online setup, which we discussed in L.219-222.

To summarize, online inference has a better real-time response than semi-online inference. Semi-online inference achieves better performance. The choice of these two modes will depend on the application priority, if the real-time inference is required, online would be preferred, if accuracy is preferred and the task is less time-sensitive, then semi-online is suggested.

We thank the reviewer for the constructive question and will include these discussions in our updated manuscript.

Limitations

L1. Potential challenges in diverse and real-world streaming cases

A: Thank you for your suggestion. Handling diverse and real-world videos presents several challenges.

One common scenario involves interrupted actions, where a subject abruptly switches to a different action, leaving the ongoing action unfinished. These interruptions can be challenging for the model to handle effectively.

Additionally, the extended length of the video poses another challenge. Streaming videos can be infinitely long, so effectively managing and preserving long-form history within a fixed memory budget becomes a critical issue.

2024-08-13

Thank you very much for your detailed rebuttal! I am convinced with your responses and happy to change my rating after discussions.

评论- Thanks for the comment

2024-08-13

Dear reviewer itXK

Thank you for taking the time to thoroughly review our rebuttal. We greatly appreciate your constructive feedback and are pleased that our responses have addressed your concerns.

作者回复

2024-08-07

We thank all reviwers for their constructive comments and address each of their concerns in separate rebuttals. The attched global rebuttal file contains

1 figure (Fig. RA) in response to Reviewer xPL1's W5 regarding the segment-accuracy tradeoff in post-processing.
1 table (Table RA) in response to Reviwer itXK's Q1 regarding switching the backbone (ASFormer) for online TAS.

最终决定Accept (poster)

2024-09-25

The paper received 3 borderline rejects during the initial review period. While the reviewers appreciated the problem setup and the clarity of presentation, concerns were raised about the novelty, limited error analysis, computational overhead for truly online segmentation, the heavy reliance on post-processing for reducing over-segmentation, and the performance gap between offline and online models. The author's rebuttal provided a response that addresses many of the concerns. While some concerns exist, the strengths outweigh the weaknesses, and I recommend acceptance since the proposed model will serve as a convincing baseline for online TAS. The authors are encouraged to include the details from the rebuttal in their final version.