Spatio-temporal Diffusion Transformer for Action Recognition

Jing Gu,Yusong Bai,Desheng Zhai,Biao Hou,Shasha Mao,Shuyuan Yang,Licheng Jiao

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

摘要

关键词

Video action recognitionfine-grained actioninformation diffusionspatiotemporal feature

评审与讨论

审稿意见

评分: 3置信度: 52024-10-28

This paper introduces a dual-path network consisting of a spatio-temporal pathway and a temporal pathway. The authors also propose a series of network blocks designed to enhance the action modeling capability. However, these proposed blocks lack novelty.

优点

The writing is neat.

缺点

Outdated Paradigm: The approach of proposing spatial-temporal network blocks and small models specific to the action recognition problem to achieve state-of-the-art (SOTA) performance is an outdated paradigm. This approach was popular before 2021, such as TDN[1]. Modern paradigms focus on developing general visual models, such as CLIP and large multi-modal models, and transferring their visual representations to action recognition tasks, by parameter-efficient tuning (PET). Representative works in this area include ST-Adapter[2].
Lack of Novelty in Network Blocks: The proposed network blocks are not novel. The mixed convolution + Attention block (shown in Fig. 2) was proposed in [3]. The cross transformer module (Fig. 3) was introduced in [4]. The spatiotemporal diffusion module (Fig. 4), which is essentially a factorized 3D convolution, was proposed in [5]. The motion excitation module (Fig. 5) was introduced in [6][7].
Poor Performance: The performance of the proposed model is subpar.

[1]Wang L, Tong Z, Ji B, et al. Tdn: Temporal difference networks for efficient action recognition[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 1895-1904. [2]Pan J, Lin Z, Zhu X, et al. St-adapter: Parameter-efficient image-to-video transfer learning[J]. Advances in Neural Information Processing Systems, 2022, 35: 26462-26477. [3]Pan X, Ge C, Lu R, et al. On the integration of self-attention and convolution[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 815-825. [4]Chen C F R, Fan Q, Panda R. Crossvit: Cross-attention multi-scale vision transformer for image classification[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 357-366. [5]Tran D, Wang H, Torresani L, et al. A closer look at spatiotemporal convolutions for action recognition[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2018: 6450-6459. [6]Wang H, Tran D, Torresani L, et al. Video modeling with correlation networks[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 352-361. [7]Li Y, Ji B, Shi X, et al. Tea: Temporal excitation and aggregation for action recognition[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 909-918.

问题

Why stick to the old "network block" game when modern paradigms are available? Specialized small models are outdated, and general models are the future.
clearly highlight any unique contributions or improvements over existing methods to establish the novelty of the proposed modules.
Why not benchmark the model on more challenging video datasets such as Kinetics700?

审稿意见

评分: 3置信度: 42024-11-04

This paper proposes a spatio-temporal diffusion(really?) transformer to improve action recognition task. Several plug-and-play modules are proposed, including Parallel Transformer Module, Cross Transformer Module, Spatiotemporal Diffusion(really?) Module and Motion Excitation Module.

优点

The experiments on Something Something V1 and V2 datasets show the effectiveness of proposed modules.

缺点

This paper is titled with "Spatio-temporal Diffusion Transformer", but I don't see any context about "Diffusion". I am not sure this is a typo(for example, "Fusion" may be more appropriate) or the authors misuse the concept of Diffusion.
Utilizing spatiotemporal and temporal information to improve video tasks is a well-known knowledge in this field. But I don't learn any new idea from this work, either from research insights or engineering implementations.

问题

I really suggest authors to double-check the concept of Diffusion.

审稿意见

评分: 3置信度: 52024-11-04

This paper proposes a method for video action recognition by designing a two-transformer-branch for extracting spatio-temporal features. Furthermore, it applies a spatio-temporal diffusion model to fuse the features from the two branches for enhancing the video features. To better process temporal information, it constructs a salient motion extraction module to convert spatial information from adjacent frames into motion features. Experiments are conducted on Something Something V1 and V2.

优点

The idea is simple and straightforward.
Applying diffusion model for action recognition is a new try.

缺点

Overall, paper writing is poor. The motivation of the proposed method is not clear. Experiments are not enough to support the proposed method.
Abstract: “the high similarity of video backgrounds and the long time span of action bring serious challenges to action recognition”. How did existing methods handle the issue? What shortages of existing methods on this issue? Why the proposed method is able to solve this issue?
Introduction: The introduction is more like a Related Work. There is no motivation introduced. From L82, it generally describes the proposed method, but it is not clear why the method is designed step by step like this? How the design to solve the issue mentioned above in the Abstract?
Method:

It mentions that one brach is for extracting spatiotemporal features for video, another is for temporal feature of actions. What difference between video features and action features here? Why extra temporal features needed? Are there any redundancy between the spatio-temporal features and temporal features? Is there any ablation study here?
Why utilize 12 modules in PTM? Is it a hype-parameter? Is there any ablation study here?
It is not clear that how diffusion mode works on spatio-temporal dimension here? What problem it solves?
It difficult for me to understand the SMEM? What is the motivation to design the model? How important it is for the propose method?

Experiment:

From Table 2, results of each module are not significant different. The table does not well support that the proposed modules are necessary.
Same with results from Table 4. There is no significant difference between different settings.
Experiments on only Something-Something are not enough to prove the effectiveness and generalizability.

From Table 2, the diffusion model does not make big difference. And also it is a very small part of the method. I don't understand why the title is spatio-temporal diffusion Transformer. How spatio-temporal diffusion contribution here?
All figures and tables miss detailed captions. It makes reading difficult.

问题

See the above Weaknesses.

AC 元评审

2024-12-24

This paper introduces a dual-path network consisting of a spatio-temporal pathway and a temporal pathway for action recognition. Although the experiments on Something Something V1 and V2 datasets show the effectiveness of proposed modules, all three reviewers point that the proposed method is lack of novelty and rate this paper as "reject". After reading the paper, the AC agrees with the reviewers' comments about novelty.

审稿人讨论附加意见

The authors do not provide feedback for the reviewers' comments.

最终决定Reject

2025-01-22

Reject