PaperHub
6.0
/10
Poster3 位审稿人
最低6最高6标准差0.0
6
6
6
3.0
置信度
正确性2.0
贡献度2.7
表达3.0
NeurIPS 2024

ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation

OpenReviewPDF
提交: 2024-05-12更新: 2024-12-23
TL;DR

a unified diffusion model for temporal action segmentation and long-term action anticipation

摘要

关键词
temporal action segmentationlong-term action anticipation

评审与讨论

审稿意见
6

This paper extends DiffAct to perform both action segmentation and action anticipation. An anticipative masking with a learnable mask token is proposed. Experiments are conducted on the three common benchmark datasets.

优点

  1. The motivation for unifying action segmentation and action anticipation is reasonable, given the task similarity. It is also intuitive and reasonable to extend a generative framework from segmentation to anticipation, given the generative nature of anticipation.

  2. The learnable mask token is interesting.

  3. The ablation studies are relatively comprehensive.

  4. Codes are provided in the supplementary.

缺点

  1. The major technical problem is that the proposed method assumes the ground truth video length is known for action anticipation at the test time. Without the ground truth video length T, the anticipative mask M^A can not be constructed during the inference time for anticipation. This is also shown in the code provided, where 'full_len' is input into the ddim inference function. This is problematic, conflicting with the goal of anticipation, and leading to incomparable experimental results.

  2. The technical novelty is limited. The main contribution is extending the DiffAct method with an anticipative mask, while other modules are from existing methods. But given the contribution of unifying the segmentation and anticipation tasks, this is not a deciding factor for me.

  3. It will be better to conduct experiments on Assembly101. Given the small data size and the saturated performance on GTEA, I would recommend a transition from GTEA to Assembly101 for this task.

Minor:

  • In Figure 1c, what do the triangle and the circle mean?
  • In Table 1, some recent methods are missing, such as MVGA ICCV23 and RTK ICCV23.

问题

NA

局限性

The limitations were briefly discussed.

作者回复

[Inference setup in LTA]

We would like to clarify that our model can predict future actions with an arbitrary length by adjusting the number of mask tokens needed for prediction; our model itself does not require a ground-truth length of anticipation, and the ground-truth length is used to generate the same length of anticipation for the convenience of evaluation in testing. Note that previous methods [19, 24, 43] also commonly used ground-truth video lengths during inference to generate the final predictions, which can be found in the released original codes. We obtain the codes for [19] from the authors. We hope this clarifies the reviewer's concern. Otherwise, please let us know.

[Novelty]

Please refer to the general response for the novelty.

[Experiments on Assembly101]

Due to limited resources and time, we were unable to finish the full-scale experiment on time. In response to suggestions regarding Assembly 101, we instead did a smaller scale experiment by randomly sampling 10% of the entire training set to train the models and evaluate the performance on the full validation set to ensure a fair comparison.

Table R5 compares the performance of ActFusion and LTContext [5], the state-of-the-art TAS model on Assembly 101 from the experiment. We find that our model outperforms LTContext across all metrics, showing the potential advantages of the proposed method. We will provide a full-scale experiment in the final manuscript.

[Table R5. Experiments on Assembly101]

methodF1@10F1@25F1@50EditAccAvg.
LTContext [5]18.715.911.217.627.318.2
ActFusion (ours)21.719.314.019.828.120.6

[Explanation of Figure 1(c)]

Please refer to the general response for a detailed explanation of Figure 1(c).

[Missing references]

Thank you for letting us know. We will include MVGA [R4] and RTK [R5] in our final manuscript.

[R4] B. Jiang et al. RTK: Video action segmentation via contextually refined temporal keypoints. In ICCV’23.
[R5] N. Aziere and S. Todorovic. Markov game video augmentation for action segmentation. In ICCV’23.

评论

Thanks very much for the response.

W1: I am not convinced.

This is not an evaluation choice as argued by the authors. This is a test data leakage issue. For any machine learning model, you should not use any ground truth when obtaining predictions on test data.

One possible solution might be using the mean length of training videos as the 'full_len' during testing. But I guess this will lower the results.

As for previous methods mentioned by the authors, I did not check their code. Did other previous methods except for those mentioned also use this ground truth? Even if ground truth was used in some of previous codes, I would consider this as an issue to be fixed in the following works rather than a convention to be inherited.

W2: I appreciate the novelty of the unification, but not the methodology. This is subjective though.

W3 and minor have been addressed.

评论

Thanks very much. I really appreciate the additional experiments. It is helpful to see the new results without using ground-truth length, which are quite reasonable.

It is surprising that all previous methods have utilised the ground-truth length during testing. I would flag this as a critical issue worth the community's attention. Revealing this issue using the above experiments might be a significant contribution to the community.

Therefore, I have greatly increased my score, on the condition that the additional experiments will be included in the later version. And all the other experiments should also be 'totally' updated to this no ground-truth version. However, this might be too substantial and there is no mechanism to guarantee that.

Overall, this is really a difficult case. Therefore, I have also changed my rating to the lowest confidence, and I would leave this to the chairs for the final judgement.

评论

We sincerely appreciate the reviewer’s constructive feedback, which has been invaluable in raising an important issue within the community and guiding us to improve our submission. We will ensure that all experiments in our final manuscript are fully updated without the use of ground-truth length during inference.

As the reviewer zKDZ mentioned, we also believe that revealing this critical issue and rectifying it would be a significant contribution to the community. We will thoroughly analyze and bring to light the issues shared by previous methods (at least, including [19, 24]), ensuring that all models are compared under adequate and fair conditions.

评论

[W1: Using the ground-truth video length during inference]

Thanks to your response, now we fully understand the point regarding the use of the ground truth video length during inference. To address this concern, we conducted additional experiments where no ground truth length is used in testing; following your suggestion, in testing, we fixed the length of future frames (i.e., mask tokens) to the maximum number of future frames in the training set.

Experimental results in this setting on the 50 Salads dataset are reported in Table R6. In this table, the column ‘use of GT length’ indicates if each model exploits ground-truth length during testing. We found that, following the first paper introducing long-term dense action anticipation [2], all previous methods utilize the ground-truth length during testing, except for those whose codebases and/or inference setups are not available [25, 51, 65]; we marked these methods 'unknown' in the column.

In the table, ActFusion* represents our original model using a fixed number of mask tokens without retraining. To ensure a fair comparison, we applied the same testing scheme to Farha et al [19] (denoted by Farha et al*) and FUTR [24] (denoted by FUTR*), both of which originally utilize the ground-truth length during testing. In this setting, ActFusion* still outperforms Farha et al* and FUTR*, and as the reviewer expected, all of these methods perform worse.

This performance drop is due to the use of the ground-truth video length in training, which results in a discrepancy between the training and inference setups. To mitigate this issue, we retrained our model while fixing the number of mask tokens to cover the maximum number of future frames in the training set. The retrained results, presented in the last row of Table R6 and denoted as ActFusion†, achieve the state of the art in long-term action anticipation (LTA). We observed that fixing the number of mask tokens brings performance gain when the prediction ranges are relatively short, as the model benefits from more stable predictions. However, for longer predictions, particularly when the prediction ratio is set to 0.5, we observed performance degradation compared to ActFusion. This degradation is probably due to the fact that it becomes more difficult for the model to determine the end of an activity. Nonetheless, these results demonstrate that our method can be flexibly adapted to different numbers of mask tokens, ultimately achieving the state of the art in LTA.

We sincerely hope this clarification addresses your concerns. We believe that the experimental setup you suggested will contribute significantly to the field by providing more realistic and reasonable evaluation protocols, and we will include all of the above results in the revision.

[Table R6. LTA results with and without using ground truth length]

methoduse of GT lengthα=0.2,β=0.1\alpha=0.2,\beta=0.1α=0.2,β=0.2\alpha=0.2,\beta=0.2α=0.2,β=0.3\alpha=0.2,\beta=0.3α=0.2,β=0.5\alpha=0.2,\beta=0.5α=0.3,β=0.1\alpha=0.3,\beta=0.1α=0.3,β=0.2\alpha=0.3,\beta=0.2α=0.3,β=0.3\alpha=0.3,\beta=0.3α=0.3,β=0.5\alpha=0.3,\beta=0.5Avg.
Temporal Agg. [51]unknown25.5019.9018.2015.1030.6022.5019.1011.2020.26
A-ACT [25]unknown35.4029.6022.5016.1035.7025.3020.1016.3025.13
Object Prompt [65]unknown37.4028.9024.2018.1028.0024.0024.3019.3025.53
Farha et al. [19]34.7628.4121.8215.2534.3923.7018.9515.8924.15
Farha et al.* [19]-29.0723.8320.4912.7726.5117.7814.3511.1919.50
FUTR [24]39.5527.5423.3117.7735.1524.8624.2215.2625.96
FUTR* [24]-28.8420.0116.6511.3722.4816.4913.219.2117.28
ActFusion (ours)39.5528.6023.6119.9042.8027.1123.4822.0728.39
ActFusion* (ours)-34.5026.1720.2711.8734.5822.7517.3111.3322.75
ActFusion† (ours)-41.3030.8324.4016.1041.7028.0822.4819.5628.06
  • In the table, the bolded values represent the highest accuracy among the models that do not use ground truth length, ensuring a fair comparison.
审稿意见
6

This paper proposes a new unified diffusion model called ActFusion, which solves the tasks of Time Action Segmentation (TAS) and Long Term Action Prediction (LTA) in a joint learning framework. To unify the two tasks, the model effectively handles the visible and invisible parts of the sequence during the training phase; The visible part is used for observed video frames, while the invisible part is used for future expectations. The experiment showed that the model achieved state-of-the-art performance in standard benchmark tests of 50 Salad, Breakfast, and GTEA.

优点

  1. The design of the ActFusion model is novel, with its anticipative masking strategy and random masking method unifying the tasks of TAS and LTA.
  2. The model enhances its performance on both tasks through mutual promotion of TAS and LTA.
  3. The reported results in the paper surpass existing techniques on multiple evaluation metrics, demonstrating significant performance improvements.

缺点

  1. What is the difference between the Diffusion model used in the paper and the DiffAct model? They appear to be similar overall, despite using different Masks to unify the tasks of TAS and LTA.
  2. The paper employs so many loss functions for model training; the authors should analyze the impact of different losses rather than simply exploring the effects of the encoder and decoder.
  3. Since the denoising process of the diffusion model is a time-consuming task, I am interested in the computational efficiency of the model proposed in the paper. The authors are advised to provide information on the model's GFLOPs, parameter scale, and inference time.
  4. The authors need to further improve the interpretation of the figures, such as what the circles and triangles in Figure 1(c) represent.

问题

  1. What is the difference between the Diffusion model used in the paper and the DiffAct model? They appear to be similar overall, despite using different Masks to unify the tasks of TAS and LTA.
  2. The paper employs so many loss functions for model training; the authors should analyze the impact of different losses rather than simply exploring the effects of the encoder and decoder.
  3. Since the denoising process of the diffusion model is a time-consuming task, I am interested in the computational efficiency of the model proposed in the paper. The authors are advised to provide information on the model's GFLOPs, parameter scale, and inference time.
  4. The authors need to further improve the interpretation of the figures, such as what the circles and triangles in Figure 1(c) represent.

局限性

In summary, the authors have creatively used different Masks to unify the tasks of TAS and LTA with a single model and achieved significant results. However, compared to existing models, there does not seem to be much modification, highlighting a disadvantage in innovation.

作者回复

[Comparison with DiffAct]

Please refer to the general response to the comparison with DiffAct.

[Loss ablation studies]

We conduct ablation studies on the loss functions: boundary loss, smoothing loss, and cross-entropy loss. Table R4 presents the results, demonstrating that the combination of bounding loss LbdL_{bd} and smoothing loss LsmoL_{smo} is effective for both TAS and LTA. While the effectiveness of these losses in TAS is well-documented in previous research [18, 64, 42], their impact on LTA has been less explored. Notably, the smoothing loss leads to significant performance gains in both tasks, indicating that smoothed predictions are beneficial.

[Table R4. Loss ablation studies]

(a) Results on TAS

LbdL_{\text{bd}}LsmoL_{\text{smo}}LceL_{\text{ce}}F1@10F1@25F1@50EditAccAvg.
88.486.579.182.584.984.3
91.390.084.586.388.888.2
91.690.784.886.089.388.5

(a) Results on LTA

LbdL_{\text{bd}}LsmoL_{\text{smo}}LceL_{\text{ce}}α=0.2,β=0.1\alpha=0.2, \beta=0.1α=0.2,β=0.2\alpha=0.2, \beta=0.2α=0.2,β=0.3\alpha=0.2, \beta=0.3α=0.2,β=0.5\alpha=0.2, \beta=0.5α=0.3,β=0.1\alpha=0.3, \beta=0.1α=0.3,β=0.2\alpha=0.3, \beta=0.2α=0.3,β=0.3\alpha=0.3, \beta=0.3α=0.3,β=0.5\alpha=0.3, \beta=0.5
35.6227.0420.1715.9334.3822.3319.9616.94
39.1928.9923.1319.4539.5325.1922.6719.88
39.5528.6023.6119.9042.8027.1123.4822.07

[Computational efficiency]

Table R5 compares the computational cost of our model with ASFormer [64] for TAS, and FUTR [24] for LTA, in terms of the number of parameters, GPU memory usage during inference, and inference time. For TAS, as shown in Table R5 (a), although ASFormer has fewer parameters with lower inference time, it requires more GPU memory during inference and obtains lower performance. To improve computational efficiency, we reduce the DDIM inference steps to 10 and 1. This reduction decreases inference time while maintaining superior performance over ASFormer.

For LTA, as shown in Table R5 (b), our model is approximately eleven times smaller than FUTR, uses less GPU memory, but has a longer inference time. By reducing the DDIM inference steps to 1, our model achieves a similar inference time to FUTR. Overall, our model is practical and efficient since it can handle both TAS and LTA tasks with a unified model, eliminating the need for separate models and reducing GPU resource usage and the time required for separate training. Note that we use model checkpoints from the official GitHub repositories for all comparisons.

[Table R5. Computational efficiency]

(a) Results on TAS

method# inference stepsAvg. Performance# parameters (M)memory (GB)inference time (s)
AsFormer [64]181.91.1340.2721.66
ActFusion (ours)186.01.5760.1640.42
ActFusion (ours)1087.91.5760.1641.17
ActFusion (ours)2588.51.5760.1642.01

(a) Results on LTA

method# inference stepsAvg. Performance# parameters (M)memory (GB)inference time (s)
FUTR [24]126.017.380.1560.21
ActFusion (ours)126.21.5760.1510.26
ActFusion (ours)1027.81.5760.1511.04
ActFusion (ours)2528.41.5760.1512.14

[Explanation of Figure 1(c)]

Please refer to the general response for a detailed explanation of Figure 1(c).

评论

Thank you for your detailed rebuttal. The author has clarified most of my concerns. I keep my score as weak accept.

评论

We thank reviewer uA7n for the motivating feedback. We are pleased to hear that most of the concerns have been addressed by our rebuttal. The results discussed in the rebuttal will be included in the final manuscript.

审稿意见
6

The author introduce a unified diffusion model for temporal action segmentation (TAS) and long-term action anticipation (LTA), dubbed ActFusion, where a single model is jointly trained to address these two problems effectively. A new anticipative masking is presented for the effective unification of two tasks, along with random masking to learn intra-action relations. ActFusion achieves the state-of-the-art performance on both TAS and LTA, demonstrating the effectiveness of joint learning of two tasks across standard benchmark datasets, 50 Salads and Breakfast, and GTEA.

优点

  • Originality
    • The paper presents a approach to integrating two popular vision tasks, temporal action segmentation and long-term action anticipation, within a unified model. This is the first time these two problems have been investigated together in a single framework, highlighting the originality of the research.
  • Clarity
    • The paper is clearly articulated and provides sufficient details. It offers an adequate background on diffusion model and how it is utilized. The inclusion of pseudo-code and actual code facilitates a better understanding for the reader. Additionally, the experimental settings are thoroughly described, which aids in the reproducibility of the results.

缺点

  • The methodology lacks novelty. The model architecture, loss function (Cross-Entropy Loss, Temporal Smoothness Loss, Boundary Alignment Loss) and mask strategy (no mask, relation mask, boundary mask) closely resemble those used in the Diffusion Action Segmentation (https://arxiv.org/pdf/2303.17959v2). The primary distinction in this work is only its extension to include the long-term action anticipation task, and the introduction of anticipative masking.
  • The performance improvement relative to other state-of-the-art works is marginal. On TAS task, when compared to DiffAct, this model shows an approximate 1-point improvement across various metrics (F1, edit score, frame-wise accuracy) on different benchmarks (50 Salads, Breakfast, GTEA).
  • The model consistently achieves sub-optimal results when assessed using frame-wise accuracy as the metric on TAS task. As the authors point out, this could potentially be addressed by employing reconstruction methods for masked features. I am eager to see how these adjustments could enhance the model's performance.
  • More ablation study for LTA task could be included, e.g., How important is past context for the models (\alpha) ? How far into the future can models predict (\beta)? Table 2 reports only a limited range of settings. A more thorough analysis on these aspects would be highly valuable.

问题

  • The approach is currently limited to predicting action labels within a closed set. To extend this work to predict open-set action labels, what can be done?
  • While "segmentation helps anticipation" is evident from table 3, "anticipation helps segmentation" is considerably less significant in table 4. What's the reason behind this?
  • If further training the model to reconstruct the original features from the masked features, would it also improve the LTA task?
  • Paper writing
    • What does circle and triangle means in figure 1(c)?
    • In table2, why there are multiple underlined values (suppose to be second-highest value) in each column?

局限性

  • Limitation
    • The author notes the sub-optimal performance when using frame-wise accuracy to evaluate the TAS task and suggests a potential solution, though its effectiveness remains unproven. I am eager to see how these proposed adjustments might improve the model's performance.
  • Potential negative societal impact
    • There is no potential negative societal impact of this work.
作者回复

[Novelty]

Please refer to the general response for our novelty.

[Marginal performance]

We would like to clarify that the performance improvements achieved by our model are significant. Figure R1 in the pdf file illustrates the performance of the Top 10 TAS models for each dataset listed in Table 1, based on their average performance across all metrics. The average performance gain between the adjacent models is 0.5 percentage points (pp), 0.7 pp, and 0.6 pp for the 50 Salads, Breakfast, and GTEA datasets, respectively. Our model, ActFusion, achieves performance gains of 0.7pp, 0.4pp, and 1.1pp compared to the second-best models for each dataset, and 1.1pp, 0.4pp, and 1.5pp compared to DiffAct. We believe the performance gains are meaningful, with notable improvements in two datasets.

[Reconstruction of masked features]

Masked auto-encoding is a technique used in training NLP models like BERT [R1] and has recently been adapted to vision models [21, 27, 68]. Inspired by this approach, we train our model to reconstruct input video features from the masked tokens as an auxiliary task. Specifically, we employ MLP layers on the encoder embeddings to reconstruct the input video features and apply mean squared error (MSE) loss LreconL_{\text{recon}} during training.

Table R2 shows the overall results on both TAS and LTA tasks. In TAS, overall performance increases. We conjecture that reconstruction helps the model gain a deeper understanding of the underlying data structure and temporal dynamics by predicting the missing parts of the input. In LTA, we find that reconstruction is more effective on relatively short-term anticipation. Since short-term predictions are often based on more immediate context, there is less uncertainty. As a result, reconstructing masked features helps the model capture immediate patterns and transitions more accurately. However, for long-term predictions, as the model attempts to predict further into the future, the uncertainty increases significantly. Long-term predictions involve more variables and potential changes, making them inherently less predictable. This increased uncertainty might cause performance degradation, making reconstruction less effective for action anticipation.

[Table R2. Effects of reconstruction loss]

(a) Results on TAS

LreconL_{\text{recon}}F1@10F1@25F1@50EditAccAvg.
-91.690.784.886.089.388.5
92.090.986.686.989.689.2

(a) Results on LTA

LreconL_{\text{recon}}α=0.2,β=0.1\alpha=0.2, \beta=0.1α=0.2,β=0.2\alpha=0.2, \beta=0.2α=0.2,β=0.3\alpha=0.2, \beta=0.3α=0.2,β=0.5\alpha=0.2, \beta=0.5α=0.3,β=0.1\alpha=0.3, \beta=0.1α=0.3,β=0.2\alpha=0.3, \beta=0.2α=0.3,β=0.3\alpha=0.3, \beta=0.3α=0.3,β=0.5\alpha=0.3, \beta=0.5
-39.5528.6023.6119.9042.8027.1123.4822.07
40.8031.0225.5913.9446.5626.2218.5616.15

[R1] J. Devlin et al. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL’19.

[More analysis on LTA]

Table R3 shows the LTA performance across different observation (α\alpha) and prediction (β\beta) ratios. Average anticipation performance improves as the observation range increases. With more observations, uncertainty about future actions is relatively reduced, leading to more accurate predictions. Conversely, average anticipation performance decreases as the prediction range increases. Predicting further into the future presents more challenges due to increased uncertainty, as future actions become less predictable and more variable.

[Table R3. Analysis of observation and prediction ranges in LTA]

α=0.2\alpha = 0.2α=0.3\alpha = 0.3α=0.4\alpha = 0.4α=0.5\alpha = 0.5α=0.6\alpha = 0.6α=0.7\alpha = 0.7α=0.8\alpha = 0.8Avg.
β=0.1\beta = 0.139.5642.8129.8737.0732.7427.5336.7335.2
β=0.2\beta = 0.228.627.1125.4127.1627.6826.9142.2829.3
β=0.3\beta = 0.323.6123.4822.4624.5027.9929.94-25.3
β=0.4\beta = 0.422.7121.0422.4425.0331.08--24.5
β=0.5\beta = 0.519.922.0722.8528.08---23.2
β=0.6\beta = 0.619.2822.7825.71----22.6
β=0.7\beta = 0.719.9325.62-----22.8
β=0.8\beta = 0.823.34------23.3
Avg.24.626.424.828.429.928.139.5

[Extension to the open-set action recognition]

We appreciate the reviewer’s insight on extending our work towards open-set action recognition. To achieve this, we can use frozen image and text encoders from CLIP [R2] to obtain shared representations for actions and text embeddings. Similar to [R3], these embeddings can then be integrated into our model to enable open-set action recognition. We plan to explore this as a future direction for improving our approach.

[R2] A. Radford et al. Learning transferable visual models from natural language supervision. In arXiv’21.
[R3] D. Chatterjee et al. Opening the vocabulary of egocentric videos. In Neurips’23.

[Reasons: segmentation is more helpful on anticipation]

We find that segmentation greatly enhances anticipation, while the effect of anticipation on segmentation is less significant (L271-272). Segmentation directly improves anticipation by providing accurate contextual cues and action boundaries of the observations, enabling the model to make more precise future anticipation. In contrast, anticipation helps segmentation more indirectly. Anticipation encourages the model to consider long temporal relations of actions within an activity, which may not result in immediate performance improvement in segmentation.

[Explanation of Figure 1(c)]

Please refer to the general response for a detailed explanation of Figure 1(c).

[Multiple underlined values in Table 2]

Thank you for pointing out. The underline of the performance of DiffAct on the 50 Salads dataset will be removed.

评论

Dear reviewer FW3D,

We'd like to thank again for your effort and time dedicated to our submission. We've addressed your concerns in our rebuttal, and it would be very helpful if you could give us any further thoughts and update your scores before the author-reviewer discussion phase ends. Your opinion would be invaluable to us in improving our work, and we would be glad to respond further to your questions. Thank you for your consideration.

Best regards,
Authors

评论

Thank you to the authors for providing detailed explanations and conducting additional experiments. These have addressed all of my questions and concerns. Please ensure to include these in the final version of the manuscript. I will be adjusting my rating accordingly.

评论

Thank you for the response. We are glad to hear that most of the concerns have been addressed by our rebuttal. We would like to thank reviewer FW3D once again for the insightful comments for extensive experiments and directions for our work. We make sure to include all experimental results in the final manuscript.

作者回复

We thank all the reviewers for their insightful comments and suggestions. We are happy to see that the reviewers have given our work a positive evaluation, noting that “this is the first time these two problems have been investigated together in a single framework, highlighting the originality (FW3D)”, “the design of the AcFusion is novel (uA7n)”, “the model enhances its performance on both tasks through mutual promotion of TAS and LTA (uA7n)” and “it is also intuitive and reasonable to extend a generative framework from segmentation to anticipation, given the generative nature of anticipation (zKDz)”.

Nevertheless, the reviewers also point out important comments that:

  1. the novelty of the proposed method should be explained,
  2. revealing the effects of learning reconstruction of masked features for TAS is suggested,
  3. further analyses on LTA, loss functions, computational cost, and dataset are suggested.

Through this rebuttal, we aim to clearly expose our novelty and provide further experimental results and analyses. We will revise the manuscript by incorporating the detailed comments from the reviewers.

In the general response, we address the questions posed by all reviewers regarding novelty and the explanation of Figure 1(c).

[Novelty]

The primary novelty of the proposed method lies in unifying the two popular video tasks, temporal action segmentation (TAS) and long-term action anticipation (LTA). The unification is not merely an extension but a novel framework that leverages the bi-directional benefits between TAS and LTA, maximizing synergies between these tasks. None of the previous work [42, 24, 51] neither has introduced a single unified model to tackle the two tasks nor explored the bi-directional benefits. This integration is indeed crucial for practical applications, such as human-assistant robots, which need to recognize and anticipate future actions simultaneously.

To achieve successful task integration, we introduce two types of masking strategies: anticipative masking (MAM^`A`) and random masking (MRM^`R`). Anticipative masking plays a crucial role in effective task integration and random masking leverages to learn intra-action relations from a video. However, simply incorporating these masking strategies does not necessarily guarantee the optimal performance for both TAS and LTA. In Table R1, we applied anticipative and random masking to DiffAct by replacing visual embeddings with zero vectors after encoder processing and using them as conditions for the diffusion process in the decoder. The results in Table R1 show that these masking strategies improve TAS performance but remain below existing state-of-the-art models for LTA [51, 19, 25, 24, 65],

We hypothesize that this is likely due to the limited information from the zero vectors used in future anticipation, which does not fully leverage the information from the visible tokens. To address this, we propose a learnable masking strategy, where input visual features are replaced with learnable mask tokens provided to the encoder. These tokens are trained to learn temporal relations between visible and invisible parts through attention mechanisms. In our model, both the encoder and decoder are trained to handle visible and invisible parts for effective task unification. The introduced masking strategy maximizes synergies between the two tasks, leading to achieving state-of-the-art performance in both TAS and LTA in Tables 1 and 2. We believe our approach presents a novel integrative framework for unifying the two tasks by introducing an effective learnable masking strategy with two types of masking.

[Table R1. Effects of a learnable masking strategy]

(a) Results on TAS

methodMAM^`A`MRM^`R`F1@10F1@25F1@50EditAccAvg.
DiffACT--90.189.283.785.088.987.6
DiffACT91.189.884.185.988.987.9
ActFusion (ours)91.690.784.886.089.388.5

(a) Results on LTA

methodMAM^`A`MRM^`R`α=0.2\alpha = 0.2 β=0.1\beta = 0.1α=0.2\alpha = 0.2 β=0.2\beta = 0.2α=0.2\alpha = 0.2 β=0.3\beta = 0.3α=0.2\alpha = 0.2 β=0.5\beta = 0.5α=0.3\alpha = 0.3 β=0.1\beta = 0.1α=0.3\alpha = 0.3 β=0.2\beta = 0.2α=0.3\alpha = 0.3 β=0.3\beta = 0.3α=0.3\alpha = 0.3 β=0.5\beta = 0.5
DiffACT--11.811.311.310.72017.216.516.6
DiffACT30.327.019.111.337.422.115.613.0
ActFusion (ours)42.833.926.020.743.125.821.320.7

[Explanation of Figure1(c)]

We apologize for not providing detailed explanations for Figure 1(c). In this figure, the circles represent the main tasks the models are proposed to address, while the triangles indicate auxiliary tasks used during training but not evaluated. We will include the descriptions in the final manuscript.

最终决定

All 3 reviewers were inclined to accept the paper. All reviewers agree the paper tackles an important problem in an interesting way and the results show that the method improves over previous SOTA significantly. The AC has read all reviews and, aligned with the reviewer recommendation, the AC is recommending the paper be accepted to NeurIPs.