PaperHub
5.8
/10
Poster4 位审稿人
最低5最高7标准差0.8
6
5
5
7
3.3
置信度
正确性3.0
贡献度3.0
表达2.5
NeurIPS 2024

OPEL: Optimal Transport Guided ProcedurE Learning

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

An unsupervised optimal transport guided method for procedure learning, which achieves state-of-the-art results

摘要

关键词
Procedure learningEgocentric visionEgoProceLOptimal Transport

评审与讨论

审稿意见
6

The paper proposes an unsupervised method for procedure learning that identifies the key steps and their orders in several videos of the same task. The paper formulates the distribution of video frames as an optimal transport (OT) problem to compute the distances between the key steps. To handle the variations of the videos, the paper introduces a regularization with prior distribution. The proposed method achieves SOTA on third-person and first-person video benchmarks.

优点

  • Different from prior methods having ordering constraints assumption, the paper relaxes this by formulating OT.

  • The paper addresses the variation between videos and then introduces a regularization with priors to enhance the OT.

  • The paper shows SOTA results in both first-person and third-person evaluations.

缺点

  • The explanation of using priors to mitigate the variation of the videos, i.e., action speeds, non-monotonic sequences, or starting of actions, is not clear.
  • Besides the ablation studies of different prior distributions, the paper should explain why Laplacian outperforms other distributions.

问题

The authors should give a clear explanation of the regularization usage as mentioned in the weakness.

局限性

The authors have mentioned the limitations.

作者回复

Thanks for your encouraging feedback. Please find the answers to your specific comments:

Explanation of priors: The concept of the Optimality Prior is crucial when dealing with video alignment, especially in challenging scenarios. When two videos are perfectly aligned (case 1 of Fig. 2B), the T^\hat{T} becomes strictly diagonal. However, the real value of the optimality prior emerges in more complex situations, such as speed variation, non-monotonicity, etc.

Speed variation: Consider the scenario in Fig. 2B, case 3: video 1, shown at the top, takes two frames to complete task A, while video 2, displayed on the left, completes task A in one frame. Here, both frames 1 and 2 from video 1 should be aligned with frame 1 of video 2. The optimality prior enables this alignment, as depicted by the blue curve in Fig. 2C and explained by Eqn. 2. The procedure involves finding, for each frame ii from video 2, the closest matching frame from video 1 based on feature similarity, even if the corresponding frame jj from video 1 has a different index. In Fig. 2C, for example, frame jj from video 1 has the highest likelihood of aligning with frame ii from video 2, with the assignment likelihood decaying exponentially for frames further away from jj.

The optimality prior also addresses non-monotonic cases (Fig. 2B, case 4). Here, though the first two frames from video 1 align with those in video 2, the third frame of video 1 reverts to an earlier task A. To correctly align this frame, it must be matched with its closest counterpart in video 2, which is frame 1 in this case. The optimality prior facilitates this by aligning non-monotonic frames that share a higher feature-wise match (similar to Fig. 2C).

However, relying solely on the optimality prior is insufficient because it overlooks the temporal ordering inherent in videos, which is crucial for maintaining temporal coherence. As discussed in the regularization section on page 4 of our paper and illustrated in Fig. R4 (rebuttal pdf), optimizing only for optimality can result in temporal incoherence. Proper alignment should account for the temporal relationship between frames, ensuring that corresponding frames in one sequence align closely with adjacent frames in the other.

This necessity leads to a second critical factor: the Temporal Prior. Unlike the optimality prior, which seeks the best feature match regardless of temporal distance, the temporal prior promotes alignment between frames that are temporally adjacent, thereby preserving the overall temporal coherence. Similar coherence based concepts have been utilized in other temporal alignment works, such as TCC [a], and contrastive regularization [b]. This temporal prior encourages the alignment matrix to exhibit peak values along the diagonal, with diminishing values away from the diagonal; we model this with a Laplace distribution (Eqn. 3 and the red curve in Fig. 2C).

Essentially, there are 2 factors in play- (i) optimality which tries to find the best match between frames irrespective of their temporal distance (which may result in temporally incoherent alignment), (ii) the temporal factor which promotes transport between nearby frames without considering their feature matching. We hypothesize (and later validate with results) that the optimal solution requires a balance between both, and therefore propose to merge these two priors addressing the above factors, as expressed in Eqn. 4. This combined prior ensures accurate alignment between videos considering all the factors, as further illustrated in Fig. 2D and supported by the results in Table 5 of the main paper.

When the starting of actions is different (case 2, Fig. 2B), video 2 has a background frame to start when video 1 starts doing task A. As mentioned in page 5 of our paper, we introduce a 'virtual frame' to handle such cases. Even in this case, the combined prior comes into play as we assign any frame to the virtual frame if the likelihood of that frame aligning with any other task-related frames from the other video falls below a predefined threshold. This phenomenon is also illustrated in Fig. 4(B) of the main paper.

Laplace distribution: As shown in appendix Table A6 (main paper), Laplace as a prior outperforms other distributions. To analyze this, we plot the distributions in Fig. R3 (rebuttal pdf). Note, we use the same distribution for optimality as well as temporal priors.

For the optimality prior, the x-axis is the difference of frames in the feature space (1-d representation for illustration purposes), and the y-axis denotes the corresponding probability of alignment. We want the point representing the most likely alignment (as per T^\hat{T}) to have the highest likelihood, while the assignment probability should exponentially decay further away. The graph clearly shows that the Laplace distribution captures this behavior suitably compared to Uniform and Gaussian. Similarly, for the temporal prior, the x-axis denotes the temporal distance between the frames, and the y-axis denotes the corresponding probability of alignment. The graph shows that Laplace distribution facilitates alignment of the frames when they are temporally aligned (close to center), and its long tail distribution enables better correlation of non-monotonic frames compared to Gaussian or Uniform. As a result, as shown in Fig. R3, even at far away locations from the center (temporally distant frames), alignment is possible if indeed these frames have a high match feature-wise. In this case, the Laplace temporal prior provides non-zero probability to that far away frame (due to its long tail) unlike other distributions and the optimality prior gives a large score (due to feature match), resulting in improved handling of non-monotonicity.

Refs:

[a] Dwibedi et al. Temporal cycle-consistency learning. CVPR 2019.

[b] Haresh et al. Learning by aligning videos in time. CVPR 2021.

评论

Dear Reviewer, thanks again for your feedback. We hope you consider our responses based on your comments, looking forward to the discussion.

评论

Thank the authors for their responses. I am satisfied with the answers. After reading other reviews, I will raise my score to 6.

评论

We thank the reviewer for the consideration and are glad that the reviewer liked our responses.

审稿意见
5

The authors propose a novel approach for procedure learning leveraging optimal transport, OPEL. OPEL integrates optimality and temporal priors, and incorporates a novel inter-video contrastive loss. OPEL achieves significant improvements on egocentric and third-person benchmarks.

优点

  1. An interesting and relevant topic regarding the treatment of procedure learning as an optimal transport problem.
  2. The authors thoroughly consider deviations in real-world sequences, including background frames, redundant frames, and non-monotonic frames.
  3. OPEL achieves excellent results across several benchmarks.

缺点

  1. The presentations of the paper needs improvement, especially in Section 3, where an excessive number of formulas obstruct readability.
  2. Some figures are unclear, such as Figure 1. Are V1, V2, V3, V4 from four different videos, or do they appear to be from the same video? Additionally, what distinguishes V3 from V4, where one represents background frames and the other represents redundant frames?

问题

  1. I still have some doubts regarding the difference between procedure learning and action segmentation. I believe their final output forms are same. From related work, I believe that action segmentation involves classifying each frame, while procedure learning only needs to identify key steps. Is this correct? Could you please explain in detail? 2.I also feel that the paper doesn't clearly present the problem formulation, including the input and output, which makes it somewhat confusing to understand. Does the output video contain a fixed number of K key steps? According to the paper, setting K to 7 is optimal. However, based on the dataset, the average number of steps for each task may be greater than or less than K. When it exceeds K, are some steps merged? And what happens when it's less than K? Is it reasonable to pre-define K?
  2. Could you provide more results of ablation training on K for additional tasks?

局限性

Yes

作者回复

Thanks for your encouraging feedback. Please find answers to your specific comments:

Presentation: We agree that the equations are too dense in Section 3, and the formulations can be simplified. We will update accordingly as per the reviewer’s suggestion during revision.

The V1 to V4 in Fig. 1 are from four different videos, while the frames from the same video are shown in x-axis. We apologize for the ambiguity of background frames vs redundant frames; the Fig.1 has been updated as Fig. R1 (rebuttal pdf). Both refer to frames not directly related to any key-step (hence, do not belong to any cluster) and thus are clustered together as a background cluster (Fig. R1, right). So, we remove the term redundant frames for clarity.

PL vs AS: Procedure learning (PL) and action segmentation (AS) are related but not the same. PL, when applied to a set of videos of the same task, involves 2 primary steps: (a) assigning each frame to 1 of the k key-steps (including background elements), (b) determining the logical sequence of the key-steps to complete the task. As illustrated in Fig. 1, PL addresses multiple videos of a given task, enabling the identification of repetitive key-steps across these videos [a, b]. In contrast, AS [c] focuses on a single video, thereby lacking the ability to discern repetitive key-steps across videos. Additionally, AS does not consider the sequence of individual events, which is often crucial for accurately identifying procedures. For instance, AS fails to capture the variations in the order of key-steps, such as those observed between V1 and V2 in Fig. 1.

In our work, the OPEL loss is used to train a representation learning algorithm to obtain an embedding space where similar frames are located close by. Then, as described in the ‘Clustering and Key-step Ordering’ part of page 6 of the main paper, these key-steps are clustered, and their sequential order is determined. A PyTorch function (with a toy example) is provided in codeblock R1 of the rebuttal pdf to illustrate this process of finding the sequential order (part b from the definition of PL given above) of the frame-wise key-step predictions (output from part a from the above PL definition).

Problem formulation: Given a set of videos of the same task (e.g. making brownies), PL aims to find the constituent key-steps (e.g. break egg, mix contents, add oil, mix egg) and their sequential order (break egg --> mix egg --> add water --> add oil --> mix contents).

During training: Inputs are untrimmed and unlabeled videos. We train with pairs of videos at a time (formulation detailed in Section 3 of the main paper). Output is the frames clustered into different key steps and their sequential order. For evaluation metrics, we use framewise cluster predictions, following other SOTA works [a,b,d].

During inference: Input: a single video, P=[p1, p2, . . . , pN]P = \left[p_1,\ p_2,\ .\ .\ .\ ,\ p_N\right] Output: Assign each frame to a phase using trained model fθfθ(pi)=lif_{\theta}∶ f_{\theta}(p_i)= l_i, where licc=1kl_i ∈ {c}^k_{c=1}, represents the key-step corresponding to the phase of each frame pip_i in the video P.

From all lil_i, we determine the sequential order of the sub-tasks (eg: 0,3,2,4,1...). Note, the cluster number is arbitrary, each of them denoting a key-step. We provide a PyTorch code to illustrate this process of finding the sequential order in codeblock R1 of the rebuttal pdf.

Choice of k: In unsupervised PL, k is not set during training (learning of the embeddings), rather k is used during inference only to fix the no. of output clusters, and each frame of the video being inferred is mapped to one of those k clusters. Now, an output video may not contain that exact k number of key-steps (e.g. in making a sandwich, steps such as adding jelly, butter etc. may be present in some videos, while absent in others).

We obtain best results with k=7 (Table 7), consistent with other SOTA methods on the same datasets [a,b,d]. We agree that based on the dataset, the average number of steps for each task may be > or < k. If it exceeds k, some steps get merged, e.g., for PC Disassembly, though the ground-truth number of steps is 9, 3 are quite similar (remove hard disk, remove motherboard, remove RAM); effectively making them quite close in the feature space. As a result, choosing k=7 results in these 3 key-steps getting merged. When the average number of key-steps becomes < k, each subtask gets split into multiple smaller clusters with very similar embeddings. This phenomenon is illustrated in Fig. R2 (rebuttal pdf), where a larger k might result in several split clusters of small windows (blue and red boxes in Fig. R2).

Note, this demarcation of subtasks (no. of clusters) varies from task to task, dataset to dataset and is subjective as some may consider semantically similar tasks (e.g. pouring oil vs water) to be same subtask, while others may consider it different. So, for best performance, k might be adjusted task-wise (using methods such as the elbow method, AIC criterion). However, optimizing for task-wise k is not the goal of this work; our contribution is on the OT-based learning and the clustering is only used as an inference post-processing step. So for fair comparison, we experiment with some reasonable values of k (following prior art [a,b,d]) and consistently outperform them across all k and all datasets with similar trends.

We report additional ablation on varying k in the rebuttal pdf, Table R1. We find that the performance degrades as k is increased from 71012157\to10\to12\to15, thus overall k=7 provides the best performance. Note, our contribution is on the OT-based learning, the clustering is only used as an inference post-processing step. Overall, we achieve similar trends for varying k w.r.t. SOTA works [a,b,d].

[a] Bansal et al. ECCV 2022

[b] Bansal et al. WACV 2024

[c] Kumar et al. CVPR, 2022

[d] Shah et al. ICCV, 2023

评论

Dear Reviewer, thanks again for your feedback, hope you consider our responses based on your comments. Please let us know if you have any further queries, looking forward to the discussion.

审稿意见
5

The paper presents OPEL, a novel framework for procedure learning from videos that leverages optimal transport (OT) to align key steps across different video instances. OPEL treats video frames as samples from an unknown distribution and formulates the distance calculation between them as an optimal transport problem, allowing for more flexible alignment compared to frame-to-frame mappings. The authors introduce two regularization terms to improve the OT formulation and experiments significantly outperforms state-of-the-art methods on benchmark datasets.

优点

  1. This paper proposes a novel optimal transport-based procedure learning framework that aligns frames with similar semantics together in an embedding space.
  2. The paper enhances the OT formulation with two regularization terms that address temporal and semantic relationships, contributing to better alignment and learning.
  3. OPEL demonstrates effective performance improvements over previous state-of-the-art methods on benchmark datasets.

缺点

  1. I believe the authors can compare with recent action segmentation methods to further strengthen the experiments.
  2. I believe the authors can also compare with methods that use temporal alignment methods like dynamic time warping to strengthen their contributions.

问题

I do not have specific questions on the model design. I only have some experiment suggestions which are the same as I listed in "weaknesses".

局限性

N/A

作者回复

Thanks for your encouraging and insightful feedback. Please find the answers to your specific comments:

Comparison with AS methods: Procedure learning (PL) and action segmentation (AS) are related but not the same. PL, when applied to a set of instructional videos depicting the same task, involves two primary steps: (i) assigning each video frame to one of the k key-steps (including background elements), and (ii) determining the logical sequence of these key-steps necessary to complete the task. As illustrated in Fig. 1 of the main paper, PL addresses multiple videos of a given task, enabling the identification of repetitive key-steps across these videos [i, j]. In contrast, AS [k] focuses on a single video, thereby lacking the ability to discern repetitive key-steps across different videos.

Despite the differences between PL and AS, as per the reviewer's suggestion, we compare our approach against existing SOTA unsupervised AS models [a-f] and present the results in Table X below. Our model demonstrates a significant performance improvement compared to these works. In [b, f], authors report a high recall score for CrossTask as it assigns majority of the frames to a single key-step - a phenomenon also reported by [j]. While achieving high recall is important for ensuring that most positive instances are correctly identified, it can result in a greater number of false positives, which in turn lowers precision and leads to undesirable results. Therefore, it is crucial to balance recall with precision to develop an effective model. This balance is reflected in the superior performance of our model, as evidenced by the F1-score results across various benchmarks. Note in the Table X, our approach is compared with SOTA Unsupervised AS methods for only third-person datasets, as these do not report any result on egocentric datasets.

Table X: Comparison of our approach with SOTA Unsupervised Action Segmentation methods. Note '-' denotes authors have not provided any data on those metrics.

Action Segmentation PapersProceLCrossTask
PRF1PRF1
JointSeqFL (2019) [a]--29.8---
Elhamifar et al. (2020) [b]9.526.714.010.141.616.3
Fried et al. (2020) [c]----28.8-
Shen et al. (2021) [d]16.531.821.115.235.521.0
Dvornik et al. (2022) [e]-----25.3
StepFormer (2023) [f]18.328.121.922.14228.3
OPEL33.636.334.935.634.835.1

Temporal Alignment methods: According to the reviewer's suggestion, we have included additional comparisons with temporal alignment techniques in Table Y. Specifically, we have compared our model with methods like TCC [g] and LAV [h], which incorporate temporal cycle consistency (TCC) and dynamic time warping (DTW), respectively. Other methods like CnC [i] uses TC3I (TCC + contrastive inverse difference moment, C-IDM) as the loss function while GPL [j] uses graph-based representation for temporal alignment. Our results clearly demonstrate the efficacy of our approach compared to these established methods, further validating the effectiveness of our model in maintaining temporal alignment while delivering superior PL performance. Note, in Table Y, our approach is compared with SOTA Unsupervised Temporal alignment methods for only egocentric (first-person) datasets, as these do not report any result on third-person datasets.

Table Y: Comparison of our approach with existing Unsupervised Temporal Alignment methods.

Temporal Alignment MethodsCMU-MMACMECCANOEPIC-TentEGTEA-GAZE+PC AssemblyPC Disassembly
PrecisionF1IoUPrecisionF1IoUPrecisionF1IoUPrecisionF1IoUPrecisionF1IoUPrecisionF1IoU
TCC [g]18.519.79.515.117.98.714.214.97.817.519.78.819.921.711.622.023.412.2
LAV (DTW) [h]20.621.19.414.617.47.115.215.88.317.419.18.021.522.711.726.426.512.9
LAV + TCC18.819.79.013.415.67.316.016.78.516.418.67.521.621.110.821.024.312.3
CnC(TC3I) [i]21.622.711.115.518.17.817.117.28.319.621.79.525.025.112.828.427.014.8
GPL [j]30.331.717.918.820.710.017.919.89.123.827.116.027.127.515.228.126.715.2
OPEL32.836.518.828.939.220.218.820.710.624.329.513.232.533.717.929.632.216.7

Thanks again for the suggestions to strengthen our paper, looking forward to your kind consideration.

Refs.:

[a] Elhamifar et al. Unsupervised Procedure Learning via Joint Dynamic Summarization, ICCV 2019.

[b] Elhamifar et al. Self-supervised multi-task procedure learning from instructional videos, ECCV 2020.

[c] Fried et al. Learning to segment actions from observation and narration, ACL 2020.

[d] Shen et al. Learning to segment actions from visual and language instructions via differentiable weak sequence alignment, CVPR 2021.

[e] Dvornik et al. Flow Graph to Video Grounding for Weakly-Supervised Multi-step Localization, ECCV 2022.

[f] Dvornik et al. StepFormer: Self-supervised Step Discovery and Localization in Instructional Videos, CVPR 2023.

[g] Dwibedi et al. Temporal cycle-consistency learning, CVPR 2019.

[h] Haresh et al. Learning by aligning videos in time, CVPR 2021.

[i] Bansal et al. My view is the best view: Procedure learning from egocentric videos. ECCV 2022.

[j] Bansal et al. United we stand, divided we fall: Unitygraph for unsupervised procedure learning from video. WACV 2024.

[k] Kumar et al. Unsupervised Action Segmentation by Joint Representation Learning and Online Clustering. CVPR, 2022.

评论

Dear Reviewer, thanks again for your feedback, hope you consider our responses. Please let us know if you have any further queries, looking forward to the discussion.

审稿意见
7

OPEL is a novel technique for Procedure Learning. Procedure learning is the task of finding key steps in an action (such as cooking brownie) and aligning the videos based on these key steps. OPEL proposes to use the optimal transport distance between the two videos as the similarity metric rather than direct frame by frame mapping or assuming a strict monotone mapping. This work also proposes two regularizers to incorporate priors about monotone mapping and about increasing the correspondence between nearby frames. The results on both egocentric and 3rd person videos show significant improvement upon the previous SOTA tasks. The ablation studies analyze the effect of each of the losses and regularizers. It seems having the temporal prior and the inter-intra cross entropy are the most important factors in terms of f1 accuracy.

优点

This is a sound and novel technique for procedure learning. Using optimal transport between frames significantly improves the results and add leniency when the frames are not exact matches temporally. This is a common happenstance in real world scenarios since not all the steps of a procedure need to happen in chronological order. Two sub procedures can be interchangeable and using optimal transport with only a regularizer on temporal monotony is can account for this.

The experiments are sufficiently evaluated on several datasets. Over all the datasets this method out performs the prior work.

The manuscript is written in a detailed fashion and the full information is provided for sake of replicability. Also the ablation study provided improves the clarity of what is important in this method.

缺点

There are many factors added together to make this method work. Although many of them are common in other methods as well, but the method and also the ablation seem to suggest that some of the losses may be redundant. It is not clear if the difference between some of the lines in table. 5 is statistically significant.

Page 6. is written in such a compact way that it's not easy to read and follow. The indices such as i,j can be probably dropped. Words such as temperature in the formula can be probably replaced with tau.

问题

Why do you think the number of steps makes such a large difference in table 7? going from 7 to 10 drops the performance significantly. another anomaly is 12 being worse than both 10 and 15.

局限性

not explicitly.

作者回复

Thanks for your encouraging and insightful feedback. Please find the answers to your specific comments. Also please refer to our overall rebuttal response to all reviewers and the corresponding pdf containing additional figures and tables to support our claims.

Method factors: The reviewer correctly points out that all the components are not equally critical for OPEL’s performance. We show the contributions of each factor individually to analyze their effect on the overall result in Table 5. Comparing row 3 with row 9, we observe, the priors jointly play a critical role; as without them (row 3), the F1 and IoU scores drop by ~5 points. Specifically, the optimality prior has a significant impact (~2 point), while the temporal prior affects the score by ~1 point. Similar to the combined priors, the intra and inter-video contrastive losses together (see row 6 vs row 9) have a significant effect (~3.5 points) on the overall performance. The individual effect of virtual frame is negligible as it only plays a role in case of excessive background frames - a scenario that is not prevalent in most datasets. Furthermore, due to the IDM structure of M(T^)M\left(\hat{T}\right), the T^\hat{T} and Q^\hat{Q} are similar by formulation. This results in KL(T^ Q^)KL\left(\hat{T}\ \parallel\hat{Q}\right) to be already small. As a consequence, adding the KL divergence as a standalone loss component in the proposed pipeline has a minimal impact. Overall, while some loss components may have a smaller individual impact, they do contribute to performance improvements, even if incrementally. Therefore, our proposed approach incorporates all of them to achieve the best possible results. We are encouraged with the reviewer's note - the provided ablation study improves the clarity of what is important in this method.

Writing: The equations are an integral part in explaining the proposed approach and the different loss terms. But we agree that the page 6 is too dense, and the formulations can be simplified to improve readability. We will update the notations and improve the overall writing as per the reviewer’s suggestion during revision.

Effect of k: Note, we obtain best results with k=7 (Table 7), and performance drops sharply as k goes from 7 to 10. This observation is consistent with all the other SOTA methods on the same datasets [a, b, c]. We hypothesize that k=7 works best as it is the optimal number of clusters considering the average number of distinct key-steps (subtasks) of the datasets. For example, for PC Disassembly, although the ground-truth (GT) number of steps is 9, 3 steps are quite similar (remove hard disk, remove motherboard, remove RAM), effectively making them quite close in the feature space. This results in k=7 being a better estimation of the cluster number with distinct steps. Note, this demarcation of subtasks (hence, number of clusters) is subjective and varies from dataset to dataset as well as from task to task; as some may consider semantically similar tasks (e.g. pouring oil vs water) to be one subtask, while others may consider it different. As k becomes larger than the actual distinctive number of clusters, each subtask gets split into multiple clusters with very similar embeddings, which upon comparison with GT leads to inferior results. This phenomenon is illustrated in Fig. R2 (see rebuttal pdf), where a larger k might result in several erroneous clusters with very small windows (blue and red boxes in Fig. R2). This leads to large fluctuations (jittery predictions) within a single GT phase, thus deteriorating the overall performance.

Secondly, the anomalous trend (k = 12 being slightly worse than 10 and 15) might be a dataset (PC Assembly) specific issue and not unique to our approach, as a similar trend for this dataset has been reported in [a, b]. However, for PC Disassembly and other datasets (reported as additional results in the rebuttal pdf, Table R1), we consistently find that the performance degrades as k is increased from 10 \to 12 \to 15. Note, that our contribution is on the optimal transport (OT)-based representation learning, the clustering is only used as an inference post-processing step. Overall, we achieve similar trends with respect to the SOTA works but consistently outperform them across all k and datasets.

Refs.:

[a] Bansal et al. My view is the best view: Procedure learning from egocentric videos. ECCV 2022.

[b] Bansal et al. United we stand, divided we fall: Unitygraph for unsupervised procedure learning from video. WACV 2024.

[c] Shah et al. Steps: Self-supervised key step extraction and localization from unlabeled procedural videos. ICCV, 2023.

评论

Thank you for your response. In terms of statistical significance, what is the standard deviation for a set of 5 runs for example?

评论

Thanks for your response. In our paper, we reported just the mean values obtained over multiple runs and not the standard deviations as the results did not vary significantly. Also, previous SOTA works [a, b, c] do not report the standard deviation across runs either. However, as per the reviewer's suggestion, we report the results of the mean ± standard deviation (SD) over 5 separate runs in Table A below. Note, in all cases, we find that the SD is quite low, and we get consistent results over the multiple runs, further demonstrating the statistical significance of the results.

Table A: Results showing mean ± SD over 5 runs for all the datasets

DatasetF1 (mean ± SD)IoU (mean ± SD)
CMU-MMAC36.5 ± 0.13818.8 ± 0.106
EGTEA-GAZE+29.5 ± 0.14713.2 ± 0.145
MECCANO39.2 ± 0.31920.2 ± 0.258
EPIC-Tents20.7 ± 0.16510.6 ± 0.101
PC Assembly33.7 ± 0.31117.9 ± 0.184
PC Disassembly32.2 ± 0.31716.9 ± 0.203
ProceL34.9 ± 0.09521.3 ± 0.037
CrossTask35.1 ± 0.14221.5 ± 0.111

Please let us know if you have any further queries. Thanks again and hope you reconsider your score based on our responses.

Refs.:

[a] Bansal et al. My view is the best view: Procedure learning from egocentric videos. ECCV 2022.

[b] Bansal et al. United we stand, divided we fall: Unitygraph for unsupervised procedure learning from video. WACV 2024.

[c] Shah et al. Steps: Self-supervised key step extraction and localization from unlabeled procedural videos. ICCV, 2023.

作者回复

We thank all the reviewers for their insightful comments and feedback. We are encouraged that the reviewers like the soundness and novelty of our approach for procedure learning along with comprehensive evaluation (Reviewer XDhK), and enhancement of results over current state-of-the-art (SOTA) works (Reviewer pvnX). Additionally, Reviewer Yy5N finds our work interesting and relevant as we thoroughly consider deviations in real-world sequences, including background and non-monotonic frames. Lastly, Reviewer 3Xtn appreciates the novelty of our work and its SOTA performance.

The reviewers have also raised some concerns that we have addressed in their individual rebuttal responses. Furthermore, as per the reviewers' suggestions, we have added additional relevant figures and results to support our claims in the attached 1 page pdf and referred to them in our responses. Specifically,

  1. Figure R1 is a modified version of Figure 1 of the main paper.
  2. Figure R2 illustrates the qualitative effect of varying k for clustering the key-steps.
  3. Figure R3 shows the reasoning for the choice of Laplace distribution as a prior over other distributions.
  4. Figure R4 emphasizes of the importance of both the optimality and temporal priors to combat real-world non-idealities like non-monotonic frames, speed variation, etc., while maintaining temporal coherence.
  5. Table R1 shows additional ablation study on the effect of k on MECCANO and EPIC-Tents datasets.
  6. Codeblock R1 depicts a Pytorch function to determine the sequential ordering of tasks from frame-wise key-step predictions.

In general, we agree that the formulations in Section 3 can be simplified to improve readability. We will update accordingly as per the reviewers' suggestions during the camera-ready version, if accepted. Please note, in our rebuttal responses, Fig./Table X (e.g. Fig. 2/Table 5) refers to the main submitted manuscript, while Fig./Table RX (e.g. Fig. R2/Table R1) refers to the rebuttal pdf attached herewith.

Once again, we sincerely appreciate your time and consideration. Please let us know if you have any further queries. We look forward to your responses.

最终决定

This paper received active discussions between authors and reviewers, with the outcome that the reviewer reached an accept consensus (5, 5, 6, 7). The reviewers appreciate the novelty of the method, the new insights that come with bringing optimal transport in the video recognition domain, and the strong empirical performance. The reviewers have pointed out limitations when it comes to writing/explanations as well as the overlap with action segmentation. In the rebuttals, the authors have performed additional experiments with comparisons on action segmentation, which further strengthens the paper. The AC finds that the strengths of the paper outweigh the limitations given by the reviewers, especially after the rebuttal stage. Hence the recommendation to accept.

公开评论

This paper has heavily copied existing works and has very limited novelty. It is merely a copy of previous works in the same self-supervised video alignment and procedure learning topics, i.e., VAVA [1] (for self-supervised video alignment) and CnC [2] (for clustering and key-step ordering). In Sec. 3:

  • "Optimal Transport Formulation", "Background and Redundant Frames", "Training Methodology", "Contrastive Regularization" are exactly copied from VAVA [1].

  • "Regularization with Priors" is copied from VAVA [1]. Temporal priors and Optimality priors were proposed in VAVA [1]. The only difference here is the use of Laplace function in Eq 2 and 3 in OPEL vs. Gaussian function in Eq 4 and 6 in VAVA [1], but it is a minor novelty.

  • "Clustering and Key-step Ordering" is exactly copied from CnC [2].

The authors make it sound like the above modules are proposed by them by 1/ using new names for existing modules, and 2/ missing citations of the original works and missing discussions on how their modules are different from those in the original works in Sec. 3.

The authors submit this paper to a Machine Learning conference, so it has less chances to be assigned to reviewers who work on video alignment/action segmentation/procedure learning. All of the competing methods in Tabs. 1, 2, 3 are published in Computer Vision conferences like ICCV/CVPR/ECCV/WACV.

Since the authors have been "succeeded" with this NeurIPS 2024 paper, they are now applying the same "trick" again with ICLR 2026. Their new submission (you can find by searching for the second author on Google Scholar) copies VAVA [1]/OPEL [2], and my VAOT [3]/RGWOT [4] works, with very limited and questionable novelty. I have added my detailed comments on their OpenReview submission.

[1] Liu et al. Learning to align sequential actions in the wild. CVPR 2022. (VAVA)

[2] Bansal et al. My view is the best view: Procedure learning from egocentric videos. ECCV 2022. (CnC)

[3] Ali et al. Joint self-supervised video alignment and action segmentation. ICCV 2025. (on ArXiv since March 2025) (VAOT, VASOT)

[4] Mahmood et al. Procedure learning via regularized gromov-wasserstein optimal transport. WACV 2026. (on ArXiv since July 2025) (RGWOT)

P/S: I am Quoc-Huy Tran, who is an expert in self-supervised video alignment (e.g., LAV at CVPR 2021, LA2DS at ECCV 2024, VAOT at ICCV 2025), self-supervised action segmentation (e.g., TOT at CVPR 2022, UFSA at WACV 2024, VASOT at ICCV 2025), and self-supervised procedure learning (e.g., RGWOT at WACV 2026). Two of my works are cited in this paper. I have been aware of this work since March 2025 when I started working on procedure learning. I didnt know I could add public comments until just now.