Learning Skill-Attributes for Transferable Assessment in Video
We propose a method to transfer video-based skill assessment knowledge across various sports domains.
摘要
评审与讨论
The paper proposes CROSSTRAINER, a method for video-based skill assessment that learns transferable skill-attributes (e.g., control, balance) shared across sports. It uses a multimodal LLM to predict incorrect attributes, generate actionable feedback, and estimate proficiency
优缺点分析
Strengths:
- Demonstrates consistent improvements over SOTA baselines across skill-attribute prediction, feedback generation, and proficiency estimation.
- Handles unseen skills and sports with minimal performance drop, showing strong generalization across tasks and domains.
- Decomposing feedback into generalizable “skill-attributes” makes the model more interpretable and adaptable.
Weakness:
- While skill-attributes reduce label cost, they still rely on curated expert feedback for supervision, which may not scale well.
- Current formulation lacks mechanisms for reasoning about external factors (e.g., terrain, opponent behavior) that may influence skill.
- Focuses only on individual physical tasks; unclear if the approach extends to multi-agent, strategic, or cognitive skills.
- Heavy reliance on LLM prompting and proprietary components (e.g., LLaMA-3) without detailed insight into prompt design or tuning.
问题
How do you ensure the extracted skill-attributes are accurate and consistent? Are they verified or clustered? Do you expect CROSSTRAINER to work for non-physical tasks (e.g., surgery, industrial skills)? Any thoughts on adapting it?
局限性
The method still depends on expert commentary for supervision. Addressing how this could be scaled or replaced with weak/noisy supervision would be useful.
最终评判理由
The rebuttal is clear to me, I will raise the score.
格式问题
No
We thank the reviewer for the feedback. All the concerns are addressed below:
1: While skill-attributes reduce label cost, they still rely on curated expert feedback for supervision, which may not scale well.
Curated expert feedback is indeed expensive, which motivates our approach for zero-shot transfer of feedback across sports—effectively reducing the need for curating supervision. That is, our proposed generalization ensures we scale better than all the baselines that are tied to the specific sport [87, 46, 52, 67, 8] (Tab 1, Fig. 4). Our method achieves better performance in novel scenarios and sports without explicit expert knowledge of the sport (Fig. 4 and L267-275).
2: Can the current formulation handle external factors (e.g., terrain, opponent behavior), multi-agent, strategic, cognitive skills, surgery, industrial skills?
Terrain and multi-agent opponent behavior: That’s a good observation. We do not factor terrain and opponent behavior in this work. This assumption is consistent with the datasets Ego-Exo4D [36] and QEVD [67] that are both single-person, and do not consider external factors. Furthermore, prior work also considers single-person skill assessment/feedback [11, 36, 67, 69, 70, 72]. There are research works in multi-agent cooperation [A, B], but they are restricted to simulation and simpler objectives than performance feedback.
Extending this method to external factors and opponent behavior would require additional skill-attributes like coordination, relative placement, spatial awareness, etc. and a similar training approach as ours. We do not consider this scenario due to the lack of training data, and prior methods; this is an interesting extension of our work.
Strategic and cognitive skills: We do not focus on strategic or cognitive skills since they are not necessarily visual—hence using visual encoders and visual models might not be required.
Surgery and industrial skills: We believe our method can be extended to surgery and industrial skills. However, adapting it to different domains would require a careful choice of visual encoders, e.g., surgery may only need hand pose encodings, and corresponding skill-attributes. Nonetheless, it is an interesting future work, provided there are datasets for such supervision with actionable feedback.
References:
A: Liu et al., ICLR 19, “Emergent Coordination Through Competition”
B: Lowe et al., NeurIPS 17, “Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments”
3: Insights into the generated skill-attributes — prompt design tuning, accuracy and consistency, and if they are verified and clustered.
Prompt design tuning and consistency w/ different language model: Thanks for the suggestion for prompt design tuning and checking for the consistency by choosing a different open-sourced language model (Llama-3 is an open sourced model). We deployed only one SOTA LLM in our pipeline, for simplicity, but we appreciate interest in knowing how sensitive our full pipeline is with respect to that choice.
We try two prompt variants, and a different language model (Mistral 8B, mistralai/Ministral-8B-Instruct-2410) to compare the similarity between generated skill-attributes—implying our idea is independent of the chosen language model. We use Hungarian matching to find the most-similar match between the new skill-attribute set and our original skill-attribute set. Next, we find the average BERTScore (Zhang et. al, ICLR 20) between the sets. The similarity is tabulated below:
| Comparing model (Ours vs) | Score |
|---|---|
| Llama-3 8B w/ prompt choice 1 | 0.99 |
| Llama-3 8B w/ prompt choice 2 | 0.98 |
| Mistral 8B | 0.98 |
We see a very high similarity between skill-attributes generated from different prompts, and a different language model. This result implies that our idea is independent of the choice of a reasonable language model.
Accuracy of the skill-attributes: We ensure the accuracy of the skill-attributes by manually verifying the skill-attributes for the videos in the test set. The manual verification process involves ensuring the skill-attribute correctly reflects actionable feedback mentioned by the expert. The test set contains 1272 videos, and we only had to edit only 0.2% of the skill-attributes. We will add this in the paper.
Are they clustered?: We do not restrict the scope of the skill-attributes, and hence, we do not cluster the skill-attributes. The skill-attributes are generated by our proposed model (Fig 3 and Sec. 3.3), and then the predicted skill-attribute is used for actionable feedback generation, and proficiency estimation (Sec. 3.4).
4. Addressing how expert feedback training supervision could be scaled or replaced with weak/noisy supervision would be useful.
In this work, we focus on maximizing the ability to provide skill assessment given the current scale of the expert commentary (L45-47). We also demonstrate the ability to transfer the assessment across the long tail of sports (L33-35).
The scale of the expert feedback can be increased by having more annotations by subject experts (costly), or leveraging knowledge from the internet, e.g., YouTube, Reddit (L155-156). Increasing the expert feedback scale for training and evaluation is an orthogonal line of research; any improvement on that front will further expand the applicability of our approach. Moreover, the supervision cannot be scaled equally for all sports, due to the interest and the availability of the data, e.g., soccer is more popular, and has more data, than korfball. Therefore, the problem of long-tail transfer will always be crucial, as shown in our work.
We are aware that Reviewer BeW2 has completed the Mandatory Acknowledgement, but we are not aware of how they interpreted our rebuttal, or which points are resolved or remain open.
We'll stand by to address any further questions before the author-reviewer discussion deadline (which is now Aug 8, AoE).
This paper tackles the problem of video-based skill assessment, aiming to overcome a major limitation of existing methods: their poor transferability across different sports or activities. The authors propose an approach centered on the idea of learning a set of universal, sport-agnostic skill-attributes like balance, control and body positioning. The core contribution is a two-stage training paradigm. First, they cleverly bootstrap these skill-attributes by using a LLM to parse existing expert commentaries from datasets, thus creating weakly-supervised labels. Second, they pre-train a multimodal model to predict these attributes from video, and then finetune it to generate actionable feedback and estimate a performer's proficiency level. The experiments conducted on datasets like Ego-Exo4D and QEVD, demonstrate that this approach improves in-domain performance and shows significantly better generalization to novel sports and drills in various zero-shot settings compared to state-of-the-art methods.
优缺点分析
Strengths:
-
The paper addresses the problem of the lack of transferability in action quality assessment. The core idea of identifying and using intermediate skill-attributes as a transferable representation is insightful. It provides a structured way to reason about why a performance is good or bad, which aligns well with the principles of AQA and effective coaching.
-
The setup for testing transferability with its multiple zero-shot configurations (ZS-1, ZS-2, ZS-3), is rigorous and provides strong evidence for their main claim. The results show their method outperforming strong baselines, especially in the challenging zero-shot scenarios where it degrades much more gracefully.
Weaknesses:
-
The quality of the bootstrapped skill-attributes is fundamental to the entire pipeline, yet this process relies heavily on an LLM prompted to extract concepts from text. The robustness of this extraction process to different LLMs, prompts, or noisy/ambiguous commentary is not deeply analyzed.
-
The overall framework of generating soft labels using LLM then fintune LoRA and the underlying model architecture like a frozen video encoder, a trainable mapping network, and a frozen LLM is a relatively standard pattern in recent multimodal research and lacks novelty.
-
While the experimental evaluation is extensive, the central claim about the necessity of the skill-attribute pre-training stage could be further substantiated with more direct ablation studies, as detailed in the questions below.
问题
-
The paper compellingly argues that the two-stage, attribute-centric approach is superior to other SOTA models. To more directly isolate and quantify the benefit of the proposed pre-training stage, have you considered running an ablation study where the exact same architecture is trained end-to-end? Specifically, a baseline trained to go directly from video to actionable feedback, without the intermediate skill-attribute discovery and pre-training stage. This would provide a very clean and powerful demonstration of the value added by your proposed pre-training methodology.
-
Regarding the crucial skill-attribute discovery stage (Sec 3.3): Could you provide more analysis on the quality and stability of the LLM-generated attributes? How consistent are the extracted attributes if a different LLM or slightly different prompts are used?
-
The paper mentions creating open-vocabulary skill-attributes. Is it a fully open vocabulary discovered on the fly, or is there a process to consolidate or cluster similar LLM outputs into a more canonical set of attributes?
-
Have you conducted any human studies to assess whether the generated feedback is perceived as actionable and correct, especially for novel sports?
-
In the ablation-like discussion, you mention that adding explicit body pose features didn't provide a significant accuracy boost relative to the computational cost. Could you quantify this trade-off?
局限性
Yes
最终评判理由
The rebuttal addresses most of my concerns, I will raise my assessment of the paper to 4.
格式问题
No
We thank the reviewer for their feedback and questions. We provide the clarifications below, including a requested baseline already implemented:
1. Ablation studies on skill attributes
Please see our response to reviewer PsTp’s #5.
2. Overall framework lacks novelty
While our implementation is indeed inspired by the success in recent multimodal architectures [43, 45, 52, 55], our key novelty is in identifying skill-attributes as the medium to unlock transferability between sports (L5-8). Moreover, we outperform vanilla fine-tuning of existing LLMs by a large margin—LLaVA-FT [87] (Tab 1). These methods also use the standard pattern of using LLM and LoRA finetuning, even when trained with the same dataset. Overall, our method correctly uses skill-attributes, both to improve in-domain performance, and generalize better to novel scenarios—more than the vanilla LLM finetuning framework.
3. End-to-end training vs two-stage, attribute-centric approach
Good point, and please note that we have this exact baseline. ExpertAF [8] and Stream-VLM [67] precisely do what the reviewer described (Tab 1). In actionable feedback generation, these methods train an end-to-end model — going directly from video to actionable feedback. The implementation in [8] fully fine-tunes the LLM; reproducing that in our LoRA training setting with a more recent LLM gives slightly lower performance. Thus, we compare directly with the reported number and outperform it by up to 3% (L254-256).
Furthermore, for skill-attributes generation, we use the end-to-end actionable feedback output to extract the skill-attributes.
4. Quality and stability of LLM-generated attributes with different LLM or different prompt
Please see our response to reviewer BeW2’s #3 where we show high similarity between the generated attribute set with different prompts, and a different LLM (Mistral 8B).
5. Are the skill-attributes fully open vocabulary discovered on the fly?
Yes, it is a fully open vocabulary. At the test time, we use a generative model (Sec 3.3) to generate skill-attributes. This ensures the model is free to generate skill-attributes that are not verbatim seen in the training, but related to it (L164-166).
6: Have you conducted any human studies to assess whether the generated feedback is perceived as actionable and correct, especially for novel sports?
For the novel sports collected from YouTube, the data collection process ensures every video contains ground truth description of the actionable feedback. Human subjects not associated with the project rated 75% of the generations as actionable and correct, compared to the ground truth—even when the sport is unseen during training. Additionally, for actionable feedback generation, ExpertAF [8] conducted a human study, and found a positive correlation between the automatic metrics and human preference.
7. Trade-off in using explicit body-pose
Consistent with the observation in ExpertAF [8], using explicit body-pose increases the performance by ~0.5% in BLEU-4, ROUGE and METEOR (Tab 1, right). However, in order to use the additional pose input, we need to train a pose encoder (similar to PCT, Geng et al. CVPR 2023, “Human Pose as Compositional Tokens”), and a pose mapper, similar to the visual mapper (L193). Since explicit body-pose is not the primary contribution of this work, and we obtain stronger results than ExpertAF [8] without pose, we do not use it.
Thank the authors for the detailed and timely rebuttal. The new experiments and clarifications have been very helpful in addressing my initial concerns.
Regarding Novelty: I appreciate the clarification. I agree that the primary novelty of the work lies in the conceptual contribution of using transferable skill-attributes, and the empirical results effectively demonstrate the value of this idea.
Regarding the End-to-End Ablation: While I recognize that comparing against ExpertAF provides a functionally equivalent baseline, a cleaner ablation would still be beneficial. You mentioned that "reproducing that [the end-to-end approach] in our LoRA training setting with a more recent LLM gives slightly lower performance." If this result was generated using an identical architecture to your proposed CROSSTRAINER (just trained end-to-end without the attribute pre-training stage), I strongly recommend adding this specific result to the experimental tables. This would provide the direct, apples-to-apples comparison I was looking for and would powerfully substantiate the necessity of your proposed pre-training methodology.
Regarding Attribute Quality and Stability: Thank you for the comprehensive new ablation studies. The experiments analyzing robustness to different LLMs, different prompts, and noisy attributes are convincing. They successfully demonstrate the stability of the skill-attribute discovery phase and quantify its positive impact on the final task performance. My concerns on this point have been fully resolved.
Regarding the Open Vocabulary: Thank you for the clarification. My question is fully resolved.
Regarding Human Studies: Relying on the conclusions from ExpertAF [8] to validate your evaluation metrics is indirect. I would strongly encourage you to include the details of your own small-scale human study—specifically, the finding that "Human subjects not associated with the project rated 75% of the generations as actionable and correct"—in the main paper. This would add valuable direct evidence for the quality of your model's outputs, especially for novel sports.
Regarding the Pose-Tradeoff: Thank you for the clarification. This question is fully resolved.
In summary, I thank the authors again for the additional experiments, which have resolved most of my questions. I would be willing to raise my score from a 3 to a 4 (Borderline Accept) if the suggestions for Question 2 (the end-to-end ablation) and Question 5 (the inclusion of the human study results) are incorporated into the final version of the paper.
We thank the reviewer for agreeing to increase the score to borderline accept, provided we add details about end-to-end ablation (Q2) and human study results (Q5).
Thank you for acknowledging our responses to all the points mentioned above. We will update the main paper to reflect all the suggestions.
Q2: In particular, we will add the end-to-end baseline details, as discussed. This end-to-end baseline has the same architecture as ExpertAF and CrossTrainer, but trained with a fewer parameter Llama (Llama-3-70B vs Llama-3.1-8B-Instruct) in LoRA setting. This baseline ensures an apples-to-apples comparison with our CrossTrainer, with the difference being the use of skill-attribute for pretraining. The language model and the LoRA training is the same for our method and this end-to-end baseline. We had run this baseline before submitting, and we alluded to its results as being weaker in the main text (L255-256), as the reviewer noted; we are happy to elaborate with the numbers.
For your reference, the performance is shown below for Ego-Exo4D actionable feedback generation (Tab 1, right):
| Method | BLEU-4 | METEOR | ROUGE-L |
|---|---|---|---|
| End-to-end baseline | 43.8 | 48.8 | 52.3 |
| ExpertAF [8] (already reported) | 44.9 | 49.6 | 54.6 |
| CrossTrainer (ours) (already reported) | 45.6 | 51.7 | 57.8 |
The table clearly shows the advantage of using skill-attributes as a pretraining.
Q5: Additionally, we will add the details of the human study we performed.
Thank you!
Thanks the author for response, the rebuttal addresses most of my concerns, I will raise my assessment of the paper to 4.
We are looking forward to your further thoughts about our paper, in light of the rebuttal. We have provided clarifications that we think should help your final assessment, and we'll stand by to address any further questions before the author-reviewer discussion deadline (which is 6th August AoE).
The paper addresses video-based skill assessment—including incorrect skill-attribute detection, actionable feedback generation, and proficiency estimation—and introduces the CROSSTRAINER framework. The core idea is to incorporate an intermediate representation, namely skill attributes, into the multimodal LLM (MLLM) pretraining and prediction process. Results show that CROSSTRAINER outperforms existing methods across various tasks, datasets (Ego-Exo4D, QEVD, and self-collected YouTube videos), and settings, particularly in diverse zero-shot scenarios such as testing on unseen skills from previously unseen sports during training.
优缺点分析
Strengths:
-
The paper establishes a strong connection between the proposed method and cognitive science (e.g., the transferability of skills across different sports), enhancing its overall coherence and plausibility (Ln 55-62).
-
The experimental results presented in Tab. 1, Fig. 4, and Fig. 5 are promising, highlighting the superiority of CROSSTRAINER over many state-of-the-art approaches, including ExpertAF [1], in both in-domain and zero-shot setups.
-
A few failure cases are presented in Fig. 5 and discussed in the supplementary material, which is reasonable and highly helpful for understanding the method.
Major Weaknesses:
-
Lack of essential training details for zero-shot setups: According to Section 3.3, the proposed method derives skill attributes from 32k Ego-Exo4D and QEVD commentaries. In zero-shot setups, were the skill attributes extracted from the “test” sports or skills excluded from all stages of training, including pretraining? Clarifying this is crucial to ensure that no information from the test sports or skills leaks into the training process.
-
Potential unfair comparison: For actionable feedback generation, the proposed method utilizes video features, predicted skill attributes, and a prompt (Lines 204–206) as input. This prompt appears to include a placeholder for the activity name (e.g., <activity>). Does this imply that the actual activity name (e.g., dribbling) is inserted during both training and testing? If so, do the competing methods listed in Tab. 1 employ a similar mechanism for generating actionable feedback? Otherwise, the proposed method may have an advantage, as it is already provided with contextual information (e.g., the sport or skill) and only needs to generate feedback based on the attributes within this known context.
-
Limited number of tested sports and skills: Although the proposed method aims to transfer skill knowledge across different sports and skills, the range of evaluated activities is quite narrow, comprising only 3 sports and 5 skills in Ego-Exo4D, 23 workout moves in QEVD, and 3 sports in the collected YouTube dataset. This limited scope makes it difficult to assess whether the observed performance gains are genuinely attributable to the method’s capabilities or simply to the specific choice of tasks. In fact, it is easy to imagine scenarios where skill transfer between sports may not be effective. For example, leaning forward may be beneficial in skiing but detrimental in basketball defense. Negative transfer [2] is a well-known challenge in machine learning, further underscoring the need for the proposed method to be evaluated across a broader and more diverse set of sports and skills to validate its generalizability.
-
The paper does not provide justification for the selection of the three sports, frisbee, water polo, and a juggling drill in soccer, for collecting the YouTube videos. Clarifying the rationale behind choosing these specific activities is important to assess whether the proposed method can truly generalize to a broader range of sports and skills.
Minor Weaknesses:
- The paper lacks ablation studies on skill attributes, particularly regarding how their diversity or quality impacts final performance. This analysis is important to support the claim in Ln 155–157 that the proposed method can effectively leverage commentaries collected from online video-sharing platforms. Without such an evaluation, it remains unclear how robust or sensitive the method is to variations in the source and quality of skill attributes.
[1] Ashutosh, Kumar, et al. "ExpertAF: Expert actionable feedback from video." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
[2] Zhang, Wen, et al. "A survey on negative transfer." IEEE/CAA Journal of Automatica Sinica 10.2 (2022): 305-329.
问题
-
Please refer to Weakness 1 and provide additional training details for the zero-shot setups. Specifically, clarify whether any skill attributes derived from the test sports or skills were used during training, including the pretraining stage, to rule out potential information leakage.
-
Please address Weakness 2 by clarifying whether the activity name (e.g., dribbling) is provided in the prompt during both training and testing. Additionally, indicate whether competing methods listed in Tab. 1 use a similar strategy, to ensure a fair comparison in actionable feedback generation.
-
In response to Weaknesses 3 and 4, please explain why testing on a limited number of sports and skills is still considered meaningful. Furthermore, clarify the rationale behind selecting the three specific sports for YouTube video collection.
-
Please address Weakness 5 by including additional ablation studies on skill attributes, if possible. In particular, analyze how variations in their diversity or quality influence overall performance.
局限性
Yes.
最终评判理由
Although I still believe it would be better to evaluate the proposed method on a more diverse set of sports and skills to fully validate its claims on transferability, most of my concerns have been addressed within the authors’ rebuttal, including the addition of ablation studies on skill attributes, a key component of the proposed method. As a result, I have decided to slightly raise my score for the paper.
格式问题
n/a
We thank the reviewer for their feedback. All the requested clarifications, and additional ablations for the choice of LLM are addressed below:
1: Lack of essential training details for zero-shot setups.
This is a possible misunderstanding. Our model generates skill-attributes, as opposed to retrieving them. Thus, we do not need a list of skill-attributes at test time; the model is capable of generating it (Sec. 3.1 and 3.3). We do not maintain any list of skill-attributes to use with the test or train set. In Sec. 3.3, we mention 32k skill-attributes to signify the scale and diversity of the skill-attributes from all the training samples. In summary, there is no information leak between the training and the testing set.
2: Clarification regarding the use of <activity> in the prompt.
It is correct that the activity name is inserted during both training and testing. Importantly, the same is done for all the baselines. In particular, the prompt format is the same in ExpertAF [8], LLaVA [52], VideoChat2 [46], and Steam-VLM [67]. In retrieval baselines, the candidates are only from the mentioned activity. Note that <activity> here refers to the sport, e.g., soccer, and not the particular drill, e.g., dribbling (same as L189). Knowing which sport is being played is a reasonable assumption that real use cases would make, i.e., the system would know it’s being asked for feedback on tennis or basketball etc.
3: Limited number of tested sports and skills, and potential negative transfer.
We believe this total of 31 skills/activities with ground truth commentary is sizable and represents the total available relevant data for evaluation. We welcome pointers from the reviewer if we have overlooked any additional dataset that would broaden evaluation even further.
To evaluate coaching feedback, the required ground truth must be of ‘expert quality’. Ego-Exo4D [36] and QEVD [67] hired sports professionals for curating the dataset (both papers are from industry labs). We leverage these datasets for our task. Prior work evaluates physical skill at the same or much lower scale— [11] uses only 48 single sport (basketball) videos; AQA [69] contains six activities totaling 803 videos; [36] and [8] are comparable since they use Ego-Exo4D [36], but we additionally use QEVD [67]. In comparison, we have 31 skill/activities, spanning 2800 videos, if anything increasing the breadth of testing for this area.
We agree that negative transfer is an important risk in transfer and multi-task learning. Precisely because of this variability between sports, we perform zero-shot transfer at the skill-attribute stage, as opposed to performing zero-shot transfer from the actionable feedback. We hypothesize that the exact actionable feedback depends jointly on the incorrect skill-attribute and the current sport. For example, for the reviewer’s example of skiing/basketball, we expect the model to flag ‘leaning’ as an incorrect skill-attribute. Next, actionable feedback for skiing would indicate leaning to be better, vs in basketball, it would advise against leaning when defending. This observation is discussed in L209-212.
Nevertheless, we acknowledge that transfer between certain sports is more suitable than others. Fig. 5 (bottom left) shows a better transfer between soccer and basketball, than rock climbing.
4: Justification for the selected sports for the YouTube dataset (3rd dataset we tested).
We create a list of novel sports and novel drills within the Ego-Exo4D and QEVD sports (basketball, soccer, rock climbing, exercise), and search for videos on YouTube with their coaching videos. Our criteria is that the video should contain a mistake done by a learner, and a coach giving feedback. Note that many videos show only the correct demonstration; hence, obtaining videos for our task is challenging.
Specifically, we randomly selected some common sports—soccer (juggling), basketball (dribbling) and some rarer ones like water polo, korfball, polo, jai alai, kin ball, frisbee. We search more than 50 videos with keywords like “Coaching video of <sport/drill>”, “<sport/drill> training session for beginners”, “<sport/drill> common mistakes for beginners”, “Dos and don’ts in <sport/drill>”. The videos are manually watched to find the desired coaching instances. The process took overall 12 hours, and was done by graduate students not associated with this project. We will add this detail in the paper.
5. Ablation studies on skill attributes
We deployed only one SOTA LLM in our pipeline (Llama 3.1 8B), for simplicity, but we appreciate interest in knowing how sensitive our full pipeline is with respect to that choice. To show the robustness and sensitivity, we perform the following additional experiments on Ego-Exo4D actionable feedback generation (Tab 1, right):
-
Robustness to the choice of the language model: We train and evaluate skill-attributes using different language models. We first train the model using Mistral’s 8B language model (mistralai/Ministral-8B-Instruct-2410) and compare it against the skill-attributes test set obtained using Llama-3.1-8B-Instruct, and vice-versa. See the table below for results. We see that the model is robust to the choice of the language model, and using any strong language model helps achieve a good performance.
-
Actionable feedback generation w/ noisy skill-attributes: We inject noise in the actionable feedback generation evaluation. We replace X% of inferred skill-attributes with a random skill-attribute and observe the performance at various levels of noise X. See results in the table below. We observe that adding noise degrades the performance, with the performance matching that of end-to-end direct training at X=20%. We can conclude that the performance is positively correlated to the quality of the generated skill-attribute. Improving that will also improve the actionable feedback performance.
| Skill-attribute source | Test-set LLM | BLEU-4 | METEOR | ROUGE-L |
|---|---|---|---|---|
| Llama-3 8B (original) | Llama-3 8B (original) | 45.6 | 51.7 | 57.8 |
| Mistral 8B | Llama-3 8B (original) | 45.3 | 51.8 | 57.5 |
| Llama-3 8B (original) | Mistral 8B | 45.8 | 51.8 | 57.8 |
| =========================================== | ====================== | ==== | ===== | ====== |
| Llama-3 8B (original) w/ 10% noise | Llama-3 8B (original) | 45.2 | 50.5 | 56.2 |
| Llama-3 8B (original) w/ 20% noise | Llama-3 8B (original) | 45.1 | 50.0 | 54.9 |
| Llama-3 8B (original) w/ 30% noise | Llama-3 8B (original) | 44.3 | 49.4 | 53.1 |
| Llama-3 8B (original) w/ 50% noise | Llama-3 8B (original) | 42.7 | 47.3 | 50.2 |
| Llama-3 8B (original) w/ 70% noise | Llama-3 8B (original) | 41.9 | 46.3 | 49.6 |
These ablation studies further showcase the effectiveness of using skill-attributes for actionable feedback. We will add this ablation in the paper.
I would first like to thank the authors for addressing my concerns during the short rebuttal period.
-
Regarding the zero-shot setups: My actual concern is that, since the paper includes four different evaluation setups (i.e., FS, ZS-1, ZS-2, and ZS-3), it is unclear whether each part of the training, particularly the pretraining stage described in Lines 186-200, was conducted independently to reflect the conditions of each zero-shot setup. For example, in the ZS-3 setup, was pretraining performed using only n-1 sports? Based on the authors’ response, I would assume this was the case, although the paper does not explicitly state it.
-
Regarding the use of <activity>: Thank you for the clarification. This seems reasonable to me. However, I would suggest unifying the two representations (i.e., <sport name> in Line 189 and <activity> in Line 204) to accurately reflect that only the sport name (and not a particular skill) was used.
-
Regarding the limited number of tested sports and skills: Thank you for the clarification. Since the paper focuses on transferable assessment across sports and skills, I still find the use of only 31 skills (from 6 sports and general physical workouts) unconvincing. Particularly, I would expect a broader range of YouTube videos covering different sports to be included. However, I acknowledge that prior work has evaluated at a similar or smaller scale, even if those studies did not emphasize transferability.
-
Regarding the YouTube videos: Thank you for providing the details. They will be very helpful for the revised manuscript.
-
Regarding the ablation studies on skill attributes: Thank you for providing the additional ablation studies. They support the claim that the proposed method can potentially use commentaries collected from online video-sharing platforms.
In summary, while I believe the proposed method should be validated on a broader set of sports and skills to fully justify its claims on transferability, the rebuttal addresses most of my concerns, including the addition of solid ablation studies. I will raise my assessment of the paper to 4.
We thank the reviewer for considering our responses, and deciding to raise the score to borderline accept. We acknowledge the reviewer's comments below:
- Regarding the zero-shot setups: Yes, that is correct. We will update the final text to reflect this.
- Regarding the use of <activity>: Thanks for the suggestion; we will unify it.
- Regarding the limited number of tested sports and skills: Thanks for acknowledging the scale in prior work. As the scale of these datasets increase in the future, our method can be used to show generalizability to an even broader set of sports.
4, 5. Thanks for the acknowledgement.
The article suggests to use cross training approach as multiple sports have common skill set requirements. The model trained performs better as it has already learned the features which are common in the sport. Multiple set of experiments are conducted: fully supervised, zero shot, similar sport zero shot and unfamiliar sport zero shot. The results show that the approach works well and results are presented in quantitative as well as qualitative forms. The results demonstrate that approach is effective and it can help in real world scenarios where datasets for all different sports are not available.
优缺点分析
strength
- paper is well written
- thorough experiments are conducted under different set of conditions to demonstrate the effectiveness of the approach
- the approach is simple and is represented well by mathematics in the article.
- the approach has many implementations as it is a concern in the literature to create individual datasets for each sport.
weaknesses
- the idea is not novel and has been used in many other fields in different forms such as computer vision and medical imaging
- the approach is pretty much similar to fine-tuning of the existing models. It is slightly different as it is learning from the other datasets.
问题
Please check comments above
局限性
Please check weaknesses mentioned above.
最终评判理由
Authors have submitted the rebuttal and clarified the comments. I have decided to raise the rating. However, authors need to embed the changes as mentioned in the rebuttal.
格式问题
Not concerns about writing
We thank the reviewer for their feedback.
1: The idea is not novel and has been used in many other fields in different forms such as computer vision and medical imaging
Ours is the first work to specifically design transferable video representations for skill assessment (L5). We identify skill-attributes as the intermediary to help transfer feedback and proficiency between sports and drills (L40-43).
Specifically, our novelty lies in discovering the physical skill-attributes, and learning video representations that transcend the boundary between sports. We acknowledge and contrast with existing works in L94-103 using attributes in image recognition (L98), and general purpose zero-shot learning methods (L96-98).
If the reviewer could provide specific reference(s) that conflict with our novelty claims, we would be happy to contrast them with our key contributions.
2: The approach is pretty much similar to fine-tuning of the existing models. It is slightly different as it is learning from the other datasets.
Our implementation is inspired by the success in recent multimodal architectures [43, 45, 52, 55] (L170). However, our key novelty is in identifying skill-attributes as the medium to unlock transferability between sports (L5-8).
Moreover, using a different dataset is not our contribution. We have existing fine-tuning methods that use the same dataset as baselines, and we outperform them because of our two-stage training approach that leverages skill-attributes (Sec 3 and Fig. 3). We outperform vanilla fine-tuning of existing models by a significant margin—retrieval baselines by >7% and multimodal finetuning by >3%, even when trained with the same dataset (Tab 1, right).
We are looking forward to your further thoughts about our paper, in light of the rebuttal. We have provided clarifications that we think should help your final assessment, and we'll stand by to address any further questions before the author-reviewer discussion deadline (which is 6th August AoE).
Thanks for the rebuttal. I got responses to my reviews. I don't think that there is anything from authors from my side. I need time to weigh in the novelty of the idea.
This paper proposes "CrossTrainer", a method for video-based skill assessment that learns transferable "skill-attributes" (e.g., balance, control) to generalize across different sports. The core idea is to discover these sport-agnostic attributes from expert commentaries and then use them to train a multimodal model to provide feedback and proficiency scores.
The main strength is the novel concept of skill-attributes, which provides an more interpretable and effective way to achieve transfer learning in skill assessment, a significant challenge in the field. The paper is supported by thorough experiments, demonstrating substantial gains over state-of-the-art methods, especially in various zero-shot settings.
Initial weaknesses raised by reviewers included concerns about the novelty of the overall architecture (Reviewers 98Cb, ph2p), the robustness of the LLM-based attribute generation (Reviewer ph2p), and the need for more direct ablations (Reviewer ph2p). During the rebuttal, the authors successfully clarified that the novelty lies in the concept of skill-attributes, not the architecture. They provided new ablation studies showing the robustness of their attribute generation process to different LLMs and prompts. These clarifications and additional results effectively addressed the reviewers' concerns, leading multiple reviewers to raise their scores and resulting in a consensus for acceptance.