Animal-Bench: Benchmarking Multimodal Video Models for Animal-centric Video Understanding
摘要
评审与讨论
This paper introduces Animal-Bench, a video question answering benchmark focused on animals, which is usually overlooked in previous video benchmarks. Animal-Bench is sourced from six datasets and includes 13 tasks. Eight video-language models are evaluated on the benchmark and the results reveal shortcomings in the models. Moreover, the paper evaluates the robustness of the model by simulating weather and shooting parameter changes, which are challenging in real-world application.
优点
- While the current video question answering benchmarks focus on human activities, Animal-Bench is the first animal-centric benchmark. It can not only boost the AI application in animal studies, but also help the development of video-language models.
- Animal-Bench aligns the benchmark with real-world applications by 1) including domain-specific tasks, e.g., breeding monitoring; 2) simulating realistic scenarios - weather and shooting parameter changes - by video editing.
- The paper thoroughly evaluates recent video-language models.
缺点
Missing dataset statistics:
The paper provides many detailed dataset statistics about the number of videos and the long-tail distribution of animal categories. But some important statistics are still missing:
- The number of questions on each video. If there can be multiple questions generated from one video, both the number of videos and number of questions should be listed in Section 3.1 and Table 3.
- The distribution of the video durations.
Answer correctness after video outpainting:
In the robustness evaluation, outpainting may introduce new animals to the video, e.g., row 2 column 2 in Figure 11. It may change the answer of the question on tasks like object existence and object count. As a result, the robustness evaluation could be inaccurate. It would be better to manually check a subset of the outpainted videos to see how well outpainting can preserve the correct answer.
Minor weaknesses:
When a pre-trained model is applied to a specific downstream domain, it is natural to improve the performance by fine-tuning it on the downstream domain. However, this benchmark only provides a test set without training and validation sets. It would have greater impact if a training set is included and the fine-tuned model performance is evaluated.
Writing:
-
It is better to explain the task abbreviations in Table 1 (either in the text or in the table caption).
-
Showing the accuracy drop in Table 2 is a great way to demonstrate the model robustness. But it would be better to also list the absolute model accuracy.
问题
- Is the accuracy drop in Table 2 relative drop or absolute drop?
局限性
Limitations are discussed in Appendix F. No potential negative societal impacts are mentioned.
Thank you for your valuable comments and thoughtful suggestions, and hope our response will address your concerns.
W1: Missing dataset statistics
For each task, the included videos correspond to only one question. For example, for the object recognition task, we only selected videos containing a single species to avoid potential ambiguities. For different tasks, the same video may correspond to two or more tasks. Some videos can be used to evaluate both the object recognition task and the action recognition task. This data reuse can, to some extent, alleviate the scarcity of animal videos. We will supplement this in subsequent versions.
Please kindly refer to "Author Rebuttal" Q1 for statistics on video duration.
W2: Answer correctness after video outpainting
It is indeed possible for new objects to appear in outpainting. To address this, we conducted manual filtering to avoid potential impacts from outpainting of new species on the answers. In practice, such changes to the answers are virtually nonexistent. For tasks like object existence and object count, the questions include specific species, such as "Is there a monkey?" This means that the example in the second column of Figure 11 does not change. Due to the diversity of species in nature, the likelihood that the species mentioned in the question coincidentally matches the new species that appear in outpainting is very low, reducing the risk of changing the original answer. Of course, outpainting does carry this risk, and we plan to reduce the appearance of new species in the future by adding negative prompts and conducting further exploration.
W3: Minor weaknesses
Although the amount of data included in our Animal-Bench is sufficient to support model evaluation, it may not be entirely adequate for model training. This is related to the difficulty of collecting animal videos, especially those of wild animals. Currently, we have only taken the first step. In the future, we will add more data to Animal-Bench and conduct model fine-tuning to ultimately achieve the goal of applying the pre-trained model in the field of wildlife protection.
W4: Writing
Thank you for your valuable suggestions. We will make the necessary improvement in the subsequent revisions.
Q1: Is the accuracy drop in Table 2 relative drop or absolute drop?
It is the relative drop that provides a clearer demonstration of the model's performance changes and differences under various conditions.
L1: Potential negative impacts
Please kindly refer to "Author Rebuttal" Q3.
I read all the reviews and responses. I am satisfied with the authors' responses and highly appreciate their efforts to construct the Animal-Bench dataset. I adjusted my rating to 7 (Accept) and suggest the authors to include the missing dataset statistics in the updated paper upon acceptance.
Thank you for your reply and positive response. We will ensure that we incorporate the reviewers' suggestions in the future version.
Propose of an automated pipeline for animal-centric large-scale multimodal video benchmark, Animal-Bench, that simulates real-world conditions such as snowing via diffusion-based video editing approach. This data is sourced from 6 dataset and multiple filtering have been applied to ensure diversity and lack of bias. ChatGPT is utilized for creating 3 types of questions for each task and one is randomly assigned to each data. Furthermore, authors simulate closeness, distance, and different angles by changing the shooting parameters. Animal-Bench covers 13 tasks across 8 animal categories (reptile, fish, insect, amphibian, mammal, sea animal, and bird), and 822 animal species. They have tested 8 current baselines in multimodal video and have analyzed the results.
优点
-
Data and code will be publicly available, and accelerate animal research
-
To the best of my knowledge, this is the first large-scale multi-categories multi-species multimodal dataset with a focus on animals across 13 tasks. This paper will open lots of avenues in the field of animal research (e.g. animal surveillance) as well as introducing new multidisciplinary collaborations.
-
Experiments are well-described and reproducible upon release of code.
缺点
-
Missing some important implementation details (still quite reproducible)
-
The main paper is 9 pages
-
Great research into prior work
-
Authors have not mentioned any potential negative impact of the paper. However, potential of misuse of technology by bad actors for hunting, poaching, and animal exploitation are some examples of how it could be leveraged incorrectly.
问题
1- How did you determine the appropriate level of difficulty for options in multiple-option questions, especially for tasks that don’t have existing QA data?
2- Did you notice any limitations or biases that were introduced by the outpainting method (line 209)?
3- Which LLM are you using? “7B LLM backend versions” line 242
4- Regarding the hallucination problem in line 260, have you considered including negative examples or adjusting the loss function to mitigate the problem?
5- Which version of ChatGPT did you use? How much was the cost of using ChatGPT per data point and also overall?
Suggested references
1- Van Horn, Grant, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. "The inaturalist species classification and detection dataset." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8769-8778. 2018.
2- Parashar, Shubham, Zhiqiu Lin, Yanan Li, and Shu Kong. "Prompting scientific names for zero-shot species recognition." arXiv preprint arXiv:2310.09929 (2023).
3- Parashar, Shubham, Zhiqiu Lin, Tian Liu, Xiangjue Dong, Yanan Li, Deva Ramanan, James Caverlee, and Shu Kong. "The Neglected Tails in Vision-Language Models." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12988-12997. 2024.
Suggestions:
1- Please use “Schiappa et. al” instead of “ Madeline Chantry Schiappa’s work” line 99
2- For each video, we sample 16 frames and resize them to 224 (line 246) – Please change to 224x224
3- line 246 stablediffusion-inpainting → StableDiffusion-inpainting
4- line 248 stablediffusion-v1.5 → StableDiffusion-v1.5
5- Please provide how much VRAM your GPU has (24GB?), how much RAM and which CPUs. How much space does your dataset take. Line 486 → Please note this should be in the main paper not appendix.
6- Please have a look at https://forum.inaturalist.org/t/published-papers-that-use-inaturalist-data-wiki-3-2022-and-2023/34753 which is a list of published papers that made use of iNaturalist data.
局限性
-
Lack of exploration of input language parameter sensitivity
-
Relying on StableDiffusion which has its own set of limitations (training data bias, manipulated content, lack of diversity)
-
Limited dataset (while they have combined multiple data points, still the variety of animals out there makes this quite a small dataset). Further, I am curious how many animals may only have a small sample of videos while animals like cats and cows may have way more samples presents.
-
The potential hallucination that happens should warn the users that may use this dataset to train their model for tasks such as animal conservation.
-
While 13 tasks is a great starter on its own (given there’s not much parallel research going on), there’s a lot more that could be explored and expanded.
Thank you for your valuable comments and thoughtful suggestions, and hope our response will address your concerns.
W1: Potential negative impacts
Please kindly refer to "Author Rebuttal" Q3.
Q1: Option-setting methods
Please kindly refer to "Author Rebuttal" Q2.
Q2: Limitations of the outpainting method
We use the StableDiffusion model for outpainting, which means the effectiveness of the outpainting depends on the capabilities of the StableDiffusion model. Additionally, employing video editing models for outpainting inevitably introduces some artifacts, such as unnatural transitions at the boundaries or the potential introduction of new animals. However, we manually reviewed the edited videos and found that these issues do not alter the answers to the questions.
Q3: Types of LLMs
In Table 15, we present the types of large language models (LLMs) we used. Except for mPLUG-Owl, which uses LLaMA-7B, all others use Vicuna-7B.
Q4: Hallucination problem
Thank you for your suggestions. In fact, our ultimate goal is indeed to fine-tune a video-language model suitable for animal video understanding and wildlife protection. Recently, research on hallucinations has been emerging continuously, and we will further consider how to mitigate this issue in the future.
Q5: Regarding the GPT used
After manually pre-filtering the data according to the task, we used the gpt-3.5-turbo-0125 model on the remaining 48,043 data points. Each data point cost approximately 0.000035 dollars, with a total expenditure of around 1.7 dollars.
L1: Parameter sensitivity
Please kindly refer to "Author Rebuttal" Q4.
L2: The limitation of StableDiffusion model
As you mentioned, the editing model does have some limitations. Therefore, we conducted manual reviews of the results from StableDiffusion to filter out a certain amount of substandard outputs. Moving forward, we will further address this issue by developing automated tools and combining them with manual reviews to ensure the quality of the edited data.
L3: The limitation in the number of animal species
Although the 822 animal species included in our benchmark are still insufficient compared to the vast diversity in nature, they cover the characteristics of most animals in the seven main animal categories (and similar species often exhibit similar behaviors). Moving forward, we plan to further expand the dataset to encompass a broader range of animals. The distribution of the number of different animal species follows a long-tail distribution, which aligns with the general pattern where some species are more abundant while others are less so in the natural world.
L4: The potential hallucination
Thank you for your suggestion. We will emphasize the potential hallucination issues in future versions to ensure that users pay close attention to and consider these issues when using our dataset.
L5: More tasks
Thank you for your suggestion. The selection of these 13 tasks helps us understand the core capabilities of the model from a coarse to a fine-grained perspective. We hope our work will provide insights for the community and stimulate further exploration.
I read the other reviewers' review as well as the authors' responses to them. I am keeping my rating as is but I highly suggest the authors to include their responses to reviewers' questions as well as the suggestions by reviewers in the main updated paper upon acceptance. There are certain details that are missing from the paper that reviewers have pointed out and authors have responded to them but would like to make sure these answers will also be reflected in the actual papers, details such as potential negative societal harms or dataset statistics. This is to say, my rating is in accordance with the assumption that the promised open-sourced materials will be available to public so researchers can build upon.
Thank you for your comment and suggestions. We are committed to including the relevant details in the final version of the paper and will also open-source the data and code.
- This paper introduces Animal-Bench, an animal-centric video understanding benchmark. The benchmark includes 13 tasks, spanning 7 major animal categories and 822 species.
- The authors collect data primarily from 6 open datasets, such as TGIF-QA and Animal Kingdom, and apply data filtering based on diversity and temporal sensitivity. They then generate task-specific QA pairs.
- To evaluate the robustness of video understanding models, the authors use realistic simulations based on video editing, including variations in weather conditions and shooting parameters.
- Eight video understanding models based on 7B LLM are tested on Animal-Bench.
优点
- This work focuses on animal-centric visual understanding, which not only aids the community in better evaluating video understanding models but also holds significant social value.
- It provides a comprehensive classification of animal video understanding tasks, summarizing and categorizing existing open datasets.
- The study uses Animal-bench to evaluate existing video models, conducting detailed performance and robustness analyses.
缺点
- According to Table 3, the dataset suffers from severe class imbalance.
- If my understanding is correct, all videos and annotations in the dataset are derived from existing open datasets. Therefore, this benchmark primarily serves to summarize and convert data formats to a conversational format, without adding new manual annotations. Also, this dataset-building strategy poses a potential risk of data leakage.
问题
- What are the resolution and duration of the dataset?
- How is data filtering specifically performed (e.g., manual filtering or automated filtering using other models)? Additionally, for each source dataset, what proportion of data is filtered?
- For the option-setting method described in L176-183, has the rationality and differentiation of this method been experimentally verified? For example, if a different random seed in setting options is used, how much would it affect the final results? For some questions, is the difference between the correct and incorrect options too large, making it easy to choose the correct one?
局限性
The web page for accessing the data and code is still unavailable.
Thank you for your valuable comments and thoughtful suggestions, and hope our response will address your concerns.
W1: Class imbalance
Firstly, the size shown in Table 3 represents the number of data points for each task, rather than the number of animal class or behavior class. Our evaluations for each task are independent, so the number of test samples for different tasks does not affect each other.
Secondly, the primary design goal of our benchmark is to evaluate the understanding capabilities of current video-language models for animal-centric videos. We believe that each task contains sufficient data for evaluation (refer to MVbench (CVPR24), where each task only has 200 videos).
Additionally, for each task, we have collected as much data as possible to ensure the richness of our benchmark. For a few tasks, due to the limited availability of related videos, the number of videos included in the benchmark is relatively small. However, as mentioned earlier, the data volume for these tasks is still sufficient to validate model capabilities. We also plan to collect more data for these tasks to further enhance the richness of the dataset.
W2: No new annotations and potential data leakage
Firstly, we chose to use open datasets because they have been widely utilized and validated by the community, ensuring better data quality and label reliability. Additionally, collecting animal videos, especially wildlife footage, is challenging, and it is difficult for ordinary people to annotate such data. Therefore, using open datasets significantly saves manpower.
Secondly, we selected data for the designed tasks instead of simply stitching together existing datasets. We examined multiple datasets and chose those capable of accomplishing specific tasks while being as diverse as possible. Many of the videos in these datasets primarily feature humans rather than animals, requiring us to filter the data accordingly. We used a combination of manual and automated tools for data filtering. For details on the filtering process, please see Q4.
Thirdly, converting the original annotations into question-answer pairs is not that easy. It requires careful consideration of how to design questions and options to be as fair and reasonable as possible. In the "Roles" column of Table 3, we present the detailed rules for generating our question-answer pairs. Additionally, in "Author Rebuttal" Q2, we elaborate on our thought process and experiments to make the generation of question-answer pairs more reasonable when no directly available pairs were present.
In Table 5, we present the pre-training datasets of the models we tested, excluding the dataset used by our Animal-Bench. Additionally, the datasets we used, such as Animal Kingdom, MammalNet, and LoTE-Animal, were proposed after 2022, whereas the datasets used for model pre-training are primarily from before 2022, thus thus preventing any data leakage issues.
Q1: The resolution and duration
Please kindly refer to "Author Rebuttal" Q1.
Q2: Data filtering details
We adopt a combination of manual and automated filtering methods. First, after designing and determining specific tasks, we manually select datasets from a large number of animal category datasets, animal behavior datasets, and general QA datasets that can achieve specific tasks. Specifically, for tasks related to "Action," "Object," and "Time," we mainly obtain annotated data for animals, actions, and locations from Animal Kingdom, LoTE-animal, and MammalNet. For "Counting" and "Reasoning" tasks, our evaluation data comes from TGIF-QA, MSRVTT-QA, and NExT-QA. For special tasks, we select data from Animal Kingdom, LoTE-animal, and MammalNet.
Secondly, we filter the data to ensure that Animal-Bench only contains animal data. For datasets like TGIF-QA, MSRVTT-QA, and NExT-QA, where most videos are human-centric and a few are animal-centric, we use GPT-3.5 to filter animal videos based on question-answer pairs and annotations. The system prompt is: "Determine if the data is about animals, not humans, based on the question and answer." We then perform further manual filtering on the automatically filtered results to ensure that the data is solely about animals.
After that, we filter each dataset according to the designed rules (as shown in the last column of Figure 3). We believe these rules can ensure the fairness and moderate difficulty of the evaluation as much as possible. We have written data filtering code to perform automated filtering.
The overall data volume and filtered data volume for each dataset are as follows:
| Dataset | overall volume | filtered volume | proportion |
|---|---|---|---|
| Animal Kingdom(AR) | 30100 | 13577 | 45.11% |
| Animal Kingdom(VG) | 18744 | 2718 | 14.50% |
| MammalNet | 18395 | 1626 | 8.84% |
| LoTE-Animal | 9991 | 602 | 6.03% |
| MSRVTT-QA | 244337 | 102 | 0.04% |
| TGIF-QA | 165165 | 3394 | 2.05% |
| NExT-QA | 52021 | 1007 | 1.94% |
Q3: Option-setting methods
Please kindly refer to "Author Rebuttal" Q2.
L1: Code and data
To present our data and code more clearly and comprehensively, we are currently working on organizing them and will make them publicly available soon.
Thanks for the detailed rebuttal which solves most of my concerns. Therefore, I would like to raise my score to borderline accept.
This work introduces Animal-Bench, a novel benchmark for evaluating multimodal video models in animal-centric video understanding. The benchmark covers 13 tasks spanning 7 major animal categories and 822 species. It proposes an automated pipeline for data filtering and question-answer pair generation, reducing human effort and potential biases. To simulate real-world shooting conditions, it employs video editing methods based on diffusion models to evaluate model robustness under various scenarios. This work evaluates 8 popular multimodal video models on Animal-Bench, identifying considerable room for improvement on animal-centric tasks.
优点
- This work introduces a comprehensive animal-centric benchmark covering a diverse range of tasks, including several that have been previously under-explored in the field.
- The authors claim to open source the code and data, which could be beneficial for the research community.
- By evaluating multiple recent multimodal video models on Animal-Bench, the work provides insights into current model capabilities and limitations, and highlights potential directions for future research and development.
缺点
-
The answer accuracy of the QA pairs. For example, in the "Reasoning" task illustrated in Figure 2, the correct answer appears to be "to fight with dog" rather than "cat".
-
The question quality needs further improvement. 1) Ambiguity: in the "Time" task shown in Figure 2, the presence of multiple objects in the video frames renders the subject of the action ambiguous. 2) Inconsistency between video frames and question description: in the "Object Count" task in Figure 13, the setting appears to be a "grassland" rather than a "forest".
-
The simulated changes intended to mimic real-world shooting scenarios exhibit noticeable artifacts and unrealistic situations. For example, Figure 11 shows visible boundaries from outpainting and implausible weather conditions (e.g., snow added to scenes with green grassland). To address this, the authors could consider implementing an aesthetic score-based filter or a specially trained discriminator to get rid of data with severe artifacts.
-
Section 4.1 mentions resizing input videos to 224. For non-square videos (particularly those with highly disproportionate size), it's unclear whether additional operations (such as padding or cropping) were employed to accommodate the inputs. If such operations were used, an analysis of their potential impact on model performance across various tasks would be beneficial.
问题
See [Weaknesses]
局限性
The uniform set of parameters used for all models in the evaluation, as mentioned in Table 4, may not align with each model's recommended settings, such as temperature. It could potentially prevent from fully leveraging the capabilities of individual models.
Thank you for your valuable comments and thoughtful suggestions, and hope our response will address your concerns.
W1, W2: Question and answer quality
In fact, although the examples in the “Time” task in Table 2 show multiple objects, only one object is performing an action. The other objects remain stationary or their actions are different from the action in the question, thus avoiding action ambiguity. Since the question-answer pairs for the “Reasoning” and “Object Count” tasks are directly sourced from existing data annotations (NExT-QA, MSRVTT-QA), the quality of these question-answer pairs is influenced by the existing data annotations to some extent. In reality, we have manually filtered them to make as few errors as possible. Despite the minor errors, these errors do not pertain to key information that can affect the ability of humans and video language pre-training models to choose the correct answer.
W3: Artifacts and unrealistic situations
Thank you for your valuable suggestion. In fact, our goal is to mimic shooting parameters and weather changes in real-world scenarios to evaluate the practical applicability of the model. This is very different from the previous method of generating counterfactual and unnatural videos through video editing (such as the crane appearing tilted on the grassland or butterflies appearing in the water, as shown in the lower left corner of Figure 1) for model evaluation. As shown in Figure 11, our method can address the issue of unnatural scenarios. Even though using editing methods inevitably results in some artifacts, this does not affect the basic adherence to natural conditions. For example, our editing results would not show scenarios like butterflies flying in the water. In the future, we will pay more attention to improve the editing effects based on your suggestion.
W4: Video preprocessing
(Setting1:) The specific video preprocessing process is as follows: if H > W, the frame is scaled to (224, 224 * H / W). If W > H, the frame is scaled to (224 * W / H, 224). After scaling, the video frames are center-cropped to obtain a center region of (224, 224).
(Setting2:) We also experimented with padding non-square videos along the shorter side to make them square before scaling them to (224, 224).
The results are as follows:
| OE | OR | OC | AR | RS | AC | AS | AP | AL | |
|---|---|---|---|---|---|---|---|---|---|
| Setting1 | 53.63 | 83.90 | 65.69 | 57.57 | 40.71 | 28.38 | 47.26 | 40.41 | 24.44 |
| Setting2 | 53.64 | 79.15 | 64.71 | 56.51 | 40.02 | 28.29 | 45.89 | 38.70 | 24.24 |
It can be seen that the video preprocessing method has a certain impact on the test results. Using padding causes a slight decrease in test accuracy, but the decrease is not significant.
L1: The parameter set
Please kindly refer to "Author Rebuttal" Q4.
I appreciate the authors' response and the additional experiments they have conducted. These efforts have partially addressed my initial concerns. After consideration, I have decided to maintain my original rating, primarily due to the remaining concerns with the models' uniform parameter setting.
We thank all reviewers for their valuable comments.
Due to the word limit for responses to each reviewer, we respond to common questions that were mentioned more than once here, and respond to other questions in the individual reviewer responses. In the numbering, "W" indicates a response to the "Weaknesses" part of the review, "Q" indicates a response to the "Questions" part, and "L" indicates a response to the "Limitations" part. We hope our replies will resolve your concerns.
Q1: The resolution and duration
Since our data is selected from multiple datasets, the resolutions of the original videos are not uniform. Most of the data have a resolution greater than or equal to 640*360, with the highest resolution being 1920*1440. A small portion of the data has a resolution less than 640*360, with the lowest resolution being 176*132.
The duration of all our videos ranges from 0.13 seconds to 9.52 minutes, with an average duration of 9.07 seconds. Due to different tasks requiring varying amounts of temporal information, the average video duration differs across tasks. For instance, the average duration for Object tasks is 4.13 seconds, while for Time tasks, it is 35.50 seconds.
Q2: Option-setting methods
Quantifying the difficulty of options is inherently challenging. In this study, we employ a qualitative analysis approach to achieve a moderate level of difficulty for the options.
For the action recognition task, we examined the frequency of various actions and found that they adhere to a long-tail distribution. We categorize common actions, or "head actions," such as "running" and "eating," as simple options that can be identified without specialized knowledge. In contrast, rare actions, or "tail actions," such as "molting" in birds, require specialized knowledge to identify and are thus classified as difficult options. Our approach involves incorporating the correct answer along with two simple options and one difficult option, thereby ensuring that the difficulty of the options is balanced and reflective of the natural frequency distribution of actions.
For the object recognition task, we tested four situations:
-
Random selection: Besides the correct answer, the other three options are randomly selected from all the animal species involved.
-
Different major categories: Besides the correct answer, the other three options are randomly selected from different major animal categories than the correct answer. This setting makes the question-answer pairs easier because it is a coarse-grained judgment. The difference between the other three options and the correct answer is large, and if the model can identify correctly at the coarse-grained level, it can answer correctly.
-
Same major category: Besides the correct answer, the other three options are randomly selected from the same major animal category as the correct answer. This setting makes the question-answer pairs more difficult because it is a fine-grained judgment. The difference between the other three options and the correct answer is small.
-
Rules designed in this paper: Besides the correct answer, two options come from a different major animal category than the correct answer, and one option comes from the same major animal category as the correct answer. This design makes the question-answer pairs neither too difficult nor too easy.
The following are the accuracy rates of videochat2's responses:
| Method | acc(%) |
|---|---|
| Random selection | 91.97 |
| Different major categories | 97.60 |
| Same major category | 73.83 |
| Rules designed in the paper | 83.90 |
The results indicate that the selection of options affects the experiment results, which also supports our theoretical analysis above. Our design can moderate the difficulty of the question-answer pairs, making the evaluation of the model more aligned with real-world scenarios.
Q3: Potential negative impacts
Although the technology itself is intended to protect and study wildlife, if it falls into the hands of malicious actors, it could be used for illegal hunting and animal exploitation. These actions could cause damage to wildlife populations. Also, excessive reliance on technology for animal monitoring and protection may lead to neglect of manual patrols and traditional conservation methods. Considering these potential negative impacts, we hope that the community adheres strictly to laws and regulations when applying relevant data and technology, to ensure the correct and safe use of technology.
Q4: Parameter sensitivity
While using a uniform set of parameters may not perfectly align with each model's best settings, this approach ensures fairness and standardization in evaluating all models. This helps in identifying relative performance differences due to model architecture rather than parameter tuning. Moreover, in real-world applications, optimal settings are not always known or achievable. Therefore, using uniform parameters also potentially validates the robustness of models to parameter variations. In the future, we will also conduct further research to test each model with other settings, thereby gaining a more comprehensive understanding of each model's potential.
Thanks for your detailed rebuttal. It would great if all reviewers could read the rebuttal, at least acknowledge them and if necessary respond.
This article presents a animal-centric dataset for the the assessment of multimodal video models. The works is timely and certainly will be of interest, thus it should be accepted.
Firstly, the paper certainly needs careful grammar checking. Secondly, the reviewers provided a number of suggestions that the authors should consider for the final version. I have also identified a number of additional issues that I will mention below.
Comments on the manuscript:
The authors argue that previous benchmarks have little focus on animals. However, this quantification is based on MVBench (Fig 1, top left), which seems anecdotal. Please support this statement also based on other data.
A table comparing different benchmarks is missing (incl. the various sources).
Examples in Figure 2 seem to be full of errors:
- Formatting, space at wrong spot in: “(A )Shaking head”
- It appears to me that the insect in the “action prediction” task is not the same and that it shows 2 different insects arriving (rather than one exciting)? [Temporal order unclear]
- For the answer to “Time”, how did the animal arrive in frame 2 vs. frame 1, if there was no locomotion at the beginning of the video? [“at the end” is marked as a correct example for this example]
- I’m honestly unsure about this one, but why would “circumanal gland signing behavior” be a social interaction?
- Typos in the reasoning example: “why does the cat jumps [..]”, the answer is also grammatically and factually incorrect (its a dog, not a cat).
- Typos in the counting examples “how many antelope does a leopard attack?”, “How many times does the panda shake its body back and forth (extra space) ?” — Also how can one know the correct answer (4) from 3 still images?
- The question of reviewer xc4z along those lines was answer with: "In reality, we have manually filtered them to make as few errors as possible. Despite the minor errors, these errors do not pertain to key information that can affect the ability of humans and video language pre-training models to choose the correct answer.”
If these typos are intended, I think the rationale for the typos should be explained and also indicated in the caption.
The figure captions are very sparse and do not explain the relevant context.
- Typos in lines 137, 140, 246 (pixel missing)
- Figure 7 misses a x-axis.
- Figure 5: clarify if change is in percent vs. in accuracy points.
- Apart from [25], references in 4.1 missing (for models).
- Table 1, can you calculate a confidence interval for the “random”model, and then indicate which models are actually significantly better than the random baseline?
- Formatting: Main table 1 -- the task acronyms seem to be nowhere defined (at least not in the main text)?
I'd find it great if the authors also report performance for the questions from TGIF-QA, MSRVTT-QA, and NExT-QA individually and compare the performance to SOTA models from those datasets. This would be interesting to 1) assess leakage (question from reviewer yXxA), as the datasets are older than the MLMs 2) show more baselines (and compare with the prior literature for those widely used datasets). Is it true that models are worse on the animal subset of those benchmarks?
Ceiling performance:
One of the arguments in the paper (also abstract) is that there is much room for improvement. The authors then focus on 7B models (for comparability). Simply evaluating bigger models would thus also be interesting to assess this claim. Given that the authors do not train, but just evaluate the models that seems feasible as well.