W1: Since the Relative Robustness () is the main proposed metric aimed at addressing the limitations of existing evaluations, the major issue is that it is unclear whether the advantages of the proposed metric () in R-Bench over existing metrics, particularly Absolute Robustness (), are demonstrated through experimental results.

While the authors have provided the conceptual advantages of () in Figure 3 with an illustrative example, as well as comprehensive results (Fig. 4-6, Table 3) for 20 LMMs with detailed discussions (which is appreciated), it is crucial to justify how and to what extent , or the combination of and is more effective than alone in capturing LMM robustness against real-world corruption by experiments, not just by illustrative conceptual examples. Further experiments are encouraged as follows:

W1.1: demonstrate when and why cannot capture certain aspects of robustness mentioned in Figure 3 (four combinations of Reference/Distorted), analyze the performance across the three LMM tasks, and provide case studies (such as distributions or concrete examples for each robustness), or to conduct a correlation analysis between and robustness or real-world performance. Figure 1 contains nice illustrative examples but the authors are encouraged to provide quantitative results from the dataset.

W1.2: demonstrate how can help resolve this limitation by revealing the missing aspect of robustness through detailed comparisons. It would be beneficial to quantify these differences as well as some qualitative examples that could be in the Appendix. Without these key comparisons, the findings may be seen as another comprehensive result, making it difficult to substantiate the proposed ’s advantage as stated in the third contribution “...we have proposed the concepts of absolute/relative robustness…”

W1.3: Subsequently, in Table 3, the rankings differ from (GPT4o, InternX2, MPlugOwl3, …) and (GPT4o, GPT4Turbo, GeminiPro, …). The authors have analyzed the differences with detailed discussions (Line 408-421). On Line 262, it says “R-Bench will calculate the performance of all LMMs individually based on the above two metrics (Relative and Absolute Robustness) and use the average value for the final robustness ranking.” The Overall Ranking is summarized in Figure 4 (a) that incorporates , the authors are encouraged to justify why this ranking is better than the one provided by since is the newly proposed robustness metric.

W2: Missing in Section 4.5 GPT4o VS Human. My interpretation of this omission is twofold: (1) LMMs are more proficient in , if the gap (LMMs vs Human) is already shown to be large in , it will be even larger in , so there is no need to conduct further experiments; (2) human answers are almost equivalent to Ground Truth, suggesting no statistical difference. However, since one main contribution is the proposed , its absence renders the results for R-Bench seem incomplete. It would make the work more complete to include results, or to provide more convincing arguments for its omission.

In summary of weaknesses, although the comprehensive and experimental provide valuable insights (e.g. the incorporation of machine-related distorted images in the training process), the role of the proposed in supporting remains unclear by experimental evidence in Fig. 4-6 or Table 3.

These comprehensive experiments and analyses result in an overall rating of 6 marginally above the acceptance threshold. However, the identified weaknesses relate to the central claim of the proposed Relative Robustness, the lack of experimental results demonstrating how is better than the existing metrics, especially , prevents a higher score. It would be beneficial for the authors to provide further clarity regarding any key messages that may be missing from this review.

Presentation

P1: Terminology Consistency: Distortion vs Corruption. This might be a little bit strict on terminology: the term “corruption” is used as a general term to describe the visual attribute degradation of an image in the whole paper, while “distorted image” is also commonly used. Do “corruption” and “distortion” mean exactly the same thing or is “distortion” a specific way of “corruption”? These two terms are used in the whole paper, it seems to me they are used interchangeably to refer to the same thing. Although both have similar meanings, they focus on different aspects especially when it comes to vision - “Distortion” is more about altering the shape or appearance while “Corruption” relates to data integrity or degradation.

Take the visualizations in Figure 8 as concrete examples, “Block Exchange“ or “Block Lost“ are types of corruption, yet most of the text uses “distorted image”. It is unclear if this term refers to the exact same types of corruption listed in Figure 8. It would make it clear if the authors can provide both definitions if they have distinct meanings, or further clarification if these terms are being used interchangeably.

P2: Line 251 “...but since the baseline of the reference is already high, this does not necessarily lower the appearance of .” is the absolute robustness defined as , however, the authors’ intended meaning of the phrase “lower the appearance of the absolute robustness” is unclear. It seems that the authors intended to convey that the value of may appear high, yet fails to capture true robustness (false high). A rephrasing for greater clarity could be helpful.

P3: Line 400 “Figures (b) and (c) demonstrate that GPT4o” seems to refer to Figure 4, following the context of text, but it is placed next to Figure 5 that also has sub-figures (b) and (c). This placement is initially confusing. It would be clearer to explicitly mention Figure 4 when referring (b) and (c) or put the text and figure closer.

P4: Line 404 “This reflects that their training process may have incorporated machine-related distorted images, especially compressed and partially masked ones (which correspond to steps 4 and 5, where LMMs are currently most proficient), thus having some robustness”. This is a nice summary and it would be beneficial if the specific references to these LMMs training process are provided for practitioners.

P5: It is unclear to me what the cyan and orange colors represent in Figure 4 (a) until later in the text on Line 376. It might be beneficial to put them closer (e.g. in the caption) for easier reference.

P6: In Figure 6, the “Quality” label in the legend is blue, while the corresponding bars in the chart appear cyan, making it unclear whether the brightness of the bar chart represents some values. No explanation is provided in the caption or text. Also, the way to quantify “Quality” is unclear, no formal definition is provided. This leaves room for multiple interpretations, such as subjective human scoring or metrics like PSNR, SSIM, etc, but these are absent from the paper. It would be great if the authors can provide a clear definition of how “Quality” is measured, as well as including the specific metric or method used to quantify “Quality” in the figure caption or main text.

Minor Issues (Typos or Grammatical Errors)

M1: Figure 3 caption - “3. LMM are able to … 4. LMM are able to” are -> is or LMM are -> LMMs are?

M2: Figure 3 caption - “But corruption make it correct as coincidence.” may be corrected as “But the corruption makes it appear correct by coincidence.” (False Positive)?

M3: Line 212 The reference mark is placed after the period - “...on KADID-10K. (Lin et al., 2019) As space limits…”.

M4: Line 475 “Figure 6 further analyzes the low-level features of the reference/distorted image in R-Bench.We cal…” There is no space between two sentences.