Insight 1 seems just hypothesis and not verified, need some citations at least.
has not bounded and its maximum value varies depending on each benchmark. In addition, the score may be affected by the number of objects (segmentation). Then, can we use Dog-IQA as a consistent IQA metric for general use-cases? For example, how should we determine for real-world images? Image A has 6.10 and image B has 6.20 but they have a large different number of segmentations then which one is better? I don't think the proposed method guarantees fairness.
The performance is largely from underlying backbone MLLM, mPLUG-Owl3. For a fair comparison, how about CLIP-IQA with mPLUG-Owl3, or how about Dog-IQA with CLIP?