All-in-One Image Coding for Joint Human-Machine Vision with Multi-Path Aggregation
摘要
评审与讨论
The paper proposed a unified image compression method with multi-path aggregation for joint human-machine vision tasks. The authors utilized a lightweight predictor to generate masks to allocate features into main and side paths. By leveraging the pre-trained main path module, shared features can be reused to support varied tasks by finetuning a relatively low amount of parameters.
优点
-
The authors provide an innovative unified image compression method with multiple path aggregation that can support multiple human-machine vision tasks with shared features.
-
Experimental results show that the proposed method achieves the SOTA performance.
-
The paper is easy to understand and the experimental results are well presented.
缺点
-
For experimental results, the author compared several SOTA methods without the basic TinyLIC model. Besides, I think it is necessary to compare with SOTA baselines of unified models in terms of parameter amounts, time complexity, and computational complexity.
-
The effectiveness of the predictor remained to be verified. According to Table 3, the proposed Predictor provided a tiny improvement in performance with only 0.19% bitrate reduction.
问题
Can the author provide more sufficient experiments to verify the effectiveness of the predictor module?
As for the base model, is it possible to apply the proposed MPA to other more recently published learned image models?
局限性
From my point of view, the proposed unified model needs to fine-tune as many sub-modules as the task amounts, and there are multiple training stages to adjust according to the specific tasks. The proposed framework still needs to accommodate different models adapted to the specific tasks, rather than an all-in-one image compression method.
We thank the reviewer for recognizing and acknowledging the strengths of the proposed approach. We will address the raised concerns below:
Response to the Weaknesses
- Comparisons to TinyLIC and other SOTA baselines.
[Reply] We thank the reviewer for the valuable suggestion. It is indeed necessary to provide a more comprehensive evaluation by comparing TinyLIC as a baseline and comparing the complexity with other SOTA methods. To address this, we have added the relevant comparisons, as shown in Figure 1 and Table 3 of the global response attachment. TinyLIC baseline is a variable-rate version optimized for MSE, while the perception-optimized version is equivalent to MPA (α=0). Since CRDR is also a modification based on MRIC, we include MRIC as a baseline for complexity comparison. As shown, our implementation has lower parameter and computational complexity compared to MRIC, while maintaining similar latency. This similarity in latency is because MRIC, being composed solely of convolutions, is more GPU-friendly than TinyLIC, which consists of transformer structures that are not yet well optimized for GPU. We promise that these modifications will be presented in the final version.
- Effectiveness of the predictor.
[Reply] We understand the reviewer's concerns regarding the effectiveness of the predictor. In fact, Table 3 in our submission only evaluates the encoder, primarily to verify the enhancement of coding efficiency by MPA. However, this table shows only a small part of the effectiveness of the predictor, and the predictor's role in the decoder is even more important which cannot be overlooked, as it enables the base model to achieve coding for multi-task. Our experiments in the submission have already demonstrated its effectiveness. As shown in Figure 3 of our submission, our approach achieves comparable performance to methods like MRIC and CRDR, which are optimized solely for distortion and realism, and matches the vision task performance of fully fine-tuned models, all within a unified model. The key to achieving this performance is that the predictor decouples features across different tasks by predicting their importance for each task, thereby making side paths always focus on the features that are most important for task optimization at any ratio, and easing the complexity of optimizing coding for multi-task. Figures 4, 5, and Appx. E.2 in the submission show that the predictors assign different importance scores for different tasks, demonstrating that the predictor enables the additional side paths to capture the most critical features for each task. Related explanations have been given in Section 5.3 of the submission.
Response to the Questions
- Effectiveness of the predictor.
[Reply] Yes. To address the concerns and the question, we have explained the effectiveness of the predictor module in our response to Weakness #2.
- Applying to other base models.
[Reply] Yes. In recent years, similar MetaFormer structures [1] (including Swin Transformer [2] and ConvNeXt [3] which have channel MLPs) have been widely applied as general backbones in the latest learned image coding research [4-8], demonstrating their effectiveness and generalization. Our MPA is an innovation based on the channel MLPs within such MetaFormer structures, enabling learned image coding models to support coding for multi-task in a natural way. Therefore, the proposed MPA can be applied to any model using a MetaFormer backbone.
Response to the Limitations
- Not all-in-one.
[Reply] We sincerely thank the reviewer for the insightful feedback. We respectfully acknowledge that our method needs to fine-tune some of the parameters as the reviewer mentioned which is indeed a common practice to optimize coding for multi-task. We would like to clarify that our understanding of an all-in-one model is in the context of practical applications. As the reviewer mentioned, even a unified model still requires fine-tuning for specific tasks. But during the actual deployment phase, it is sufficient for the optimized unified model to handle all required tasks using a single encoder and decoder pair. Our MPA can achieve this by performing one encoding and user-controllable decoding, allowing a unified model to reconstruct an image for the target we want. This is what we mean by "all-in-one" coding. We appreciate the reviewer's understanding, and if the reviewer still finds this term unsuitable, we are open to adjusting the wording in the final version.
References
[1] W. Yu, et al. MetaFormer Is Actually What You Need for Vision. In CVPR 2022.
[2] Z. Liu, et al. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In ICCV 2021.
[3] Z. Liu, et al. A ConvNet for the 2020s. In CVPR 2022.
[4] Y. Zhu, et al. Transformer-based Transform Coding. In ICLR 2022.
[5] R. Zou, et al. The Devil Is in the Details: Window-based Attention for Image Compression. In CVPR 2022.
[6] J, Liu, et al. Learned Image Compression With Mixed Transformer-CNN Architectures. In CVPR 2023.
[7] Z. Duan, et al. Lossy Image Compression With Quantized Hierarchical VAEs. In WACV 2023.
[8] H. Li, et al. Frequency-Aware Transformer for Learned Image Compression. In ICLR 2024.
This paper explores image coding for multi-task applications and introduces Multi-Path Aggregation (MPA), integrated into existing models to facilitate joint human-machine vision through a unified architecture. The MPA employs a predictor to distribute latent features among task-specific paths according to their importance, thus maximizing the utility of shared features and preserving task-specific features for further refinement. Additionally, a two-stage optimization strategy is proposed to mitigate multi-task performance degradation. Experimental results show that MPA achieves performance comparable to state-of-the-art methods in both task-specific and multi-objective optimization across human viewing and machine analysis tasks.
优点
-
This paper explores image coding for multi-task applications and introduces Multi-Path Aggregation, integrated into existing models to facilitate joint human-machine vision through a unified architecture.
-
Experimental results show that MPA achieves performance comparable to state-of-the-art methods in both task-specific and multi-objective optimization across human viewing and machine analysis tasks.
-
The paper is exceptionally well-written and has clearly articulated the problem, proposed method, and the significance of their contributions.
缺点
I'm sorry that this paper is quite different from my research field, and I cannot make an accurate judgment on this paper. However, I think that this paper is exceptionally well written and has clearly articulated the problem, proposed method, and significance of their contributions. Therefore, I have no issues with it and am willing to defer to the opinions of other reviewers before making a final decision.
问题
Please check the weakness.
局限性
The authors have discussed the limitations of the work in Sec. 5.4 and Appx. A.
We would like to express our sincere gratitude to you for the considerate feedback. We appreciate your recognition of the clarity and quality of our paper, particularly in articulating the problem, proposed method, and significance of our contributions.
We want to affirm that your understanding of the advantages of our proposed method aligns well with the insights provided by other reviewers. Your perspective is valuable and contributes to a well-rounded evaluation of our work.
To further assist you in understanding the contributions and significance of our work, we encourage you to review our responses to the other reviewers' comments. These responses provide additional context and detailed explanations that might help in appreciating the strengths and innovations of our approach.
Thank you again for your supportive comments and for taking the time to review our work. We hope this information enhances your confidence in your initial assessment.
Thank you for the author's response that addressed my concerns. Overall, I think the motivation is clear, and the writting is satifactory. However, considering the scores of other reviewers, I decide to maintain my score unchanged.
Thank you for your response and for your recognition of our work.
The paper introduces a Multi-Path Aggregation (MPA) architecture designed to unify image coding for both human perception and machine vision tasks. By integrating the side path, the authors aim to optimize performance across various tasks while maintaining efficiency in terms of parameter and bitrate usage. This approach promises seamless transitions between tasks and improved performance.
优点
-
This paper is presented very well, including but not limited to its readability and the rational presentation of experiments.
-
Each part is highly motivated. For example, in the MPA module, a learnable mask is first trained to decouple the features into task-shared features and task-specific features. By optimizing the side path, it adapts to various downstream tasks. Techniques such as Gumbel-Sigmoid are used to handle non-differentiable problems. By introducing the MPA, this paper unifies the encoder and decoder models. These techniques can be referenced and adopted by the corresponding fields. Additionally, the experiments are conducted very thoroughly.
缺点
-
If we do not consider the issue of feature decoupling, introducing and fine-tuning the side path to adapt to downstream tasks is not a novel idea. For instance, in [1], a similar bypass fine-tuning method was introduced.
-
The proposed two-stage training strategy is somewhat complex, and it introduces a large number of training losses and hyperparameters, which may pose some inconvenience for its generalization and use.
[1] Y. Guo, H. Shi, A. Kumar, K. Grauman, T. Rosing, R. Feris. SpotTune: Transfer Learning through Adaptive Fine-tuning. In CVPR 2019.
问题
-
In the paper, MPA is used for classification and segmentation tasks, achieving good results. Can MPA be applied to other fundamental vision tasks, such as object detection? Alternatively, can MPA be used for knowledge transfer in cross-domain tasks, such as shifting from learning "task-specific" features to learning "domain-specific" features?
-
If possible, I would like to know whether this method is sensitive to changes in hyperparameters (such as those mentioned in Section 4) and how these hyperparameters affect the experimental results.
-
In terms of LPIPS and FID (from Bitrate 0.3 to 0.7), it seems that CRDR performs better. I completely accept that such a situation can occur, but I am curious about why this happens.
-
Can MPA be compared with parameter-efficient fine-tuning methods such as LoRA? Is it possible to conduct a simple performance comparison between them?
局限性
Please refer to the Weakness and Question sections. During the rebuttal process, I am willing to actively and promptly discuss with the authors and other reviewers. If the authors can adequately address my main concerns, I am willing to increase my score.
First of all, we really appreciate the reviewer's careful comments. We offer the following response to the reviewer's concerns:
Response to the Weaknesses
- Comparison to SpotTune.
[Reply] We appreciate the reviewer's attention to the differences between MPA and SpotTune. It is essential to emphasize that feature decoupling is central to MPA's effectiveness, an aspect that should not be overlooked, especially crucial for coding for multi-task scenarios. By considering feature decoupling, MPA supports smooth transitions between tasks, allowing for controllable and flexible reconstruction orientations, catering to diverse coding applications. In the encoding phase, feature decoupling enhances compression efficiency. In the decoding phase, it enables controllable reconstruction, which is vital for task-controllable image coding. In contrast, SpotTune only considers the impact of fine-tuning across different layers, applying indiscriminate routing to all features, which restricts its utility in task-controllable image coding.
- Complexity of training strategy.
[Reply] We thank the reviewer for the concerns regarding the generalization and usability of MPA. In fact, our proposed strategy originates from the widely-used multi-stage training strategy for generative image compression [2], which has been broadly validated for its stability and generalization, while not being overly complex. Here, we provide a further explanation. First, the initial stage involves training a general basic model. In this stage, the losses and hyperparameters we use strictly follow MRIC, with the only difference being that we fix MRIC's β at 2.56 (cf. Eq. (4) in [3]) and add a ratio loss to optimize the predictor in the encoder. Second, the subsequent stage of training only optimizes the side path and predictor in the decoder. When optimizing for MSE, the loss we use is the MRIC loss with β fixed at 0 (cf. Eq. (4) in [3]), plus the ratio loss. For vision task optimization, the loss we use removes the GAN loss from the first stage and adds a vision task-specific loss. Therefore, our proposed method ensures the generalizability and usability.
Response to the Questions
- Applications for other fundamental tasks and cross-domain tasks.
[Reply] We thank the reviewer for the insightful question regarding the application of MPA to other tasks. As we mentioned in our response to Weakness #2, our method demonstrates the generalizability and can adapt to various application scenarios. Here, we conducted additional tests on MPA's performance in object detection shown in Figure 1 of the global response attachment, using the same object detection model and task loss as TransTIC [4]. Regarding knowledge transfer in cross-domain tasks, our exploration mainly focuses on task-specific optimization and has not yet extended to domain-specific features. This is a promising direction for future research, and we will continue to follow up on this.
- Hyperparameters and loss terms.
[Reply] We thank the reviewer for the insightful question. Regarding the hyperparameters, we use the same settings as in [3] which have been thoroughly evaluated by the authors (cf. Section 5.2 in [3]). It's hard to reproduce the ablations on hyperparameters in such a limited period of rebuttal phase, so we respectfully suggest you refer to [3] for details. As for the loss terms, we provide detailed ablation studies in Table 1 of the global response attachment to demonstrate that our current combination can achieve a competitive trade-off. Regarding the role of each term, LPIPS enriches the semantic information of the reconstructed images, improving generalization, MSE constrains the pixel value consistency between the reconstructed and original images, and the task loss directly optimizes accuracy. The rate is determined by the encoder and entropy model, which are frozen and do not affect the second-stage optimization. We hope our ablations on the loss terms should help you more directly understand the role of each loss and the potential results of adjusting hyperparameters.
- Performance gap.
[Reply] There are two main reasons for the observed performance differences. First, the model size limits the performance especially at relatively higher bitrates. As shown in Table 3 of the global response attachment, TinyLIC, the base model we use to implement MPA, is much smaller than MRIC which is the base model of CRDR. Second, as a common sense in the community of image coding, optimizing with a relatively narrower bitrate coverage range can facilitate the coding performance for variable-rate models. As shown in Figure 3 of our submission, the bitrate coverage range of CRDR (0.08~0.72bpp) is much smaller than that of our implementation (0.07~1.20bpp). Thus the performance gap is reasonable and acceptable.
- Comparison to LoRA.
[Reply] We appreciate the reviewer's suggestion regarding LoRA. Our experiments recognize the effectiveness of LoRA in Table 2 of the global response attachment. Although MPA involves more fine-tuning parameters, it also achieves better performance. Moreover, our work has distinct features. We utilize predictors to support coding for multitask and smooth transitions between tasks within an all-in-one framework. The low-rank structure design of LoRA inspires us to consider improvements for the side path in our future work.
References
[1] Y. Guo, et al. SpotTune: Transfer Learning through Adaptive Fine-tuning. In CVPR 2019.
[2] F. Mentzer, et al. High-Fidelity Generative Image Compression. In NeurIPS 2020.
[3] E. Agustsson, et al. Multi-Realism Image Compression with a Conditional Generator. In CVPR 2023.
[4] Y. Chen, et al. TransTIC: Transferring Transformer-based Image Compression from Human Perception to Machine Perception. In ICCV 2023.
We appreciate all reviewers for their recognition of the strengths of MPA, as well as their insightful feedback and constructive suggestions. We have identified several common themes in the comments and would like to address them comprehensively in this global response.
Advantages of MPA and Comparisons to Other Methods
The reviewers have shown interest in the advantages of MPA. MPA's core lies in its feature decoupling capability, enabling smooth task transitions. A significant advantage of MPA is its all-in-one coding approach, where all tasks can be achieved through extended paths within a single model, eliminating the need to train multiple models for different tasks. Compared to other methods catering to fine-tuning like SpotTune [3] and LoRA [4], MPA leverages different importance of features for different tasks to enable coding for multi-task and task transitions. Compared to other unified models like MRIC [1] and CRDR [2], MPA can achieve comparable performance and easily support more tasks with lower complexity. To address the reviewers' concerns, we have added the comparisons to the TinyLIC's performance and MRIC's complexity as Figure 1 and Table 3 in the attachment, and will be presented in the final version.
Effectiveness of the Predictor
The reviewers have raised concerns regarding the predictor's effectiveness. As shown in Figure 3 of our submission, our approach achieves comparable performance to methods like MRIC and CRDR optimized solely for distortion and realism, and matches the vision task performance of fully fine-tuned models, all within a unified model. The key to achieving this performance is that the predictor decouples features across different tasks by predicting their importance for each task, thereby making side paths always focus on the features that are most important for task optimization at any ratio, and easing the complexity of optimizing coding for multi-task. Figures 4, 5, and Appx. E.2 in the submission show that the predictors assign different importance scores for different tasks, demonstrating that the predictor enables the additional side paths to capture the most critical features for each task. Related explanations have been given in Section 5.3 of the submission.
Training Strategy
Regarding the training strategy, the reviewers have highlighted the importance of discussing complexity, loss terms and hyperparameters. Here we give a further explanation:
-
First Stage: Training a generalized basic model. During this stage, the loss terms and hyperparameters strictly follow those used in MRIC, with the addition of a ratio loss to optimize the predictor in the encoder. The primary difference is that we fix MRIC's β at 2.56 (cf. Eq. (4) in [2]) since we do not consider task transitions in this stage.
-
Second Stage: Optimizing the side path and predictor in the decoder. The transitions between tasks are only controlled by the predictor, which is optimized by minimizing a separate ratio loss. Thus for MSE optimization, we can use the MRIC loss with β fixed at 0 (cf. Eq. (4) in [2]), along with the ratio loss. For vision task optimization, we remove the GAN loss since GAN is not used now, and add a vision task-specific loss.
The strategy ensure that our method remains aligned with established practices to achieve stable training. We used the same hyperparameters as in the MRIC paper, which have been thoroughly evaluated for their influences. To address the reviewers' concerns, we have conducted a detailed ablation study (cf. Table 1 in the attachment) to showcase the necessity of each loss term. The results will be included in the final version. Note that the rate term do not affect the task performance theoretically since it is determined by the encoder and the entropy model which are frozen in the second stage.
Generalizability and Applicability
The reviewers have raised concerns regarding the generalizability and applicability of MPA to various tasks and base models. Our additional experiments and the detailed results in the attachment's Figure 1, the main paper's Figures 4, 5 and Appx. E.2 demonstrate that MPA adapts well to different application scenarios including object detection. The generalizability is guaranteed by our proposed training strategy, which is inherited and developed from the widely used optimization methods, as we discussed in the previous section. Furthermore, MPA can be integrated into any model using MetaFormer [5] backbones with channel MLPs (like Swin Transformer [6] and ConvNeXt [7]) which has been widely used by the learned image coding community [8-12], ensuring the versatility and practical applicability of the proposed MPA. For the application of MPA in cross-domain tasks, we think this is a promising direction for exploration and will continue to follow up.
References
[1] E. Agustsson, et al. Multi-Realism Image Compression with a Conditional Generator. In CVPR 2023.
[2] S. Iwai, et al. Controlling Rate, Distortion, and Realism: Towards a Single Comprehensive Neural Image Compression Model. In WACV 2024.
[3] Y. Guo, et al. SpotTune: Transfer Learning through Adaptive Fine-tuning. In CVPR 2019.
[4] E. Hu, et al. LoRA: Low-Rank Adaptation of Large Language Models. In ICLR 2022.
[5] W. Yu, et al. MetaFormer Is Actually What You Need for Vision. In CVPR 2022.
[6] Z. Liu, et al. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In ICCV 2021.
[7] Z. Liu, et al. A ConvNet for the 2020s. In CVPR 2022.
[8] Y. Zhu, et al. Transformer-based Transform Coding. In ICLR 2022.
[9] R. Zou, et al. The Devil Is in the Details: Window-based Attention for Image Compression. In CVPR 2022.
[10] J, Liu, et al. Learned Image Compression With Mixed Transformer-CNN Architectures. In CVPR 2023.
[11] Z. Duan, et al. Lossy Image Compression With Quantized Hierarchical VAEs. In WACV 2023.
[12] H. Li, et al. Frequency-Aware Transformer for Learned Image Compression. In ICLR 2024.
This paper proposes Multi-Path Aggregation (MPA) as an image coding method that can be commonly used in multitask scenarios to satisfy both human perception and machine vision. The problem addressed in this paper is applicable to various computer vision tasks, and its effectiveness has been thoroughly evaluated through experiments across multiple tasks in experiments. One of the significant contributions of the proposed method is its simplicity in training, as it can handle various tasks with a single model. The readability and strong motivation of the paper have been recognized by most reviewers, and all reviewers have given evaluations leaning towards acceptance. The authors' rebuttal effectively addressed the reviewers' concerns, resulting in the maintenance of the reviewers' scores. Some reviewers' suggestions, such as comparisons with LoRA, remain as future work. Additionally, one reviewer noted that the proposed method is not entirely an all-in-one approach, citing this as a limitation. It is expected that responses to this concern will be reflected in the final manuscript.