5.0

/10

Poster4 位审稿人

最低3最高6标准差1.2

4.5

置信度

正确性2.8

贡献度2.0

表达2.3

NeurIPS 2024

COSMIC: Compress Satellite Image Efficiently via Diffusion Compensation

Ziyuan Zhang,Han Qiu,Zhang Maosen,Jun Liu,Bin Chen,Tianwei Zhang,Hewu Li

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

摘要

关键词

Image compressionSatellite imagesGenerative modelProbabilistic diffusion models

评审与讨论

审稿意见

评分: 5置信度: 42024-07-01

The authors propose COSMIC, a simplified and efficient compression method for satellite earth observation images. Due to the increasing number of satellites and the volume of image data, existing compression schemes are difficult to deploy with the limited computing power and energy available on satellites. COSMIC designs a lightweight encoder to significantly reduce computation while achieving a high compression ratio. For ground-based decoding, it uses a diffusion-based model to compensate for the detail loss caused by the simplified encoder, leveraging the multi-modal nature of satellite data (such as coordinates and timestamps) to improve image reconstruction quality. Experimental results show that COSMIC outperforms existing methods in both perceptual quality and distortion metrics.

优点

(1)The writing and presentation of the paper are excellent, with a clear and logical flow. (2)The authors introduce a substantial amount of knowledge on deep learning-based image compression and diffusion models, making it easy for readers who are not in this field to understand. (3)Satellite image compression is a novel topic, and the lightweight coding framework used by the authors is of practical significance.

缺点

(1)More performance comparison tests of various models need to be included, such as those introduced in Section 2.2 on deep learning-based remote sensing image compression [1-3] and some of the latest works [4,5] in deep learning-based image compression. (2)The process of handling metadata is not clearly explained. I only saw the Metadata Encoder mentioned in the paper. How is this part of the data processed, and how is it aligned with the satellite image data? (3)The authors need to further report the spatiotemporal complexity of COSMIC compared to other methods to demonstrate its excellent lightweight architecture.

[1] Fu, Chuan, and Bo Du. "Remote sensing image compression based on the multiple prior information." Remote Sensing 15.8 (2023): 2211. [2] Xiang, Shao, and Qiaokang Liang. "Remote sensing image compression based on high-frequency and low-frequency components." IEEE Transactions on Geoscience and Remote Sensing (2024). [3] Zhang, Lei, et al. "Global Priors with Anchored-stripe Attention and MultiScale Convolution for Remote Sensing Images Compression." IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2023). [4] He, Dailan, et al. "Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive coding." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. [5] Liu, Jinming, Heming Sun, and Jiro Katto. "Learned image compression with mixed transformer-cnn architectures." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023.

问题

See Weaknesses

局限性

See Weaknesses

作者回复

2024-08-06

Thank you for your careful review and valuable feedback.

Q1. More performance comparison tests of various models need to be included.

Please see our general response 2 above.

Q2. The process of handling metadata is not clearly explained.

Sorry for the confusion in writing. First, we normalize each metadata. And then, inspired by the way that the timestep $t$ processed by diffusion model, we use sinusoidal embedding to encode each metadata. For each metadata embedding, we use its corresponding linear function to map it to CLIP space. Finally, we use CLIP to align it with the satellite image.

Q3. Spatiotemporal complexity comparison.

In this paper, we use two lightweight modules, LCB and CAM, while other methods use ordinary convolution. We list their time and space complexity as follows. Assuming that the number of input channels is $c_{in}$ , the number of output channels is $c_{out}$ , the input feature map size is $n^2$ and the convolution kernel size is $k$ .

Modules	LCB	CAM	ordinary convolution
Temporal complexity	$O(n^2\times c_{in} \times k^2 + \frac{1}{2} \times n^2 \times c_{in} \times c_{out})$	$O(k\times n^2 \times c_{in} \times c_{out})$	$O(n^2 \times k^2 \times c_{in} \times c_{out})$
Spatio complexity	$O(k^2 \times c_{in} + \frac{1}{2} \times c_{in} \times c+{out})$	$O(k\times c_{in} \times c_{out})$	$O(k^2 \times c_{in} \times c_{out})$

评论- A Gentle Reminder of the Final Feedback

2024-08-11

Please allow us to thank you again for your careful review and valuable feedback, and in particular for recognizing the strengths of our paper in terms of clear writing, novel topic and practical significance.

Kindly let us know if our response and the new experiments have properly addressed your concerns. We are more than happy to answer any additional questions during the post-rebuttal period. Your feedback will be greatly appreciated.

评论- A Gentle Reminder of the Final Feedback

2024-08-13

Dear Reviewer AGki,

Thank you very much again for your initial comments. They are extremely valuable for improving our work. We hope our response has adequately addressed your concerns. We shall be grateful if you could kindly give any feedback to our rebuttal.

Best Regard,

#Paper9518 Author(s)

审稿意见

评分: 3置信度: 52024-07-07

This paper presents a novel method to address the challenge of transmitting the increasing volume of satellite images to ground stations. The core innovation lies in designing a lightweight encoder that reduces computational complexity on satellites, coupled with a diffusion-based compensation model on the ground to enhance image quality. The experimental results demonstrate COSMIC's superior performance over existing methods in terms of both perceptual and distortion metrics.

优点

Lightweight Encoder: The design of a lightweight encoder significantly reduces the computational load on satellites, making the solution feasible for in-orbit deployment.
Diffusion-Based Compensation: Utilizing a diffusion-based model to enhance image details during the decompression process effectively addresses the limitations of the lightweight encoder.
Comprehensive Evaluation: The extensive experiments and comparisons with state-of-the-art baselines highlight the robustness and superiority of COSMIC in various metrics.
Multi-Modal Integration: Incorporating sensor data as conditions for diffusion generation leverages the multi-modal nature of satellite images, enhancing the overall reconstruction quality.

缺点

Encoder Degradation: The lightweight encoder's reduced feature extraction capability may limit image quality at extremely low bit rates.
Training Specificity: The reliance on a pre-trained stable diffusion model that lacks specific priors for satellite images could limit the model's performance under certain conditions.
Limited Power Supply Considerations: While the lightweight encoder addresses computational constraints, the paper does not thoroughly discuss the power supply limitations on satellites and their impact on the proposed method.
Real-World Application Scenarios: The paper could benefit from more detailed discussions on practical deployment scenarios and the associated challenges, such as real-time processing requirements and potential bottlenecks.
Satellite images typically consider using pixel-level distortion metrics instead of perceptual metrics. Due to the sensitivity of satellite images to compression method distortion, lossless or near-lossless compression methods are usually used.

问题

Encoder Degradation at Low Bit Rates: a) Question: How does the encoder's degradation specifically affect the image quality at low bit rates? Are there specific types of image details that are consistently lost? b) Suggestion: Provide a detailed analysis of the types of image features that are most affected by the lightweight encoder at low bit rates. This can include examples or case studies highlighting these issues.
Training Specificity and Pre-Trained Models: a) Question: How does the performance of the pre-trained stable diffusion model compare with a model specifically trained on satellite images? Have any preliminary experiments been conducted in this regard? b) Suggestion: Discuss any preliminary results or plans for training a diffusion model specifically on satellite images. This could include potential improvements or challenges identified during these experiments.

局限性

The authors have acknowledged the following limitations: a) Encoder Degradation at Low Bit Rates: The lightweight encoder's performance drops at very low bit rates, affecting image quality. b) Training Specificity: The use of a pre-trained stable diffusion model, which lacks specific prior knowledge of satellite images, may limit performance under certain conditions.

作者回复

2024-08-06

Q1. Encoder Degradation and Training Specificity.

These two questions are just repeating our last section, i.e. Limitations & Future work. Moreover, we are frustrated to find two detectors (GPT-zero and Scribbr) have 100% confidence of AI-generation on your review. It's clearly doubtful if the review can really help this paper's decision process.

Anyway, let us to explain these two questions again.

We never claim COSMIC is perfect. In experiments, we notice the performance of COSMIC is slightly degraded at extremely low bit rate and we report this limitation honestly. The reason is simple since the less image content the image encoder provides, the more compensation from diffusion is needed at decompression. It is hard to rely only on diffusion to decompress.
We believe that training a diffusion model specifically for satellite images, which has sufficient prior knowledge of satellite images, can solve this problem to some extent. However, this is not the scope of this paper.

Q2. Limited Power Supply Considerations & Real-World Application Scenarios.

Please see our general response 1 above.

Q3. Satellite images typically consider using pixel-level distortion metrics instead of perceptual metrics.

In the paper, we considered both distortion metrics and perceptual metrics, and experimental results show that COSMIC can achieve SOTA performance in both metrics.

2024-08-12

I think satellite images should be coded and decoded in a pixel-level controllable way, and the author's method can only ensure that the distribution of the generated image is as close as possible to the distribution of the original image, and the distortion is uncontrollable and unsuitable for satellite image compression.

评论- Author Response (Reviewer upSj)

2024-08-12

Thanks for the reply.

First, thanks for acknowledging our work can ensure that the distribution of the generated image is as close as possible to the distribution of the original image which is the general goal of any lossy image compression methods.

The history of satellites' photography is much longer than image compression algorithms. Of course there are pixel-level image compression methods used by Mars rover Viking [1] launched in 1975, way before the invention of JPEG. Other pixel-level methods are mostly used for lossless compression, which is neither within our scope nor used by satellites launched since 2000. Today, most satellites are using lossy compression methods like JPEG which is not coded and decoded in a pixel level . We list satellites using JPEG as follows: Soloar-B[2], BILSAT-1[3], Cartosat-2[4], TacSat-2[5], TEAMSAT[6], SPOT-5 [7], Cartosat-1[8], CartoSat-2E[9], TurkSat-3USat[10], SAC-C[11], etc.).

Moreover, JPEG also results in uncontrollable distortion but is still widely used by various satellites. JPEG uses different quantization table only to tune the compression ratio which cannot control the distortion performance which depends on the image contents. In our paper, we have demonstrated that COSMIC surpasses JPEG2000 across both distortion metrics and perceptual metrics.

Reference

[1] The Martian Landscape, https://www.nasa.gov/wp-content/uploads/2024/01/sp-425-the-martian-landscape.pdf

[2] Soloar-B, https://www.eoportal.org/satellite-missions/solar-b#spacecraft

[3] Bradford A, Gomes L M, Sweeting M, et al. BILSAT-1: A low-cost, agile, earth observation microsatellite for Turkey[J]. Acta astronautica, 2003, 53(4-10): 761-769.

[4] Cartosat-2, https://space.skyrocket.de/doc_sdat/cartosat-2.htm

[5] TacSat-2, https://space.skyrocket.de/doc_sdat/tacsat-2.htm

[6] TEAMSAT, https://www.eoportal.org/satellite-missions/teamsat#overview

[7] SPOT-5, https://www.eoportal.org/satellite-missions/spot-5#launch

[8] Cartosat-1, https://earth.esa.int/eogateway/missions/irs-p5

[9] CartoSat-2E, https://www.eoportal.org/satellite-missions/cartosat-2e#spacecraft

[10] TurkSat-3USat, https://www.eoportal.org/satellite-missions/turksat-3usat#transponder

[11] SAC-C, https://www.eoportal.org/satellite-missions/sac-c#mmrs-multispectral-medium-resolution-scanner

审稿意见

评分: 6置信度: 42024-07-10

The authors propose a novel method to compress satellite images using a learned algorithm that relies on a diffusion model on the ground to decode the compressed image. The proposed method is designed for deployment on satellites.

优点

A novel method to compress satellite images by using a diffusion model to reconstruct the encoded image is presented. The diffusion and decoder models are used to compensate for the lightweight encoder used. The method is original and may have the potential to be extended to other edge devices.

缺点

Authors claim in the literature reviews that no algorithm for compressing data on satellite exist. A simple search will show that this is not true. Below are two examples Artificial Intelligence Based On-Board Image Compression for the Φ-Sat-2 Mission A Simple Lossless Algorithm for On-Board Satellite Hyperspectral Data Compression
The difference between training and testing networks is unclear in the text or the figure.
No comparisons with efficient compression algorithms used on satellites or edge devices
Authors claim that the advantages of the proposed method are particularly visible on image seams (page 7); elsewhere in the paper, it is mentioned that the proposed method deals with image patches. How can the proposed method show advantages on image seams while it is working on the individual patches as input and not the stitched image?

问题

Please rewrite the text and the figures to clarify the distinction between training and testing stages.
Update the literature review with models used for compression on-board satellites. (suggested search terms: on-board satellite compression, in-orbit satellite compression) *Does the proposed compression and decompression work with the complete large image? Or just the patches? When do the reconstructed patches get stitched? See the last comment on the "Weaknesses" section.

局限性

Yes

作者回复

2024-08-06

Thanks for the very detailed review and suggestions. Fig.S2 can be found in global response PDF.

Q1. Please rewrite the text and the figures to clarify the distinction between training and testing stages.

Sorry for the confusion in writing. The training is divided into two stages.

In the first stage, we train the compression model. Since the Image decoder $\mathcal{D}$ needs two parts of information for decoding, they correspond to $y\prime$ in Fig.S2, which represents the feature map extracted by the on-board image compression encoder, and $z_0$ , which represents the compensation information. Therefore, we introduce another Image encoder, corresponding to the Image encoder in the Compensation Model part of Fig.S2, to extract compensation information $z_0$ from the original image. In the first stage, $\mathcal{E}$ , $\tilde{\mathcal{E}}$ and $\mathcal{D}$ are trained together.

In the second stage of training, we freeze the parameters of $\mathcal{E}$ , $\tilde{\mathcal{E}}$ and $\mathcal{D}$ , and train the noise prediction network, with the goal of making the information generated by the diffusion model as close to $z_0$ as possible, denoted as $z_0\prime$ , so as to generate the compensation information required by the decoder.

In the testing stage, the trained diffusion model can generate compensation information $z_0\prime$ . Therefore, we no longer need $\tilde{\mathcal{E}}$ . The $z_0\prime$ generated by the diffusion model replaces the $z_0$ extracted by $\tilde{\mathcal{E}}$ to help the image decoder decompress the image.

We will improve the text and figures to guarantee a better reading experience.

Q2. Update the literature review with models used for compression on-board satellites.

We will add the following content to the Background section.

There are many methods for remote sensing image compression, but most of them don't focus on onboard deployment. There are also some works for compressing data on satellites. [1] used the CAE model to extract image features and reduce the image dimension to achieve compression. However, this method only considers the reduction of image dimension and does not consider the arithmetic coding process in the actual transmission process, resulting in the image compression rate being fixed at 8 and the bpp not being able to be flexibly adjusted. [2] proposed the complexity-reduced VAE, which reduces the amount of calculation by reducing the number of model channels and the entropy model structure. However, violently reducing the number of model channels will lead to a significant decline in model performance.

Q3. No comparisons with efficient compression algorithms used on satellites or edge devices.

On terrain, there have been some efficient compression algorithms for edge devices. However, on terrain, edge devices are usually used to decompress images at the receiving end [3][4][5], for example, receiving a picture on a smartphone. Therefore, existing efficient compression algorithms for ground-edge devices usually focus on decoder efficiency, which is not applicable to satellite scenarios. Few works focus on encoder efficiency on satellites for image compression. We choose the European Space Agency (ESA)'s paper [1] as a baseline and list comparison as follows and will add them in revision. [1] conducts experiments on three platforms, respectively NVIDIA GeForce GTX 1650, Intel Myriad 2 VPU, and Intel CoreTM i7-6700 Processor. However, the power consumption of GPU and CPU shown in the article is 83W and 45.5W respectively (the results are shown in Table V in [1]), which is unpractical for power-constrained satellites and no satellites deploy GeForce GTX 1650 or Intel i7-6700 as payloads. For VPU, according to Table V, taking 10.92 seconds to process 2048 $256\times256$ patches, we can infer that the throughput onboard of [1] is 98.36Mbps, while COSMIC can achieve 507.37Mbps within satellite supported power range [6].

Q4. Authors claim that the advantages of the proposed method are particularly visible on image seams (page 7); elsewhere in the paper, it is mentioned that the proposed method deals with image patches. How can the proposed method show advantages on image seams while it is working on the individual patches as input and not the stitched image?

Earth-observation satellites take large photos, for example, the swath width of WorldView-3 is 13.1km with 1m GSD [7] which will lead to several GBs raw data per photo. The constrained computing resources on the satellites cannot support to compress an entire photo. Therefore, satellite photos are typically compressed by tiling to small images and compression onboard. After receiving on the ground, they are decompressed and stitched together. COSMIC and all baselines all follow this process and do not specially process seams. Generally speaking, without any special processing on the seams, the higher the decompression image fidelity, the less noticeable the seams will be. Moreover, stitching satellite image algorithms is orthogonal to compressing each tile, and we’ll discuss them in revision.

Reference

[1] Artificial intelligence based on-board image compression for the Φ-Sat-2 mission. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023.

[2] Reduced-complexity end-to-end variational autoencoder for on board satellite image compression. Remote Sensing, 2021.

[3] Computationally efficient neural image compression. arXiv:1912.08771, 2019.

[4] Computationally-efficient neural image compression with shallow decoders. ICCV 2023.

[5] Complexity-guided slimmable decoder for efficient deep video compression. CVPR 2023.

[6] Wildfires From Space. https://blogs.nvidia.com/blog/ororatech-wildfires-from-space/, 2021.

[7]WorldView-3 DG_WorldView3_DS_2014.pdf (spaceimagingme.com)

评论- Reviewer response

2024-08-10

Thanks alot for clarifying. Q4 is clear now but the remaining points require major revisios to the paper especially Q3.

评论- Major revision for Q3

2024-08-11

This is the response to Q3.

Q3. Add a baseline to compare with efficient compression algorithms used on satellites

To further alleviate your concerns, we select a representative work [4] of ESA for onboard compression for comparison. We use the same model structure as in [4] and triple the number of channels to adapt RGB image as input. We retrain the models with fMoW dataset for a fair comparison with Adam optimizer with $lr = 1 × 10^{−4}$ for 100 epochs with a batchsize of 64.

COSMIC can still achieve SOTA results on distortion and perception metrics. The results show that under similar PSNR, COSMIC achieve higher MS-SSIM, lower LPIPS and FID at lower bpp. As [4] only considers the reduction of image dimension, only some fixed bpps can be achieved. And due to the serious reduction of image dimension, a large amount of information is lost, and the image reconstruction quality is poor.

Method	bpp↓	PSNR↑	MS-SSIM↑	LPIPS↓	FID↓
ESA_2023[4]	1.0	28.07	0.979	0.1229	71.55
COSMIC(ours)	0.61	28.68	0.980	0.0462	19.44
ESA_2023[4]	2.0	29.51	0.986	0.0863	57.46
COSMIC(ours)	0.76	29.42	0.986	0.0349	16.95

For efficiency comparison, COSMIC can reduce FLOPs by $3\times$ and increase throughput by $5\times$ , as shown in the following table. [4] conducts experiments on VPU platform, and we use the same level computing power edge devices, Jetson Xavier NX. For detailed information, please refer to Global response Q1 and Rebuttal Q3.

Method	FLOPs (G) ↓	Throughput (Mbps)↑
ESA_2023[4]	15.4	98.36
COSMIC(ours)	4.9	507.37

Reference

[1] Remote sensing image compression based on high-frequency and low-frequency components. IEEE Transactions on Geoscience and Remote Sensing, 2024.

[2] Remote sensing image compression based on the multiple prior information. Remote Sensing, 15(8):2211, 2023.

[3] Global priors with anchored-stripe attention and multiscale convolution for remote sensing images compression. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023.

[4] Artificial intelligence based on-board image compression for the Φ-Sat-2 mission. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023.

2024-08-12

Thanks for the feedback. I raise my score to 6 borderline accept because of the new edits.

评论- Thank You for Your Positive Feedback and Gentle Reminder

2024-08-12

Dear Reviewer GEfY,

Thank you so much for your positive feedback! It encourages us a lot!

We noticed that you mentioned in your response that you would raise your score to 6. Again, we sincerely appreciate this! However, the current score remains unchanged (as 4). We speculate that you may have forgotten to make changes to your original rating in your busy schedule. We would be very grateful if you could kindly change the score before the end of the author-reviewer discussion at your convenience, in case of potential misunderstandings during the reviewer discussion period.

Best Regard,

#Paper9518 Author(s)

评论- Major revision for Q2

2024-08-11

This is the response to Q2. We’ll make the revision as follow:

Q2. Update the literature review in Background section

We revise the Background section as follow:

Remove the claim that no algorithm for compressing data on satellite exist.
Update the literature review of onboard image compression algorithm.

The final revision in Sec2.1 from L112 to L119 is as follow (~~strike-through~~ means the removed content and Italic+Bold indicates newly added):

~~Although there are some works on remote sensing image compression, none of the previous work is targeted at on-board computing scenarios.~~ There are some compression methods specifically for remote sensing images [1,2,3]. [1] uses discrete wavelet transform to divide image features into high-frequency features and low-frequency features, and design a frequency domain encoding-decoding module to preserve high-frequency information, thereby improving the compression performance. [2] explore local and non-local redundancy through a mixed hyperprior network to improve entropy model estimation accuracy. Few of these work focus on onboard deployment. [4] use the CAE model to extract image features and reduce the image dimension to achieve compression, and deploy the model on VPU. However, this method only considers the reduction of image dimension and does not consider the arithmetic coding process in the actual transmission process, resulting in the image compression rate can only be adjusted by changing the model architecture.

Reference

[1] Remote sensing image compression based on high-frequency and low-frequency components. IEEE Transactions on Geoscience and Remote Sensing, 2024.

[2] Remote sensing image compression based on the multiple prior information. Remote Sensing, 15(8):2211, 2023.

[4] Artificial intelligence based on-board image compression for the Φ-Sat-2 mission. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023.

评论- Major revision for Q1

2024-08-11

Thank you for the feedback.

Given the character limit (5000), we have to response each question separately in three comment. This is the response to Q1. We’ll make the revision as follow:

Q1. Rewrite the training and testing stage

We revise the Method section as follow:

Add more detailed training process and rearrange the order for clearer expression.
Clarify the distinction between training and testing stages.

The final revision in Sec4.3 from L198 to L205 is as follow (~~strike-through~~ means the removed content and Italic+Bold indicates newly added):

~~We first determine what details are lost in $\lfloor \mathbf{y} \rceil$ .~~ The training is divided into two stages. In the first stage, we train the compression model. Here, the image decoder receives two parts of information, one of which is the latent representation $\lfloor \mathbf{y} \rceil$ extracted by the encoder $\mathcal{E}$ on the satellite, and the other part is the information $\mathrm{z}_0$ extracted from the original image $\mathbf{x}_0$ by the image encoder $\tilde{\mathcal{E}}$ on the ground, which is used as compensation to $\lfloor \mathbf{y} \rceil$ .Since the Image decoder $\mathcal{D}$ needs two parts of information (i.e. $y\prime$ and $z_0$ in Figure 2) for decoding, we introduce another image encoder $\tilde{\mathcal{E}}$ to extract compensation information $z_0$ from the original image. In the first stage, $\mathcal{E}$ , $\tilde{\mathcal{E}}$ and $\mathcal{D}$ are trained together. In the second stage of training, we freeze the parameters of $\mathcal{E}$ , $\tilde{\mathcal{E}}$ and $\mathcal{D}$ , and train the noise prediction network, with the goal of making the information generated by the diffusion model as close to $z_0$ as possible, denoted as $z_0\prime$ , so as to generate the compensation information required by the decoder. During the inference phase, ~~$\mathrm{z}_0$ is replaced by $\mathrm{z}_0\prime$ generated from Gaussian noise by diffusion under the guidance of specific conditions.~~ the trained diffusion model can generate compensation information $z_0\prime$ . Therefore, we no longer need $\tilde{\mathcal{E}}$ . The $z_0\prime$ generated by the diffusion model replaces the $z_0$ extracted by $\tilde{\mathcal{E}}$ to help the image decoder decompress the image.

审稿意见

评分: 6置信度: 52024-07-12

This paper presents COSMIC, a coding scheme designed for satellite-to-ground image transmission. It addresses the disparity in computing performance between the satellite and ground station. COSMIC features a lightweight encoder on the satellite, reducing FLOPs by 2.6 to 5 times, to achieve a high image compression ratio and save bandwidth. On the ground, a diffusion-based model compensates for image detail loss during decoding. Together, these components facilitate efficient satellite-to-ground image transmission.

优点

Unlike traditional methods that rely on arithmetic coding, this paper employs a generative model to reduce the precious information bandwidth, making for an interesting and novel approach.
The use of a diffusion model to supplement missing details is highly feasible. Despite the availability of various encoding and decoding techniques, this choice is both wise and practical.
Experimental results demonstrate that this method achieves better rate-distortion (RD) performance compared to other approaches.

缺点

The application scenario restrictions are not comprehensive. In particular, the paper overlooks an important limitation of satellites: their power capability. Generally, satellites are powered by photovoltaic panels, so power consumption must be considered when calculating hardware demands. Compared to power consumption, the influence of channel width may be less critical. Therefore, I would like to see a more in-depth discussion on power challenges.
There are significant issues in the writing of the thesis. For example, section 4.3 states: "We first determine what details are lost in y. In the initial training stage (Figure 2(b)), we train the image compression encoder $\mathcal{E}$ , image encoder $\widetilde{\mathcal{E}}$ , and image decoder $\mathcal{D}$ jointly." However, in this figure, the image compression encoder, image encoder, and image decoder are not correctly annotated, which seriously affects the interpretation and assessment of this paper. There are many similar instances throughout the text.
Different satellites may carry different sensors, leading to variations in the type of metadata ( $m$ ), which may affect the method's performance. This potential variability requires more discussion.

问题

See the weaknesses.

局限性

See the weaknesses.

作者回复

2024-08-06

We thank the reviewer for this thoughtful review and we are glad to see their positive assessment.

Note that Fig.S2 can be found in the PDF attached to the global response.

Q1. More in-depth discussion on power challenges.

Please see our general response 1 above.

Q2. Issues in the writing.

Sorry for the confusion. The image compression encoder is deployed on the satellite, which is lightweight and used to extract satellite image features. In Fig.S2, it corresponds to the image encoder on the satellite in the Compression Model part. The image encoder corresponds to the Image Encoder module in the Compensation Model part in Fig.S2. This module is only used during training to extract the information lost in the lossy compression process to provide compensation for the decompression process, and help finetune the diffusion to learn how to generate the compensation information as a target. During the inference, this module is not used, and the compensation information is generated by diffusion. Image Decoder corresponds to the Image Decoder module in the Compression Model part of Fig.S2, which is used for the image decompression process. We make detailed annotations in Fig.S2, and will further explain and modify them in revision.

Q3. Different satellites may carry different sensors, leading to variations in the type of metadata (m), which may affect the method's performance.

Good question. We demonstrate that COSMIC has SOTA results with only three common metadata (i.e. location, timestamp, and GSD) on satellites. Different satellites carry different sensors but there are several sensors on almost all satellites. Take LANDSAT8 launched by NASA in 2013 [1] and Sentinel-2 launched by ESA in 2015 [2] as examples, they can all collect location, timestamp, and GSD. If we use these three metadata and take a bpp of 0.46 to experiment with COSMIS, PSNR and MS-SSIM decrease slightly, respectively from 27.31 to 27.20 and from 0.969 to 0.968， which can still guarantee the SOTA results. We believe that more metadata can achieve better results, and COSMIC can still achieve SOTA performance even with commonly used metadata. We’ll add them in revision.

Reference

[1] Landsat 8 (L8) Data Users Handbook Version 5.0 https://www.usgs.gov/landsat-missions/landsat-8-data-users-handbook

[2]Sentinel-2 User Handbook sentinel.esa.int/documents/247904/685211/Sentinel-2_User_Handbook

2024-08-12

Your explanations effectively addressed my concerns. Considering the technical reliability and potential impact, I will maintain my original rating.

评论- A Gentle Reminder of the Final Feedback

2024-08-11

Please allow us to thank you again for reviewing our paper and the insightful comments, and in particular for recognizing the strengths of our paper in terms of novel method, highly feasible, good soundness, and good contribution.

作者回复

2024-08-06

We thank the reviewers for their insightful comments and acknowledging that the paper is well-written and with clear logical flow (AGki), the method is interesting and novel (gvA5/GEfY/AGki), highly feasible (gvA5/upSj/AGki) and potential to be extended to other edge devices (GEfY), and the evaluation is comprehensive (gvA5/upSj). We have carefully considered your comments and will take them into account to further improve our work. Before we respond to each reviewer individually, we address common concerns as follows.

Q1. Power consumption in real-world application

Thanks to Reviewer AGki for useful suggestions. We plan to add the following contents in revision. COSMIC can be deployed on a real satellite by deploying COSMIC's encoder $\mathcal{E}$ on an embedded GPU (i.e. Nvidia Jetson Xavier NX) which was already deployed on various satellites in-orbit (FOREST-1, 2 [1], Chaohu-1 [2], Optimus [3], etc.). After training, we convert and deploy the image compression encoder $\mathcal{E}$ compatible with the NX system via TensorRT 8.2.1 SDK (Jetpack 4.6.1, CUDA 10.2, cuDNN 8.2.1). We use the tegrastats tool to monitor NX's power consumption. During the compression, the power consumed by $\mathcal{E}$ is between 5.7W and 7.7W, which can be fully supported by satellites like FOREST-1, 2 since their payloads can support NX running at 15W [1].

Q2. More baselines for comparison

Different from natural images, remote sensing images are actually multimodal data, which contain many sensor data in addition to images (e.g. timestamps and location). Our insight is that multimodal information in sensor data is instructions of image contents to a certain extent. For example, a satellite's location can roughly determine whether its photo content is a city, desert, or ocean. Thus, COSMIS is novel since we use the multimodal data nature of sensing images for better decompression which cannot be generally extended by natural images without such multimodal information as instructions. Moreover, various existing methods rely on complex encoders so they are hard to meet the computing and power constraints of satellites.

According to the suggestions of reviewers GEfY and AGki, we add the representative work HL-RS [5] in remote sensing image compression and a latest work Elic [6] as baselines. Please note that LIC_TCM [7] has FLOPs of 51.19G, which is 10 times larger than COSMIC, so we do not experiment it as a baseline for comparison. The results of the encoder efficiency analysis (FLOPs) are shown as follows (* represents the newly added baseline).

Method	Elic*[6]	HL-RS*[5]	CDC	COLIC	Hific	mbt-2018	cheng-2020	COSMIC
FLOPs(G)	21.78	11.87	13.1	26.4	26.4	8.07	24.45	4.9

We also show the rate-distortion (perception) results across 12 metrics in Fig.S1 in the PDF attached to the global response. In 8 of them, COSMIC achieves SOTA results at all bpp. In the other 3 metrics, COSMIC achieves SOTA results at certain bpp.

Reference

[1] Wildfires From Space. https://blogs.nvidia.com/blog/ororatech-wildfires-from-space/, 2021.

[2] "Chaohu 1". Gunter's Space Page. Retrieved July 12, 2024, from https://space.skyrocket.de/doc_sdat/chaohu-1.htm.

[3]Space Machines Company, Optimus OTV, https://space.skyrocket.de/doc_sdat/optimus-otv.htm, 2024.

[4] Jetson Modules https://developer.nvidia.com/embedded/jetson-modules

[5] Remote sensing image compression based on high-frequency and low-frequency components. IEEE Transactions on Geoscience and Remote Sensing, 2024.

[6] Elic: Efficient learned image compression with unevenly grouped space-channel contextual adaptive coding CVPR 2022.

[7]Learned image compression with mixed transformer-cnn architectures. CVPR 2023.

最终决定Accept (poster)

2024-09-25

This paper received 3 positive and 1 negative scores. Most reviewers believe that the rebuttal has addressed their concerns well. The negative review is not grounded on facts, and with a good chance was generated by GenAI - therefore we ignore it. After reading the rebuttal and discussions, I believe that the authors can resolve these concerns well in the revision. Of course, the authors are encouraged to refine the work based on the comments.