PaperHub
6.0
/10
Rejected4 位审稿人
最低5最高8标准差1.2
6
8
5
5
4.5
置信度
正确性2.8
贡献度2.5
表达3.0
ICLR 2025

FourierMamba: Fourier Learning Integration with State Space Models for Image Deraining

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05

摘要

关键词
Fourier transformmambaimage deraining

评审与讨论

审稿意见
6

The paper presents a novel approach to image deraining by integrating Fourier transformation with a state-space model. Unlike previous methods, the proposed approach applies the state-space model in both the spatial and frequency domains, introducing a new scanning strategy referred to as zigzag scanning. This strategy gradually processes both low- and high-frequency components, aiming to enhance the model's ability to capture rain-related distortions. The key contribution of the paper lies in the introduction of this zigzag scanning mechanism, which differentiates it from existing methods.

优点

The proposed model involving zigzag scanning is novel and intuitive.

The paper provides a thorough and intuitive discussion of the advantages of zigzag scanning in the context of the deraining task.

The proposed model demonstrates state-of-the-art performance, balancing accuracy with reasonable computational cost and model size.

缺点

Although the motivation for zigzag scanning is well-discussed, the paper lacks details on its implementation. Considering the complexity involved, the implementation is non-trivial and deserves attention. The way zigzag scanning is executed, along with pixel organization strategies, can significantly impact inference speed. Including a comparison of running times would enhance the evaluation.

The visual comparison baselines could also be updated. It would be beneficial to include comparisons with more recent methods (post-2022) and provide additional zoomed-in patches or error maps to better highlight performance differences.

There are several confusing typographical issues, such as "FCS" being used as the abbreviation for "Fourier Channel Evolution" and "C-FFT" lacking a clear explanation (possibly referring to "channel FFT"). These should be clarified.

Furthermore, the paper overlooks several closely related works, including: [1] Wavelet Approximation-Aware Residual Network for Single Image Deraining, 2023. [2] A Hybrid Transformer-Mamba Network for Single Image Deraining, 2024. [3] Image Deraining with Frequency-Enhanced State Space Model, 2024.

Additionally, applying the L1 norm directly in the frequency domain may be less accurate than alternative metrics, such as coherence-based distances. An ablation study on the impact of different frequency-domain loss functions would strengthen the analysis.

Finally, the zigzag scanning strategy appears intuitive for other low-level vision tasks, such as deblurring. It would be valuable to investigate the model's performance on additional tasks to assess its generalizability.

问题

The main points of confusion, as outlined in the Weakness section, are as follows:

Implementation and Efficiency of Zigzag Scanning: The paper does not provide sufficient details on how the zigzag scanning is implemented, despite its potential complexity. Additionally, it is unclear how efficient the scanning strategy is in practice. A more thorough discussion of the implementation and its impact on inference speed would strengthen the work.

Generalization to Other Tasks: Given that the motivation for zigzag scanning is not specific to rain removal, it would be valuable to evaluate the performance of the proposed method on other low-level vision tasks. Demonstrating its effectiveness beyond the deraining scenario would provide further evidence of its broader applicability.

评论

Q1: Although the motivation for zigzag scanning is well-discussed, the paper lacks details on its implementation. Considering the complexity involved, the implementation is non-trivial and deserves attention. The way zigzag scanning is executed, along with pixel organization strategies, can significantly impact inference speed. Including a comparison of running times would enhance the evaluation. (Implementation and efficiency of zigzag scanning.)

A1: In fact, our implementation of zigzag scanning involves two steps: obtaining the path and executing the scan. We obtain the scanning path by combining the zigzag scanning algorithm used in JPEG compression with the characteristics of the Fourier domain. For instance, given a Fourier spectrum, we take the high frequency in the upper left corner as the starting point, scan to the low frequency in the center according to the path of the ziazag algorithm, and then return from the low frequency to the high frequency in the upper right corner based on the symmetry of the Fourier spectrum. This process generates a complete scanning path. Then, we save the obtained path as a list, and then each time we run the scan, we only need to read the index in the list to scan according to the preset path, thereby significantly reducing running times. We compare the runtime of our method with several others using 512×512 images on an NVIDIA 4090 GPU. The results, shown below, indicate that the runtime of our method is comparable to that of other approaches. According to the reviewer’s suggestion, we have provided this comparison result in Appendix A.3.

Table 1:Runtime comparison between our method and other methods.

MethodMambaIRVmambaIRFreqMambaRestormerOurs
Runtime (s)0.5340.4231.8370.2530.523

Q2: The visual comparison baselines could also be updated. It would be beneficial to include comparisons with more recent methods (post-2022) and provide additional zoomed-in patches or error maps to better highlight performance differences.

A2: According to the reviewer’s suggestion, we have updated the visual comparison baseline by including visual results of more recent methods and providing additional zoomed-in patches, as shown in Figures 5 and 6 of the revision.

Q3: There are several confusing typographical issues, such as "FCS" being used as the abbreviation for "Fourier Channel Evolution" and "C-FFT" lacking a clear explanation (possibly referring to "channel FFT").

A3: Thank you for the reviewer's corrections. We have addressed these issues in the revision. We have updated the abbreviation for "Fourier Channel Evolution" to "FCE." Additionally, "C-FFT" indeed refers to the channel-dimension Fourier transform.

Q4: The paper overlooks several closely related works.

A4: Thank you for the reviewer’s suggestion. Wavelet Approximation-Aware Residual Network for Single Image Deraining proposes a wavelet approximation-aware residual network, which efficiently removes rain from low-frequency structures and high-frequency details at each level separately. A Hybrid Transformer-Mamba Network for Single Image Deraining introduces a network combining Transformer and Mamba to capture long-range dependencies related to rain. Image Deraining with Frequency-Enhanced State Space Model achieves effective deraining by parallelizing frequency-domain processing branches with the Mamba branch. In contrast to these methods, our work explores how to integrate information across all frequency bands in the Fourier domain using the Mamba architecture, fully leveraging the complementarity between different frequency bands to enhance image deraining performance. According to the reviewer’s suggestion, we have included these closely related studies in the related work section of the revision to provide a more comprehensive overview of the research background and the connections between the methods.

评论

The comparison results for image dehazing are presented in the following table.

Table 4: Comparison of methods on Dense-Haze and NH-HAZE datasets.

MethodDense-Haze PSNRDense-Haze SSIMNH-HAZE PSNRNH-HAZE SSIM
DCP10.060.385610.570.5196
DehazeNet13.840.425216.620.5238
GridNet13.310.368113.800.5370
MSBDN15.370.485819.230.7056
AECR-Net15.800.466019.880.7173
FreqMamba17.350.582719.930.7372
Ours18.910.676320.030.7508

It can be seen that our method demonstrates good generalization and potential for other low-level vision tasks beyond rain removal. According to the reviewer’s suggestion, we have provided the results for these additional tasks in Appendix A.8 of the revised version.

[1] Focal Frequency Loss for Image Reconstruction and Synthesis, 2021.

[2] Fourmer: An Efficient Global Modeling Paradigm for Image Restoration, 2023.

[3] FreqMamba: Viewing Mamba from a Frequency Perspective for Image Deraining, 2024.

评论

Q5: Applying the L1 norm directly in the frequency domain may be less accurate than alternative metrics, such as coherence-based distances. An ablation study on the impact of different frequency-domain loss functions would strengthen the analysis.

A5: According to the reviewer’s suggestion, we evaluate three additional frequency-domain loss functions—Phase Consistency Loss (PCL), Frequency Distribution Loss (PDL), and Focal Frequency Loss (FFL) [1]. The Phase Consistency Loss measures the mean squared error of the phase difference between two images in the frequency domain, used to assess the similarity of the phase information in the frequency domain between the two images. The Frequency Distribution Loss quantifies the difference in the frequency domain amplitude distributions of two images, comparing the amplitude distributions of the reconstructed image and the reference image in the frequency domain to measure their similarity. The Focal Frequency Loss adaptively focuses on frequency components that are hard to synthesize by down-weighting the easy ones. We perform ablation experiments on these loss functions as follows.

Table 2: Comparison results of different frequency-domain loss functions.

PCLFDLFFLOurs
PSNR39.6739.6939.7539.73
SSIM0.98480.98520.98590.9856

The results indicate that the performance achieved with these four loss functions is similar. The focus of this work is on the design of the network architecture, so we follow existing methods [2] to use the L1 norm in the frequency domain. We will explore more frequency-domain loss functions in future work. According to the reviewer’s suggestion, we have provided the description and results of the above ablation experiments in Appendix A.19 of the revision.

Q6: It would be valuable to evaluate the performance of the proposed method on other low-level vision tasks. Demonstrating its effectiveness beyond the deraining scenario would provide further evidence of its broader applicability.

A6: The reviewer’s suggestion is insightful. Following FreqMamba [3], we evaluate our method on low-light enhancement and image dehazing. We use the LOL-V1 and LOL-V2-synthetic datasets to evaluate the performance of our method on low-light enhancement, and the Dense-Haze and NH-HAZE datasets are used to evaluate the performance of our method on real-world image dehazing. The results for low-light enhancement are shown in the table below.

Tabel 3: Comparison of methods on LOL-V1 and LOL-V2-Syn datasets.

MethodLOL-V1 PSNRLOL-V1 SSIMLOL-V2-Syn PSNRLOL-V2-Syn SSIM
RetinexNet18.380.775619.920.8847
KinD20.380.824822.620.9041
ZeroDCE16.800.557317.530.6072
KinD++21.300.822621.170.8814
URetinex-Net21.330.834822.890.8950
FECNet22.240.837222.570.8938
SNR-Aware23.380.844124.120.9222
FreqMamba23.570.845324.460.9355
Ours23.780.846724.750.9452
评论

Dear Reviewer NHz8:

Thanks for your valuable suggestions and recognition. Your recognition means a great deal to us. We sincerely appreciate your effort and support once again!

Best Regards,

Authors of #8897

审稿意见
8

This paper proposed a new network for image deraining by introducing the state space models, or more specifically, the recently proposed Mamba structure, in the Fourier spaces. By virtue of the Mamba, the frequency correlations can be better captured and utilized in the Fourier domain, such that the better deraining performance can be expected. Experiments on various deraining datasets have been conducted to demonstrate the effectiveness of the proposed method.

优点

  • The application of Mamba to the Fourier domain is new and worthy to explore.

  • The experimental results are promising

缺点

  • The motivation should be strengthened. Currently, it seems the work is mostly about applying Mamba in the Fourier domain, though this task itself could be non-trivial. However, it is unclear why Mamba is more effective than other architectures for processing the Fourier frequencies. Visualizing the intermediate features might be useful.

  • As reviewed by the authors, another work (Zhen et al., 2024) considers Mamba in the frequency domain but with wavelet transformation. It is better to discuss more about wavelet transformation and Fourier transformation in dealing with frequencies. Besides, as shown in Table 1, the performance of (Zhen et al., 2024) is close to the proposed method on several datasets, but the comparison with this method on SPA-Data is missing.

  • The quantitative comparison is conducted with respect to PSNR and SSIM. However, as more and more researchers realized, these two metrics can be often inconsistent with human perceptions and thus some newly developed metrics that can better reflect human perceptions should be considered.

问题

  • In Section 4.1, the authors mentioned the "progressive training strategy". What does it mean? And why is it called "progressive"?

  • There are several writing issues that I can find:

  1. Line 75-76: "...build correlation the correlation among...";
  2. The rightmost of Line 73: ".arrangement";
  3. According to the formatting of other parts of this manuscript, the "Fourier transformation." in Section 3.1 should not take the whole line.
评论

Q3: The quantitative comparison is conducted with respect to PSNR and SSIM. However, as more and more researchers realized, these two metrics can be often inconsistent with human perceptions and thus some newly developed metrics that can better reflect human perceptions should be considered.

A3: The reviewer's suggestion is very insightful. We use widely adopted perceptual metrics, including BRISQUE, NIQE, and SSEQ, to conduct quantitative evaluations, with some results shown in the table below. The complete evaluation results are provided in Appendix A.17. In future work, we plan to explore additional metrics that better reflect human perception.

Tabel 2:Performance comparison of different methods on Test2800 dataset.

MethodBRISQUE ↓NIQE ↓SSEQ ↓
MPRNet15.7826.2519.470
MAXIM15.2726.1148.760
Restormer18.6016.1699.579
MambaIR13.2466.1658.332
VmambaIR13.4656.1148.306
FreqMamba19.9425.43910.371
Ours12.8955.2588.286

Q4: In Section 4.1, the authors mentioned the "progressive training strategy". What does it mean? And why is it called "progressive"?

A4: The progressive learning strategy refers to training the network on smaller image patches during the initial stages and gradually increasing the patch size in later stages. This approach is referred to as "progressive" because the resolution increases incrementally. Many previous methods, such as Restormer and MambaIR, also adopt this training strategy.

Q5: There are several writing issues that I can find.

A5: Thank you to the reviewer for pointing these out. We have addressed and resolved these issues in the revised version.

评论

Q1: It is unclear why Mamba is more effective than other architectures for processing the Fourier frequencies. Visualizing the intermediate features might be useful.

A1: First, Mamba utilizes sequence modeling to integrate information across all frequency bands, effectively leveraging the complementary relationships between different bands. In contrast, convolution, as a local operation, struggles to holistically model global features across all frequency bands when processing frequency information in the Fourier domain. This limitation significantly constrains its capacity in the Fourier space.

Second, Mamba's sequence modeling is orderly, which can help the network establish an orderly dependency relationship between different frequencies. This characteristic is critical for modeling image degradation information. Conversely, convolution is insufficient in capturing the dependencies between high and low frequencies in the Fourier domain, thereby weakening its ability to accurately represent degradation features.

In summary, based on these two advantages, Mamba achieves better coordination of high-frequency and low-frequency information in the Fourier domain during the image restoration process. On the one hand, it effectively captures rain streaks, and on the other, it enhances background reconstruction, significantly improving the quality of image restoration. According to the reviewer's suggestion, we have shown feature visualizations in Figure 8 of Appendix A.11.

Q2: It is better to discuss more about wavelet transformation and Fourier transformation in dealing with frequencies. Besides, as shown in Table 1, the performance of (Zhen et al., 2024) is close to the proposed method on several datasets, but the comparison with this method on SPA-Data is missing.

A2: FreqMamba (Zhen et al., 2024) employs Mamba scanning in the wavelet-transformed domain, which lacks the unique advantages of the Fourier domain. In image restoration tasks, the Fourier domain offers two critical characteristics: 1) Decoupling of degradation and background content: The Fourier transform can decouple the degradation components and content components of an image to some extent, effectively providing a prior related to degradation information. 2) Global characteristics: Each pixel in the Fourier domain is associated with all pixels in the spatial domain, offering a global perspective that is particularly important for capturing long-range correlations.

However, the design of FreqMamba is based on the wavelet transform, which divides the image into several patches and performs spatial domain scanning within each patch. Since the wavelet-transformed domain lacks the global characteristics of the Fourier domain, FreqMamba has relatively limited capability for modeling the correlations between frequencies.

To further evaluate the performance difference, we use FreqMamba's recently released open-source code to train and test on the SPA-Data dataset. The results are shown below.

Table 1: Quantitative comparison with FreMamba on the SPA-Data.

SPAFreqMambaOurs
PSNR48.4749.18
SSIM0.99230.9931

It can be seen that the performance of our method is better than FreqMamba. According to the reviewer’s suggestion, we have provided this comparison result in Appendix A.15.

Furthermore, the performance of FreqMamba shown in Table 1 of the manuscript is actually obtained by training and testing separately on each dataset. In contrast, like most other methods, we train on Rain13k and then test on individual datasets. This discrepancy may lead to an overestimation of FreqMamba's performance in Table 1 of the manuscript. Therefore, we provide the performance of FreqMamba using our experimental settings in Table 11 of Appendix A.9. Experiments show that under the same experimental conditions, our method can achieve better performance than FreqMamba.

评论

The authors have addressed my concerns, and thus I decide to raise my rating.

评论

Dear Reviewer xRKZ:

Thanks for your acknowledgment of our work and responses. We'll carefully revise our final paper. Your positive rating means a lot to us. We appreciate your constructive feedback that has helped refine our research.

Best Regards,

Authors of #8897

审稿意见
5

This paper proposes Fourier Mamba, an image rain removal framework. It improves the rain removal effect by applying the Mamba technique in Fourier space and utilizing the correlation between different frequencies. This method uses zigzag encoding to rearrange the frequency order in the spatial dimension and directly uses Mamba for frequency correlation in the channel dimension. Experiments have shown that Fourier Mamba has achieved more competitive results in image rain removal tasks.

优点

  1. The proposed framework combines Fourier priors and state space models to correlate different frequencies in Fourier space to enhance the rain removal effect of images. This combination utilizes the advantages of the Fourier transform in global modeling and introduces the efficient feature modeling capability of the Mamba model.

  2. To rearrange the order from low frequency to high frequency in the Fourier space of the spatial dimension, Fourier Mamba proposed a scanning method based on zigzag encoding, which systematically associates different frequencies. This method introduces zigzag encoding in Fourier space to rearrange the frequency order, thereby orderly associating the connections between frequencies.

缺点

  1. The experiment is insufficient and lacks comparison with methods such as DRSformer [1] and FADformer [2].
  2. I am a bit concerned about the quantitative results in this paper. The difference between the results shown in Table 1 and FreqMamba is not significant. Please further explain the differences between the proposed method and FreqMamba, as they both use Fourier correlation strategies. It is recommended to add visualization results to supplement this.
  3. It is worth noting that FFT is a complex valued computation. Suggest the author to add a comparison of the actual inference time of the model. FLOP comparison is not enough, because many works now have lower FLOP but slower running speed, and other issues such as memory consumption may slow down computation speed. [1] Chen X, Li H, Li M, et al. Learning a sparse transformer network for effective image deraining. [2] Gao N, Jiang X, Zhang X, et al. Efficient Frequency-Domain Image Deraining with Contrastive Regularization.

问题

Please see the Weaknesses.

评论

It can be seen that under the same experimental conditions, our method outperforms FreqMamba across all metrics. We provide a detailed description and results of this comparison in Appendix A.9. Additionally, to further validate our performance, we also train and test our method on the Test2800 dataset using the experimental setup of FreqMamba (see Appendix A.16). The results of the above experiments demonstrate that our method outperforms FreqMamba.

Q3: It is worth noting that FFT is a complex valued computation. Suggest the author to add a comparison of the actual inference time of the model.

A3: I understand the reviewer's concern regarding the use of the Fast Fourier Transform (FFT), which indeed involves complex-valued computations. However, the actual inference time of our model is not significantly higher than that of other methods. This is primarily due to two reasons. First, our network only uses half the number of channels and blocks of MambaIR, which inherently reduces runtime. Second, we leverage PyTorch's built-in torch.fft, which benefits from optimized FFT operations and hardware acceleration. The inference time comparison on 512×512 images using an NVIDIA 4090 GPU is presented below. According to the reviewer’s suggestion, we have provided a comparison of inference times in Appendix A.3.

Table 3: Runtime comparison between our method and other approaches.

MethodMambaIRVmambaIRFreqMambaRestormerOurs
Runtime (s)0.5340.4231.8370.2530.523
评论

Q1: The experiment is insufficient and lacks comparison with methods such as DRSformer and FADformer.

A1: According to the reviewer's suggestion, we add comparisons with methods such as DRSFormer and FADFormer. Following their experimental settings, we evaluate our method on Rain200L, Rain200H, DID-Data, DDN-Data, and SPA-Data. The comparison results are provided below.

Table 1: Performance comparison of methods across various datasets.

MethodRain200L PSNR ↑Rain200L SSIM ↑Rain200H PSNR ↑Rain200H SSIM ↑DID PSNR ↑DID SSIM ↑
DRSformer41.230.989432.170.932635.350.9646
FADformer41.800.990632.480.935935.480.9657
Ours42.270.990832.710.939535.490.9659

It can be seen that our method consistently achieves superior deraining performance on most datasets. We have provided the complete comparison of this experimental setup in Appendix A.10.

Q2: The difference between the results shown in Table 1 and FreqMamba is not significant. Please further explain the differences between the proposed method and FreqMamba, as they both use Fourier correlation strategies. It is recommended to add visualization results to supplement this.

A2: Our method focuses on customized design based on the characteristics of Fourier space, combining Fourier priors with state space models and exploring the potential of introducing Mamba directly in the Fourier domain. In contrast, FreqMamba operates in the Fourier space using only 1×1 convolutions, which fails to fully utilize the rich frequency information inherent to the Fourier domain.

Specifically, FreqMamba applies Mamba scanning in a wavelet-transformed domain. However, the wavelet-transformed domain lacks the notable advantages of the Fourier domain, such as the Fourier transform's ability to decouple degradations and its global representation properties. Additionally, after wavelet decomposition, FreqMamba divides the image into multiple patches and performs spatial scanning within each patch. This design limits FreqMamba's ability to effectively model frequency correlations.

In contrast, our method performs Mamba scanning directly in the Fourier domain, fully leveraging the global characteristics of the Fourier transform. This allows our approach to better capture rain streaks, which often exhibit high apparent repetitiveness. Consequently, from a visual perspective, our method demonstrates significantly better performance in removing rain streaks. We present the relevant visual comparisons in Figure 7 of the Appendix A.9, clearly illustrating this advantage.

Furthermore, the quantitative results of FreqMamba listed in Table 1 of the manuscript are taken directly from its original paper, where training and testing are performed on different datasets. This training approach may explain why the performance gap between our method and FreqMamba appears less pronounced. For a fairer comparison, we retrain FreqMamba using our experimental setup and evaluate it on various datasets. The results are shown in the table below.

Tabel 2: Performance comparison with FreqMamba.

MethodTest100 PSNR ↑Test100 SSIM ↑Rain100H PSNR ↑Rain100H SSIM ↑Rain100L PSNR ↑Rain100L SSIM ↑Test2800 PSNR ↑Test2800 SSIM ↑Test1200 PSNR ↑Test1200 SSIM ↑Average PSNR ↑Average SSIM ↑
FreqMamba31.890.92131.670.91039.080.97733.960.94333.310.92533.980.9352
Ours32.070.92531.790.91339.730.98634.230.94934.760.93834.520.9422
评论

Dear Reviewer 1tV5:

Thanks for your valuable comments! This has greatly encouraged us and made our paper more complete and coherent. It would be very appreciated if you raise your rating since we have generally addressed your concerns. Your support is really important for our work. Thanks again!

Best Regards,

Authors of #8897

审稿意见
5

This paper presents a Mamba-based image deraining algorithm. The proposed FourierMamba mainly involves two components: spatial interaction SSM and channel evolution SSM. The spatial branch further consists of spatial mamba and frequency mamba. The main contributions of the paper primarily lie in the adaptation of mamba in the frequency domain for image deraining and the scanning strategy for the spatial-dimensional spectra.

优点

  1. The paper applies mamba in the frequency domain to enhance frequency interactions. To this end, it devises a zigzag scanning strategy.
  2. The proposed methods are evaluated on synthetic and real-world deraining datasets and achieve promising performance.

缺点

  1. The paper lacks the theory analyses to support the importance of frequency interactions for image restoration.
  2. The channel fft has already been proposed in REVITALIZING CHANNEL-DIMENSION FOURIER TRANSFORM FOR IMAGE ENHANCEMENT. Why does the channel mamba outperform the 1x1 channel version? The reviewer thinks a 1x1 convolution can also model interactions between channel spectra. Moreover, the idea of interactions between frequencies is already implemented in Fourmer via convolutions.
  3. The authors replace mamba with 1x1 conv for the first ablation studies. However, the flops or parameter comparisons are not provided. Is it possible that the improvement of mamba versions is derived from the additional computation overhead compared to the 1x1 conv version?
  4. Regarding the experimental results, the paper only provides results on four datasets in Tab.1, lacking the test100 dataset. Moreover, the competitors in Fig.6 are a little outdated. From my side, the experiments are insufficient. Specifically, compared to other mamba-based methods, the evaluation data/tasks are fewer. Compared to other deraining methods like drsformer, the authors adopt only a rain13 dataset plus spad for evaluation, which is also insufficient.

问题

Please refer to Weaknesses

评论

Table 6: Performance comparison of methods across various datasets.

MethodRain200L PSNR ↑Rain200L SSIM ↑Rain200H PSNR ↑Rain200H SSIM ↑DID PSNR ↑DID SSIM ↑
DRSformer41.230.989432.170.932635.350.9646
FADformer41.800.990632.480.935935.480.9657
Ours42.270.990832.710.939535.490.9659

It can be observed that the combined effect of Mamba and the Fourier prior enables us to achieve superior performance. We provide the complete comparison results for this experimental setup in Appendix A.10.

评论

Q5: Regarding the experimental results, the paper only provides results on four datasets in Tab.1, lacking the test100 dataset. Moreover, the competitors in Fig.6 are a little outdated. From my side, the experiments are insufficient. Specifically, compared to other mamba-based methods, the evaluation data/tasks are fewer. Compared to other deraining methods like drsformer, the authors adopt only a rain13 dataset plus spad for evaluation, which is also insufficient.

A5: We add comparisons on the Test100 dataset, as presented below. According to the reviewer’s suggestion, we have added results on the test100 dataset in Appendix A.4.

Tabel 3: Performance comparison on Test100.

MetricPReNetMPRNetRestormerMambaIRVmambaIRFreqMambaOurs
PSNR ↑24.8130.2732.0031.8231.8431.8932.04
SSIM ↑0.8510.8970.9230.9220.9180.9210.925

In response to the reviewer's comment that the comparison methods in Figure 6 are outdated, we have added visual results of several recent methods, as shown in Figure 6 of the revised version.

In response to the reviewer's comment regarding the limited number of evaluation tasks, we follow FreMamba and conduct experiments on low-light enhancement and image dehazing tasks. For low-light enhancement, we use the LOL-V1 and LOL-V2-Synthetic datasets for evaluation, as shown in the table below.

Tabel 4: Comparison of methods on LOL-V1 and LOL-V2-Syn datasets.

MethodLOL-V1 PSNRLOL-V1 SSIMLOL-V2-Syn PSNRLOL-V2-Syn SSIM
RetinexNet18.380.775619.920.8847
KinD20.380.824822.620.9041
ZeroDCE16.800.557317.530.6072
KinD++21.300.822621.170.8814
URetinex-Net21.330.834822.890.8950
FECNet22.240.837222.570.8938
SNR-Aware23.380.844124.120.9222
FreqMamba23.570.845324.460.9355
Ours23.780.846724.750.9452

For real-world dehazing, we evaluate on the Dense-Haze and NH-HAZE datasets, with results presented in the table below.

Table 5: Comparison of methods on Dense-Haze and NH-HAZE datasets.

MethodDense-Haze PSNRDense-Haze SSIMNH-HAZE PSNRNH-HAZE SSIM
DCP10.060.385610.570.5196
DehazeNet13.840.425216.620.5238
GridNet13.310.368113.800.5370
MSBDN15.370.485819.230.7056
AECR-Net15.800.466019.880.7173
FreqMamba17.350.582719.930.7372
Ours18.910.676320.030.7508

It can be demonstrated that our method is effective for other image restoration tasks. We have added the results of the evaluation on these two tasks in Appendix A.8.

According to the reviewer's suggestion, we conduct evaluations on Rain200L, Rain200H, DID-Data, DDN-Data, and SPA-Data based on the experimental settings of DRSFormer. A subset of the comparison results is presented below.

评论

The 1×1 convolution merely serves as a linear combination of input channels, limiting its interactions to a single frequency band. This is inadequate for handling the complex, periodic nature of rain streak noise. In contrast, our proposed channel Mamba architecture employs sequential modeling to fully integrate information across all frequency bands. This allows it to utilize non-rain frequency bands to effectively suppress rain streak noise. This deep interaction of all frequency bands not only achieves accurate removal of rain streaks but also avoids damaging the image content.

Q3: Moreover, the idea of interactions between frequencies is already implemented in Fourmer via convolutions.

A3: Fourmer processes information in the Fourier space solely through convolution, which limits its utilization of Fourier priors in the following ways:

  1. Locality limits global modeling capabilities: Convolution is inherently a local modeling operation, while rain streak noise exhibits multi-scale characteristics and is often distributed across the entire image. Relying only on convolution to process frequency information makes it difficult to sufficiently model and interact with rain streak noise across all scales, thereby restricting the ability to capture global features. In contrast, our approach, based on Mamba, integrates global information across different frequency bands through sequential modeling. This method is better suited to the characteristics of rain image restoration, as rain streak noise has a multi-scale distribution, strong repetitiveness, and involves both high- and low-frequency degradation information.

  2. Lack of ordered dependency modeling between frequencies: Convolution cannot establish ordered dependencies between different frequencies in the Fourier space, which are critical for modeling image degradation. Our method, utilizing Mamba, effectively captures the ordered relationships between high and low frequencies, enabling a more comprehensive representation of rain streak degradation features.

  3. Insufficient interaction in the channel dimension: Fourmer concatenates multiple channels along the channel dimension and processes them with a simple 1×1 convolution, resulting in outputs that are merely linear combinations of each channel. This approach does not fully exploit the interaction between channels, thereby limiting the expressiveness of the features. In contrast, our method introduces a channel Fourier Transform (CFT) in the channel dimension, followed by Mamba to scan and integrate information in the frequency domain. This allows for comprehensive integration of all frequency bands along the channel dimension, enabling different bands to complement and compensate for each other, resulting in more effective and precise rain streak removal.

In conclusion, our design not only addresses the limitations of convolution in the Fourier space but also aligns better with the specific requirements of rain image restoration. It significantly enhances global modeling capability and improves the effectiveness of information interaction.

Q4: The authors replace mamba with 1x1 conv for the first ablation studies. However, the flops or parameter comparisons are not provided. Is it possible that the improvement of mamba versions is derived from the additional computation overhead compared to the 1x1 conv version?

A4: Following the reviewer's suggestions, we perform a revised ablation study on the FSI-SSM module in Table 3 of the manuscript. By further reducing the number of channels and blocks, we compress our model to match the computational cost of the model without FSI-SSM and compare the results, as shown in the table below. The results show that Mamba still achieves better performance.

Table 1: The computational overhead of the ablation study on FSI-SSM.

MethodPSNRSSIMFLOPs (G)Params (M)
w/o FSI-SSM39.050.983514.4210.82
Ours39.370.984514.6410.12

In addition, the ablation study on w/o FCE-SSM in Table 3 of the manuscript already considers the impact of computational cost. The variant without FCE-SSM (w/o FCE-SSM) stacks several cascaded 1x1 convolutions with residual connections to achieve a computational cost similar to that of Mamba. The computational costs of w/o FCE-SSM and our proposed method are provided below. We have added these results in Appendix A.5.

Tabel 2: The computational overhead of the ablation study on FCE-SSM.

MethodPSNRSSIMFLOPs (G)Params (M)
w/o FCE-SSM39.080.983621.0817.81
Ours39.730.985622.5617.62
评论

Q1: The paper lacks the theory analyses to support the importance of frequency interactions for image restoration.

A1: The goal of the image deraining task is to restore image information damaged or obscured by rain streaks. Introducing a frequency interactions mechanism enables complementary integration of information across different frequency bands, effectively reducing the risk of loss or bias caused by relying on a single frequency band. This is because rain streaks impact image information in varying ways across different frequency bands. By modeling these bands separately and fusing them through interactions, the mechanism allows for more efficient and comprehensive image restoration. From a theoretical perspective, frequency interaction is essentially equivalent to a multi-scale processing method, enabling the model to simultaneously capture both global and local features. In the frequency domain, images often exhibit sparsity, with most energy concentrated in the low-frequency region and relatively little in the high-frequency region. This sparsity inherently provides an implicit prior constraint for frequency band interaction, functioning similarly to a regularization mechanism. When high-frequency and low-frequency information are not well-aligned, artifacts resembling rain streaks may appear in the restored image. However, the frequency band interaction mechanism can effectively smooth these artifacts while enhancing authentic high-frequency details, thereby improving image restoration quality. The theoretical values of frequency interactions can be summarized as follows:

  1. Multi-scale modeling: Through frequency division, the model can simultaneously process global and local characteristics, capturing both structural and detailed features of the image comprehensively.

  2. Information complementarity: Information from different frequency bands can complement each other, mitigating potential deficiencies that arise from processing a single frequency band.

  3. Noise suppression: In the frequency domain, frequency interaction effectively filters out interference, smoothing high-frequency artifacts.

  4. Implicit regularization: Through sparsity constraints, frequency interaction provides prior knowledge for image restoration, enhancing the generalization ability of the model.

  5. Fusion of global and local features: The frequency interactions mechanism organically integrates characteristics at different scales, improving the model’s robustness and accuracy.

In conclusion, the frequency interactions mechanism provides multifaceted theoretical support and practical advantages for the image deraining task. It significantly enhances the robustness and quality of image restoration, establishing a solid basis for high-quality image draining.

Q2: The channel fft has already been proposed in REVITALIZING CHANNEL-DIMENSION FOURIER TRANSFORM FOR IMAGE ENHANCEMENT. Why does the channel mamba outperform the 1x1 channel version? The reviewer thinks a 1x1 convolution can also model interactions between channel spectra.

A2: REVITALIZING CHANNEL-DIMENSION FOURIER TRANSFORM FOR IMAGE ENHANCEMENT introduces the channel-dimension Fourier Transform (FFT), but our study differs from this work in several key aspects.

First, in terms of research purpose, the referenced paper focuses on image enhancement tasks, leveraging channel-dimension FFT primarily to enhance the discriminative ability of global features and thereby improve image expressiveness. In contrast, our study focuses on image deraining, which belongs to the field of image restoration. The core objective of this task is to accurately extract and represent degraded information. Therefore, we utilize channel-dimension FFT to enrich the global representation of degradation features. This highlights a fundamental difference in motivation between our work and the referenced paper.

Second, in terms of implementation, the core operation in the referenced paper is a 1×1 convolution, which facilitates relatively simple interactions across different frequency bands. However, for image deraining, especially in addressing high-frequency rain streak noise, this approach has two major limitations: (1) Rain streak noise often gets confused with the structural edges of objects, making it difficult for simple interactions to effectively distinguish between the two; (2) Rain streaks exhibit multi-scale characteristics, and relying solely on 1×1 convolution fails to capture the complex relationships between frequency bands. To address these challenges, we design the Mamba architecture for image deraining, incorporating channel-dimension FFT to specifically target the modeling of degraded information in the frequency domain.

评论

Dear Reviewer yyRc:

Thanks for your valuable suggestions! Your feedback is incredibly meaningful to us. If we have addressed your concerns, we earnestly look forward to you reconsidering your score. Your support is really important to us. Once again, thanks for your time and effort!

Best Regards,

Authors of #8897

评论

Dear reviewers:

Thanks a lot for your previous constructive comments. We would like to know if our revisions address your concerns. If you still have any concerns, we are eager to hear from you, please let us know and we are more than happy to discuss them with you.

Best regards,

Authors of #8897

AC 元评审

This paper proposes a novel framework named FourierMamba for image deraining, which integrates a Fourier representation with state space models (SSMs). The authors claim that their method can effectively capture and utilize frequency correlations in the Fourier space, leading to improved image deraining performance. The paper introduces a zigzag coding strategy to scan frequencies in the spatial dimension Fourier space and directly apply SSMs in the channel dimension Fourier space. The authors report state-of-the-art performance on several image deraining datasets.

Strengths:

  • The paper presents a novel idea of integrating Fourier space with SSMs for image deraining.
  • The proposed zigzag coding strategy is well-motivated and technically sound.
  • The experimental results demonstrate the effectiveness of the proposed method.

Weaknesses:

  • Limited novelty of the channel FFT concept.
  • The experimental validation could be more comprehensive, including additional datasets and comparisons with state-of-the-art methods.
  • The advantages of the proposed Mamba structure over other architectures for processing Fourier frequencies are not clearly demonstrated.
  • The paper could benefit from visualizing intermediate features to support the claims of improved frequency correlation modeling.
  • Limited Novelty with respect to previous work (in particular over FreqMamba).

This paper presents a novel and interesting idea, but requires further revisions before acceptance. While the rebuttal addressed some concerns, significant changes to the paper's body are needed to clearly establish its key contributions and differentiate it from existing work.

Specifically, the authors should:

  • Broaden the scope: Explore the applicability of the proposed innovations to more general tasks beyond image deraining. This will increase the potential impact and relevance of the work.
  • Clarify the paper's unique contributions: More explicitly highlight how this work advances the field beyond existing approaches to image deraining. Right now the paper seems like an incremental change over FreqMamba, which is interesting, but it's limited in terms of impact.

By addressing these points, the authors can strengthen the paper and better communicate its value to the research community.

审稿人讨论附加意见

The reviewers raised several concerns during the discussion period, including the novelty of the approach, the choice of Mamba for processing Fourier frequencies, the comprehensiveness of the experimental validation, and the inference speed. The authors provided responses and made some revisions to address these concerns. However, some issues remain unresolved, such as the lack of clear justification for using Mamba, the limited experimental validation, and the relation and advantages to previous work (FreqMamba).

最终决定

Reject