CosAE: Learnable Fourier Series for Image Restoration
摘要
评审与讨论
This paper formulates the latent space of the autoencoder as a set of Fourier series space, and the encoded images are represented as corresponding amplitude and phase coefficients, which formulates a highly compressed latent space with faithful reconstruction ability. Extensive experiments on natural images and face images and conducted.
优点
- The reformulated fourier latent space seems work well with reconstruction ability through encoded coefficients.
- Experiments on natural images and face images are conducted.
缺点
- It is not clear what is the necessity to formulate a highly compressed latent space (or so called information bottleneck) with detailed reconstruction ability for image restoration, as the skip connection will compensate the downsampling lost, which is also the common practices in current restoration network. On the other hand, the compressed latent space may be useful for latent generation, e.g., ldm, however, the corresponding experiments are lacked.
- In order to establish a highly compressed latent space for autoencoder, the comparisons with VAE or VQ-VAE is lacked, and it is hard to analysis if the progress is made.
- As both experiments are conducted on image restortation, and it is hard to connect the significance of compressed latent space with image restoration tasks, the effectiveness of the proposed method is unable to evaluate with current experimental results.
- Is there any ablations that the basis of the latent space are formulated with original fourier basis instead of learnable.
- Is the encoder capable of encoding higher resolution images, such as 512, 1024, 2048, etc., will the reconstruction performance be declined, and the compression ratio should be ablated.
问题
Please see the paper weakness.
局限性
The authors have adequately addressed the limitations and broader impact of their work.
We thank the reviewer for the valuable feedback. In regards to your questions, see our responses below:
Q1: It is not clear what is the necessity to formulate a highly compressed latent space (or so called information bottleneck) with detailed reconstruction ability for image restoration, as the skip connection will compensate the downsampling lost.
Thank you for your insightful question. We emphasize exploring information bottleneck is definitely valuable for image restoration, from the following points:
-
First, the key to image restoration is learning the intrinsic structure from noisy data, not merely preserving details. Creating an information bottleneck that captures the main structure while removing noise has been studied extensively, such as all the previous work on Denoising Autoencoders (DAE) (see Sec.2). Those methods are less popular for image restoration due to their long-standing limitations in detail preservation with a narrow bottleneck – and we are here to address it. Our CosAE does explore an effective way of making use of the information bottleneck, without loss of details.
-
Second, our CosAE architecture has demonstrated STOA performance on numerous image restoration tasks, which strongly supports the value of our approach. This demonstrates that CosAE is valuable not only for research exploration on information bottlenecks, but also as a practical application.
-
Additionally, we want to point out that networks with skipped links (e.g., RestoreFormer, LTE, ITNSR, etc.) or wider bottlenecks (LIIF-4x) can also retain noise and degradation signals. For example, our LIIF-4x performs poorly under larger degradations, demonstrating inconsistency and less robustness compared to narrower networks like LIIF-64x and CosAE. On the other hand, many recent works also explore bottleneck architectures to balance detail preservation and noise reduction, such as CodeFormer. Our work aligns with this trend but introduces a novel Fourier-based approach for a more compact and effective representation.
Q2: The compressed latent space may be useful for latent generation, e.g., ldm, however, the corresponding experiments are lacked.
While latent diffusion models (LDM) with KL or VQ regularization support generative tasks, the scope of CosAE in this paper focuses on image restoration. These two goals and applications are all very different, making direct comparisons impractical.
In addition, please refer to the answer to Q4 from Reviewer gTQ7, for the discussion of how to further develop CosAE to have image generation capability. Again, we regard it as a different, future direction.
Q3: The comparisons with VAE or VQ-VAE is lacked.
First, VAEs and VQ-VAEs differ in that they are primarily proposed to enable image generation, by including a sampling module in the bottleneck. In contrast, CosAE does not have it and is not designed for image generation. Also, VAEs and VQ-VAEs do not directly work for blind image restoration or super-resolution tasks. Consequently, a direct comparison is not applicable.
However, it is important to note that while VQ-VAE is not directly applicable, CodeFormer, built on top of it, facilitates blind face restoration. We compare CosAE with CodeFormer in Figure 5, 13, Table 3, and 6 across multiple datasets. Since both models utilize similar encoder and decoder architectures from VQ-VAE [6], this allows for an indirect comparison of our method to the VQ-based image restoration approach.
Q4: It is hard to connect the significance of compressed latent space with image restoration tasks.
Please refer to Q1 for why the proposed method is effective.
Note that to validate "the significance of compressed latent space", we conducted ablation studies comparing wider and narrower bottlenecks, such as LIIF-4X, LIIF-64x, and CosAE under the same settings (see Figures 3, 4, 9, 10, 12, and Tables 1 and 2). The results consistently show that a narrow bottleneck network performs favorably.
Q5: Is there any ablations that the basis of the latent space are formulated with original fourier basis instead of learnable?
Yes, we did include that. Since the “original Fourier basis” is ambiguous, we discuss it with the following possibility:
-
If "Original Fourier basis" means conducting a Fourier transformation on the RGB space, it's important to note that without any network learning capabilities, one can only perform basic Fourier transforms or inverse transforms. This allows for fundamental image processing techniques such as low-pass or high-pass filtering. However, it does not enable advanced tasks like image restoration.
-
We have an ablation model named CosAE-imcos, introduced in lines 269-272, which encodes the RGB using the original Fourier basis. To facilitate image restoration, we utilize the same auto-encoder, but without the learnable Fourier module in the bottleneck, to process the encoded input signals. Both quantitative and qualitative results are reported in Figure. 3, Table 1, and Figure. 12.
-
Additionally, we experimented with a uniform Fourier basis on the latent space. Although it remains in the latent space, it mimics the original Fourier basis instead of being learnable. For quantitative and qualitative results, please refer to Q3 of Reviewer Gqnq, Table 1, and Figure 1.
Q6: Is the encoder capable of encoding higher resolution images, such as 512, 1024, 2048, etc.?
Yes, all experiments, except for FR-SR on face images and 4x SR on ImageNet, involve restoring images with resolutions of 512x512 or higher. For instance, we perform blind face restoration on 512x512 images and SR on DIV2K with a maximum resolution over 2K (see Figures 4, 9, 10, 16, and 17). We also show SR results for face images at various resolutions, from 64x64 to 512x512, in Figure 11. CosAE can accept any resolution as input, and all these experiments validate its effectiveness on high-resolution images.
Thanks for the authors rebuttal. Most of my concerns have been addressed, however, it is regretful that the experiments on image generation are not conducted, which missing a valuable application that should be done and seems somewhat incomplete. Therefore, I maintain my score.
Thank you for your feedback. We are pleased to see that most concerns have been addressed.
Again, we want to emphasize that the title and the topic of this paper is about image restoration.
As all the other reviewers have acknowledged, the paper demonstrated both solid theoretical analysis and strong experimental results on several image restoration tasks. Every paper has its focus and image generation is simply NOT the task and focus of this work, even though the reviewer really likes this task. This is the same as requesting an object detection paper to perform generative learning. The authors find that it's unfair for the rejection to be based on the reviewer's personal interest in a different task, rather than an objective assessment of the work — particularly given the broad range of tasks already addressed in the paper.
The paper introduces the cosine auto encoder method for image restoration. CosAE encodes frequency coefficients to enable high spatial compression. Experiments on flexible resolution super-solution and blind image restoration demonstrate its effectiveness and generalization.
优点
- Nicely presented paper. The paper is well-written. The figures and tables can convey their information clearly.
- The idea is interesting. The novel observation is that Fourier space can enable extreme compression ratio.
- Rich Experiments across face image restoration and natural image restoration. The proposed method can achieve well performance across multiple image restoration benchmarks.
缺点
- The paper lacks theoretical interpretations about the superiority of Fourier space over previous methods involving latent space.
- Can the method be applied to other image restoration tasks? such as image deblurring.
- The authors should add more visual comparisons with previous methods (not just LIIF).
问题
- Does the method have the potential to construct a very low dimensional latent space to facilitate image generation?
局限性
The authors have discuss the limitations carefully.
We appreciate the reviewer's recognition of our idea, paper presentation, and solid experiments, as well as their valuable feedback. In regards to the weaknesses and the questions, see our responses below:
Q1: The paper lacks theoretical interpretations about the superiority of Fourier space over previous methods involving latent space.
While we have very detailed theoretical deductions from Sec. 3.1 to 3.5, as well as B.1 and B.2 in the supplementary. Also, as acknowledged by Reviewer Gqnq, these derivations, grounded in well-established Fourier theory, are easy to follow and well-justified. To summarize from the theatrical part of the paper, we have the following advantages:
-
Compact Representation (line 128): Unlike most existing architectures that preserve details by maintaining a wider bottleneck or using skip links, our narrow bottleneck representation is highly compact due to the inherently compressive nature of Fourier space, yet it still models both low and high-frequency details.
-
Learnable Fourier Coefficients (Sec. 3.2): CosAE designs amplitude and phase to be learnable, allowing flexible and adaptive encoding of spatial information, e.g., via HCM, compared to fixed transformations in the latent spaces of conventional networks.
-
Consistency and Robustness: Fourier-based representations are intrinsically less sensitive to variations in image resolution and degradation types. The harmonic functions used in CosAE ensure consistent performance across different image resolutions and degradation scenarios.
We will further discuss this to strengthen the paper.
Q2: Can the method be applied to other image restoration tasks? such as image deblurring?
Yes. Our model works favorably on common types of blur images, including Gaussian and Poisson noise, generalized Gaussian blurring, and JPEG artifacts. This is because we explicitly synthesize these degradation operators to generate the training data (line 298). Since these operators mimic the most common degradations caused by camera sensors and the image compression process, our model works well on most real-world blur images, even if the degradation is severe, as shown in Figure 2 (a) in the rebutal PDF.
On the other hand, we didn't include any motion blur kernel in the data synthesize pipeline, nor pairs of training data. However, we found that CosAE still can generalize well to mild motion blur, as shown in Figure 2 (b). The model performs less effectively on severe motion blur images, as shown in Figure 2 (c). We anticipate that this can be resolved by further incorporating synthetic training images augmented with diverse blur kernels.
Q3: The authors should add more visual comparisons with previous methods (not just LIIF)
Thanks for the suggestion! We performed visual comparisons with LIIF for the FR-SR task because other approaches, such as LTE and ITNSR, do not perform well when the same combination of objectives (i.e., LPIPS and GAN losses) is added to their original models. For fair comparisons, we used models with only the MSE loss, as shown in the upper part of Table 1. However, the upsampled images predicted by these models lack details with the single MSE loss, making visual validation of detail preservation capability difficult.
However, we note that for other tasks such as blind restoration, we include the latest, and so far the best methods for visual comparison, such as GFPGAN, RestorFormer, CodeFormer, as well as SCUNet. These provide comprehensive qualitative comparisons to highlight the strengths of our method relative to others.
Q4: Does the method have the potential to construct a very low dimensional latent space to facilitate image generation?
Yes. Although this is beyond the scope of this paper as a plain auto-encoder already performed favorably for blind image restoration, CosAE is suitable for facilitating image generation. A direct way is to equip a KL, or a VQ sampling block in the bottleneck. The advantage is obvious: compact and low-dimensional latent space could potentially benefit high-resolution image generation. It could also benefit the latent diffusion model for high-resolution image generation, or VLM for compressive tokenization for high-resolution images.
However, we note that this exploration may be non-trivial. For instance, it raises questions such as: (a) whether the basis functions need to be conditioned on the sampling block, and (b) the best way to define the dictionaries for Amplitude and Phase across different channels (basis functions), etc. Therefore, we consider this a new topic for future work that is beyond the scope of this paper.
Thanks for the detailed rebuttal. The authors address most of my questions. I raise my score to 7.
This paper proposed CosAE, a novel autoencoder architecture integrated with the Fourier series for image restoration tasks. Unlike traditional autoencoders that use spatially compressed latent spaces, CosAE encodes images using frequency coefficients, which allows for significant spatial compression while preserving fine details. CosAE outperforms in continuous super-resolution and blind image restoration, with its ability to generalize across various types of image degradations.
优点
- Simple idea, but powerful and intuitive framework
- Paper is overall well-written
缺点
- No major concern exists, please check Questions
问题
-
In line 232, GAN loss is adopted for LIIF for a fair comparison. Is the LPIPS loss function also adopted, considering that the original LIIF does not included LPIPS loss?
-
In lines 238 and 239, doesn't CosAE also require the parameter , which corresponds to the upsampling ratio? What is the fundamental difference between CosAE and other methods for blind super-resolution, regarding the required hyperparameters?
-
In line 283, while the authors described that CosAE does not support a wider bottleneck, isn't increasing the number of channels playing a similar role (That increases the dimension of intermediate features)?
-
Does citation [38] refer to the blind face image restoration? This paper is cited several times, but it does not contain any content of face image restoration (referred to in Section 4.3) or dictionary learning (referred to in Line 89).. Please check if this citation is correct.
-
In Line 211, is the ratio an integer or rational number?
.
Minor comments and typos
- Caption at Figure 5 STOA --> SOTA
.
Reference [38] Guangming Liu, Xin Zhou, Jianmin Pang, Feng Yue, Wenfu Liu, and Junchao Wang. Code441 former: A gnn-nested transformer model for binary code similarity detection. Electronics, 442 12(7):1722, 2023.
局限性
The authors adequately addressed the limitations.
We appreciate the reviewer's acknowledgment that our approach is simple but insightful. We also thank the reviewer for the valuable feedback. In regards to the questions, please see our responses below:
Q1: Is the LPIPS loss function also adopted, considering that the original LIIF does not include LPIPS loss?
Yes, we employ the same loss modules for LIIF, including LPIPS and GAN loss, to ensure complete alignment with CosAE.
Q2: Doesn't CosAE also require the parameter, which corresponds to the upsampling ratio? What is the fundamental difference between CosAE and other methods for blind super-resolution, regarding the required hyperparameters?
Thank you for the good question! Most previous methods, such as LIIF, LTE, and ITNSR, require explicitly providing a "cell" map to the decoder, which corresponds to the upsampling ratio. In other words, their networks need this ratio as input guidance.
This doesn't impact super-resolving a low-resolution (LR) image where the ratio, or cell map, can be obtained by dividing the desired output size by the LR image size. However, for random LR images from the internet that may have been zoomed with an unknown ratio, determining the cell map needs to know the actual LR image size, which is difficult. In contrast, CosAE can still handle this by rescaling the image to the desired HR size as the network input. It does not need such upsampling ratio as guidance.
We will further clarify it in the paper.
Q3: While CosAE does not support a wider bottleneck, isn't increasing the number of channels c playing a similar role?
Increasing the number of channels is not equivalent to increasing the size of the bottleneck. Typically, a larger bottleneck results from an encoder with fewer pooling operations, which better preserves spatial information. While increasing the number of channels in a narrow bottleneck expands the latent space's capacity, it does not compensate for the loss of spatial information due to pooling. In CosAE, increasing the number of channels means using more cosine basis functions, which is fundamentally different from preserving larger spatial resolutions.
Ideally, we would compare a wider version of CosAE with the narrower one we proposed. However, since CosAE does not support a wider bottleneck design, we instead compare LIIF-4x with LIIF-64x to demonstrate the impact of bottleneck size. The comparison shows that wider bottlenecks maintain more consistent performance across different upsampling ratios. Although our comparison isn't direct, it provides valuable insights: networks with narrower bottlenecks tend to perform more consistently regardless of the input resolution. We will further clarify it in the revised paper.
Q4: Does citation [38] refer to the blind face image restoration?
Thank you for pointing it out! It is a typo, the correct one to cite is the following. We will fix it in the revised paper.
Shangchen Zhou, Kelvin C.K. Chan, Chongyi Li, Chen Change Loy. "Towards robust blind face restoration with codebook lookup transformer." NeurIPS 2022.
Q5: In Line 211, is the ratio an integer or rational number?
The ratio can be a rational number. For example, when upsampling a input to , (See Table 1 and 2).
I thank the authors for the clear and detailed response. It effectively addresses my concerns and questions.
I have one additional question,
In Figure 11 on page 20, why does the image right to the LR input (with the red bounding box) seem weird? Seems like the network super-resolve well for the factor of 2, 3, 5, .. but it failed to do super-resolving with a factor of 1 (which is just identity mapping).
Thank you for your feedback. We are pleased to see that most concerns have been addressed.
This is a really good question. To briefly revisit the context and motivation: Sec 4 (lines 209-218) notes that CosAE is trained with a varying to enable flexible output ratios. Figure 11 is to investigate whether is effectively learned to control the output resolution. Instead of using the SR method proposed in the paper — where the LR image is upscaled to the desired output size with for inference — we explore an alternative approach. We upscale the image to a larger size, i.e., in Figure 11, and then fix this size, while varying across the range T_{min}T_{max} . For faces, varies from 4 to 32 (line 214). This "identity mapping" indeed goes over two steps: (i) upscaling to , and (ii) inference by setting .
However, recall that is essentially the range for the 2D cosine maps. A grid is too small to accurately represent a valid 2D cosine function, as four discrete points are insufficient to form a recognizable cosine shape. As observed, when using during inference on the upscaled image, noticeable artifacts are introduced. As T increases, the cosine functions become more accurately shaped, and the artifacts diminish.
Again, we thank the reviewer for pointing out the phenomenon and will provide further clarification in the appendix. It’s also worth noting that when following the super-resolution inference pipeline introduced in the paper, these artifacts do not occur.
The paper introduces a novel autoencoder that represents an input image using a series of paired Fourier coefficients, representing amplitude and phase. Each pair corresponds to a specific frequency, with all frequencies being learnable parameters of the autoencoder, shared across all images. During decoding, the coefficients are used to construct 2D harmonic functions on a predefined grid. These functions are then input into a decoder network, which outputs the reconstructed image. Through experiments on super-resolution and blind image restoration, the authors demonstrate the effectiveness of the proposed cosine autoencoder in significantly compressing input images into compact representations while preserving both low and high-frequency details
优点
- The paper is well-written, with a clear and comprehensive presentation of the background and relevant literature.
- The derivations of the proposed autoencoder are easy to follow and well-justified, resulting in a simple yet elegant solution.
- The proposed encoder, grounded in well-established Fourier theory, is likely to have a significant impact on the community, providing a strong foundation for further research.
- The ability to construct harmonic functions from an image representation allows for visualization of the learned representation, aiding in analysis and interpretation.
- The authors provide sufficient experiments to demonstrate the effectiveness of their method compared to state-of-the-art approaches.
缺点
-
While most of the work is well-justified and grounded in well-established concepts, this does not extend to the decoding part, specifically the decoding network. The encoding part can be viewed (in a simplified manner) as a non-uniform Fourier transform, so it would be expected that the decoding part would mimic its inverse or at least be more structured than a standard network accepting harmonic images and outputting the recovered image. Although the decoding part is discussed in the paper, I would like to see both the discussion and the ablation study on this point expanded, providing more explanation as to why simple summation does not work in the authors' opinion.
-
The authors state that the learned frequencies effectively capture both low and high frequencies without significant deviation from their initial uniform values. While I agree that the learned frequencies do capture both low and high frequencies, Figure 7 suggests they do deviate in practice from their initial values, as there are clear regions of high and low density of learned frequencies. Furthermore, if the frequencies do not deviate significantly from their initial uniform values, is it necessary to learn them? Does fixing them to a uniform grid lead to significantly degraded performance? Finally, does fixing them facilitate the decoding part, leading to a more structured decoding network?
问题
Please address weaknesses.
局限性
I find that the discussion and limitations in the supplementary material adequately address the major limitations of the proposed work.
We appreciate Reviewer Gqnq’s positive assessment of our work regarding the presentation, the technique contribution, and the soundness of our experiments. We address the questions and concerns in the following.
Q1: More justification for the decoding part.
The Fourier inverse transform is explicitly mimic by CosAE through two modules: (i) the HCM module, which composes the learned amplitude, phase, and cosine functions exactly as in Eq. (2); and (ii) the Decoder, which maps the harmonics directly to the RGB feature.
If we exactly follow the classical Fourier inverse transform as shown in Eq. (1), we need to perform (i) summation of the harmonics, and (ii) mapping the latent space to the RGB space via a Decoder – this is exactly what CosAE-FT does in the paper, as introduced in line 273-276, and evaluated in Table 1and Figure 12. The results show that our CosAE, which does not sum the harmonics, performs better. We also consistently observed that removing the summation yields much better results since the beginning of the exploration CosAE.
To explain intuitively, the bottleneck space is a latent space very different from the RGB space. Simple summation in this space does not equate to summation in RGB and can cause high-frequency information loss. Instead, the Decoder aligns the summation operator with a learnable network, resulting in better performance.
Q2: Do the learned frequencies deviate from the initial values?
By saying "the frequencies are not deviating significantly", we mean the learned are still widely distributed across both low and high frequencies. This is advantageous because low frequencies occupy a larger area than other bandwidths in natural images. Previous models, like LTE [32], tend to converge to predominantly low frequencies. In contrast, our method maintains a more balanced frequency distribution, which is what we meant by 'not significantly deviate'.
We acknowledge that our presentation was not entirely accurate. The learned frequencies differ from the initialized ones, with mid-frequencies being less prominent compared to low and high frequencies. We will analyze this further and revise this part to provide a more accurate depiction.
Q3: Does fixing them to a uniform grid lead to significantly degraded performance? Should we learn the frequencies?
Within the settings of our paper, fixing to a uniform grid results in mildly degraded performance. However, learning remains necessary to achieve a more generalizable network design. We discuss them in the following.
First, we retrained the model with fixed to a uniform grid and reported the results in Table 1 and Figure 1 of the rebuttal PDF. As shown, CosAE-uniform underperforms CosAE on all the metrics. Figure 1 illustrates that images recovered by CosAE-uniform generally exhibit less high-frequency details on the skin, hair, and teeth regions, compared to those restored by CosAE with learnable .
To explain the phnominal, in our paper, CosAE regularizes . Since we set T=32, initializing 0, 15 resulted in pairs, which correspond to 256 basis maps (channels). This uniform sampling works reasonably well in our setting because the frequencies are quite dense.
However, if one increases the (e.g., for higher resolution training), or reduces the number of channels (e.g., for better model efficiency), uniform sampling will result in sparsely sampled frequencies. For example, with and the number of basis maps set to , both and are sparse, sampled as 0, 4, 8, ..., 32 . Since frequencies in natural images are not uniformly distributed, not allowing them to be adjusted during training prevents effective modeling of these frequencies. Additionally, uniform sampling requires the number of basis maps to always be the square of an integer, which is overly restrictive. Thus, considering the superior performance and the generalization of network design, we suggest making it learnable. We will include the ablation in the revised paper.
Q4: Does fixing them facilitate the decoding part, leading to a more structured decoding network?
No. As explained in Q1, we do not directly sum over the harmonics simply because it yields worse performance. It is irrelevant whether the frequencies are uniform or not.
I thank the authors for their thoughtful rebuttal and detailed explanations. I consider this work a valuable contribution, particularly appreciating its rigorous and well-justified methodology, and I hope to see more research that follows this standard. I believe my current score accurately reflects the merits of this work.
We thank the reviewers for recognizing the technique contribution of our work (Reviewer Gqnq), the good-quality presentation (Reviewer Gqnq, Yyg3, gTQ7), simple and intuitive idea (Reviewer Yyg3, gTQ7), and acknowledging its potential impact (Reviewer Gqnq).
While we address the individual questions and concerns in detail below, we have included the following experiments and comparisons in the PDF:
(A) Additional ablation studies.
We include the CosAE-uniform model with uniformly sampled on grids, and un-learned during training. Both qualitative and quantitative results are shown in Figure. 1 and Table. 1.
Please refer to the answers for Q3 (Reviewer Gqnq) and Q5 (Reviewer Fx9w), for more discussion.
(B) Qualitative evaluation on image deblurring.
In Figure 2, we show how our method performed on blurry images, including (i) real, severely degraded face images from WebPhoto-test [38], (ii) motion blur sample synthesized from CelebA and TextOCR, as publically available from kaggle.
Please refer to the answers for Q2 (Reviewer gTQ7) for more discussion.
The paper proposed an autoencoder with Fourier modulation on feature space. Most reviewers appreciated the technical advancements and its contribution leaning toward acceptance although one of them tends to reject it. The raised concerns are the clarity of writing (principle of algorithm, details of decoder), lack of baselines (i.e. VAE or VQ-VAE) and need for demonstration about scalability. The authors resolved most of raised concerns and got the raised score from one reviewer making consensus to accept in rebuttal phase. One reviewer is still unsatisfied with lack of generative experiments, but this AC believes the requested generative experiments are less relevant with the original content of the paper because the paper is designed for image restoration already mentioned in the title "image restoration". With most of consensus toward acceptance, I recommend acceptance of this paper.