Xformer: Hybrid X-Shaped Transformer for Image Denoising
摘要
评审与讨论
In the proposed architecture, the authors design two branches to conduct these interaction modes. Both branches use an encoder-decoder setup to capture multi-scale features. An essential addition to this structure is the "Bidirectional Connection Unit (BCU)", which couples the learned representations from the two branches and facilitates better information fusion.
The combined designs enable the Xformer to effectively model global information in both spatial and channel dimensions. Through extensive experiments, the authors demonstrate that the Xformer achieves state-of-the-art performance on both synthetic and real-world image denoising tasks, all while maintaining comparable model complexity.
优点
The paper stands out in its innovative approach to image denoising by proposing a hybrid X-shaped Transformer. The clear presentation, combined with extensive experiments and state-of-the-art results, underscores its significance in the domain. The novel components, especially the BCU, and the creative combination of spatial and channel-wise blocks, emphasize its originality. The potential impact of this work on the broader image processing community is considerable.
缺点
The hybrid nature of the model, with its dual branches and BCU, might be challenging for some readers to grasp fully. While the description seems structured, visual aids might be lacking. 7.Questions:Could you provide more insights into the design rationale behind the Bidirectional Connection Unit (BCU)? How does it ensure efficient information fusion between the two branches?
问题
Please see the weakness.
Q1: The hybrid nature of the model, with its dual branches and BCU, might be challenging for some readers to grasp fully. While the description seems structured, visual aids might be lacking.
A1: Thanks for your suggestions. We have tried to introduce our work using simple and effective descriptions. But it might be challenging for some readers to understand. Therefore, we will consider adding visual aids in the supplementary material. At the same time, we will try to improve the writing of the main paper.
Q2: Could you provide more insights into the design rationale behind the Bidirectional Connection Unit (BCU)? How does it ensure efficient information fusion between the two branches?
A2: Thanks for your questions. We will give our replies as follows.
(1) The BCU is proposed for the information fusion between two types of feature extraction branches. Theoretically, the gaps exist between the spatial-wise self-attention mechanism and channel-wise self-attention mechanism. Therefore we try to utilize the dual-branch network. In order to construct an effective hybrid network, simple concatenating operation is not able to effectively utilize information from different branches . Therefore, we consider proposing the Bidirectional Connection Unit (BCU) to couple the deep features from their respective feed-forward processes for feature complementarity.
(2) The BCU provides effective refinements for different types of features. The purpose of BCU is enabling the dual branches to capture the effective feature information from the opposite branch. Therefore, the refinements of different types of features are very important. However, a simple and effective way is that using light-weight module to finish effective information processing. Since the convolution operation is commonly-used and able to meet our needs, we considering using it to finish the feature refinements. Since the BCU uses simple convolution operations, it can be thought as efficient.
(3) The experimental results validate the effectiveness of the BCU. Under the premise that the proposal of BCU is reasonable, we conduct extensive experiments to validate its effectiveness. We provide fair ablation study results to demonstrate that the usage of BCU can bring better performance improvement. Besides, we analyze the effectiveness of the single-direction connection unit. Lastly, we provide visual results to compare the output features of the models with BCU and without BCU. All the experimental results support that the BCU can provide efficient information fusion between the two branches.
The authors proposed a hybrid X-shaped transformer for high-quality image denoising. The idea is good. Specifically, the technique consists of spatial-wise transformer blocks (STB) and channel-wise transformer blocks (CTB) to model global information. The authors provide extensive ablation studies to support the effectiveness of each proposed component, like STB, CTB, and BCU etc. The main comparisons with recent methods further show that the proposed method Xformer achieves better performance than others quantitatively and visually.
优点
The idea is good and novel. The proposed Xformer exploits stronger global representation of tokens with a hybrid implementation of spatial-wise and channel-wise Transformer.
The bidirectional connection unit (BCU) is proposed to couple the learned representations from two branches of Xformer. It is simple but effective according to the ablation.
The authors provide extensive ablations to show the effects of some key components, like STB, CTB, BCU, and shift operation.
The main comparisons are also extensive. The authors provide both Gaussian and real image denoising results, where the proposed Xformer achieves better average quantitative results and also shows better visual results.
The writing is good and the work is well-prepared. The overall paper framework is well-organized.
The authors provide more results and analyses in supplementary file, where a sample code is also available. Such a code makes the reproducibility more faithful.
缺点
Some details are not clear enough for better understanding. How did the authors determine the final model when training is finished? For example, did the authors choose the model based on the best validation performance or just use the model from the final iteration?
In the ablation study, Table 1 (b), it seems that w/o BCU and BCU-1 is comparable, BCU-2 and Complete BCU is comparable. Please give more analyses about their difference.
If the proposed method could be used for other image restoration tasks? If so, please give some comments and discussions. Or is it specifically designed for image denoising?
The Xformer shows very good performance. Are there any failure cases for image denoising? Namely, the proposed method can hardly recover good details either.
问题
Please refer to the Weaknesses for details.
Q1: Some details are not clear enough for better understanding. How did the authors determine the final model when training is finished? For example, did the authors choose the model based on the best validation performance or just use the model from the final iteration?
A1: Thanks for your questions. In our experiments, we choose the model from the final training iteration. We find that our model has stable performance after the given training steps by watching the loss curves and validation curves. Therefore, we guarantee the unity of training and choose to use the final training iteration as the last model.
Q2: In the ablation study, Table 1 (b), it seems that w/o BCU and BCU-1 is comparable, BCU-2 and Complete BCU is comparable. Please give more analyses about their difference.
A2: Thanks for your comments. We will give our replies as follows.
(1) About the differences. The differences of them are given in the ablation study of the main paper. We will give more details here. (a) The model without BCU, named w/o BCU, means that the BCU is not used in the constructed network. Attention that the BCU has two independent fusion modules, which enable the bilateral interactions. (b) BCU-1 and BCU-2 are thought as using the single-direction BCU. Note that using single-direction BCU means that we only use the depth-wise convolution (DWConv) or the vanilla convolution (Conv) to provide the information transmission from a single direction. Furthermore, BCU-1 denotes the model using DWConv to refine the deep features from the spatial-wise branch. BCU-2 denotes the model using Conv to refine the deep features from the channel-wise branch. (c) Complete BCU means that we use the whole BCU in the network.
(2) More analyses. We analyze the results in the ablation study of the main paper. It is right that w/o BCU and BCU-1 is comparable, BCU-2 and Complete BCU is comparable. The experimental results demonstrate that the information flow from the channel-wise branch has a bigger impact. Since the spatial-wise branch deals with the local patches interactions, it requires guidance from globally distributed feature information cross channels.
Q3: If the proposed method could be used for other image restoration tasks? If so, please give some comments and discussions. Or is it specifically designed for image denoising?
A3: Thanks for your comments. Our Xformer is not specifically designed for image denoising. We have tried Xformer on the motion deblurring task. We keep the training settings the same with Restormer. Besides, we do not change our Xformer model and use the same architecture. We provide the comparative results with Restormer here. PSNR and SSIM are reported on four benchmarks. More detailed contents can be found in the supplementary materials.
| Motion Deblurring | Metrics | Gopro | HIDE | RealBlur_J | RealBlur_R |
|---|---|---|---|---|---|
| Restormer | PSNR(dB)/SSIM | 32.92/0.961 | 31.22/0.942 | 28.96/0.879 | 36.19/0.957 |
| Xformer (ours) | PSNR(dB)/SSIM | 33.06/0.962 | 31.19/0.942 | 29.02/0.883 | 36.19/0.957 |
As we can see, our Xformer achieves the comparable performance with Restormer. On Gopro, Xformer has PSNR gain of 0.14 dB over Restormer. It indicates that our Xformer can also perform well on some other image restoration tasks.
Q4: The Xformer shows very good performance. Are there any failure cases for image denoising? Namely, the proposed method can hardly recover good details either.
A4: Thanks for your comments. As we provide extensive visual examples in the supplementary material, we validate that our Xformer can solve many challenging denoising cases. Of course, Xformer also has some failure cases for image denoising. We are willing to discuss and share some failure cases of Xformer. Because of the limited space, we will add these contents in the revised supplementary material.
In this paper, the authors propose a hybrid X-shaped vision transformer for image denoising, named Xformer. Xformer has two branches with one containing the spatial-wise transformer blocks and the other containing the channel-wise transformer blocks. Between these two branches, there are the bidirectional connection units which couple the learned representations from these two branches. The experimental results show that the proposed method performs well on the synthetic image denoising dataset, but the method does not achieve the SOTA on the real-world image denoising dataset.
优点
- The X-shaped architecture is elegant and reasonable.
- The experimental results on the synthetic dataset are good.
- The overall paper writing is good.
缺点
There are several places that are not intuitive or clear:
- The authors claim that "we make the last encoder involving STBs of two branches share parameters for the purpose of computational efficiency." However, it is unclear how much the performance will be influenced by the parameter-sharing strategy. It is also not clear why it is critical to share parameters for this place in the network. Why not share parameters in other places?
- The authors claim that "In short, the STB utilizes non-overlapping windows to generate shorter token sequences for the self-attention computation, which can enable the network to obtain fine-grained local patches interactions." Shorter token sequences? Compared to what? Why does the shorter token sequences can enable the network to obtain fine-grained local patches interactions?
- The authors claim that "In order to introduce contextualized information into self-attention computation, we choose to use 3×3 depth-wise convolution (Conv) following 1×1 Conv to generate query (Q), key (K), and value (V)." Why not directly use a vanilla 3×3 convolution?
- The authors claim that "Specifically, we use a 3×3 depth-wise convolution layer to refine the deep features from the spatial-wise branch for the purpose of saving computational consumption." Why not also using the 3×3 depth-wise convolution layer to refine the deep features from the channel-wise branch? Will it influence the performance compared to using the 3×3 vanilla convolution?
问题
See the weakness section.
Q1: The experimental results show that the proposed method performs well on the synthetic image denoising dataset, but the method does not achieve the SOTA on the real-world image denoising dataset.
A1: Thanks for your comments. It is worth determining that our Xformer performs well on the synthetic image denoising dataset. At the same time, our Xformer has much better performance on DND dataset and sightly limited performance on SIDD dataset while solving real-world image denoising. We want to give some explanations here.
(1) The commonly-used testing datasets on real-world image denoising task include SIDD and DND. We should attach importance to both of them. In our experiments, the performance of Xformer on DND is much better than Restormer. Our Xformer obtains 0.16dB higher PSNR score over Restormer on DND.
(2) Furthermore, the performance of Xformer on SIDD is influenced by the characteristics of the data itself. As we know, the test process on SIDD is based on 1280 image patches of size 256×256 pixels from 40 high-resolution images. Since we carefully investigate these 1280 image patches, we find that they have very smooth texture features and most of them are from the background with unity color. The evaluation on these images has limited ability to mine the strengths of our model. It is because that our Xformer shows the strengths by combine the local fine-grained features and global features across channels. Therefore, the testing results on SIDD of Xformer is slightly below the results of Restormer since Restormer does not consider the feature extraction on spatial dimension.
Q2: The authors claim that "we make the last encoder involving STBs of two branches share parameters for the purpose of computational efficiency." However, it is unclear how much the performance will be influenced by the parameter-sharing strategy. It is also not clear why it is critical to share parameters for this place in the network. Why not share parameters in other places?
(1) The performance is not influenced by the parameter-sharing strategy. In our work, we use the parameter-sharing strategy in the last encoder of U-Net structure network to reduce the model complexity and improve the computing efficiency. During our study, we validate that this strategy has no negative influence on the model performance. We construct two types of neural networks. One is using the parameter-sharing strategy and the other is not. We conduct fair experimental comparison on the gaussian color image denoising task with noise level 15. We keep the same experimental settings. We show the experimental results here.
| PSNR results (dB) | Params (M) | FLOPs (G) | CBSD68 | Kodak24 | McMaster | Urban100 |
|---|---|---|---|---|---|---|
| w/o params-sharing | 37.18 | 40.07 | 32.42 | 35.38 | 35.65 | 35.20 |
| w/ params-sharing | 25.07 | 39.11 | 34.42 | 35.37 | 35.65 | 35.22 |
According to the table results above, we demonstrate that the parameter-sharing strategy has little influence on the model performance.
(2) Explain the position of parameter-sharing strategy. We choose to use the parameter-sharing strategy at the lowest layer of U-Net network. On the one hand, the module at the lowest layer processes the feature graph with the lowest resolution, so the computational complexity is low when calculating self-attention interaction. On the other hand, we use the STB to replace CTB by using parameter-sharing since the input feature of the lowest layer includes enough global information. Therefore, the STB can also capture the global feature by conducting self-attention interactions in the spatial dimension. In short, the best position to using parameter-sharing strategy is just the last encoder of U-Net network.
Q3: The authors claim that "In short, the STB utilizes non-overlapping windows to generate shorter token sequences for the self-attention computation, which can enable the network to obtain fine-grained local patches interactions." Shorter token sequences? Compared to what? Why does the shorter token sequences can enable the network to obtain fine-grained local patches interactions?
A3: Thanks for your questions. We give our explanations as follows.
(1) About the shorter token sequences. We claim the shorter token sequence because we compare it to the token sequence obtained by the common self-attention block of original vision Transformer. Since we utilize non-overlapping windows to extract tokens, we obtain the token sequence within a local window, not the whole image area. Therefore the sequence is shorter.
(2) About fine-grained local patches interactions. At first, it does not matter whether we utilize shorter token sequence. The network can obtain fine-grained local patches interactions because it extracts the tokens from the spatial dimension. Each token represents a local image patch. Therefore, the interactions among tokens can be thought as the interactions of local patches. From this point of view, we think that the proposed STB can enable the network to obtain fine-grained local patches interactions.
Q4: The authors claim that "In order to introduce contextualized information into self-attention computation, we choose to use 3×3 depth-wise convolution (Conv) following 1×1 Conv to generate query (Q), key (K), and value (V)." Why not directly use a vanilla 3×3 convolution?
A4: Thanks for your comments. We choose to use 3×3 depth-wise Conv and 1×1 Conv because we want to reduce the computational cost and model parameters. In practice, it is called the depthwise separable convolution and has been widely used in recent works. Therefore, we do not use a vanilla 3×3 convolution.
Q5: The authors claim that "Specifically, we use a 3×3 depth-wise convolution layer to refine the deep features from the spatial-wise branch for the purpose of saving computational consumption." Why not also using the 3×3 depth-wise convolution layer to refine the deep features from the channel-wise branch? Will it influence the performance compared to using the 3×3 vanilla convolution?
A5: Thanks for your questions. We give our replies as follows.
(1) Firstly, since the spatial information of features passed to STB is important, we use vanilla convolutions to process the deep features from the channel-wise branch. It is because the vanilla convolution has more learnable parameters and thus extract more spatial information. Therefore, we choose to use the 3×3 vanilla convolution.
(2)Secondly, we are going to provide specific experimental results to demonstrate the necessity of using 3×3 vanilla convolution. We train the new network named Xformer-change by replacing the 3×3 vanilla convolution in BCU with 3×3 depth-wise convolution on the gaussian color denoising task with noise level 15. We compare it to our current model Xformer. We keep the same experiment settings while training these two compared models. For ablation study, We train these models for 150k iterations with batch size 16. We report the compared results here.
| PSNR results (dB) | Params (M) | FLOPs (G) | CBSD68 | Kodak24 | McMaster | Urban100 |
|---|---|---|---|---|---|---|
| Xformer-change | 24.71 | 40.90 | 34.27 | 35.15 | 35.25 | 34.76 |
| Xformer | 25.23 | 42.24 | 34.38 | 35.41 | 35.55 | 35.08 |
As we can see, the model Xformer-change without using 3×3 vanilla convolution has limited performance. Therefore, we demonstrate that using 3×3 vanilla convolution in BUC is necessary.
This paper proposes XFormer for image denoising, which aims to combine SwinIR which utilizes spatial-wise self attention and Restormer designing channel-wise self attention for image denoise, thereby leveraging the advantages from both methods. The designs, including dual-branch architecture and the bidirectional connection unit for bilateral interactions between two branches, are straightforward.m The paper is motivated well, whilst the technical novelty is incremental, especially compared to SwinIR and Restormer.
优点
-
This paper is motivated well. It is reasonable to combine the advantages of both spatial-wise self attention and channel-wise self attention to capture both the local fine-grained features and global features across channels.
-
The paper is organized well and easy to follow despite some typos.
缺点
-
The technical novelty is incremental. There are two core designs: the dual-branch architecture and the bilateral interactions between two branches, which are both typical designs and have been extensively explored in other work. Thus, the technical novelty is limited, especially compared to SwinIR and Restormer.
-
Compared to Restormer, Xformer has limited performance improvement, especially on real image denoising scenarios which is more important for evaluation.
问题
It is suggested to investigate both theoretically and experimentally what kind of denoising scenarios (noises) are STB and CTB suitable for, respectively.
Q2: Compared to Restormer, Xformer has limited performance improvement, especially on real image denoising scenarios which is more important for evaluation.
A2: Thanks for your comments. We propose Xformer and evaluate its performance on the synthetic and real-world image denoising tasks. Our replies are as follows.
(1) About the performance on the synthetic datasets. (a) Compared to Restormer, Xformer achieves obviously better performance on the synthetic image denoising tasks. For example, our Xformer obtains 0.34 dB higher PSNR score over Restormer on Urban100 for gaussian color image denoising with noise level 50. (b) Besides, we provide extensive visual results in the main paper and supplementary material to validate the superior of Xformer. Therefore, Xformer has obvious performance improvement on the synthetic image denoising tasks.
(2) About the performance on real image denoising. (a) The commonly-used testing datasets on real image denoising task include SIDD and DND. We should attach importance to both of them. In fact, the performance of Xformer on DND is much better than Restormer. Our Xformer obtains 0.16dB higher PSNR score over Restormer on DND. (b) Furthermore, the performance of Xformer on SIDD is influenced by the characteristics of the data itself. As we know, the test process on SIDD is based on 1280 image patches of size 256×256 pixels from 40 high-resolution images. Since we carefully investigate these 1280 image patches, we find that they have very smooth texture features and most of them are from the background with unity color. The evaluation on these images has limited ability to mine the strengths of our model. It is because that our Xformer shows the strengths by combine the local fine-grained features and global features across channels. Therefore, the testing results on SIDD of Xformer is slightly below the results of Restormer since Restormer does not consider the feature extraction on spatial dimension.
(3) Conclusion. We explain the reasons of the limited performance of Xformer on SIDD dataset. According to the analyses above, we confirm that our Xformer has obvious performance improvement over Restormer.
Q3: It is suggested to investigate both theoretically and experimentally what kind of denoising scenarios (noises) are STB and CTB suitable for, respectively.
A3: Thanks for your suggestions. In our paper, we discuss the performance of different network structures while using STB or CTB. The corresponding results can be found in the ablation study. (a) Theoretically, STB is good at solving the denoising scenario which involves dealing with some spatially uniform distributed noise, such as the synthetic gaussian noises. CTB is good at solving the denoising scenario which involves dealing with some global consistent distributed noise. (b) Experimentally, we find that the network using STB has better performance than that using CTB while solving gaussian color image denoising task. In the future, we will conduct more experiments to investigate the suitiable denosing scenarios of different Transformer blocks.
Thanks for the response, which addresses most of my concerns.
Q1: The technical novelty is incremental. There are two core designs: the dual-branch architecture and the bilateral interactions between two branches, which are both typical designs and have been extensively explored in other work. Thus, the technical novelty is limited, especially compared to SwinIR and Restormer.
A1: Thanks for your comments. We consider your concern carefully and give our replies as follows.
(1) The technical novelty is well. (a) At first, we should know that the technical novelty should be consistent with the research motivation and background. According to our research, creating a dual-branch network using different Transformer blocks is not common in the low-level vision field. (b) Besides, our motivation is not to validate the effectiveness of the dual-branch network. We want to explore stronger learning representation ability by utilizing different types of self-attention mechanisms in a unity neural network. Although the dual-branch architecture and the bilateral interactions are not the first try, they have not been investigated to the exploit the advantages of both spatial-wise self attention and channel-wise self attention. Therefore, our core designs are novel.
(2) Compared to SwinIR and Restormer. (a) We are willing to discuss the differences between our proposed method and these two works while considering the research motivation. In fact, it is important to further understand our motivation. In our work, the research subjects are the core self-attention mechanisms, but not the specific modules which have been proposed in SwinIR and Restormer. (b) Our starting point is not to improve existing Transformer blocks for better performance. In other words, we do not focus on using which kind of self-attention module proposed by previous works. We consider more improvements from the level of feature capture and interaction. Since existing Transformer-based denoising methods consider feature extraction from a single dimension, spatial or channel-wise, our method creatively uses a concurrent network to model global information from two dimensions. Besides, our proposed BCU is proved to effectively fuse two types of features. Extensive experimental results validate that our method has promising denoising performance. Here, we provide the compared results with SwinIR and Restormer on gaussian color image denoising.
| PSNR results (dB) | CBSD68 (15/25/50) | Kodak24 (15/25/50) | McMaster (15/25/50) | Urban100 (15/25/50) |
|---|---|---|---|---|
| SwinIR | 34.42/31.78/28.56 | 35.34/32.89/29.79 | 35.61/33.20/30.22 | 35.13/32.90/29.82 |
| Restormer | 34.40/31.79/28.60 | 35.35/32.93/29.87 | 35.61/33.34/30.30 | 35.13/32.96/30.02 |
| Xformer (ours) | 34.43/31.82/28.63 | 35.39/32.99/29.94 | 35.68/33.44/30.38 | 35.29/33.21/30.36 |
As we can see, our Xformer achieves the best performance on all benchmark datasets across three noise levels. It validates that the technical novelty brings Xformer the obvious performance improvement.
(3)Conclusion. In our work, the core issues we want to solve are about the extraction and fusion of different types of features from spatial or channel dimensions. Therefore, we design Xformer to improve the global information modeling from two dimensions. We conduct extensive experiments to demonstrate its promising denoising performance. From this perspective, our work does not lack novelty.
We thank all reviewers for their precious review time and valuable comments. We would like to give a brief summary here.
What we have done during the rebuttal phase.
(1) We give detailed replies to the concerns of Reviewer V2SV. The technical novelty of Xformer is consistent with the motivation, which is regarded to be well. We provide abundant statements about the novelty of our method. Besides, we provide theoretical analyses and strong experimental results to demonstrate that our method can achieve significant performance improvement.
(2) We explain some unclear parts in our paper and provide more details. Besides, we provide addtional experimental results on another image restoration task and analyze some failure cases of our Xformer.
(3) We reply to all the concerns given by Reviewer wiTn. We explain that our method can perform well on different denoising tasks. Besides, we explain some of the confusion and analyze the necessity of some designs of the proposed method. We provide corresponding experimental results to support our viewpoint. Analyses and experimental verifications can validate that our proposed network is designed well.
(4) We take some valuable suggestions proposed by reviewers, e.g., investigating the suitable scenarios for STB and CTB, adding visual aids, and so on. These suggestions can help us extend our work well.
The paper introduces XFormer, a novel hybrid X-shaped transformer for high-quality image denoising. XFormer integrates spatial-wise transformer blocks (STB) and channel-wise transformer blocks (CTB) to enhance the representation of global information. Through ablation studies and comparisons with recent methods, the authors demonstrate the effectiveness of their design, particularly the Bi-Directional Connection Unit (BCU), in both synthetic and real-world denoising tasks. The BCU is highlighted as a mechanism to couple the learned representations from the dual branches, aiming to achieve state-of-the-art performance.
(a) Strengths of the Paper:
The novel design combines spatial and channel-wise attention effectively, capturing fine-grained features and global contexts. The structure and rationale behind the XFormer are clear, with the BCU being a standout feature for its innovative approach to information fusion. Extensive ablation studies and empirical results show the model's superior performance over existing methods. The writing and presentation of the paper are of high quality, with the reviewers praising the paper's clarity and organization. The submission includes supplementary material, like sample code, which improves transparency and reproducibility.
(b) Weaknesses and Potential Omissions:
Reviewers V2SV and WiTn express concerns about the incremental nature of technical novelty and the clarity of certain methodological aspects. There is a call for more detailed analysis on the generality of the model for other image restoration tasks, as indicated by Reviewer GouV, and a better understanding of design choices such as parameter-sharing strategies, as pointed out by Reviewer WiTn. Additionally, the paper could benefit from further discussions on the BCU design rationale, as suggested by Reviewer Md3A.
Considering the rebuttal and the detailed positive feedback from Reviewer Md3A and Reviewer GouV, although with the constructive criticism provided by Reviewer V2SV and Reviewer WiTn, the paper appears to have strong motivation, foundation, and robust experimental results. While there are areas for improvement, particularly in clarifying the incremental aspects of the design and expanding the discussion on applicability to other tasks as indicated by Reviewer GouV and Reviewer WiTn, the strengths noted by all reviewers suggest a potential for acceptance.
为何不给更高分
-
Incremental Novelty: The technical novelty of the paper is recognized but seen as incremental. Reviewers V2SV and WiTn have noted that while the paper's approach is valid, it does not represent a significant gain over existing methods. For a spotlight or oral presentation, a higher degree of innovation is typically expected.
-
Performance on Benchmark Datasets: While the model performs well, Reviewer V2SV notes that it does not achieve state-of-the-art results on all benchmarks, especially in real-world scenarios which are crucial for establishing the model's practical value.
为何不给更低分
The decision not to assign a lower score and reject the submission can be justified by several key points highlighted across the reviewers' feedback, for example, its innovative elements and strong foundation, its quality of presentation and extensive validation, and its potential for related fields and future work.
All reviewers have provided constructive feedback that aims to improve the paper rather than dismiss its contributions. Two Reviewers gave acceptance ratings, with positive comments on the paper's contribution and presentation, reflecting a consensus that the paper's merits over its limitations.
Accept (poster)