Prune and Repaint: Content-Aware Image Retargeting for any Ratio
摘要
评审与讨论
The paper presents an innovative method for image retargeting. It addresses two core challenges in image retargeting (preserving the main information and avoiding artifacts on key objects) simultaneously by carefully devising a content-aware seam-carving method and an adaptive repainting method respectively. Both quantitative and qualitative results demonstrate the advantages of the method over others.
优点
- Significant improvement over prior works on both preserving semantic completeness and maintaining the harmony of the retargeted contents.
- The paper is well-motivated and the method presented is well-structured.
- The proposed idea is exciting. The semantic-guided local repainting solution is very reasonable and inspirable for the following researchers on this topic. The two key designs are effective.
- The paper is well organized and written, with clear explanations of the method and effective visual representations.
缺点
- Detailed explanations for some key concepts are missing and adding them will further enhance the clarity of this paper, e.g., in Line 142: the process for computing x_0 should be presented. Is it the saliency center of the entire image or the saliency center of a row or column?
- It will be better for this paper to demonstrate its advantages if the authors include recent deep learning works related to image editing/generation for comparison, especially those global generation methods. As stated, the proposed method selectively regenerates the abrupt pixels and preserves the foreground consistency and local smoothness over previous global generation methods without maintaining foreground consistency with the original image. Although Figure 1 shows an example, more comparisons will be highly beneficial to verify such advantages.
- The authors perform subjective evaluation by designing a User Study Metric. However, it lacks explanations with corresponding visual samples to show the complementarity among the four different evaluation metrics in Table 2. For example, the authors should include some samples with different degrees of deformation or distortion for comparison and explanation.
- The ARRD module in Figure 2 is not very clear to show the determination process for inpainting or outpainting. Expanding the mask map seems an optional choice and only works for some special cases when the foreground regions are large and direct cropping will lose too much object contents. However, in Figure 2, it appears always required for any input and ratio.
问题
- From Figure 6, in addition to addressing the drawback of BR, which is unable to handle discontinuities in foreground pixels, I also notice that the inclusion of AR can more faithfully restore the depth and proportion of the original image (such as the size relationship between the cylinder and the background building in architectural images). Can the authors briefly discuss the reasons behind this advantage?
- I am very interested in the performance on videos. Intuitively, the proposed method will work well on each frame. However, inter-frame smoothness will be a new problem for videos. Without explicit inter-frame consistency regularization, will the proposed model result in cross-frame mutation and jitter?
- In line 266, what is the detailed difference between BR and AR. Both of them regenerate background regions, but what are the reasons that make BR unable to address discontinuities in foreground pixels?
- From Table 1, I notice that the performance for different target ratios varies a lot. Please explain this difference.
- In line 178, how to decide the threshold η?
局限性
Limitations and analyses are included by the authors.
-
Weakness 1: Some detailed concepts are missed, such as in line 142.
Response: Thanks for pointing it out. The center coordinate represents the center of the entire image. The height coordinate can be calculated as the average of the heights of all the salient pixels, where ranges from 0 to , the total number of salient pixels:
The width coordinate can be determined in a similar manner. -
Weakness 2: Lack of comparison with recent image edit/generation methods.
Response: Thanks for your suggestion. The results compared with full-image repainting (FR) and InGAN are included in Tab. R1, Tab. R2, Fig. R1 and Fig. R2 in the PDF. Both qualitative and quantitative results demonstrate that our method significantly outperforms existing approaches by our proposed CSC to preserve the key information and AR to enhance the local smoothness. -
Weakness 3: Lack of explanations with corresponding visual samples to show the complementarity among the four different evaluation metrics in Table 2.
Response: An example illustrating different degrees of content loss, deformation, distortion and aesthetics is presented in Fig. R3 of the PDF. Although the overall deformations in the third and fourth images are quite noticeable, the third image is very smooth locally, so it belongs to deformation rather than distortion, whereas the fourth image has both deformation and distortion. Further examples and scoring criteria will be included in the appendix. -
Weakness 4: The ARRD module in Figure 2 is not very clear.
Response: Thanks for pointing out this issue, expanding the mask map is an optional step that will only be implemented when the retargeted image cannot accommodate salient objects. We will revise the flowchart with a dashed box and a legend to avoid this confusion. -
Question 1: Why AR restore the depth and proportion of the original image better than BR.
Response: The expansion operation of AR reduces the seams that should have been removed from the foreground, thereby minimizing pixel displacement and alleviating misalignment of foreground objects. This allows for better preservation of the relative positions and depths of foreground objects. -
Question 2: Performance on the video.
Response: An example of video retargeting is presented in Fig. R4 in the PDF. As expected, the lack of inter-frame consistency in the PruneRepaint approach leads to inconsistencies in generating objects in the background area. -
Question 3: The detailed difference between BR and AR in line 266.
Response: To achieve background harmony and foreground preservation, BR identifies and repaints the background areas based on the saliency map, which does not affect the foreground and therefore fails to repair discontinuities in the foreground pixels. In contrast, AR is designed to achieve harmony across the entire image, which adaptively identifies areas where CSC removed more seams, thereby mitigating discontinuities in both the foreground and background. -
Question 4: Explain why the performance for different target ratios varies a lot.
Response: Typically, the larger the aspect ratio difference between the retargeted image and the original image, the greater the loss in image saliency. We analyzed the RetargetMe dataset and found that the average aspect ratio is 0.7405, which is closest to 9/16 (0.5625), indicating minimal saliency loss at this ratio. As the aspect ratio deviates further from the original ratio, the Saliency Discard Ratio (SDR) increases. -
Question 5: How to decide threshold in line 172.
Response: The threshold is the mean value of the entire saliency map.
Thank you for the detailed responses. I have reviewed the rebuttal and other reviews. The authors have sufficiently addressed all my concerns, and I am inclined to uphold the initial score.
This paper proposes a new image retargeting framework that prunes background and repaints local connections. It improves the traditional seam-carving with semantic guidance to make the pruning content-aware, avoiding deformation and loss of important objects. Meanwhile, the authors introduce an adaptive repainting module using an image-conditioned diffusion model to selectively inpaint or outpaint local regions to achieve local smoothness. The authors also design two evaluation metrics, including the Saliency Discard Ratio (SDR) and a user study metric for evaluation. The proposed method shows superior performance over others and generalization to varying target ratios.
优点
Good motivation and key problems are clearly summarized.
The method is technically sound and novel. I am very interested in this task and like the proposed idea. It appears to be a pioneering work in the image retargeting community using diffusion models.
Impressive results and a large improvement over previous works in preserving object completeness, and coherence, and it holds better generalization.
缺点
As pointed out by the authors, the inference speed has a large room to be improved. I suggest the authors adopt some accelerated diffusion models to improve it.
The authors argue that their method can work well on any ratio, such as 1:1 and 4:3, yet these visual results are not illustrated.
Minor writing or grammar mistakes, such as missing the point in line 70, and missing “the” in the caption of Figure 3.
How does the text prompt work? I am interested in its contributions to the retargeting results.
问题
In equation (5), what does W_s denote? Is it the same as W in equation (2), meaning the width of the input image?
For the proposed metric SDR, the authors use the width change of the saliency map to measure semantic completeness. How to implement this measurement? If my understanding is correct, is it to sum the width of all lines in the saliency map and output the maximum one as the width of the saliency map? Why do not use the metrics in salient object detection such as F-measure or use IOU to evaluate?
In Table 2, I think content completeness and aesthetic scores are more important than the other two scores as they measure the two key views for image targeting. So, I wonder whether there is a better strategy to obtain an overall score rather than just averaging them.
局限性
The authors explicitly discussed the limitations of this work.
-
Weakness 1: Inference speed to be improved.
Response: Thanks for your suggestion, we will further employ accelerated diffusion models to improve the inference speed. -
Weakness 2: Results on ratio 1:1 and 4:3 are not visualized.
Response: Thanks for pointing out this issue, we present visualization results for only two extreme ratios to more clearly demonstrate the advantages of PrueRepaint as other ratios hold less difficulties. Limited by the page size, visualized results for the other two aspect ratios will further be included in the appendix. -
Weakness 3: Minor writing or grammar mistakes.
Response: Thanks for your detailed review, we will carefully proof the manuscript. -
Weakness 4: About the text prompts.
Response: Here we regard Image Retargeting as an image-to-image generation task and do not utilize additional text prompts to assist in the task. The positive text prompts used in the experiment are the default 'best quality, high quality,' while the negative prompts are 'monochrome, lowres, bad anatomy, worst quality, low quality.' The weight of the image prompt is set to 1, meaning that the text prompts contribute minimally. -
Question 1: The meaning of in equation (5).
Response: is different from the image width . denotes the saliency width which is defined in Eq. (4). We will add the description on in Eq (2). -
Question 2: How to implement in metric SDR? Is it to sum the width of all lines in the saliency map and output the maximum one as the width of the saliency map? Why do not use the metrics in salient object detection such as F-measure or use IOU to evaluate?
Response:- For the first and second questions, I'm afraid there might be a slight misunderstanding. sums the width of all lines in the saliency map and takes their union rather than the maximum. In the implementation, we use a list of zeros with length to store whether each column has salient pixels. If a column has a salient pixel, we set the corresponding entry to 1. Finally, summing the list gives the salient width.
- For the last question, both saliency detection metrics and IOU are only suitable for evaluating two images with completely identical resolutions. However, image retargeting changes the aspect ratio as well as resolutions of the original image. Therefore, we proposed a simple metric that provides a rough estimate of saliency preservation. Thank you for your inspiring question, we will explore more retargeting metrics in our future research.
-
Question 3: The weights of different subjective metrics.
Response: We agree with you that different people and tasks have different biases towards subjective metrics, thus specific weights can be tailored to specific needs. We take averages just to compare the overall performance in a more intuitive and concise way.
This work contributes to a new image retargeting model named PrueRepaint, which is adaptive to work with any target ratio.
The authors first improve the traditional seam-carving method with saliency priors to achieve content-aware pruning and protect important semantic regions.
After that, they introduce an adaptive repainting module using the diffusion model to maintain local smoothness after pruning.
The method is unique compared to the previous and also effective. The newly proposed metrics are reasonable. Experiments have demonstrated the effectiveness of the key designs and the large advantages over other methods.
优点
-
I believe the view that the authors address the image retargeting task is novel and very important. This work takes a step towards a spatial-variable diffusion model. In contrast, most diffusion models focus on the spatial-fixed community.
-
The method is well-designed and reasonable. The proposed idea differs a lot from previous cropping-based or global generation methods. The authors provide a more reasonable and effective solution path.
-
The authors also introduce two reasonable evaluation metrics.
-
Ablation studies verify the effectiveness of each design of the proposed method.
-
The proposed method achieves fairly superior results compared to previous models.
缺点
-
It will be more convincing to verify the superiority of the proposed method if more methods such as the ‘InGAN’ model in Figure 1 can be involved for comparison. The previous methods address this task from varying views, such as using cropping, scaling, seam-carving, and a generative model. This work seems to provide a new idea. So I expect a more comprehensive comparison to other solutions to show its advantages.
-
Although the proposed method has achieved large improvement over previous works on avoiding artifacts, the results still hold some distorted regions. In my view, strengthing local correlations will be helpful to solve this problem.
-
The dataset contains limited samples. Although the authors try to output different ratios and the results show the advantages and better generalization over previous works, it will be much better to enhance the results and comparisons with more samples. So, I suggest the authors collect a new high-quality dataset with diverse scenarios for this task and test the model with a larger dataset for evaluation in the future.
问题
-
The authors point out that ‘the repainting region generated by ARRD is not complete enough and contains certain distorted regions’ in the limitation section. Please give more explanations for this limitation.
-
The result of ‘+CSC’ in Table 3 and ‘background repainting’ in Table 4 hold the same score. This should not be a coincidence, please explain the reason behind it.
-
For the IP-adapter, it seems that the proposed method does not rely on text prompts to fulfill or improve the retargeting result. However, I am still interested in what text prompt is used in your design. Have you tried different text prompts such as “keep key semantics” and how about the results?
-
The output ratio is set to 16:9, 1:1, 4:3, and 9:16. What are the reasons for selecting these ratios as the target output? In previous image retargeting works, I also notice some other ratios such as 2:3 and other extreme ratios.
-The repainting choice that inpainting or outpainting is based on setting a hyperparameter and decided by comparing the target ratio and the foreground size. I expect an adaptive strategy for this determination. Can you provide some potential solutions for this problem?
局限性
The authors discussed the limitations adequately.
-
Weakness 1: Lack of comparison with more retargeting methods.
Response: Thanks for your advice. We have added experiments on InGAN as well as full-image repainting (FR) in Fig. R1, R2 and Tab. R1 in the PDF. Both the quantitative and qualitative comparisons show the large superiority of our method. -
Weakness 2: Suggestions to solve the distorted regions with local correlations.
Response: Yes, thanks for your constructive suggestion. Although our method has achieved significant improvement on both preserving semantic completeness and avoiding local artifacts, it is indeed a promising solution to further improve local smoothness by strengthening local correlations. However, the spatial misalignment between the original and retargeted images makes this local consistency constraint difficult to be implemented. We will study this problem carefully in our future work. -
Weakness 3: Lack of a larger high-quality dataset.
Response: Thanks for your advice, we are actually working on the retargeting dataset. We collect images with a wider variety of objects, input ratios, foreground scales, and object layouts for comprehensive evaluation. -
Question 1: Explain the limitation of ARRD.
Response: The reason is that ARRD searches the local pixel displacement area without global understanding. For example, in comparing the 'Original image' and '+CSC+AR' images in the second row of Figure 6, some seams passing through the streetlight were removed, causing misalignment. Ideally, the entire streetlight should be repainted, but AR only repaints the pixels near the deleted seams in the middle, resulting in a streetlight that remains misaligned in the generated image. -
Question 2: Explain why the results of '+CSC' in Table 3 and ‘background repainting’ in Table 4 are the same.
Response: The BR and AR are all implemented on the base of CSC. As BR identifies the background based on the saliency map for repainting, it will not change the saliency regions, thus they hold the same score. -
Question 3: About text prompts.
Response:- For the first question, we regard Image Retargeting as an image-to-image generation task. Moreover, designing specific prompts for each image is quite expensive thus we do not utilize additional text prompts to assist in the task. The positive text prompts used in the experiment are the default 'best quality, high quality,' while the negative prompts are 'monochrome, lowres, bad anatomy, worst quality, low quality.' The weight of the image prompt is set to 1, meaning that the text prompts contribute minimally.
- For the second question, text prompts are supposed to be captions that describe an image. However, prompts such as 'keep key semantics, preserve the main structure' are typically not captions of images. As a result, they are meaningless in guiding the generation process.
-
Question 4: The reasons for selecting the reported ratios.
Response: Most existing devices have aspect ratios ranging from 16:9 to 9:16. The models that handle extreme ratios tend to perform better on the more common ratios as well. Therefore, we only select 4:3 and 1:1 as the middle-ground aspect ratios to test. Among them, 16:9 and 4:3 are common aspect ratios for computers and televisions, 1:1 is a preferred image and video size for social media (such as Instagram), and 9:16 is commonly seen on smartphone screens.
After checking the response and other reviews, I am inclined to increase the rating.
Thank you for your detailed rebuttal and for thoroughly addressing all of my concerns. I appreciate the additional experiments and explanations you provided, particularly regarding the comparisons with retargeting methods, the handling of local correlations, and the dataset enhancement.
Given the improvements and clarifications, I will maintain my initial positive rating.
The work presents an addon using Diffusion models to Seam Carving to perform Content Aware resizing of images.
优点
- None
缺点
-
The work seems to be a rehash of Seam Carving, and some diffusion models were added to perform retargeting. There is no novelty in the method.
-
One page of the paper is direct equations from Seam Carving. How can it be a contribution?
-
The application is quite dated and has not made any significant through the methods proposed, which are logical but lack impact.
-
The writing is hazy, and there is barely any motivation about the approach and the application.
问题
Please see above.
局限性
Limitations not well discussed.
-
Weakness 1: No novelty in the method.
Response:- Seam-carving is a semantic-agnostic approach that often results in severe foreground loss and distortion (see Fig. 4, 5 and 7). In contrast, our proposed content-aware seam-carving (CSC) incorporates semantic awareness to preserve key objects, leveraging visual saliency cues and a careful integration strategy (see section 3.2).
- Additionally, to address the inconsistent style and structure issues in previous global generation methods, we introduce adaptive repainting (AR). AR adaptively identifies discordant artifacts with a masking strategy and adjusts the mask to accommodate any aspect ratio.
- The qualitative and quantitative comparisons in Sections 4.3 and 4.4 demonstrate the significant superiority of our method over previous approaches, highlighting the advantages of our designs.
-
Weakness 2: One page of the paper is direct equations from Seam Carving.
Response: We respectfully disagree that CSC is merely a rehash of seam-carving (SC). SC determines the importance of different structures in an image using low-level gradient information, which often leads to key foreground loss or deformation. To address SC's lack of semantic awareness, we introduce high-level saliency priors alongside spatial priors (see lines 138 to 148), as presented in Eq. (2). Additionally, a tolerable saliency loss ratio (see lines 154 to 161) is set to avoid excessive loss of the foreground in extreme ratios and to accommodate potential image expansion operations in AR. Our CSC successfully preserves key information. As demonstrated by the results in Tab. 3 and Figs. 6 and 7, our proposed CSC method achieves significant improvements over traditional SC. -
Weakness 3: The application is quite dated and has not made any significant through the methods proposed.
Response:- Regarding applications, the reviewer may refer to popular tools like Adobe Photoshop's "Content-Aware Scaling" and Instagram's "Auto Crop", which improve the work efficiency of media professionals and enhance the entertainment experience of the general public.
- Our method significantly advances the two key aspects of retargeting - semantic completeness and local smoothness. This greatly enhances the user experience when deploying retargeting in real-world applications.
-
Weakness 4: The writing is hazy, and there is barely any motivation about the approach and the application.
Response: Our motivation is to address two challenging and key issues in image retargeting tasks: preserving the main information and avoiding artifacts, as clearly stated in line 25 of the manuscript. CSC and AR are proposed to solve them respectively.
We thank all for your efforts and are glad to achieve your high recognition of the work's innovation, suitability and importance for image retargeting, significant outperformance, as well as the written. We gratefully thank all the reviewers for their constructive remarks and useful suggestions, which have greatly helped us to improve the quality of our manuscript.
We thank all the reviewers for their efforts. We are encouraged they found our idea to be sound, clear and effective (RHY7, HZC2, oiR7). We are glad they found our approach to be intuitive and holds superiority in model performance (RHY7, HZC2, oiR7). We are pleased reviewer RHY7 recognizes the potential of our research and HZC2 and oiR7 raise future insights for us. We address reviewer comments below.
This paper appears was a unanimous accept by the reviewers. They all appreciated the novelty and improvement over baselines, found the paper well-motivated, and felt the paper is a good contribution to the field. The author rebuttal was reviewed with discussion between the authors and the reviewers.
Congratulations!
In revising your paper for the camera-ready, please include and comments and clarifications that were made or proposed during the rebuttal and discussion phase.