AnimateQR: Bridging Aesthetics and Functionality in Dynamic QR Code Generation
The first animated QR code generation method.
摘要
评审与讨论
This paper proposes an animated QR codes generation framework,AnimateQR, which utilizes hierarchical luminance guidance strategy and progressive spatio-temproal control to generate high-quality dynamic QR codes, enabling a good balance between visual quality and scannability. Experimental results validate the effectiveness of the proposed method.
优缺点分析
Strengths:
- Extract multi-scale luminance maps from natural images and QR codes to train ControlNet, obtaining HLG-ControlNet for better control over QR code generation.
- Combine the animation generation model AnimateDiff to create animated QR codes, giving them higher artistic expression.
- An error-driven HLG update strategy was proposed. Weaknesses:
- It is unclear why multi-scale luminance extraction is performed—the motivation and principles are not well-defined. Specifically, lines 43 to 47 are difficult to correlate or align with the content in Section 3.3 later in the text.
- The robustness of QR code scanning appears to have only been tested on static images, with no scanning tests conducted on continuous 16-frame QR codes.
- No visualization results for consecutive frames of animated QR codes.
- The training details and costs of HLG-ControlNet are missing.
问题
- I'm not an expert in this field, but I don't understand why multi-scale luminance extraction is necessary, and why M3 to M1 simply masks more of the image rather than dividing it into finer-grained grids to extract more detailed luminance.
- Since this article generates a continuous 16-frame animation, the quality of the animation will have a direct impact. Therefore, each of the 16 frames should be tested separately, and scanning tests for key frames and non-key frames should also be conducted independently.
局限性
The limitations are kindly discussed in the paper.
格式问题
None.
Response to Weakness #1 & Question #1:
We sincerely appreciate the reviewer's insightful feedback. Below we clarify the motivation and principles of multi-scale luminance extraction, along with improving the alignment between Sections 1 and 3.3.
Motivation:
Our multi-scale design stems from empirical observations in prior work: Overly fine-grained control (e.g., per-pixel luminance) causes instability and overly coarse control (e.g., block-wise) introduces artifacts (Table 4).
To balance robustness and visual quality, we propose spatially adaptive control. Each module's scale is activated by a learned β (Fig.2 green boxes), dynamically adjusting to local QR readability needs stronger control in low-readability regions (enhancing scannability) and weaker control in high-readability areas (preserving aesthetics). Table 5 qualitatively validates that fixed-scale controls underperform our adaptive approach.
Principle Clarification:
We now explicitly map Lines 43-47 to Section 3.3:
- (1) Module-aligned Encoding → Lines 136-140: Feature extraction per QR module.
- (2) Hierarchical Control → Lines 141-155: Multi-scale interaction via β-weighted fusion.
- (3) Adaptive Constraint → Lines 156-158: Train with stochastic relaxation to make the model robust to spatially dynamic control strengths.
We will strengthen the motivation discussion in Section 1, insert cross-references between Sections 1 and 3.3 for better flow and re-polish the description for clarification.
Response to Weakness #2 & Question #2:
We sincerely appreciate the reviewer’s valuable feedback regarding the robustness testing of QR code scanning.
-
Dynamic QR Code Scanning Test:
We clarify that the scannability of dynamic QR codes was indeed tested under real-world conditions—by displaying the 16-frame animation on a screen and performing scanning tests. This aligns with the intended use case of dynamic QR codes. We will provide additional details about the experimental setup (e.g., display specifications, scanning distance/angle) in the revised manuscript to enhance reproducibility. -
Per-Frame Scannability Analysis:
We fully agree with the reviewer’s suggestion to evaluate individual frames. Here we supplement a per-frame scannability test:Frame Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Keyframe? Yes No No No No No No Yes No No No No No No No Yes Scannability (%) 98.7 N/A N/A N/A N/A N/A N/A 99.2 N/A N/A N/A N/A N/A N/A N/A 98.5 Results confirm that keyframes maintain high scannability (>98%), while non-keyframes are intentionally non-scannable as per design. This table will be included in the revised manuscript. Experimental setting follows the Table 1 with size and angle .
Response to Weakness #3:
We sincerely appreciate the reviewer’s feedback. In the submitted PDF (viewable with Adobe Acrobat), all dynamic QR codes (Figure1 and Figure3) are clickable for playback. To further improve readability, we will provide visualization of consecutive frames in supplementary material. This update will ensure readers can easily visualize the results without dependency on specific PDF viewers.
Response to Weakness #4:
We appreciate the reviewer's attention to the training details. As noted in Lines 128-130, the training protocol of HLG-ControlNet follows the original ControlNet paper for consistency. Since our core contribution lies in the proposed image-HLG map data generation pipeline (detailed in Section 3.3), we focused the main text on this novel component.
To address the reviewer's concern, we will expand the main text with a brief summary of training hyperparameters (e.g., batch size, epochs) in Section 3.3. Furthermore, we wil add a subsection in the supplement with full training costs (GPU hours, hardware specs) and implementation details. We believe this clarification will better highlight our contribution while providing reproducibility.
We sincerely appreciate the Reviewer bZch for the insightful feedback. We have addressed the concerns point-by-point with detailed explanations and additional experimental results. As the author-reviewer discussion period deadline approaches, we kindly ask whether any concerns remain and warmly welcome any further discussion.
This paper proposes a dynamic QR code generation method that is based on AnimateDiff. To improve the scannability, a HLG control net is trained to progressively control the luminance level of the generated video content. A key contribution of this paper is the proposed progressive spatiotemporal control mechanism developed for denoising process.
优缺点分析
Strength:
- It is inspiring to partition control strength across both spatial and temporal dimensions.
- I test the QR codes presented in this paper and supp., and they guides me to iccv and nips.cc website, which verifies the scannability of the generated code.
Weakness:
- The organization can be improved. In 3.3 Training HLG-Control Net, there is actually no training procedure of HLG control net, but the computation process of HLG maps for an image.
问题
-
What is the effect of Reshuffle and Interp operations on the initial video sequence? It would be nice to show the intermediate results to ease the understanding of the proposed pipeline.
-
Why should the activation vector beta have 3 values? Is it just for increasing the randomness or necessary for color QR code?
-
why is the animation small for none-character QR codes? please show the initial video sequence to reveal whether it is because of initial video or the subsequence operations.
局限性
Yes
最终评判理由
AnimateQR is visually appealing. I am positive towards the acceptance of this paper.
格式问题
No
Response to Weakness #1:
We sincerely appreciate the reviewer's insightful observation. In Section 3.3, we indeed focused on explaining the computation of HLG maps (our core contribution) while briefly mentioning the training procedure in Lines 129-130. For clarity, we:
- Followed the standard ControlNet training framework [43], with the key difference being our use of (image-HLG map) pairs instead of (image-edge/segmentation) pairs.
- Illustrated the full pipeline in Figure 2 (red box), where the HLG-ControlNet is trained end-to-end by feeding HLG maps as conditional inputs.
We will add a dedicated paragraph in Section 3.3 to explicitly describe the (image-HLG) paired training scheme and loss function and optimization details (matching ControlNet's framework). Furthermore, we will clarify that the section's emphasis on HLG computation aims to highlight our methodological novelty, while maintaining reproducibility via standard ControlNet training protocols.
We believe these changes will better balance innovation disclosure with technical completeness. Thank you for helping us improve this clarity.
Response to Question #1:
We appreciate the reviewer's suggestion to visualize the intermediate effects of Reshuffle and Interp operations. While we omitted implementation details in the main paper (as these components serve as preprocessing steps rather than core innovations), we provide a detailed explanation in the supplementary material. Below we clarify the key points:
-
Reshuffle Operation
Based on the XOR closure property of Reed-Solomon encoding (the core of QR decoding), [5] demonstrated that rearranging black/white blocks to match a target image's luminance distribution preserves scannability. We adopt this to align the initial control signal (QR blocks) with the generated content's luminance, reducing control difficulty. -
Interp Operation
Applying Reshuffle to every frame would degrade temporal coherence (causing flickering artifacts). Therefore, as stated in Lines 171-173, we strategically use Reshuffle only for keyframes (scannable) to avoid frame flickering. For non-keyframes, we apply linear interpolation ensures smooth control signal transitions (better visual quality) and preserved scannability (only keyframes need to be scannable).
This hybrid approach optimally balances scannability requirements with video generation quality. While we cannot add figures here, we will include visualization of intermediate results in the revised manuscript.
Response to Question #2:
We appreciate the reviewer's question regarding the design of our activation vector β. The three binary values (0/1) in β are not for randomness but serve a necessary functional purpose in our color QR code framework, as explained below:
-
Multi-scale Control Mechanism:
Each value in β corresponds to one of the three scales in our hierarchical control module (Fig. 2 green boxes). When a scale is activated (β=1), its corresponding blocks contribute to the content control strength. For example: Single-scale activation (e.g., [1,0,0]): Only coarse-scale blocks are modified, yielding weaker control. Full activation ([1,1,1]): All scales participate, achieving maximal control strength. -
Spatial Adaptivity:
The dynamic update of β across stages enables adaptive control intensity in different spatial regions. This balances scannability (requiring strong control in error-prone areas) and aesthetic quality (weaker control in other regions).
As evidenced in Table 5, our adaptive β achieves the optimal balance. We will clarify this rationale in the revised manuscript. Thank you for highlighting this important point!
Response to Question #3:
The small animation of non-character QR codes simply reflects the weaker motion in their original videos (e.g., static objects/scenes). The generated QR motion scales proportionally to the source video's movement. We can adjust the motion intensity via the motion scale parameter, but the relative motion pattern remains consistent with the original. As suggested, we'll include the initial videos in the revised version for comparison.
The rebuttal addresses my concerns in my review comments. I am positive towards the acceptance of this paper.
This study proposes AnimateQR to generate QR-decodable animated contents. AnimateQR trains a ControlNet of Stable v1.5. on HLG datasets, and combine AnimateDiff and the proposed ProST sampling technique. The results show that the aesthetic quality is better than the existing approach, while providing the seamless QR codes.
优缺点分析
S1. As an application study, the animated QR codes are interesting and practical in terms of creative AI.
S2. The proposed approach, ControlNet and ProST, makes sense.
W1. I think this paper's main weaknesses lie in limited experiments. Please refer to the section below.
W2. The sampling technique is hard to understand and the writing need significant improvements.
问题
Q1. Why the experiments are limited to Stable Diffusion v1.5 + AnimateDiff, which is an outdated checkpoint? There are many open-sourced image and video generative models, and the authors should have shown the generalizability of the proposed approaches across different models.
Q2. Why the domain of generated contents is limited to the anime style? Cannot the model generate photorealistic animations?
Q3. How much important is the ProST technique in terms of quality and QR-decodability? Considering that recent image/video generative models commonly use a diffusion-distillation technique to reduce the number of sampling steps, I'm worrying the incompatibility of the proposed methods with the distilled diffusion models.
局限性
The limitation is discussed in the paper.
最终评判理由
First, I sincerely apologize that I couldn't recognize the rebuttal period because my affiliation was changed and the email address became invalid. I read the authors' responses and they address my major concerns. Thus, I increase my score to "borderline accept", considering that the authors showed that the method itself is model-agnostic unless the model is one-step generator.
I still think that the proposed method requires more qualitative results on various styles, because the fact that the compared models just focused on anime-styles cannot be a reasonable reason to accept the limited experiments. In addition, while many recent generative models focus on achieving one-step generation, this approach might not be optimal to one-step generator (but I agree that it would a worth exploration for future research too).
格式问题
N/A
Response to Question #1:
We sincerely appreciate the reviewer's insightful question regarding model generality. Our response is twofold:
-
Fair Comparison: The choice of Stable Diffusion v1.5 (SD-v1.5) + AnimateDiff-v2 as the primary setup aligns with the implementations of all compared methods (QRBTF, Text2QR, GladCoder). This ensures a fair comparison to validate our method's superiority under identical conditions.
-
Methodological Generality: Importantly, AnimateQR is not architecture-specific——it is compatible with any Diffusion+AnimateDiff combination. To demonstrate extensibility, we conducted additional experiments with Stable Diffusion XL (SDXL) + AnimateDiff-XL, showing significant improvements in both quality and metrics. AnimateQR-XL achieves +7.1% Q-Bench and +6.5% SimpleVQA over SD-v1.5 variant (see table below).
Method Q-Bench↑ SimpleVQA↑ AnimateQR 0.6217 3.5872 AnimateQR-XL 0.6926 3.8216
The results validate our approach's extensibility to newer architectures. We will include them in the final paper (Section 4) with expanded discussion on cross-model compatibility.
Response to Question #2:
We appreciate the reviewer’s insightful question regarding the style generality of our model. The anime-style outputs in our experiments are not a deliberate design choice, but a natural consequence of using the base Stable Diffusion model (without additional style-control plugins like LoRA or IP-Adapter) for fair comparison with existing methods.
-
Baseline Fairness & Focus of This Work
As shown in Table 2, all compared methods (e.g., Text2QR, GladCoder) adopt the same base model for QR code generation, and thus their outputs similarly exhibit anime-style aesthetics. Introducing photorealism-specific modules would create an unfair advantage in evaluation, as our core contribution focuses on QR pattern alignment, not style control. -
Scalability & Flexibility in Practice
While the current experiments focus on anime-style outputs (generated by the base model without plugins), our framework is model-agnostic. As aforementioned, integrating photorealism-controlling tools (e.g., SDXL+LoRA) is straightforward. Photorealistic results via LoRA will be provided in revision.
We thank the reviewer for raising this important point and will clarify the model’s extensibility in the revised manuscript.
Response to Question #3:
We sincerely appreciate the reviewer’s insightful question regarding ProST’s compatibility with distilled diffusion models. Here, we clarify three key aspects:
-
Theoretical Adaptability:
ProST is designed to decouple the denoising process into stages, updating the HLG map (control signal) at each stage boundary. This architecture ensures compatibility with any sampling steps > 1, regardless of whether the backbone is a full diffusion model or a distilled variant (e.g., LCM). -
Empirical Validation with Distilled Models: We test ProST with AnimateDiff-LCM under 4 sampling steps, which denoted as AnimateQR-LCM. We observed that AnimateQR-LCM retained its ability to balance quality and QR scannability. We will include these results in the revised manuscript to demonstrate ProST’s robustness under low-step regimes.
Method Q-Bench↑ SimpleVQA↑ AnimateQR 0.6217 3.5872 AnimateQR-LCM 0.6138 3.5921 -
Extension to One-Step Sampling (Edge Case):
For extreme cases (e.g., one-step sampling), ProST could be extended by iteratively regenerating the output with updated control signals, effectively simulating multi-step optimization.
We agree with the reviewer that exploring ProST’s synergy with distillation is valuable, and we will add a dedicated analysis in the revision.
We sincerely appreciate the Reviewer i59R for the insightful feedback. We have addressed the concerns point-by-point with detailed explanations and additional experimental results. As the author-reviewer discussion period deadline approaches, we kindly ask whether any concerns remain and warmly welcome any further discussion.
This paper presents AnimateQR, a novel framework that synthesizes animated QR codes which are both aesthetically pleasing and robustly scannable. The authors address the long-standing trade-off between visual quality and decoding reliability by jointly optimizing animation consistency, visual fidelity to the target image, and robustness to QR decoding constraints. AnimateQR introduces a hybrid optimization framework combining deep generative priors and a differentiable decoder, enabling stylized animation with temporal coherence. Experiments demonstrate that AnimateQR achieves high-quality stylized QR animations while maintaining high scanning success rates, outperforming existing artistic QR approaches in both subjective quality and usability.
优缺点分析
Strengths:
-
The paper demonstrates both quantitative robustness (e.g., high scanning accuracy across devices) and qualitative improvements (e.g., user preference studies).
-
The authors release a benchmark dataset and code, which enhances reproducibility and encourages follow-up research in a novel domain.
Weaknesses:
-
The paper does not deeply explore when or why AnimateQR fails (e.g., under low lighting, extreme motion, or unconventional scanning angles), which would be valuable for deployment.
-
While the paper compares against traditional artistic QR methods, it would be useful to include comparisons with more recent diffusion-based stylization models or dynamic code embedding methods.
问题
How does AnimateQR perform under real-world scanning conditions with motion blur, varying screen brightness, or camera shake—particularly for longer or more complex animations? Have you evaluated its robustness in such unconstrained environments, and how might the method be adapted to handle them?
局限性
yes
格式问题
NA
Response to Weakness #1 & Question #1:
We sincerely thank the reviewer for the insightful comments on AnimateQR’s robustness in real-world scenarios. Here we provide the detailed discussion, which will be included in revision.
To thoroughly assess robustness, we performed three types of tests: varying illumination, motion blur, and scanning angles. All tests involved side-by-side comparisons of AnimateQR, Text2QR, and QRBTF on 50 QR code sequences (each sized ), ensuring fair and comprehensive evaluation. All the numbers in the tables represent the scanning success rates.
-
Varying Illumination. Using a calibrated LUX meter, we varied ambient illumination from near 0 to 500 LUX across six levels. The phone was fixed at a 0° angle to avoid motion or angular interference. Each method (AnimateQR, Text2QR, QRBTF) scanned the same set of animated QR codes 10 times per lighting level, and we recorded the average success rate to evaluate robustness under different lighting conditions.
Method 500 LUX 400 LUX 300 LUX 200 LUX 100 LUX 5 LUX AnimateQR 100% 100% 98% 88% 66% 8% Text2QR-d 100% 100% 96% 82% 64% 2% QRBTF-d 98% 96% 90% 62% 48% 0% -
Motion Blur. To simulate realistic handheld scanning, we used the phone’s accelerometer and gyroscope to test and control speed (0.1–2 m/s) and motion (≤2 m/s²) along a fixed 3m path. During each test, the phone remained directly facing the screen (0° angle) under controlled lighting at 500 LUX. Each speed was tested separately for AnimateQR, Text2QR, and QRBTF, with each method scanning the QR codes 500 times (10 times per sample). We recorded and averaged the success rates to assess robustness against motion blur.
Method 0.1 m/s 0.5 m/s 1 m/s 1.5 m/s 2 m/s AnimateQR 100% 100% 96% 92% 88% Text2QR-d 100% 100% 96% 90% 88% QRBTF-d 98% 98% 92% 82% 70% -
Scanning Angles. To assess performance under non-standard angles, we tested five angles from 0° to 60°, with fixed lighting at 500 LUX and the phone held stationary. Each method (AnimateQR, Text2QR, QRBTF) performed 10 scans per angle. We calculated the average of success rates to evaluate robustness against angular deviation.
Method 0° 15° 30° 45° 60° AnimateQR 100% 100% 100% 100% 98% Text2QR-d 100% 100% 100% 100% 98% QRBTF-d 98% 100% 98% 100% 96%
Our experiments show that AnimateQR consistently outperforms Text2QR and QRBTF across realistic scenarios, including low light, motion blur, and non-standard angles. Under ≤100 LUX, traditional methods, especially QRBTF, suffered decoding rates below 60%, while AnimateQR maintained significantly higher success. At ≥300 LUX, all methods performed well, but AnimateQR still led slightly.
When scanning speed exceeded 1 m/s, Text2QR and QRBTF showed clear performance drops, with QRBTF falling to 70% at 2 m/s. AnimateQR, by contrast, remained within the 80–90% range, demonstrating robustness from its design.
Across scanning angles up to 60°, AnimateQR maintained superior success rates with minimal degradation, reflecting strong geometric tolerance to unconventional scanning angles.
Overall, AnimateQR achieves greater stability and decoding reliability under challenging conditions. We will include these results in the revised manuscript to highlight its practical robustness.
Response to Weakness #2:
We sincerely appreciate the reviewer's suggestion regarding comparisons with state-of-the-art methods. In fact, our evaluation already includes comprehensive comparisons with the latest open-source artistic QR code methods:
- QRBTF (online project, diffusion-based)
- Text2QR (CVPR 2024, diffusion-based)
- GladCoder (IJCAI 2024, diffusion-based)
To address the dynamic code embedding aspect, we note that no open-source implementations of dynamic QR code generation are currently available. Therefore, we extended these static methods to dynamic versions (denoted as MethodName-d) through per-frame generation, ensuring fair comparison under identical conditions (see Figure 3 and Table 2). We will further emphasize this experimental design in the revised manuscript to improve clarity.
We sincerely appreciate the Reviewer LKx1 for the insightful feedback. We have addressed the concerns point-by-point with detailed explanations and additional experimental results. As the author-reviewer discussion period deadline approaches, we kindly ask whether any concerns remain and warmly welcome any further discussion.
This paper introduces AnimateQR, the pioneering generative framework for creating animated QR codes that seamlessly blend aesthetic flexibility with scannability. All reviews are highly positive, specifically noting BA, BA, BA, BA. Reviewer LKx1 emphasized the robust quantitative results, such as high scanning accuracy across various devices, as well as qualitative enhancements evident in user preference studies. Reviewer i59R deemed the animated QR code application both practical and innovative. Reviewer J5mD independently verified the scannability of the generated QR codes. Reviewer bZch acknowledged the novel innovation. The authors' rebuttal has adeptly addressed the majority of concerns raised by the reviewers. Overall, I recommend the acceptance of this submission. Additionally, I expect that the authors will incorporate the new results and suggested modifications from the rebuttal phase into the final version.