PaperHub
6.0
/10
Poster4 位审稿人
最低4最高7标准差1.2
7
7
4
6
3.8
置信度
正确性2.8
贡献度2.3
表达2.5
NeurIPS 2024

Optical Diffusion Models for Image Generation

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06
TL;DR

Light propagation can be programmed to perform denoising diffusion efficiently by transmitting through learned modulation layers.

摘要

关键词
Diffusion based modelimage generationoptical computingefficient computing

评审与讨论

审稿意见
7

This work presents an image generation framework using denoising diffusion models implemented though optical computing. The method uses passive diffractive optical layers that are trained to manipulate light propagation through the system — peforming effective denoising. Using the physical properties of light, the framework allows to rapid image processing with significantly reduced power consumption. The authors demonstrate the effectiveness of their framework using MNIST, Quick Draw and Fashion MNIST.

优点

The paper is well-written and clear. The study addresses the environmental impact of traditional computing methods for image processing — the presented framework is 400 times more energy-efficient than conventional GPU approaches.

缺点

The paper doesn’t discuss in detail the scalability of the proposed models, especially in the context when deployed in challenging real-world scenarios.

Further comparison with respect to available GPU-based denoising frameworks would be adequate. The work could benefit from a broader comparison with existing approaches, particularly in terms of image quality and convergence time or processing speed under different scenarios.

The maintenance of these systems in practical, long-term deployments is not discussed. For example, liquid crystal displays (e.g., SLMs) are temperature-sensitive — how this affects their performance?

问题

I noticed a typo in line 175: ==autodiff citation ==.

How well does this framework scale with resolution and complexity of the images?

The motivation of the work is to address the environmental footprint from classical computing hardware. The comparison of the method with respect to GPU hardware is presented in the appendix. I recommend you discuss that in the main text — it would strengthen your work.

局限性

The authors don’t discuss explicitly the limitations of their work. Although in the Check List the authors refer to section 4.2 — however, it is not clear what the authors define as limitations.

Post-rebuttal: All questions have been carefully addressed by the authors. I believe this work presents a novel approach while addressing the environmental impact of traditional computing methods. I modify my score: 7.

作者回复

We thank the reviewer for their time and attention to our work. We would like to provide some replies to the issues raised, and they will be incorporated into the main text in the next revision.

Scalability of the Proposed Model and Comparison with GPU-based Denoising Frameworks

Thank you for highlighting this potential for improvement, we agree with the importance of the scaling of the framework with the resolution and complexity of the images and accordingly, we looked into the scalability of our approach from different aspects in the author's rebuttal part and compared the performance of the ODU with digital models in terms of energy and image quality.

For different output image resolutions, while models are kept fixed, we demonstrate ODU can constantly generate higher quality images than the MFLOP-scale convolutional and fully connected digital networks with a smaller energy budget (Rebuttal Figure 1). This trend persisted even when a more challenging task is benchmarked as shown in Reb. Fig. 2.

Similarly, when the task is fixed and the number of model parameters is changed, ODU has nearly the same scaling trend as large-scale digital image generation networks. The slope of the trend shown in Reb. Fig. 3, which is plotted with the data from the main text's Fig. 3, is approximately the same as the power-law scaling trend of generative digital models in [1].

Practical Aspects of the Hardware

We thank the reviewer for bringing up the important practical deployment aspect of the proposed method. In general, we acknowledge that analog hardware is more challenging to operate in terms of stability and degradation with time and these challenges are tackled, usually with feedback, only if the advantages obtained weigh significantly more. We believe our demonstration is in this category. Moreover, Liquid Crystal on Silicon (LCoS) SLM is one of the most mature and reliable opto-electronic devices, widely used in networking infrastructure with solutions provided by companies such as Coherent, Lumentum, Huawei, etc. For example, we invite the reviewer to check the press release by Coherent ultrareliable Datacenter Lightwave Cross-Connect for datacenter optical circuit switches. Moreover, note that LCoS SLMs are also used in telecommunications in undersea applications, where the maintenance is very, if not impossible, restricted. Hence, we are confident in the feasibility of industrial-grade deployment while acknowledging additional challenges common for all analog computing hardware.

Limitations of the Proposed Work

-As the framework reaches the best performance for a given configuration of hardware, such as angles of the input beam and the mirror, when a trained model is transferred to new hardware with some potential alignment differences, this would require a fine-tuning of the model before deployment. This requirement should be further analyzed in the following studies quantitatively. However, considering that once it is trained, the deployed model can be utilized for years, this occasional fine-tuning should have a negligible overall effect.

  • We demonstrated this proof-of-principle study with easily available, off-the-shelf opto-electronic equipment such as cameras and light modulators. Even though they are ideal for initial studies, their speed and energy efficiency are not optimized. Consequently, the speed and energy consumption of the proposed framework is also limited by their performance. Similarly, reaching higher accuracies requires larger models with ODU as in other modalities, and currently, the size of the optical model is limited by the modest number of pixels on light modulators. Due to these reasons, reaching competitive performances and addressing real-world problems efficiently with optics will be possible with the development of specialized hardware for this purpose.

[1] Henighan, Tom, et al. "Scaling laws for autoregressive generative modeling." arXiv preprint arXiv:2010.14701 (2020).

评论

Thank you for the clarifications and the new figures. All my concerns have been addressed.

审稿意见
7

This paper proposes an opto-electronic realization of a toy diffusion model for image generation. The system consists of a DMD for optical image projection, SLMs for phase modulation, and a CMOS photosensor for signal intensity recording. These physical components, together with a digital processor (for noise addition), work collaboratively in an iterative manner to implement a multi-step diffusion process (i.e., iterative denoising and noise synthesis). To account for the imperfection of the physical setup (e.g., systematic errors stemming from the fabrication and misalignment), it proposes an online learning scheme to optimize the real physical parameters (of SLMs). It demonstrates that such a system can be used to generate simple images (in toy datasets) in an energy-efficient way in comparison to the dominant electronic approach.

优点

  • Innovative demonstration of an opto-electronic implementation of a toy diffusion model

  • The proposed online learning scheme sounds effective for bridging the gap between digital simulation and real experiments.

缺点

Though the overall idea (the opto-electronic diffusion model realization and the online training algorithm to alleviate the calibration issues) sounds interesting and reasonable for me, the paper itself seems to be written in a hurry, where the content is not well polished and its clarity can be further improved.

Online training algorithm

I'm particularly interested in the online learning and the experimental system parts, but some technical details are unclear or missing in the paper. In L216-223, the authors stated, "The digital twin is calibrated to approximate an experiment, and a 20% misalignment exists between its calibration and the actual experiment’s alignment, which is simulated by another diffraction model," but how to calibrate the digital twin to approximate the experiment is not mentioned. What 20% misalignment suggests is also vague for me; do you mean the MSE between the simulation and the experimental results on average? Also, I don't understand the meaning of "simulated by another diffraction model." Please further elaborate on this point to improve the reproducibility of the proposed method.

In Algorithm 1, it seems like the authors maintain two sets of phase variables (\thete_layer and \theta_alignment) during the training. \theta_layers will be used to configure the SLM in the real experiment, while \theta_alignment is used in the digital twin to account for the imperfection of the system calibration. If so, I would like to see the final difference between the SLM's phase profile and the one used in the digital model. This will be helpful to understand the misalignment between the physical system and its digital twin.

Besides, the online learning algorithm without Digital Twin Refinement is exactly the hardware-in-the-loop approach invented in computer-generated holography [A], which should be mentioned in the related work.

I am not familiar with the experimental error backpropagation algorithm compared in the paper. Could the authors provide additional discussion to clarify the differences between the proposed algorithm and this method?

Finally, did you first pretrain the digital model in simulation and then finetune it with the online learning algorithm, or train the model with online learning from scratch? Since online learning requires the hardware-in-the-loop scheme, I doubt its implementation efficiency in practice.

[A] Peng, Yifan, et al. "Neural holography with camera-in-the-loop training." ACM Transactions on Graphics (TOG) 39.6 (2020): 1-14.

Experimental setup

Regarding the experimental system setup in the supplementary material, Figure 6 only provides an incomplete illustration of the system, so it's a bit unclear how the authors realized the experimental setup in reality. Specifically, how many SLMs did you use in the experiment? From Figure 6, it seems four independent SLMs are used to implement four optical diffraction layers, and their weights are dynamically reconfigured to support time-aware iterative denoising.

It's therefore necessary to discuss the Time Multiplexing technique used in this experiment.

Limitations incurred by opto-electronic conversion

One of the major limitations and bottlenecks of such an opto-electronic system is not discussed, i.e., the repeated analog-to-digital and digital-to-analog conversions, which are especially energy-consuming since many conversion cycles are needed in the iterative inference process.

Benchmark performance

Finally, it would be informative to understand the performance of this optical diffusion model by situating it in the landscape of pure digital diffusion models. For example, what is the pure digital model that is comparable to the proposed model in terms of generation quality (FID, etc.)? I understand that at this stage, the competing digital model would be quite simple and naive (e.g., just a simple MLP with small numbers of parameters). However, providing such information would shed light on the path to further improve the performance of the optical neural network.

Minor suggestions:

  • Please add a spacing between citation bracket and the prior word

  • L175, " (==autodiff citation==).", the citation TODO is not addressed.

  • L180, "300x300", please replace the letter "x" by the multiplication operator "\times".

  • L211, "1 − CorrelationCoefficient" looks a bit ugly. please also clarify how to compute CorrelationCoefficient

  • It is preferable to have self-contained captions for each figure so that readers won't need to consult the main paper's text in order to understand the figure.

  • A recent paper regarding the optical neural network [B] might be worthy of being mentioned in the paper.

[B] Wei, Kaixuan, et al. "Spatially varying nanophotonic neural networks." arXiv preprint arXiv:2308.03407 (2023).

问题

See above.

Post-rebuttal:
The authors have carefully addressed my prior concerns and several follow-up questions. Overall, this work is novel to me in terms of both the new application (an optical realization of the diffusion model) and the unique online training algorithm that bridges the gap between simulation and actual experimentation. Therefore, I vote for acceptance of this paper.

局限性

See above

作者回复

We thank the reviewer for the in-depth analysis of our study, and numerous crucial suggestions for improving the quality of the manuscript significantly. We hope to address the points raised with the following responses and the common author rebuttal.

Online training algorithm

Re: "...Also, I don't understand the meaning of "simulated by another diffraction model." ..."

We will clarify the experiment in Fig. 5 as follows with an addition to the main text and the appendix:

"In Fig. 5, we explore the efficiency of the proposed algorithm by modeling a possible experiment scenario. In this scenario, we define two different optical models, while actually both of them are simulations of the optical wave propagation for an exact insight into the algorithm, we designate the first one as the “optical experiment” by configuring it with the calibration angles obtained from the physical experiment. These four angles account for the slight misalignment of the experiment and define the input angle of the beam to the cavity and the angle between the mirror and the SLM, in x and y axes, all being in the range of a few milliradians and their measurement details being provided in A.7. The second model, considered as the digital twin, is initialized with their calibration angles 20% higher."

Re: "... I would like to see the final difference between the SLM's phase profile and the one used in the digital model..."

Thanks for pointing out this distinction to be clarified. The digital twin and the physical system share the same layer parameters, these are denoted by θlayer\theta_{layer} and they indicate the phase value for each pixel of the layers. On the other hand, θalignment\theta_{alignment} is the expected alignment parameters of the physical system, including the beam and mirror angles, and layer distances.

Re: "...the experimental error backpropagation algorithm compared in the paper...."

The main difference between the Neural Holography approach and the one in [27] is that in [27] backpropagation of the experimental loss is realized over a pre-trained neural-network-based emulator of the physical system instead of the wave propagation-based physical model. Thanks for the suggestion, we will rewrite starting from L97: “Calculating the gradients of the experimental loss through the optical wave propagation model allows for high-quality computer generated holograms[A]. Similarly, a pre-trained neural network-based emulator of a physical system can also be used for the same purpose [27].”

Re: "...did you first pretrain the digital model in simulation and then finetune it with the online learning algorithm ..."

In our implementation, θalignment\theta_{alignment} was pre-trained while θlayer\theta_{layer} was trained from scratch in the optical experiment mainly because the experiment can create outputs faster than its digital model. Still, as mentioned, for deploying this framework at scale it is not practical to train models from scratch for each hardware instance. Fortunately, we expect that with some fine-tuning epochs, the pre-trained models should be deployed to different hardware with small differences. The original model can be trained with a vaccination approach completely digitally, by slightly changing optical alignment, so that it is resilient to small mismatches with some performance penalty [1]. Then on hardware fine-tuning can further improve its performance.

Experimental setup

We apologize for the confusion and we will re-work on the presentation. In the schematic of Figure 6, we intended to show that we use one SLM device shown by the gray outer box. Within this single device, we show four modulation layers by placing them in sub-regions of the single device. Using a mirror positioned across, light bounces back and forth and gets modulated by the layers (sub-regions) successively. On the right-hand side of Figure 6, we show a photograph of the experimental system where the single SLM display is indicated as well as the highlighted light path and the mirror position. Since we used a single device, we did not need to have time multiplexing. But it would be an interesting direction that we will look into for the future when many devices are used together to scale up the system.

Limitations incurred by opto-electronic conversion

We have provided the power comparisons in the common rebuttal showing a significant advantage for the ODU and we will modify the manuscript accordingly. Our calculations incorporated board-level energy consumptions meaning that opto-electronic transformations including analog to digital conversions and data transfer in the loops are included. Since these repeated conversions and data transfers are happening in kHz scale, the power consumption does not increase dramatically as the dissipated power depends on the frequency. We agree with the reviewer on the fact that these calculations would be quite different if we want to operate on the GHz scale. While kHz-scale looping already yields higher performance as presented by our results, we acknowledge that one major future research direction is efficient conversions and data transfer in the GHz scale.

Benchmark performance

We thank you for the suggestion. Correspondingly, we compared the generation performances with two pure digital models, a convolutional U-Net architecture and a fully connected network for different output resolutions of Fashion-MNIST dataset, and a more complex AFHQ dataset. The results show that ODU performs and scales in the same manner as large-scale digital models while benefiting from the energy efficiency of optics, showing promise for real-life tasks. We refer you to the author rebuttal for their details.

Minor suggestions

Thank you for the suggestions, they will be included and we believe they will improve the paper’s quality and readability.

[1]: Mengu, Deniz, et al. "Misalignment resilient diffractive optical networks." Nanophotonics 9.13 (2020): 4207-4219.

评论

Thanks for the clarification. I have some follow-up questions regarding the response provided by the authors. Specifically,

  1. How do you model the beam angle, and mirrors in the simulation? The equation (2) is too generic, and it's hard to dig out how to simulate the proposed system rationally.

  2. How do you make the \theta_alignment differentiable? In general, the layer distance seems to be non-differentiable, and thus it's unclear how do you backpropagate the gradient to these parameters.

评论

Thanks for your questions. We will also include the following explanation about the modeling of the experimental system in the appendix, section A.3, of the revised version of the manuscript.

Differentiable Modelling of Light Propagation in ODU

To benefit from parallelized and optimized FFT algorithm and automatic differentiation, the wave propagation in the proposed system is modeled in PyTorch environment with a Split-Step Fourier formalism which is derived from Eq. 2. The diffraction step of the propagation is calculated with a nonparaxial diffraction kernel [1] in Fourier domain and effects such as reflection or modulation of the light beam with layer parameters are applied in the spatial domain. Such that the electric field after propagating a distance Deltaz\\Delta z becomes:

E(x,y,z+Δz)=F1FE(x,y,z)R(x,y)ejΔz(kx2+ky2)k+k2kx2ky2E(x, y, z+\Delta z) = \mathcal{F}^{-1} \\{ \mathcal{F}\\{ E(x,y,z) R(x,y) \\}e^{-\frac{j \Delta z \left(k_x^2 + k_y^2\right)}{k +\sqrt{k^2 - k_x^2 - k_y^2}}}\\}

In addition to the parameters of layers Lnm(x,y)L_n^m (x, y) (or simply θlayers\theta_{layers} ), the spatial term R(x,y)R(x,y) can include the angle changes of the beam. For instance, if the beam is not perpendicular to the SLM or the mirror, the reflection creates a change in the angle of the beam, Δα=(αx,αy)\Delta \alpha = (\alpha_x, \alpha_y). Then, for example, on the SLM plane Rm(x,y)=Lm(x,y)ejk(xsinαx+ysinαy)R_m(x, y)=L_m(x, y) e^{-j k\left(x \sin \alpha_x+y \sin \alpha_y\right)}, where Rm(x,y)R_m(x,y) is the compound spatial term, Lm(x,y)L_m(x,y) is the modulation parameters at layer mm, and ejk(xsinαx+ysinαy)e^{-jk(x \sin \alpha_x + y \sin \alpha_y)} is the operator that changes the direction of the wave propagation vector. Similarly to the SLM, also the angle of the mirror determines the propagation direction of the beam, which can be included in the model as Rmirror (x,y)=ejk(xsinαx+ysinαy)R_{\text {mirror }}(x, y)=e^{-j k\left(x \sin \alpha_x+y \sin \alpha_y\right)}

To calibrate the model of the experiment with the actual experiment, we define three trainable alignment parameters, zgapz_{gap} (distance between the mirror and the SLM), Δαmirror\Delta \alpha_{mirror} (twice the angle between the mirror and the SLM), and Δαbeam\Delta \alpha_{beam} (twice the angle between the input beam and the SLM). This group of trainable model parameters is called θalignment\theta_{alignment}, and as its constituents appear in the forward model within differentiable functions, the auto-differentiation algorithm [2] can calculate their derivatives with respect to the error between the predicted camera images and the acquired ones. θalignment\theta_{alignment} is initially pre-trained with experiments placing square shaped π\pi phase differences randomly on the SLM. During the online training procedure, it is further trained with the data from denoising experiments.

[1]: M. D. Feit and J. A. Fleck, "Beam nonparaxiality, filament formation, and beam breakup in the self-focusing of optical beams," J. Opt. Soc. Am. B 5, 633-640 (1988)

[2]: Paszke, Adam, et al. "Automatic differentiation in PyTorch." (2017).

评论

Thanks for the explanation. Now I can better grab the method details.

I'm still confused by the following statement: "we show four modulation layers by placing them in sub-regions of the single device. Using a mirror positioned across, light bounces back and forth and gets modulated by the layers (sub-regions) successively"

Particularly, how do you ensure the diffracted light by a SLM's sub region can be bounced back and forth to hit the precise region in the SLM as you want? could you explain this formally instead of using the ray-optics analog (as we know for the wave optics, the light can propagate in diverse directions )?

Anyway, I think these technical deltas constitute novel innovations suitable to be published in NeurIPS. Note unlike Nature/Science series, the NeurIPS weights more on technical and method parts, so it's better to re-compose the paper and elaborates these novel technical parts in Method section.

评论

Thank you for the questions and your constructive feedback, we deeply appreciate it. It is also clear to us now that in the next version, more technical details should be included in the main text.

For the beam position on the SLM, instead of precisely controlling it, we aligned the optical setup manually and roughly with the location and angle of the SLM so that the beam performs the aimed number of reflections, and exits the cavity formed by the SLM and the mirrors. During this procedure, the beam is collimated with a beam waist on the order of meters and it is not modulated, therefore we didn't need to consider the wave optics aspect of it. Then, with the help of two 4f imaging systems directed to input and output reflections on the SLM, we imaged the entry and exit positions of the beam. These locations and two more evenly spaced locations are assigned as our 4 modulation areas on the SLM manually.

However, due to the slight misalignment in the beam and mirror angles (milliradian scale), 4 reflections are not perfectly evenly spaced on the SLM. Since the forward model of the system is informed about this mismatch through the definition of Δαbeam\Delta \alpha_{beam} and Δαmirror\Delta \alpha_{mirror}, we observed that the trained modulation patterns on these locations of the SLM correct for this effect automatically through the application of a weak phase ramp. If we analyze this linear phase ramp in 1D for a correction angle of β\beta, the phase pattern becomes Φ(x)=ksinβx\Phi (x) = k \sin \beta x . For a wavelength of 850 nm, and β=103\beta = 10^{-3}, this weak phase ramp performs a few cycles of 2π2\pi over the 2.4 mm beam width.

While the optimization algorithm defines this ramp itself to center the beam in the defined modulation regions, we did not observe any further general shift in the location of the beam due to other diffractive effects. Additionally, as the digital model of the system is based on wave optics and can track diffracted out light, the trained phase patterns keep the light in the previously defined modulation regions by focusing them. This is because the diffracted out light can carry relevant information, and losing it increases the cost function. In summary, after the initial definition of modulation areas on the SLM, the rest of the beam propagation is fully controlled by the trainable phase modulations.

评论

Thank you for the discussion. Any other thoughts from other reviewers?

审稿意见
4

Summary: This paper has trained free-space diffractive optical neural networks as an implementation of diffusion model for image denoising application. On-chip learning and hybrid training have been used to improve the noise and variation robustness.

优点

The application of optical computing hardware is interesting. Robustness has been considered in the training procedure with online learning for error calibration.

缺点

  1. The novelty is limited. No new hardware system is proposed. Diffusion model uses the standard algorithm. On-chip learning and error calibration are also previously proposed methods in the literature. Most of the method section is standard DDPM, diffusion, on-chip learning algorithms. No new contributions.
  2. What are the major advantages of free-space optical diffusion model compared to other accelerators? Rigorous area/speed/throughput/energy efficiency evaluation is missing.

问题

  1. The novelty is limited. No new hardware system is proposed. Diffusion model uses the standard algorithm. On-chip learning and error calibration are also previously proposed methods in the literature. Most of the method section is standard DDPM, diffusion, on-chip learning algorithms. No new contributions.

  2. What are the major advantages of free-space optical diffusion model compared to other accelerators or other diffusion neural network models? Rigorous area/speed/throughput/energy efficiency evaluation is missing. Thorough comparison to other diffusion models is missing.

  3. Only toy examples are shown on MNIST, Fashion MNIST. High-resolution image denoising benchmarks are required.

  4. The free-space phase mask is not reconfigurable. SLM also has low-resolution or slow reprogramming speed. The flexibility and speed/efficiency have major concerns.

局限性

  1. The novelty is limited. No new hardware system is proposed. Diffusion model uses the standard algorithm. On-chip learning and error calibration are also previously proposed methods in the literature. Most of the method section is standard DDPM, diffusion, on-chip learning algorithms. No new contributions.

  2. What are the major advantages of free-space optical diffusion model compared to other accelerators or other diffusion neural network models? Rigorous area/speed/throughput/energy efficiency evaluation is missing. Thorough comparison to other diffusion models is missing.

  3. Only toy examples are shown on MNIST, Fashion MNIST. High-resolution image denoising benchmarks are required.

  4. The free-space phase mask is not reconfigurable. SLM also has low-resolution or slow reprogramming speed. The flexibility and speed/efficiency have major concerns.

作者回复

We thank the reviewer for their time and efforts. We reply to the comments under four main sections:

Limitations - 1

This paper shows diffusion-based image generation with the advantages of the optical modality for the first time. We use an architecture that incorporates off-the-shelf hardware in an innovative way while maintaining near-future deployment practicality and the reproducibility of the study.

Moreover, the algorithmic novelty of the study is two-fold:

1- The proposed time-aware denoising algorithm allows image generation with minimal changes to the set of passive optical weights, this results in a substantial decrease in the latency and energy consumption of the process.

2- To the best of our knowledge, we propose and show the effectiveness of the online learning algorithm for the first time, differing from the existing literature by continuously updating the physical alignment parameters of the system during training. This alleviates the model drift issue significantly, where a fixed digital model of a physical system becomes insufficient with the change in the operation point of the system over the training. We experimentally show that online training achieves convergence with a physical system, which is too challenging with existing methods.

Limitations - 2

Our study demonstrates that optical wave propagation can be a new modality for realizing diffusion models with the proposed algorithms. This is crucial because as opposed to electronics, optics has an intrinsic near-zero loss and parallelization capability. We also hope that this would motivate the community to realize machine learning algorithms with different modalities and address the environmental impact of AI models.

Even though this initial study mainly aims to be a proof-of-concept with non-optimized hardware, we discuss the energy consumption in Appendix 4, indicating that the proposed method can enhance the energy efficiency of diffusion models significantly. Furthermore, to address the need for a deeper analysis you mentioned, now we provide a more extensive comparison in the rebuttal phase between our approach and two digital models of comparable accuracies in the shared rebuttal section. You can refer to Reb. Table 1 for the details of the comparison. We show with this new set of measurements in Reb. Fig. 1 that ODU scales in the same way as digital implementations in terms of the output image complexity and can perform comparably with models of MFLOPs of compute. Reb. Fig. 3, plotted with the re-organization of the information in Fig. 4, shows the scaling of the ODU’s performance with the number of model parameters. Remarkably, the scaling constant matches closely with the power-law scaling constant of large-scale image generation models from the literature [1]. Even though ODU has a completely different architecture, the similar scaling trend shows its potential for tackling high-resolution real-life scenarios. We plan to include these additional findings in the article.

Limitations - 3

In this study, we aimed to show the realizability of performing diffusion-based image generation with optics, while working with a number of parameters that can be realized with easily accessible devices, so that the findings could be reproduced by different teams. The scaling studies in Fig.3, Reb. Fig. 1 and 3 provide evidence that the method can scale similarly to digital models, which we know are capable of high-resolution tasks when they reach sufficient complexities.

Another new piece of evidence in this regard is the comparison of ODU and two other digital models in a more complex and higher-resolution dataset of cat images with 40-by-40 pixels from the AFHQ dataset. Even though none of the models has enough complexity to create realistic images, as shown in Reb. Fig. 2, ODU still obtains a better generation and respective FID score than digital models with this dataset. This result further strengthens the expectation that the proposed method will perform competitively in high resolution with a larger number of parameters.

Limitations - 4

When the model is deployed with passive phase masks for inference, the phase masks will not be reconfigurable but provide a significant power efficiency advantage. Please check the common rebuttal for detailed speed/power/efficiency/scaling results, augmented and represented. The use of SLM is for cases that require reconfigurability: training and modalities with interchangeable inference tasks. We do not agree on the resolution being low. Typical pixel pitch is 8 microns and there are models with lower pixel pitch employing >10 million pixels in a single cm-scale device which can be cascaded for larger networks. Regarding speed, there are 1 kHz models commercially available with ongoing research for faster models. The reviewer can reach out to the mentioned models and specifications on various vendors' catalogs available online (e.g. Holoeye). We use one set of masks for 100 passes, meaning that the SLM pattern does not need to change for 100 passes. This is one of the novelties of our algorithm. When 1 kHz is taken as the refresh rate, it would mean the SLM can be fed with the data as high as 100 kHz, which would mean completing 1000 passes in 10 milliseconds.

[1] Henighan, Tom, et al. "Scaling laws for autoregressive generative modeling." arXiv:2010.14701 (2020).

评论

Thanks for the responses. The programming frequency concern is addressed. The scalability or compactness of using an SLM-based DONN solution for reconfigurability is not justified, as it is only suitable for lab demonstration. Also, this diffractive hardware with fixed phase masks for 100 passes is designed for a single-channel diffusion model; the expressivity on complicated tasks is still not demonstrated. I will give a 4 due to these fundamental limitations.

评论

Thank you for your comments. We would like to briefly introduce some argumentation, which we believe can be relevant to your concerns.

Although this study focused on the initial, proof-of-principle demonstration of diffusion models with optics through algorithmic innovations with easy-to-access hardware, it's worth noting that SLM-based systems have shown potential for widescale deployment in industrial scenarios. One example is compact LCoS-based wavelength selective switches (WSS) used in telecommunications.

As you recognized, due to the larger wavelength of light, using this type of architecture in tiny devices such as smartphones could be challenging. However, the majority of AI workloads are currently handled by large-scale datacenters, and SLM-based rack-scale devices, such as optical cross-connects, are already used for connectivity in these datacenters without issues related to their form factors. We believe the proposed method could be particularly suitable for sampling with diffusion models at low energy costs in a datacenter setting.

On the expressivity on complicated tasks, our scalability results, provided in the author rebuttal, indicate that as datasets become more complex, ODU continues to compare favorably with various digital NNs. Furthermore, increasing the trainable parameter count of the optical system improves generation quality at the same rate as digital implementations. We believe these findings indicate that so far, there is no fundamental limitation of the proposed model for tackling more complicated tasks.

Additionally, we agree that adding multi-channel capabilities to this framework is a crucial step. We are already exploring ways to achieve this in future studies and expect significant improvements in generation performance with it. Optics offers different means to add this additional dimensionality without significantly altering the architecture, such as utilizing different properties of light like wavelength [1], polarization [2], or space-multiplexing.

[1]: Feldmann, Johannes, et al. "Parallel convolutional processing using an integrated photonic tensor core." Nature 589.7840 (2021): 52-58.

[2]: Li, Jingxi, et al. "Polarization multiplexed diffractive computing: all-optical implementation of a group of linear transformations through a polarization-encoded diffractive network." Light: Science & Applications 11.1 (2022): 153.

审稿意见
6

The authors present a hardware-based implementation of a denoising diffusion model. Instead of using a neural network, the authors propose to use an optical setup to perform the denoising steps during image generation. The system is trained using a digital twin simulating the optical setup. The authors evaluate their method on different datasets and report the relevant quality metrics.

优点

  • Building an efficient hardware-based diffusion model can be dramatically impactful as generative AI becomes more and more popular and finds more and more applications.

  • The potential speed up compared to conventional networks mentioned by the authors would be a game changer.

  • I believe the solution the authors came up with to enable training using the digital twin is quite elegant and makes a lot of sense.

缺点

  • There are now baselines in the experimental evaluation. Obviously, this is only a proof of concept and we cannot expect competitive numbers. Still there may be insightful baselines to compare the system against. E.g. How does the performance compare to the simulated digital twin? It would also be great to show qualitative results for comparison. Another baseline could be simple neural network with the same runtime?

  • The presented implementation uses purely optical elements that are limited to linear operations. Even though the authors hint that this may be fixed in future work, I think this is an important caveat. After all, the power of the conventionally used artificial neural networks heavily relies on the non-linearities which are used.

  • I am missing a qualitative evaluation of the denoising task itself. Can the authors include images directly showing inputs and outputs of the optical system, instead of the results of the diffusion process.

  • I am not completely sure NeurIPS is the correct venue for this type of publication which is very much focussed on the exerimental hard ware aspect.

问题

Can the authors discuss how their denoiser implementation differs from the one in [23]? Could the could other optical denoisers be utilised as well?

局限性

I feel the fact that the denoiser is fully linear is quite limiting. How could non-linear elements be included into the setup in the future? Would the increase execution time?

作者回复

We thank the reviewer for their overall positive appreciation of our manuscript. We start with point-by-point replies to the mentioned weaknesses (W1-4), questions (Q), and limitations (L) respectively. We will incorporate the additional results we present here and revisions in the main text.

W1. As can be now seen in the common rebuttal Reb. Table 1, we looked into both fully connected and convolutional architectures to compare with our performance. We have seen that depending on the output image resolution we have similar scaling with these pure digital networks as depicted in Reb. Figure 1. Moreover, Reb. Figure 3 shows a scaling of model parameters with task accuracy, indicating a similar trend by following the well-established power law in digital neural networks [10.48550/arXiv.2010.14701] with an on-par slope.

W2. We thank the reviewer for allowing us to clarify this point. In the proposed framework, several factors create the nonlinear transformation of the input information. One of these effects is the inherently nonlinear nature of the intensity detection. While the programmable parameters and free-space propagation are acting on the wave propagation by affecting both phase and magnitude, with complex numbers in mathematical terms, the detection only records real numbers by applying an absolute square operation, as indicated in Fig. 2. Note that programmable parameters are introduced with phase-encoding, which also creates a nonlinear relation between those and the detected output intensity. The similar performance with nonlinear neural networks, as shown in Reb. Fig. 1-3, are obtained thanks to these factors introducing nonlinearity in the system. We discuss the possible additional nonlinearities in our reply to the comment in the Limitations section (please see below).

W3. We thank the reviewer for this feedback. We agree that images corresponding to the output intensities of the ODU would help the readers to follow the method more easily. We included these images in Reb. Fig. 4 and will be included in the main text after acceptance. The camera plane images contain more pixels. They are matched with data size by downsampling and normalization, where these steps are part of the forward pass and error backpropagation on the digital twin to match it with the experiment. The reviewer can notice the effect of downsampling on the camera plane by emerging high-intensity grid matching the dataset pixelation.

W4. While we propose new hardware for denoising diffusion-based image generation, since this study aims to create a different type of machine learning infrastructure, we hoped that it fits the description given in the NeurIPS’s Call for Papers: “Infrastructure (e.g., libraries, improved implementation, and scalability, distributed solutions)...Machine learning is a rapidly evolving field, and so we welcome interdisciplinary submissions that do not fit neatly into existing categories.”, and wanted to build an interdisciplinary bridge so that the development of novel hardware can utilize the feedback from the machine learning community. Moreover, programming the hardware for performing the image generation task required the development of a novel algorithm for efficient and reliable operation. Our main contributions in this direction are the removal of time embedding steps for multi-step denoising, and the online learning algorithm to train with non-ideal experimental systems. These algorithmic tools are not limited to specific hardware but can be beneficial to use in virtually any analog computing architecture especially where updating weights is the expensive procedure.

Q. Both implementations denoise images physically with multiple modulation layers in free space, while they differ in terms of different aspects such as their physical architectures, electromagnetic spectra, denoising methods, and noise types. Ref. 23 utilizes a single set of modulation layers to filter out the salt and pepper type of noise in images and predict images with lower noise. In contrast, our work is a multi-pass architecture where the layers do change with respect to the timestep and the prediction is the Gaussian noise term in input images, not the denoised image. Hence we can do image generation, which is not possible for ref. 23.

Other differences are that ref. 23 works with terahertz frequency level electromagnetic waves, and 3D printed, fixed modulation layers where the wavelength and feature sizes are on the order of millimeters, while this study uses the multiple reflections of an optical light beam off a single reconfigurable spatial light modulator, where the wavelengths and features are on the micrometer scale. This allows for compatibility with mature optical technologies for an efficient prototype and a compact form factor.

L. While our system benefits from multiple nonlinear effects and shows similar scaling trends with digital nonlinear networks, it is also realizable to add more nonlinear interactions through the addition of different optical mechanisms to the system. The addition of a saturable absorber [1] or optically limiting film [2] would benefit from optical nonlinearities without increasing the energy budget substantially. Moreover, these physical effects can have lifetimes as low as picoseconds, so they would not increase the latency of the system either.

[1] Zubyuk, Varvara V., et al. "Low-power absorption saturation in semiconductor metasurfaces." Acs Photonics 6.11 (2019): 2797-2806. [2] Vivien, L., et al. "Carbon nanotubes for optical limiting." Carbon 40.10 (2002): 1789-1797.

评论

Thank you for the rebuttal. I don't have any additional questions.

作者回复

We sincerely thank the reviewers for their time and efforts in providing their insightful comments. They recognized our work as an “innovative demonstration of an opto-electronic implementation of a toy diffusion model (5uf9)”, agreed with its potential impact; “efficient hardware-based diffusion model can be dramatically impactful (hbsN)”,” addresses the environmental impact of traditional computing(heiE)” and found the proposed solution useful, “training using the digital twin is quite elegant (hbsN), proposed online learning scheme sounds effective for bridging the gap (5uf9)”.

We would like to start by briefly reiterating our findings and motivation. This study demonstrated for the first time how programmed optical wave propagation can be used as a computing engine for image generation. The main potential benefit of this new method is the sampling of diffusion models with a much smaller energy budget compared to electronics since optical wave propagation has a very small intrinsic loss while acquiring comparable, or better quality (Rebuttal (Reb.) Fig. 1 and 2). This is especially interesting because diffusion models are currently one of the most costly generative AI models due to their repetitive denoising process, with a correspondingly large environmental impact. The generation of a single image can emit more than a kilogram of CO2CO_2 [1].

Moreover, the scaling behavior of the optical implementation follows the same trend as the digital. The scaling of the proposed method was a common comment in the majority of the reviews and we address it by introducing additional data (Reb. Fig. 1 and 3).

To enable this proof-of-principle we developed a novel method, the time-aware denoising algorithm, which allowed image generation with minimal changes to the set of passive optical weights, with a substantial decrease in the latency and energy consumption of the process. In addition, the proposed online learning algorithm achieved convergence showing the promise of the framework for real-life scenarios. Thanks to these flexible algorithms, this method can be used to implement image generation with different physical hardware.

Scaling Image Resolution and Comparison with Digital Neural Networks

Receiving the overlapping comments of the reviewers about this topic made it obvious that the study should provide further details on the scaling of the proposed system both in terms of parameter counts and output dimensions along with the comparisons with purely digital implementations. As detailed in the attached Reb. Table 1, two digital architectures, one being fully connected and the other convolutional U-Net [2], are trained for the same image generation tasks with the Optical Diffusion Unit (ODU). Here we would like to emphasize that the energy consumption and speed of the ODU are indicated for the simple laboratory implementation where the efficiency is not optimized, hence the energy efficiency can be significantly improved. For the digital implementations, we performed the benchmark on an Nvidia L4 GPU, which is one of the most energy-efficient devices available today. Due to the changes in the digital hardware and models, we will update the corresponding appendix section. The results shown in Reb. Fig. 1 indicate that ODU outperforms the two digital neural networks, and all three scale in a similar manner both in terms of denoising and generation performances when the generated image dimension is increased while the model sizes are kept constant. To extend the analysis on the output image size to a higher resolution and real-life images, we trained the same three networks on cat images with 40-by-40 pixels from the AFHQ dataset [3]. As shown in Reb. Fig. 2, while this problem required more capable models for realistic images, ODU kept acquiring better FID scores from digital neural networks at this scale.

Scaling of Model Parameter Counts

In addition to the scaling with respect to the image resolution, by combining and replotting the first two tiles of Fig. 4 of the main paper, Reb. Fig. 3 illustrates the same widely accepted power-law trend of the performance versus parameter count of digital networks in ODU as well [4]. Most significantly, when the optical implementation is fitted to a power-law equation, the exponential of the power law (-0.15) is approximately the same as the reported value (-0.16) for large-scale image generation networks in [4]. This fit parameter gives the slope of the line in the logarithmic plot, indicating how fast the generation performance scales with the number of parameters, in this case showing that ODU improves its performance at a similar speed with large-scale image generation networks while its parameter count is increased. Finally, we remark that the single outlier in this trend is the case where there is only a single modulation layer, which does not benefit from multiple optical modulations in the proposed architecture.

[1] Luccioni, Sasha, Yacine Jernite, and Emma Strubell. "Power hungry processing: Watts driving the cost of AI deployment?." The 2024 ACM Conference on Fairness, Accountability, and Transparency. 2024.

[2] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer International Publishing, 2015.

[3] Choi, Yunjey, et al. "Stargan v2: Diverse image synthesis for multiple domains." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

[4] Henighan, Tom, et al. "Scaling laws for autoregressive generative modeling." arXiv preprint arXiv:2010.14701 (2020).

评论

Dear reviewers, do the authors' responses answer your questions or address your concerns? Thanks.

评论

Dear reviewers, as we approach the final two days, please take a moment to review the author's responses and join the discussion. Thank you!

评论

Dear reviewers, the authors are eagerly awaiting your response. The author-reviewer discussion closes on Aug 13 at 11:59 pm AoE. Thanks!

最终决定

The paper proposes a novel approach for implementing denoising diffusion models using optical computing, leveraging passive diffractive optical layers to perform efficient denoising with minimal energy consumption. The method addresses the high energy demands of traditional diffusion models on electronic hardware by utilizing light propagation through trained modulation layers, offering a promising alternative for energy-efficient image generation. The paper received mixed opinions. After going through all the reviews and discussions, we borderline accept the paper. While it shows promise, it would benefit from further development in scalability, practical deployment, and detailed comparisons with existing methods. The potential impact on energy-efficient AI is significant, and with additional refinements, this work could make a valuable contribution to the field. Lastly, we encourage the authors to polish their paper significantly.