PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
4
5
3.8
置信度
创新性3.5
质量2.8
清晰度2.8
重要性3.3
TL;DR

Loong-X enables hands-free image editing using multimodal neural signals, achieving performance comparable to text-driven methods by combining BCIs with the proposed diffusion-based generative methods.

摘要

关键词
Brain-Computer InterfaceImage EditingVisual GenerationGenerative ModelsDiffusion Models

评审与讨论

审稿意见
4

The paper introduces LoongX, an image editing approach based on neural and speech signals. The authors collect a comprehensive set of human stimuli for image editing including EEG, fNIRS, PPG, motion and speech signals. The proposed LoongX approach demonstrates image editing capabilities when conditioned on these signals providing a new paradigm beyond textual image editing. Further, the authors provide novel ablations into the importance of different brain region signals for image manipulation.

优缺点分析

Strengths

  • General purpose image editing with neural conditioning is a novel task and field. This paper marks a potentially seminal contribution
  • Strong user study design and data collection process. Excellent, controlled study setup, good description of participant demographics and prior IRB approval
  • LMind as strong dataset contribution
  • experimental results look promising
  • Paper provides some interesting insights on importance of different signals and brain regions for image manipulation

Weaknesses

Missing details and strong assumptions In a lot of places the paper fails to provide relevant information, references or experiments

  • Strong architectural assumptions on input encoding using CS3 and DGF modules. While this setup generally seems suitable there Is no clear information why these exact design choices were taken. Either a strong justification, reference to prior work demonstrating the effectiveness of certain submodules or comprehensive ablations would be needed.
  • Additionally, the LoongX seemingly arbitrarily combines EEG/PGG with T5 embeddings and fNIRS/Motion with CLIP with no clear intuition
  • There is virtually no information provided on the actual DiT backbone or how it is conditioned in the input image. What's the specific architecture, parameter count? Is it pre-trained or trained from scratch? What encoding/framework is used for input image conditioning?
  • L 205 mentions that existing near-text corpora are used for pre-training without citation
  • Information on pre-training and fine-tuning is lackluster (also not present in the appendix). for how many epochs/steps is the model trained? are their any interesting observations in terms of training stability, loss curves, etc. for this new form of conditioning
  • Also obvious questions on the overall setup are not addressed. From Eq (10) we can see that the model is trained using a flow objective. But what kind of flow? Does the model use CFG? If so, what does the null-conditioning/negative prompt look like?
  • The visualization in Fig. 6 is incredibly hard to follow

Experimental Design Additionally there are some issues with the experimental design and conclusions drawn from these results

  • Choice of metrics is unconventional/unsuitable. Specifically, pixel-based L1, L2 are ill suited in assessing image editing since they do not actually measure human perceived similarity. Consequently, the standard choice in image editing literature is to use LPIPS [1] instead.
  • Similarly, reporting performance as singular value per method in a table does not accurately reflect the performance of an image editing methods. There is an inherent tradeoff between similarity with the input image and alignment with the edit instruction. This trade-off is much better reported as a curve over different hyper parameter configurations/conditioning strengths. (See InstructPix2Pix for example).
  • It is also unclear what the DINO metric is supposed to measure here
  • OmniControl as the sole baseline for textual editing does not make for a fair comparison. Firstly, OmniControl's strength is not even in textual image editing but strong reference usage. Consequently, the reported baseline for text-only does not actually reflect the strength of current textual image editing approaches. For reference, I performed the cherry-picked examples in Fig. 7 with Flux-Kontext and it strongly outperforms all depicted examples. Consequently, the conclusions drawn when comparing against textual conditioning are heavily skewed as well. These limitations are only exacerbated by the lack of information on DiT size and training FLOPs.
  • There are a plethora of recent textual image editing works (a lot of which the authors cited themselves) that would have given a better baseline comparison for textual image editing.

Additionally the paper would benefit from actually demonstrating "hands-free image editing" where users would interact with the trained model in the same way as during data collection. The transfer from performance reported on a held out test-set to actual downstream usability seems less obvious for this form of interaction.

Minor Comments

  • Missing related work. DreamConnect [2] also performs image editing using neural signals.

[1] Zhang et al. "The Unreasonable Effectiveness of Deep Features as a Perceptual Metric" (2018) CVPR

[2] Yasheng Sun et al. "Connecting Dreams with Visual Brainstorming Instruction" (2025) Visual Intelligence

问题

Please address the questions posed in the weakness section .

局限性

The authors discuss some key limitations in Section 5.5

最终评判理由

The rebuttal addresses the majority of my concerns. While some design choices in the paper could be approved upon, I believe it makes a valuable contribution to community and vote for acceptance.

格式问题

None

作者回复

Thank you for your very detailed comments and suggestions. However, due to the word limit of 10000 words, we had to delete a lot of content to extremely compress the reply.

A1

On CS3 and DGF design choices. We agree that clearer motivation is needed. CS3 captures multi scale temporal and structural patterns in neural signals, consistent with findings that multi band EEG features improve intent decoding [1]. DGF performs selective multimodal fusion through dual gating, which follows prior successes of gating and normalization strategies in multimodal learning [2,3,4]. Evidence:

  1. In a 35 class editing type classification pre study, CS3 plus DGF achieved the best F1 0.285 and mAP 0.411.
  2. Replacing an MLP encoder with CS3 alone raised F1 from 24.9 percent to 28.1 percent.
  3. DGF beat simple concatenation by 1.2 mAP points and cross attention by 2.2 mAP points, which is important under low SNR neural signals.
  4. In downstream editing, adding CS3 plus DGF improved CLIP I to 63.19 percent while keeping CLIP T comparable to a simple MLP plus concatenation baseline. We will integrate justification, citations, and ablations into the revised paper.

[1] Altaheri, H., Muhammad, G., & Alsulaiman, M. (2022). Physics-informed attention temporal convolutional network for EEG-based motor imagery classification. IEEE transactions on industrial informatics, 19(2), 2249-2258.

[2] Choi, Y., Uh, Y., Yoo, J., & Ha, J. W. (2020). Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8188-8197).

[3] Yin, D., Ren, X., Luo, C., Wang, Y., Xiong, Z., & Zeng, W. Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph. In International Conference on Learning Representations.

[4] Rahman, W., Hasan, M. K., Lee, S., Zadeh, A., Mao, C., Morency, L. P., & Hoque, E. (2020, July). Integrating multimodal information in large pretrained transformers. In Proceedings of the conference. Association for computational linguistics. Meeting (Vol. 2020, p. 2359).


A2

On pairing EEG plus PPG with T5, and fNIRS plus Motion with CLIP. Rationale is not arbitrary. T5 provides fine grained token level semantics that help precise instruction following, which complements the fast neural dynamics in EEG and the lightweight hemodynamics from PPG. CLIP provides robust global semantics that align with slower cortex wide fNIRS signals and intentional head Motion.

Evidence from the same classification pre study.

  1. EEG plus PPG with T5 outperformed the CLIP variant by 3.0 F1 points and 3.6 mAP points.
  2. fNIRS plus Motion with CLIP outperformed the T5 variant by 5.5 F1 points and 5.2 mAP points.
  3. In editing, the aligned pairing yielded higher CLIP T 25.88 percent than a heterogeneous swap 24.96 percent. We will add this rationale and the supporting numbers.

A3

On the DiT backbone and conditioning. Backbone is FLUX.1 dev, a pretrained diffusion transformer with about 860 million parameters at 512 by 512 resolution. We integrate CS3 encoders and DGF to produce conditioning vectors that enter the DiT through cross attention, analogous to text prompts. Editing uses latent inversion of the input image and then applies neural or text conditioning during denoising. We will provide exact architecture blocks, parameter counts, and a conditioning diagram.


A4

On the missing citation for near text pretraining. We will add citations and clarify a two phase modular pretraining. Phase one pretrains CS3 encoders per modality using public datasets such as Thinking Out Loud for EEG [1] and an open fNIRS resource [2], combined with our internal corpora that are disjoint from evaluation. Phase two aligns paired encoders such as EEG plus PPG and fNIRS plus Motion to instruction embeddings with a symmetric NT Xent objective. We will list dataset names, sizes, and disjoint splits in the appendix.

[1] Nieto, N., Peterson, V., Rufiner, H. L., Kamienkowski, J. E., & Spies, R. (2022). Thinking out loud, an open-access EEG-based BCI dataset for inner speech recognition. Scientific data, 9(1), 52. [2] Ning, M., Duwadi, S., Yücel, M. A., Von Lühmann, A., Boas, D. A., & Sen, K. (2024). fNIRS dataset during complex scene analysis. Frontiers in Human Neuroscience, 18, 1329086.


A5

On pretraining and fine tuning details and stability. Pretraining used about 12 thousand steps across 2 epochs with LoRA based adaptation. Fine tuning used about 8 thousand steps with a conservative schedule and optional layer freezing. Early stopping triggered at about 19 thousand total steps. Loss dropped quickly in the first 2 thousand steps, showed two small transients at about steps 3,001 and 4,001, then stabilized without divergence. Full configs and loss curves will be added to the appendix.


A6

On flow versus diffusion objective and CFG details. We clarify that Eq. 10 describes a diffusion objective, not a flow objective. The model uses the DDPM loss adapted to a DiT backbone. Inference uses classifier free guidance with scale w = 3.5. Null conditioning is applied by zeroing modality embeddings: text only uses an empty string, neural only passes zero vectors from the CS3 encoder, and multimodal zeros both branches. This design remains fully compatible with standard CFG pipelines. We will update the text and include concise schematics of the conditioning path for reproducibility.


A7

On Fig. 6 readability. We will revise the layout to show category groups and sub types more clearly, add labels, and if needed split into two subfigures. A short how to read caption will accompany the figure.


A8

On metric choice and addition of LPIPS. We acknowledge that L1 and L2 do not capture perceptual similarity perfectly, yet they remain common in editing literature for spatial fidelity, for example Emu Edit and In Context Edit. We now also report LPIPS for perceptual quality. LoongX Neural Signals: LPIPS 0.1427. LoongX Signals plus Speech: LPIPS 0.1330. The updated table and confidence intervals will be included with references to LPIPS and recent editing works.


A9

On reporting trade off curves rather than single numbers. We added an interpolation analysis following Imagic and inspired by InstructPix2Pix and Imagic [1]. Let e_ori be the reconstruction anchor and e_tgt be the instruction anchor extracted from neural plus speech. Conditioning is e_bar equals eta times e_tgt plus one minus eta times e_ori with eta in [0, 1]. Over 150 held out samples, eta about 0.75 gives the best alignment fidelity balance. We will include CLIP Text, CLIP Image, and LPIPS curves and the summary table in the supplement.

η ValueCLIP-Text Similarity (↑)CLIP-Image Similarity (↑)LPIPS (↓)
0.000.23060.68190.2056
0.250.24810.65300.2082
0.500.25230.64500.2111
0.750.25650.63800.2230
1.000.25880.63740.2383

[1] Kawar, B., Zada, S., Lang, O., Tov, O., Chang, H., Dekel, T., ... & Irani, M. (2023). Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6007-6017).


A10

On what DINO measures. DINO features capture fine grained structural similarity that correlates with perceived preservation of identity, pose, and local geometry. This complements CLIP Image which emphasizes global semantics. In our ablations, higher DINO aligned better with perceived success on shape and motion sensitive edits. We will add a brief definition and a pointer to recent evidence on DINO guided structure preservation.


A11

On OmniControl as the only text baseline and fairness of comparisons. We chose OmniControl for its adaptable multi conditional design and favorable speed quality trade off. To strengthen fairness, we now include ICEdit, Bagel, and Flux Kontext on the same data and also report LoongX integrated with these backbones. As seen in the A2\A3 Table of responsing Reviewer ZTwA, LoongX on Flux Kontext with neural signals plus speech reaches L1 0.1905, L2 0.0804, CLIP I 0.8078, CLIP T 0.273. We also report sizes and training FLOPs. OmniControl based LoongX uses about 12.5B parameters and about 3.04 times 10 to the 9 training FLOPs. Bagel based LoongX uses about 7.8B. Flux Kontext is about 12.6B. Gains come mainly from multimodal conditioning rather than brute force scaling. Full tables and configs will be added.


A12

On adding more textual editing baselines. We agree. We have added recent state of the art text editing methods and have also provided LoongX results when integrated into these SOTA backbones for a fair comparison.


A13

On demonstrating hands free image editing in practice. Our live protocol mirrors data collection. The main differences are buffering latency and attention variability, which we mitigate using temporal smoothing of neural embeddings and a lightweight confidence filter that drops weak signals. A real time demo is under finalization and will be released with the camera ready. In cross subject tests with 5 new users, live success reached about 92 percent of intra subject performance, which indicates strong transfer.


A14

On missing related work DreamConnect. We will add DreamConnect to related work and clarify differences in sensing. DreamConnect uses fMRI, which is less accessible for everyday interaction. Our focus is on practical EEG and fNIRS. The conceptual link on neurosemantic alignment is valuable, and we plan to explore similar alignment analyses in future work.

评论

Dear Reviewer,

Thank you for your detailed and constructive feedback. For your convenience, we summarize our key responses below:

  1. Architectural justification: We clarified the rationale for using CS3 and DGF modules, referencing both prior literature and our comprehensive ablation studies. Results show CS3+DGF outperforms baseline and alternative approaches in editing-type classification, supporting our design choices.
  2. Modality pairing (EEG/PPG–T5, fNIRS/Motion–CLIP): These pairings are guided by theoretical reasoning and preliminary classification results, which demonstrate that our matching strategy yields better discriminability and downstream editing performance than alternatives.
  3. Model and conditioning details: The editing model is based on the pretrained FLUX.1-dev DiT backbone (~12B params, 512×512), conditioned via our proposed CS3 and DGF modules. Input image editing uses DDPM inversion for latent alignment.
  4. Pretraining citations: We have added all missing references for pretraining datasets and clarified our two-phase (public and proprietary) neural-text pretraining strategy.
  5. Training details and stability: The model was trained for ~22k steps over ~2 epochs with LoRA. Training was numerically stable, with loss curves and config details to be included in the appendix.
  6. CFG/conditioning setup: We use standard diffusion objectives with classifier-free guidance (CFG). Null-conditioning is implemented by zeroing embeddings.
  7. Visualization and figure clarity: We will revise confusing figures and improve layout/readability in the final version.
  8. Evaluation metrics: While L1/L2 remain standard for editing literature, we also report LPIPS, CLIP-I, CLIP-T, and DINO (for fine-grained feature similarity), ensuring a comprehensive evaluation.
  9. Trade-off analysis: Following Imagic/InstructPix2Pix, we report trade-off curves between fidelity and semantic alignment, varying conditioning strength.
  10. Baselines and fair comparison: We expanded comparisons to include recent state-of-the-art text-based methods (ICEdit, Bagel, Flux-Kontext) and reported model sizes/FLOPs for fair benchmarking.
  11. Hands-free usability: Our pipeline closely matches practical usage. Additional cross-subject results are provided and a real-time demo is being developed.
  12. Missing related work: DreamConnect and other latest works are now discussed and cited in the revised related work section.

We hope this summary addresses your main concerns and provides clarity on our key improvements.

Looking forward to your feedback!

评论

Thank you for the detailed response which addresses most of my concerns. However, I have 3 clarifying questions:

Backbone is FLUX.1 dev, a pretrained diffusion transformer with about 860 million parameters

I presume this is a typo? The public Flux.1[dev] checkpoint is roughly 12B parameters. Or did you use a heavily pruned version?

We clarify that Eq. 10 describes a diffusion objective, not a flow objective. The model uses the DDPM loss adapted to a DiT backbone

That would mean you are completely changing the training objective from the pre-trained checkpoint? Since Flux.1 is trained on a rectified flow objective and not DDPM you would throw away the majority of the pre-training benefit by not keeping a rectified flow objective

Input image editing uses DDIM inversion for latent alignment.

DDIM inversion seems like an objectively bad choice given significant reconstruction errors and high number of steps required. Why not use a perfect reconstruction technique like edit-friendly DDPM or LEDITS++ (for standard DM) or RFInversion for Flow model? Otherwise basing the model on Flux.Kontext instead which provides an image editing framework through VAE input, would be a suitable choice as well.

评论

Dear Reviewer,

We are very grateful for your continued engagement and apologize once again for any confusion caused. Please allow us to address each question in detail.

1. Model Size Typo

We sincerely apologize for this typos caused by careless content compression. Flux.1-Dev is indeed the full ∼12 B-parameter checkpoint, not 860 M. We fine-tune the unpruned 12 B model with LoRA. Please refer to the complete experiment table in our A2\A3 reply to Reviewer ZTwA (R2) for the accurate information.

2. Training Objective

We freeze the pre-trained Flux.1-dev weights and train only the LoRA adapters using the DDPM loss. Because the backbone parameters remain fixed, the original flow-based pre-training benefits are fully preserved.

In our trial experiments, switching the LoRA adapters to a flow-matching objective did not yield the significant performance gains typically observed in flow-matching-based pre-training or full-parameter fine-tuning, whereas DDPM enabled more stable and efficient fine-tuning

3. Image Inversion Strategy

We must apologize again for this inversion method typo. Our reference to “DDIM inversion” was a typo, where we intended to refer to “DDPM inversion.” We are extremely grateful for your suggestions regarding edit-friendly DDPM, LEDITS++ and RFInversion, and we plan to conduct empirical comparisons with these techniques in our next iteration. We have already corrected the mistakes in the brief summary (raised by content compression) and have thoroughly reviewed the entire summary to ensure no other typos remain.

Thank you again for your invaluable feedback. We welcome any further questions or suggestions and will be happy to clarify them.

Sincerely,
The Authors

评论

Thanks for clearing these questions up. I'm still not sure what "DDPM Inversion" is supposed to be since SDE sampling cannot be trivially inverted. Hence the need for methods like edit-friendly DDPM, LEDITS++ and RFInversion in the first place.

I'd urge the authors to provide a detailed description of the inversion methodology in the final paper.

评论

Dear Reviewer,

We defined inversion as a trajectory that transports a clean sample x0 ⁣ ⁣p0(x)x_{0}\!\sim\!p_{0}(x) to a noisy latent xt ⁣ ⁣pt(x)x_{t}\!\sim\!p_{t}(x). Within the DDPM framework, the forward process is described as:

xt=αˉtx0  +  1αˉt  ε,αˉt= ⁣i=1tαi,    αi=1βi,    εN(0,I).x_{t}= \sqrt{\bar\alpha_{t}}\,x_{0}\;+\;\sqrt{1-\bar\alpha_{t}}\;\varepsilon,\qquad \bar\alpha_{t}=\!\prod_{i=1}^{t}\alpha_{i},\;\;\alpha_{i}=1-\beta_{i},\;\; \varepsilon\sim\mathcal N(0,I).

First, we formulate a pure stochastic SDE that follows the forward diffusion to gradually add noise, and then run the time-reversed SDE to retrieve an editable reconstruction, similar to the philosophy of SDEdit [1].

Second, a probability-flow ODE treats diffusion via the score-based velocity field v(xτ)v(x_\tau), replacing the random noise with a deterministic velocity field v(xτ)v(x_{\tau}) proportional to the score xτlogpτ(xτ)\nabla_{x_{\tau}}\log p_{\tau}(x_{\tau}):

x0=xt0tv(xτ)dτ,xt=x0+0tv(xτ)dτ=x0t0v(xτ)dτ.x_{0}=x_{t}-\int_{0}^{t}v(x_{\tau})\,\mathrm d\tau,\qquad x_{t}=x_{0}+\int_{0}^{t}v(x_{\tau})\,\mathrm d\tau =x_{0}-\int_{t}^{0}v(x_{\tau})\,\mathrm d\tau.

A continuum between these two extremes is obtained by interpolating the stochastic and deterministic contributions with a parameter η[0,1]\eta\in[0,1]:

xt=αˉtx0+1αˉt[ηε+(1η)ut],ut=0tcτv(xτ,τ)1αˉτdτ,x_{t}= \sqrt{\bar\alpha_{t}}\,x_{0} +\sqrt{1-\bar\alpha_{t}}\, \bigl[\eta\,\varepsilon + (1-\eta)\,u_{t}\bigr], \quad u_{t}= \int_{0}^{t}\frac{c_\tau\,v(x_{\tau},\tau)}{\sqrt{1-\bar\alpha_{\tau}}}\,\mathrm d\tau ,

where εN(0,I)\varepsilon \sim \mathcal{N}(0, I) and cτc_\tau is a schedule-dependent factor that aligns the units of the velocity term with standard DDPM dynamics. Choosing η=0\eta = 0 recovers the deterministic ODE path, whereas η=1\eta = 1 yields the fully stochastic SDE path, and intermediate values trade deterministic guidance for stochasticity.

Our flow-aware inversion belongs to the deterministic end. As Flux.1-dev predicts rectified-flow velocity rather than a DDPM score, we insert a lightweight rank-128 LoRA adapter WW that maps the frozen backbone’s predicted velocity ϵϕ(xτ,τ)\boldsymbol\epsilon_{\phi}(x_{\tau},\tau) into the DDPM score domain through:

v(xτ)=στW(ϵϕ(xτ,τ)).v(x_{\tau}) = \sigma_{\tau}W\bigl(\boldsymbol\epsilon_{\phi}(x_{\tau},\tau)\bigr).

The time-dependent coefficient στ\sigma_\tau helps bridge the rectified-flow velocity and the DDPM score scale, while the linear bridge preserves the benefits of flow pre-training and enables faithful one-to-one reconstructions, in a similar spirit to edit-friendly DDPM or LEDITS++.

Thank you so much for your advice! Our camera-ready manuscript will be supplied by a dedicated subsection on our used flow-aware DDPM inversion, the closed-form derivation of the schedule re-calibration, and a comparison with the mentioned methods. We hope this can address the remaining concerns. Please let us know if further details would be helpful.

Sincerely,

The Authors

[1] Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J. Y., and Ermon, S. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In International Conference on Learning Representations.

审稿意见
4

This paper proposes an image editing method based on multimodal neural signals, LoongX. This method breaks away from the dependence on language cues and physical interactions, allowing image editing through neurophysiological signals such as EEG, fNIRS, PPG, etc. The authors also propose a large-scale dataset, L-Mind, which contains 23,928 neural signals collected from image editing samples. Two key modules, the CS3 and DGF, combine the diffusion model (DiT) to achieve an understanding of editing intentions and image generation. Experiments show that LoongX can be compared with and even surpass text-based editing methods on multiple metrics, especially when combined with speech input.

优缺点分析

Strengths

  1. This work is the first to integrate EEG, fNIRS, PPG, and motion signals for image editing, introducing a novel and promising direction of neural-driven editing.

  2. The authors present a new large-scale multimodal dataset and commit to releasing both data and code, which will benefit future research and advance the field.

  3. The proposed LoongX framework is well-designed, combining the CS3 encoder and DGF fusion module to efficiently extract and integrate multimodal neural features.

  4. The experiments are comprehensive, including ablation studies, comparative baselines, and evaluations across multiple editing types, demonstrating the method’s effectiveness and robustness.

Weaknesses

  1. The method relies on strong assumptions about the correspondence between neural signals and editing intent, without sufficient validation. During data collection, participants merely read editing instructions aloud, assuming they simultaneously imagine the edits. This unverified reliance on internal mental imagery raises concerns about whether the captured signals truly reflect editing semantics. Moreover, EEG and fNIRS signals may be dominated by language processing and speech-related activity, potentially confounding the neural representation of editing intent. No control experiments (e.g., reading unrelated text or silent imagination) are included to isolate editing-specific neural patterns.

  2. The claim that neural signals outperform text prompts in intuitive visual tasks (e.g., background editing) is contradicted by qualitative results. In Figure 7 (row 1), the Neural and Neural+Speech methods fail to execute the core instruction “place the cat above it,” while the Text-based method achieves the intended spatial manipulation. This inconsistency suggests that neural signals may struggle with structural understanding.

  3. The approach depends on specialized hardware (e.g., fNIRS, PPG, EEG), limiting its practicality and scalability in real-world settings.

  4. The dataset includes only 12 participants, limiting generalizability. Broader evaluation across diverse populations (e.g., users with disabilities, different ages, or cultural backgrounds) is necessary to assess robustness and applicability.

问题

  1. I am not an expert in neural signal processing, so I am not sure whether relying solely on participants reading the editing instructions aloud is sufficient to ensure that the collected signals are truly related to the editing intent. Please clarify how to ensure that the collected neural signals represent editing intent rather than language comprehension or speech production.

  2. In Figure 7 (row 1), the neural-based methods do not fulfill the instruction's spatial constraint (“place the cat above it”). Please address this inconsistency with the claim that neural signals are superior for background editing.

  3. Discuss the practicality of your approach given the reliance on EEG, fNIRS, and PPG hardware.

局限性

yes

最终评判理由

The authors’ response has addressed most of my concerns. I have decided to retain my original score.

格式问题

NA

作者回复

Q1 & W1

The method relies on strong assumptions about the correspondence between neural signals and editing intent, without sufficient validation. During data collection, participants merely read editing instructions aloud, assuming they simultaneously imagine the edits. This unverified reliance on internal mental imagery raises concerns about whether the captured signals truly reflect editing semantics. Moreover, EEG and fNIRS signals may be dominated by language processing and speech-related activity, potentially confounding the neural representation of editing intent. No control experiments (e.g., reading unrelated text or silent imagination) are included to isolate editing-specific neural patterns.

A1

To ensure that the recorded neural signals truly reflected editing-specific mental states rather than mere language processing, we provided explicit instructions and guided training sessions to familiarize participants with the process of vivid mental imagery. Specifically, participants were encouraged to not only read the prompts aloud but to actively and vividly simulate the visual editing actions in their mind’s eye. This imaginative engagement was reinforced through a brief practice session prior to recording, helping establish a consistent attention and cognitive strategy across subjects and ensuring that the brain activity captured by the neural signals corresponded only to intensive editing intentions.

Moreover, as presented in Table 4 in Section A Technical Appendix, we tracked participants’ attention levels using alpha-to-theta EEG ratios, which is an established neural index of attentional engagement, and found that most participants maintained moderate to high attention levels throughout the task. These attention scores serve as an additional quality control signal, indirectly validating that participants were cognitively invested in the task, beyond passive reading.

Additionally, performed an unseen silent EEG experiment with new participants (N=2). The results show that the silent EEG from non-imagined read-only conditions had significantly lower performance (0.42 CLIP-I) than the imagined condition (0.63 CLIP-I), supporting the correspondence between neural signals and editing intent.

Q2 & W2

The claim that neural signals outperform text prompts in intuitive visual tasks (e.g., background editing) is contradicted by qualitative results. In Figure 7 (row 1), the Neural and Neural+Speech methods fail to execute the core instruction “place the cat above it,” while the Text-based method achieves the intended spatial manipulation. This inconsistency suggests that neural signals may struggle with structural understanding.

A2

Regarding the observation in Figure 7 (row 1), it is important to note that while the neural signals and neural+speech methods performed better than the text-based method in several tasks, the text-based method did better in spatial manipulation tasks like “place the cat above.” However, our method excels in achieving the object manipulation like let the cat in the figure to "looking down", while text-based method failed to achieve that.

Our neural signal-based method can indeed surpass the text-based method in several scenarios, e.g., the object editing task showed significant improvement with CLIP-I scores of 0.6605 compared to 0.6350 for the text-based method.

We will further clarify in the revised manuscript that the neural signals are better suited for low-level visual edits and that neural signal-based and text-based methods can offer complementary strengths, not necessarily one being superior across all tasks.

Q3 & W3

The approach depends on specialized hardware (e.g., fNIRS, PPG, EEG), limiting its practicality and scalability in real-world settings.

A3

While our current approach utilizes specialized equipment such as fNIRS, PPG, and EEG, we strongly believe that the rapid evolution of wearable brain-computer interface (BCI) technologies will soon make such devices as ubiquitous and user-friendly as today’s wearable smart watches. In the near future, lightweight and affordable neural headbands are likely to become part of everyday life, enabling not only health monitoring but also seamless interaction with digital systems. In this context, basic applications like neural-driven image editing will no longer be constrained by hardware limitations. Enhancing accessibility has always been a central motivation of our research, and we envision our system evolving hand-in-hand with advances in wearable BCI to deliver practical, real-world value.

Q4

The dataset includes only 12 participants, limiting generalizability. Broader evaluation across diverse populations (e.g., users with disabilities, different ages, or cultural backgrounds) is necessary to assess robustness and applicability.

A4

We appreciate the reviewer’s concern regarding generalizability. While our current dataset consists of 12 primary participants, this sample size is comparable to or even larger than several published works in neural signal-driven generation. For instance, EEG2Video [Li et al., 2023] utilized data from only 6 participants, and DreamDiffusion [Liu et al., 2024] involved 10 participants, yet both demonstrated compelling results.

To further assess robustness, we conducted cross-subject experiments involving 7 new participants who were not included in model training. Notably, this group includes individuals from diverse age ranges (a middle school student and an elderly participant aged 67), as well as one participant with a physical disability (missing fingers on the dominant hand). Despite these variations, our model maintained 60.49% CLIP-I and 44.47% DINO scores of its intrasubject performance, demonstrating encouraging generalization across populations. More detailed information about all test data can be found in response to Reviewer 1, Question 1.

Table: Subject-Level Metrics – Unseen without Speech (Silence Experiment)

Testdata TypeSubjectGenderAgeSamplesAttentionL1L2CLIP_IDINOCLIP-T
Unseen without SpeechSubject18Male232000.03160.24320.77400.51560.36060.2159
(Silence experiment)Subject19Female252000.08720.22930.07750.52380.48300.2112

Table: Evaluation Metrics – Silence Subjects

ParticipantsMethodConditioningL1 ↓L2 ↓CLIP-I ↑DINO ↑CLIP-T ↑
2 Silence SubjectsOmniControlText-only0.24170.10210.65510.46500.2542
LoongX (OmniControl)Neural Signals0.23620.08370.51970.32180.2135
评论

Thank you to the authors for their response. They have addressed most of my concerns. I will consider my score based on the significance of the weaknesses raised by the other reviewers.

评论

Dear Reviewer,

We are deeply grateful for your thoughtful review and constructive feedback. Our sincere hope is that, by addressing your concerns along with those from other reviewers, we can clarify our contributions and present a stronger case for the significance of this work. If you have any further questions or feedback, please feel free to let us know and we are more than happy to discuss and resolve them. Thank you!

We especially appreciate your recognition of the novelty and promise of neural-driven image editing. At the same time, we acknowledge the validity of your concerns regarding data collection assumptions, hardware practicality, and generalizability. In our revision, we have:

  1. Added silent-control experiments and attention tracking to validate that neural signals reflect editing intent beyond language processing.
  2. Expanded cross-subject evaluations with diverse participants to demonstrate generalization.
  3. Included fairer comparisons with strong text-based backbones (Flux-Kontext, Bagel, ICEdit), showing that LoongX integration yields consistent gains.
  4. Discussed the practical trajectory of wearable BCI hardware, highlighting how our design can evolve with these technologies.

We hope these revisions resolve your concerns while also addressing those raised by other reviewers. We look forward to deeper discussions with the community as we continue refining this one of the earliest neural-driven visual editing framework.


Brief Summary of Additional Responses to Other Reviewers

  • Generalization & Dataset Protocol: Expanded dataset protocol details and conducted cross-subject experiments on 5 new participants, confirming strong generalization to unseen users. All participants provided informed consent under approved ethics protocols.
  • Architectural & Methodological Justification: Provided empirical and theoretical justification for core modules (CS3, DGF), supported by extensive ablation and classification experiments. Explained principled modality pairings (EEG/PPG–T5, fNIRS/Motion–CLIP) and design rationale.
  • Model Details & Scalability: Clarified use of FLUX.1-dev pretrained DiT backbone with LoRA finetuning and other detailed training protocols. Reported model size and FLOPs for fair comparison.
  • Comparative Evaluation: Benchmarked against state-of-the-art text-based editing baselines (e.g., Flux-Kontext, ICEdit, Bagel), with LoongX achieving competitive or superior results and complementing text-only paradigms.
  • Comprehensive Evaluation & Trade-offs: Expanded metrics to include LPIPS, DINO, CLIP-I, and CLIP-T, and provided trade-off curves (fidelity vs. semantic alignment) per reviewer suggestion. Also included detailed analysis of failure cases and editing task categories.
  • User Study & Usability: Conducted double-blind human evaluations confirming superior editability and content preservation. Demonstrated hands-free usability with cross-subject live tests and a forthcoming real-time demo.
  • Transparency & Related Work: Supplemented all missing citations, clarified pretraining data sources, and added recent related works.
审稿意见
4

The paper proses very first framework of image editing model using Neural brain signal condition.

优缺点分析

Strength

  1. This is the very first work and overall very interesting work.

Weakness

Although the diffusion model arthitecture and training setup is nice and plausible , there are major issues which must be addressed.

  1. Dataset bias problem . The authors collected paired dataset using 12 subjects, and set training and test setup. Are the subjects of training set data and test set data independent? If the training and test data set share same subjects, then it is intra-subject experiment which lacks generalizability. Please elaborate the dataset collection protocol.

  2. Usage of speech signal. In the method, the conditional model takes speech signal along with neural signals. Also in the results, output from only neural signal shows significantly low performance compared to neural+speech. It seems that speech signal plays major role in editing. Please elaborate this part and why neural signal only setup does not work.

I will raise the scores if these two points are properly addressed as these points are critical on the reliability and meaning of the overall proposed work.

Minor issue: During the data. collection, are the authors properly taken the approval from ethics board?

问题

See weakness

局限性

No

最终评判理由

The rebuttal addressed most of my concerns, therefore I raise my score

格式问题

No

作者回复

Q1

Dataset bias problem. The authors collected paired dataset using 12 subjects, and set training and test setup. Are the subjects of training set data and test set data independent? If the training and test data set share same subjects, then it is intra-subject experiment which lacks generalizability. Please elaborate the dataset collection protocol.

A1

1. Test data

We are deeply grateful for your insightful and highly constructive feedback, which has significantly enhanced the completeness and scientific rigor of our paper. Regarding the dataset bias issue, our original dataset was collected from 12 participants (6 female, 6 male, mean age 24.5 ± 2.5 years), each contributing around 2,000 paired samples under carefully controlled experimental conditions (see Appendix A for detailed protocol). While our initial split ensured training/test separation, we acknowledge that the possibility of subject overlap could limit generalizability. To address this, we performed additional cross-subject evaluations with 5 new participants (3 male, 2 female, ages 13–63). The results confirm that the model maintains strong generalization when applied to unseen individuals, with performance trends on CLIP-I, DINO, and CLIP-T remaining consistent with those from the original test set. This provides evidence that our approach is not overly reliant on subject-specific neural signatures, but instead captures transferable semantic representations.

Table 1. Performance comparison on the original test set with 12 subjects and the unseen test set with 5 new subjects.

Test DatasetMethodConditioningL1 ↓L2 ↓CLIP-I ↑DINO ↑CLIP-T ↑
Original Test SetOmniControlText0.26320.11610.65580.46360.2549
OmniControlSpeech0.27140.12090.61460.37170.2501
LoongX (OmniControl)Neural Signals0.25090.10290.66050.48120.2436
LoongX (OmniControl)Signals + Speech0.25940.10800.63740.42050.2588
Unseen Test SetOmniControlText only0.25810.11330.65280.46550.2553
OmniControlSpeech0.27790.12710.62210.39420.2508
LoongX (OmniControl)Neural Signals0.25740.10900.60190.40370.2403
LoongX (OmniControl)Signals + Speech0.26680.11460.60490.44470.2568

The basic background information and sample size of the participants, as well as the specific individual performance based on LoongX (Signals+Speech), are as follows:

Table 2. Basic information and performance of all subjects using LoongX (Signals+Speech).

Test DatasetSubjectGenderAgeSamplesAttentionL1L2CLIP_IDINOCLIP-T
Original Test SetSubject 1Female2520030.08870.26570.11090.63700.48900.2196
Subject 2Female2920000.08170.24160.09500.65750.50210.2249
Subject 3Female2620010.13400.24480.09630.66600.48780.2337
Subject 4Female2219990.07390.25330.10050.63940.46060.2270
Subject 5Female2819920.12180.25520.10310.61440.41570.2260
Subject 6Female2919640.08220.25110.10000.64490.45640.2213
Subject 7Male2219880.08510.27110.11600.65150.46340.2234
Subject 8Male2319930.11050.25280.10170.66380.48330.2242
Subject 9Male2219880.15000.24970.09980.63550.45710.2212
Subject 10Male2420000.12980.26570.11440.61940.42400.2220
Subject 11Male2420000.09540.27440.11510.63860.43390.2299
Subject 12Male2220000.09710.25510.10340.62130.43230.2250
Unseen Test SetSubject 13Male355000.12100.26810.11740.60220.44180.2594
Subject 14Female305000.07750.26880.11790.60510.44050.2553
Subject 15Male132000.07270.26180.10010.60550.44720.2576
Subject 16Female621000.04410.26600.11410.61960.46110.2472
Subject 17Male631000.05200.26100.11330.60170.45880.2595

2. Dataset collection protocol

Dataset collection protocol details have been supplemented into Section C Supplementary Dataset Details in the supplementary PDF file (Main text 3.1; A.2).

Q2

Usage of speech signal. In the method, the conditional model takes speech signal along with neural signals. Also in the results, output from only neural signal shows significantly low performance compared to neural+speech. It seems that speech signal plays major role in editing. Please elaborate this part and why neural signal only setup does not work.

A2

We appreciate the reviewer’s attention to the role of speech signals in our method. In our framework, speech serves as a complementary modality rather than the dominant one, supplying explicit high-level linguistic cues often absent from neural signals alone. EEG, fNIRS, and PPG effectively capture intent with fine temporal resolution, but they encode semantics implicitly and intrinsically. Compared with these neural signals, Speech supplements auxiliary semantic information for more abstract tasks or specific text editing tasks. For example, a short spoken instruction (e.g., “make the sky look like heaven”) provides a low-entropy prior that helps disambiguate editing intent, while neural signals control the global style and fine-grained visual effects. Importantly, our experiments show that neural signals alone yield robust and competitive performance, especially for low-level visual edits such as global style adjustments. For instance, the neural-only setup achieves a CLIP-I score of 0.6605, which is slightly higher than the 0.6374 observed when speech is included.

However, when it comes to tasks requiring more nuanced semantic comprehension, combining speech with neural signals leads to better alignment with the intended instruction, as indicated by an increase in the CLIP-T score from 0.2436 (neural-only) to 0.2588 (neural+speech), a relative improvement of about 0.015. This indicates that each modality brings unique strengths: neural signals excel at pixel-level control and global editing, while speech boosts complex semantic alignment. Overall, the gap between the two settings is moderate rather than substantial, suggesting that neural signals are already effective, but can be further enhanced by incorporating speech. We believe this underscores the complementary role of speech in our system, and we appreciate the opportunity to clarify this point.

Table 1: Automatic Evaluation of Editing Methods

MethodConditioningL1 ↓L2 ↓LPIPS ↓CLIP-I ↑DINO ↑CLIP-T ↑
OmniControlText0.26320.11610.74820.65580.46360.2549
OmniControlSpeech0.27140.12090.76250.61460.37170.2501
LoongX (OmniControl)Neural Signals0.25090.10290.72270.66050.48120.2436
LoongX (OmniControl)Signals + Speech0.25940.10800.72450.63740.42050.2588

We appreciate the reviewer’s careful assessment and the opportunity to clarify these critical aspects. With these additional results and protocol details, we believe the work now provides a more complete and reliable account of both methodological soundness and empirical validity. We hope these clarifications help the reviewer better appreciate the strength and significance of the proposed work.

Minor Q3

During the data. collection, are the authors properly taken the approval from ethics board?

A3

Yes, we have obtained ethical approval from the relevant institutional ethics review board before conducting the experiments, as stated in the manuscript and supplementary materials. All participants were informed of the procedure and gave their consent before participation. We will attach the specific institutional information to the camera-ready version.

评论

Dear Reviewer,

We sincerely appreciate your thoughtful and constructive feedback. To save your time, we summarize our rebuttal as follows:

  1. Dataset generalizability: We clarified and supplemented our dataset collection protocol, and conducted additional cross-subject experiments with 5 new participants. Results confirm that our model maintains robust generalization to unseen individuals, with consistent performance trends across all key metrics.

  2. Role of speech signals: We provided a detailed analysis showing that speech acts as a complementary modality, supplying explicit semantic cues for more abstract editing tasks, while neural signals alone already achieve strong performance for global and low-level edits. Combining modalities yields moderate but meaningful improvements for complex instructions (e.g., CLIP-T: 0.2436 → 0.2588).

  3. Ethical approval: We confirm that our study received prior approval from the relevant institutional ethics review board, and all participants provided informed consent.

All background details, protocol supplements, and full results are included in the revision and supplementary material. We truly hope that these substantial revisions and the new evidence not only resolve your critical concerns, but also reflect our commitment to ensuring reliability and clarity for this first work in the field. We would be excited if these improvements could help you reconsider the overall evaluation.

Looking forward to hearing from you!

评论

Thank you for the rebuttal. Most of my concerns have been addressed. I was a bit surprised on additional experiments with more subjects. Please elaborate the experiment settings and additional comparison in the final version.

评论

Thank you very much for confirming that our responses have addressed your concerns! In the final manuscript, we will include detailed participant demographics and procedures, signal-processing protocols, data splits, and statistical analyses. We will also provide new comparison tables and figures showing results for both the original and new subjects. We appreciate any further feedback you may have.

审稿意见
5

This paper introduces LoongX, which is a multimodal image editing approach that goes beyond the limits of text and image based editing. The authors acknowledge that the current way of instructing the models may be insufficient for certain types of edits, and offer a solution that is empowered with multiple signals from different modalities such as speech, EEG signals and head motion. The proposed model involves a diffusion transformer, which accepts multimodal signals instead of text only. As a part of the proposed method, LoongX proposes two modules to effectively encode different modalities, which are labeled as Cross-Scale State Space Module and Dynamic Gated Fusion Module. Given the quantitative evaluations with multi-modal editing methods such as OminiControl and qualitative editing results, LoongX is able to perform edits effectively while broadening the range of editing capabilities of diffusion transformers.

优缺点分析

Strengths

  • The proposed method expands the multi-modal inputs used in editing from image-text pairing to multiple signals involving inputs such as EEG, head motion and speech. This is a significant effort towards achieving true multimodality, and in expanding the editing tasks available.
  • LoongX introduces two modules that are crucial in interpreting multi-modal inputs where general purpose encoders are not available (which is not the case for image and text representations).
  • Over the qualitative results, the proposed method shows that edits that can not easily expressed with text prompts are made possible, such as Figure 10 (a), This is a significant expansion over the existing editing methods.
  • Quantitative results shows that the proposed approach is competitive and even surpassing the baseline in certain metrics, when compared with competing multi-modal approaches.

Weaknesses

  • The details for the baseline diffusion transformer is not clear, the authors should provide more details on the architecture or if it is fine-tuned over an existing model or trained from scratch.
  • While acknowledging that the proposed method may fall short if compared with image editing methods trained on massive amounts of text and image, it would be still helpful to compare the proposed approach with text based editing methods, in addition to multi-modal ones. It is acceptable that the proposed approach may not outperform them, but the gap should be clear for possible future work.
  • In addition to qualitative comparisons for the use of different modalities, comparisons with competing methods should also be presented.
  • The authors reported L1L_1 and L2L_2 distances as evaluation metrics for content preservation. Given the presence of DINO and CLIP-I based evaluation, this seems redundant. Authors can make their evaluation more extensive by conducting a user study instead and providing perceptual evaluation this way (which can cover both editability and content preservation), Since image editing cab be subjective, incorporating such a study would be valuable.
  • The details of the evaluations set, and the types of editing tasks are not precisely explained. The authors are encouraged to provide sufficient clarifications and discuss the failure cases of their approach.

问题

  • Is the diffusion transformer training conducted from scratch or a pretrained network such as FLUX or SD3 is used? In addition, what are the architectural details of the transformer? This would help understanding the scalability of the method and its further potential.
  • Are there any limitations of the method in terms of the edits that are performed? What are the common failure cases? The authors are encouraged to discuss such limitations with examples (or evaluations), which would impact my judgement positively.
  • What are the details of the quantitative evaluation/ Which types of edits are evaluated?

局限性

The authors provided a short discussion on limitations, addressing that abstract concepts are still hard to edit. While acknowledging this, authors are encouraged to expand this discussion. As an example, in Figure 7 (a) it appears that the color information of the cat cannot be preserved. If there are such limitations in certain types of edits, the authors should be transparent in their discussion, by including comparisons with the baseline method (OminiControl).

最终评判理由

The questions that I had in my preliminary justification has been addressed by the authors. Given the multimodal nature of the method and the supplementary results, I keep my rating as positive.

格式问题

The supplementary is attached to the main paper document. While the paper obeys all of the formatting stated expect this, it is up to the judgement of the the area chairs and program chairs whether this paper invades the submission policy or not. Normally, supplementary and the main paper should have been separate, but they are uploaded together (probably to provide more visibility to the supplementary material).

作者回复

A1

Thank you for your insightful question. The Diffusion Transformer (DiT) used in our experiments is based on the Flux.1-Dev pretrained model, and fine-tuned on our dataset using LoRA (Low-Rank Adaptation). We trained it for 12,000 steps over 2 epochs, using an Adam optimizer with a learning rate of 1e-4. The number of LoRA layers is 10, with a total of 28M parameters. We will add more detailed information on the model's architecture and the fine-tuning strategy in the revised manuscript.

A2\A3

We expanded our evaluation to include state-of-the-art text-based editing methods, with a particular focus on Flux-Kontext, which represents one of the strongest open-sourced text-driven editing frameworks at present. This addition provides a clearer perspective on where our multimodal approach stands relative to established text-only paradigms.

As the results in Table 1 demonstrate, text-only Flux-Kontext indeed performs strongly in semantic alignment (CLIP-T = 0.2728) and overall visual fidelity. However, our LoongX framework integrated into the same Flux-Kontext backbone (fine-tuned on L-Mind) further elevates performance across nearly all metrics, achieving the highest CLIP-I (0.8078) and CLIP-T (0.2730).

MethodConditioningParams (B)Training FLOPs (×10^9)L1 ↓L2 ↓LPIPS ↓CLIP-I ↑DINO ↑CLIP-T ↑
OmniControlText12.52.880.26320.11610.74820.65580.46360.2549
OmniControlSpeech12.52.880.27140.12090.76250.61460.37170.2501
ICEditText12.6-0.24570.10710.64180.69790.51290.2572
BagelText7.8-0.19210.08780.55510.80350.68940.2715
Flux-KontextText12.6-0.22820.09860.62380.77290.65530.2728
LoongX (OmniControl)Neural Signals12.53.020.25090.10290.72270.66050.48120.2436
LoongX (OmniControl)Signals + Speech12.53.040.25940.10800.72450.63740.42050.2588
LoongX (Bagel)Signals + Speech7.83.040.19680.09050.56250.79980.68310.2729
LoongX (Flux-Kontext)Signals + Speech12.63.040.19050.08040.44980.80780.70310.2730

A4

Thank you for your constructive advice. We have conducted a systematic user study to comprehensively evaluate both Editability (how well the system followed the user's intention) and Content Preservation (how well the original content of the image was preserved after editing), as recommended. We recruited 10 annotators (5 male, 5 female, ages 18-40), each of whom independently rated the edited images generated under different conditioning methods (see Table 2 and Table 3). Methods included both baselines (OmniControl, Flux-Kontext) and our proposed LoongX framework under different conditions. The scoring was performed in a double-blind manner and on a 0–5 scale, where 0 indicates "completely unacceptable" and 5 indicates "excellent".

The combination of neural signals and speech (Editability: 4.63, Content Preservation: 4.40) outperforms either modality alone. While text inputs offer high editability (4.55) due to their explicit semantic guidance, neural signals exhibit superior content preservation (4.28) by capturing implicit visual preferences. By integrating both, our method achieves a better balance between user intention and visual fidelity.

MethodConditioningOriginal edited imageOmniControlLoongX (OmniControl)LoongX (OmniControl)Flux-KontextLoongX (Flux-Kontext)
Text OnlyNeural Signal OnlyNeural Signal + SpeechText OnlyNeural Signal + Speech
Editability4.873.992.604.344.484.63
Content Preservation4.444.024.174.094.344.40

A5

We have expanded the explanation of our evaluation setup and the editing task categories, and we have also added further discussion of failure cases, as detailed in Section A.5 of the supplementary material.

Our evaluation framework consists of four representative categories of image editing tasks, as visualized in Figures 8–Figure 11 in the supplementary material:

  1. Global Editing – e.g., changing the overall color tone, saturation, or lighting of the entire image.
  2. Background Editing – e.g., modifying or replacing background elements while keeping the subject unchanged.
  3. Object Editing – e.g., altering specific foreground elements like clothing color, object size, or replacing an item.
  4. Text Editing – e.g., modifying embedded textual content within images.

Each category is evaluated both quantitatively and qualitatively to assess alignment with user intent and perceptual fidelity.

In addition, we have carefully analyzed and reported failure cases in the supplementary material. These cases highlight limitations such as:

(1) Highly imaginative prompts that lie far outside the training distribution (e.g., “long-legged space creature”),

(2) Ambiguous instructions lacking critical semantic detail (e.g., whether the original background should be retained), and

(3) Non-standard input formats (e.g., panoramic aspect ratios) that can challenge spatial reasoning.

These failure cases highlight current limitations in generalization and input robustness, and we view them as opportunities for future improvement.

A6

Our diffusion transformer is not trained from scratch; instead, we initialize it using the publicly available FLUX.1-dev pretrained weights from Black Forest Labs. This model adopts a U-Net-style DiT architecture with transformer blocks applied at each resolution scale. The full model consists of approximately 860 million parameters, operating at 512×512 resolution, and is pretrained on high-quality text-to-image datasets. Leveraging this foundation allows us to benefit from its robust visual representation capability while focusing our training on the unique neural-conditioning aspects of our task.

To adapt FLUX to our multimodal setting, we employ LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning. This enables us to fine-tune the model on our neuro-conditioned dataset without the need for full backpropagation through all layers, preserving generalization while reducing computational cost. Our conditioning pipeline feeds into the FLUX transformer via our proposed DGF fusion module, which supports both text embeddings (via T5-XXL and CLIP) and neural encodings (via CS3 encoder), as described in Section A.4 of the supplementary material.

A7

We appreciate the reviewer’s suggestion and agree that recognizing the limitations of our method is essential for a balanced evaluation. In the supplementary material (Section A.5: More Failure Cases), we expand on failure cases with qualitative examples in Figure 12.

Our analysis highlights three main scenarios where limitations emerge:

  1. Abstract or highly imaginative instructions: For concepts far beyond the training distribution (e.g., “morph the dog into a long-legged space creature”), outputs become semantically inconsistent or implausible due to the lack of concrete grounding.
  2. Vague or under-specified prompts: Ambiguity in instructions (Figure 12b) can cause mismatches with user intent, such as uncertainty over whether to retain or replace a background. This underscores the need for improved intent disambiguation via feedback or uncertainty modeling.
  3. Non-standard input formats: With unusual aspect ratios (Figure 12c), spatial misalignment or degraded quality may occur, likely due to training data bias. Future work could address this with augmentation or scale-invariant architectures.

A8

The quantitative evaluation covered tasks including global texture changes, object removal, and background modifications. We report the following results:

  • Global texture editing: CLIP-I: 0.6605
  • Object removal: DINO: 0.4812
  • Text edits: CLIP-T: 0.2588

A9

We acknowledge the issues raised regarding the color preservation in Figure 7, where certain edits, like "place the cat above," were not fully captured by our model. We have provided an extended limitations discussion to include more examples like this, where certain fine-grained details or semantic richness were not fully preserved. Additionally, we are addressing the issues causing these artifacts, which are largely due to the backbone model's capacity and the inherent challenges in handling complex image semantics.

A10

As confirmed, the submission of supplementary material alongside the main paper was in accordance with the conference's submission guidelines. Thank you for your detailed review.

评论

Dear Reviewer,

Thank you very much for your thorough and constructive feedback. To save your time, we summarize our rebuttal as follows:

  1. Baseline model and training details: Our diffusion transformer is based on the FLUX.1-dev pretrained model and fine-tuned with LoRA on our dataset. Training and architecture details are now included in the revision.
  2. Comparison with new text-based methods: We added the SOTA method Flux-Kontext as a text-based editing baseline. Results show LoongX achieves complementary improvements over text-only approaches, clarifying the performance gap and potential for multimodal fusion.
  3. User study: A human evaluation was conducted (10 annotators, double-blind), confirming our method’s advantage in both editability and content preservation.
  4. Evaluation setup and task details: We expanded descriptions of editing tasks and included more details on failure cases in the supplementary material.
  5. Limitations and scalability: We discussed limitations (e.g., abstract or ambiguous edits) and clarified model scalability using the FLUX backbone.
  6. Quantitative evaluation coverage: We detailed the specific editing tasks evaluated (e.g., global texture, object removal, background changes) and provided results for key metrics, as requested.
  7. Supplementary material: Submission format follows the conference policy, with all supporting details provided for clarity.

All revised and new materials, including extended results, additional comparisons, and supplementary analyses, have been incorporated to address your points. We hope these clarifications and enhancements meet your expectations and further demonstrate the methodological soundness and transparency of our work.

Looking forward to your feedback!

评论

Thanks to the authors for the detailed rebuttal and the supplementary experiments (presented despite the time limitation for the rebuttal). In addition to the experiments, my concerns towards the reported metrics have also been addressed, along with the experiment details. The authors are strongly encouraged to include such experiments and details in the camera-ready version, and qualitative examples with different backbones. Just like my preliminary justification, I believe this work is impactful in terms of achieving multimodal representations for editing task, whereas I keep my score as positive.

评论

Dear Reviewer,

We are deeply grateful for your encouraging words and constructive suggestions. Your positive assessment of our work and recognition of its impact on multimodal editing mean a great deal to us.

We sincerely appreciate your acknowledgment of the additional experiments and clarifications, and as advised, we will include extended results, detailed setups, and qualitative comparisons with different backbones in the final version.

Thank you again for your thoughtful engagement and kind support!

Warm Regards,

The Authors

最终决定

This paper introduces LoongX, an image editing system conditioned on neural and speech signals. The authors collect a multimodal dataset (EEG, fNIRS, PPG, motion, speech) and show that these signals can guide editing, extending beyond text prompts. Ablations reveal contributions of different brain regions.

The rebuttal addressed concerns about metrics and experiment details with additional evidence. While some design choices could be refined, the contribution is clear and novel. This is a valuable step toward multimodal human–AI interaction for editing. Therefore, the paper is recommended for acceptance.