6.4

/10

Spotlight4 位审稿人

最低4最高4标准差0.0

2.8

置信度

创新性3.0

质量2.5

清晰度2.8

重要性2.3

NeurIPS 2025

Quantum Doubly Stochastic Transformers

Jannis Born,Filip Skogh,Kahn Rhrissorrakrai,Filippo Utro,Nico Wagner,Aleksandros Sobczyk

OpenReview PDF

提交: 2025-05-12更新: 2025-10-29

TL;DR

A hybrid quantum-classical Transformer model with quantum-induced doubly-stochastic attention that stabilizes and improves small-scale vision transformers

摘要

关键词

Quantum Machine LearningTransformersSelf-AttentionComputer Vision

评审与讨论

审稿意见

评分: 4置信度: 32025-06-30

This paper presents a novel way of modifying the Attention Mechanism in Transformers by replacing the softmax operator with a quantum algorithm for generating a doubly-stochastic matrix. The quantum method is a new finding in the theory of optimal transport and it is used here in the context of Transformers to achieve more stable training and increased expressiveness. Numerous experiments studying the effect of the method are performed showcasing its promise and limitations.

优缺点分析

Strengths

The paper addresses an important and impactful question in modern AI - the quest for a a more expressive and easily trainable Transformer.
The paper's application of a DSM generating method is novel in this space and showcases the potential of using quantum algorithms in machine learning.
The paper thoroughly tests the proposed ideas experimentally, exploring different connections and caveats carefully, even at the absence of well-developed theory.
I liked the inclusion of a counting argument for DSMs in the quantized space. It is valuable to have theoretical insights, despite the lack of a complete understanding of the DSM generation problem either in the classical or quantum space.
The experimental results seem promising - this method could suggest that striving for a DSM transformation of the attention score matrix is an important future area of research.

Weaknesses

The paper is written in a slightly quantum-agnostic manner, which makes it a bit hard to read for a non-expert in quantum computing. This primarily shows up when the QontOT method is explained. I think a primer of the method and a quick explanation of how it works would be a good addition to the paper to improve clarity. Beyond affecting the flow of the paper though, it is also a matter of soundness - I would have liked to see a formal statement (maybe in the form of a lemma) of what the QontOT method guarantees in terms of input-output relationship and why it is applicable to use it with attention scores.
Practically, the method has a few weaknesses, which, while acknowledged by the authors, make it difficult to tell how far the method is from actually being applicable, even in the presence of quantum-capable hardware.
- First, the error rate for each quantum computation is really high. The authors do mention that around $\Omega(n^2 / \varepsilon^2)$ samples need to be taken for reducing the error, which ultimately forces them to consider really small values of the input length $n$ .
- On that same note, findings like Figure A1 worry me a bit: is it suggesting that the error does not diminish even when the number of shots is increased exponentially?
- Because the experiments are performed with really small values of $n$ , I worry about the statistical significance when compared with softmax. It seems like in many experiments the method is only marginally better than softmax, which could not be the case when $n$ is larger. This could be a positive for the method also, but it means that more experiments will have to be ran somehow with larger values of $n$ . Exploring algorithmic ways to reduce the effect of noise and maintain desirable qualities of the output matrices seems like an important question to suggest.
I am still a bit unsure about the value of having doubly-stochastic matrices over row-stochastic matrices only in attention. The authors cite the Sinkformer paper as a justification of this, but I'm wondering if double stochasticity is widely verified as a desirable property. What does it gain us? Is it expressiveness? Stability? Of course, studying this might fall beyond the scope of the paper, but it makes for an important point on the paper's claim: "double-stochasticity is better than row-stochasticity". Is this actually the case?
QontOT is a parametric method for double-"stochastifying" a matrix. But how does the result relate to the input. As far as I understand, the method starts with attention score matrix $M = QK^T$ and then applies QontOT on the exponentiated version $e^M$ . How is the result interpretable in terms of $M$ ? For softmax the intuition is that each row is a categorical distribution on the similarity scores. Now, since QontOT is a black-box in terms of presentation, it's hard to tell what the output actually is.

问题

How does QontOT exactly transform the input to the output?
Are there issues of numerical stability by applying QontOT to the exponentiated output? Otherwise, what other methods of ensuring positive matrices would you consider?
How do you explain that the Frobenius distance in Figure A1 does not increase with the number of shots?

局限性

Yes

最终评判理由

This paper finds an interesting connection between quantum computing and Transformer architectures, presenting a novel technique for having a doubly-stochastic matrix in the attention module. Though I was initially skeptical about the value of such a transformation, the rebuttal discussion allowed me to dive into some concurrent and recent works that further support the claim that doubly-stochastic attention matrices are desirable. Also, the paper does a good job of exploring their proposed method via a large number of empirical experiments and ablation studies.

At the same time however, I remain slightly unsatisfied in terms of the clarity of presentation of the paper. Furthermore, I remain unsure about the scalability of this approach (or approaches similar to it) to large or even moderately large context lengths, given the known volatility of quantum hardware to errors. At its current state, the method still appears impractical to me.

格式问题

N/A

作者回复

2025-07-31

Thank you for the feedback, please find our responses below. We're glad to discuss further during the discussion period if there are other questions!

(Q1): The paper is written in a slightly quantum-agnostic manner...
- Answer: Thank you for this feedback. It is difficult to hit the right tone addressing both quantum and classical AI researchers but we will revisit the QontOT section to make it more clear.
(Q2): First, the error rate for each quantum computation...
- Answer: Note that these $\Omega()$ -type lower bounds are generally pessimistic, as they aim to capture the worst-case instances, which are rarely met in practice. Recall that the classical complexity of attention is also $\Omega(n^2)$ (see e.g., Alman and Song, NeurIPS 2023), which matches the quantum lower bound with respect to $n$ . However, these lower bounds are not very informative, since there exist many efficient approximation algorithms and heuristics that surpass the quadratic dependence. Our quantum hardware experiment points in the same direction. Instead of the theoretical lower bound of $640K$ shots, we obtained high information preservation with as little as 10K shots, a reduction by a factor of 64.
(Q3): On that same note, findings like Figure A1 worry me a bit...
- Answer: The quantum hardware result in Figure A1 suggests two things. First, the Frobenius distance between the exact DSM and the one obtained from a noisy quantum computer does not improve much beyond a shot count of 10K. This might be due to a multitude of factors but is most likely explained by high noise in existing quantum hardware. While this is indeed not ideal, we argue that Frobenius Distance is not the most meaningful measure to compare real and noisy attention matrices. Attention is not about absolute values, but about relative values. Does token A or token B get more attention from the source token? Panel B of Appendix Figure A1 shows that the ordering of the attention scores is preserved extremely well. With as little as 10K shots, we obtain a spearmn rank correlation above 0.9. In practice, this means that QDSFormer preserves peak attention scores but alters their absolute value by adding additional entropy. This entropy might indeed help to mitigate known issues with Softmax like attention entropy collapse (Zhai et al, ICML 2023) and thus act as regularization.
(Q4): Because the experiments are performed with really small values of $n$ ...
- Answer: This is a great point and we agree that more experiments are always better. We take this as an opportunity to invite you to revisit our experiment on the Eureka dataset (Figure 4). This Figure evidently shows that a plain ViT obtains an accuracy of around $55\%$ while the QDSFormer achieves above $80\%$ in almost all settings. This is a large step forward on this challenging compositional image classification task. Moreover, regarding small gains and statistical significance: the performance gain in Table 1 is well above the standard deviation in almost all cases. Indeed, we ran an two-sided Wilcoxon signed-rank test between all ViT and QDSFormer results in Table 1 ( $n=40$ ) and obtained a clearly significant improvement ( $p<0.017$ ). While our improvements in Table 1 are often below $1\%$ , the absolute numbers are not very meaningful for accuracies above $90\%$ , indeed, the gain translates to reductions in error rate by around $10\%$ which is substantial.
(Q5): I am still a bit unsure about the value of having doubly-stochastic matrices...
- Answer: There is a huge body of literature discussing limitations of attention in Transformers, many of which tend to be naturally fixable with doubly stochastic attention (e.g., entropy collapse [22] or rank collapse [23]). Since the Sinkformer, many papers have refined the idea of doubly stochastic attention, e.g., the concurrent work of the ESPFormer at ICML this year. Ultimately, time will show how practically useful this stream of research will be.
(Q6): How does QontOT exactly transform the input to the output? How is the result interpretable in terms of $M$ ?
- Answer: QontOT is a variational quantum circuit that performs a unitary transformation as described in Section 2.3.3. The actual transformation is obtained by the ansatz realization through various two-qubit gates (mostly ECR gates) that provably yields a DSM. Due to the difficulty of interpreting this, Section 3 provides extensive empirical experiments. We find that Sinkhorn often produces collisions (i.e., different attention score matrices $M$ yield the same DSM $D$ ) whereas QontOT remains injective. Moreover, the Frobenius Distance between $M$ and $D$ is lower for QontOT than for Sinkhorn attention and the entropy is slightly higher which is generally regarded beneficial (see e.g., ref [22]).
(Q8): Are there issues of numerical stability by applying QontOT to the exponentiated output? Otherwise, what other methods of ensuring positive matrices would you consider?
- Answer: That is a beautiful question. One advantage of QontOT over Sinkhorn attention is that it does not require exponentiation of inputs. As we found in a new experiment for Reviewer 1 (gGX6), Sinkhorn's algorithm produces collisions whenever $M$ is low-rank and has essentially constant rows. For example, Sinkhorn fails to differentiate attention score matrices $\mathbf{e}_2 \mathbf{1}^T$ from $\mathbf{e}_4 \mathbf{1}^T$ (where $\mathbf{e}_2$ and $\mathbf{e}_4$ are the second and forth column of the identity). Sinkhorn maps these matrices to the center of the Birkhoff polytope. Instead, the QDSFormer avoids collisions which is critical because it implies that it does not confuse cases where attention matrices are row-wise constant but each row has a unique value.

评论- Response to Rebuttal

2025-08-02

Thank you for responding to my comments and clarifying my questions.

A more involved description of QontOT in the appendix, perhaps clarifying the formal guarantees of the method a little more, would perhaps help make the exposition a bit more approachable.
Also in terms of clarity, I think my main confusion while reading this paper stemmed from the fact that two separate "themes", both important, are explored:
- Using QontOT as a Transformation of the $QK^T$ matrix. This is a novel idea and improves the state-of-the-art (with caveats, of course, in terms of error and scalability - but still promising)
- Understanding how QontOT compares to other techniques (eg. Sinkhorn) in terms of the "quality" and "quantity" of DSMs it can produce.
Both of these questions are explored adequately, but for me the first one deserves a bit more of a spotlight in the paper. A section, or paragraph, for instance, where the method is more explicitly outlined. Right now this is mainly done through Figure 1. Perhaps separating these two "themes" more explicitly could improve clarity.
Thank you for re-emphasizing the amount of existing work on DSM Attention. It is definitely an important direction, and this paper definitely contributes positively towards improving the state-of-the-art.

Overall, I think the paper contains a valuable contribution. My main concerns are ultimately about clarity, and they seem to be addressed / addressable in a revision. Therefore, I will be raising my score.

2025-08-06

Many thanks for your thoughtful response (which really helps to improve our paper) and for your willingness to adjust your initial score.

We agree that providing formal guarantees about the behavior of QontOT is important. We refrained from this because some guarantees (e.g., that the output is always a DSM) were worked out already by Mariella et al., (ICML 2024). We will update the description of QontOT giving better intuition in the main text and list formal guarantees in the appendix.

Thanks for the advise on two "themes". We tried to split these two themes between Section 3 and Section 5 but it is true that the Introduction could be more clear regarding our contributions on these two themes. We will clarify that in the final version!

审稿意见

评分: 4置信度: 22025-07-01

This paper proposes the Quantum Doubly Stochastic Transformer (QDSFormer), which replaces the standard softmax in self-attention with a variational quantum circuit (QontOT) that directly outputs doubly stochastic matrices (DSMs).

Unlike Sinkhorn-based methods, which are non-parametric and approximate, QontOT is parametric and theoretically guarantees DSM outputs. The authors demonstrate improved expressivity (more diverse DSMs) and higher entropy in attention, which stabilizes training. Empirical results on small-scale vision tasks (e.g., MNIST, FashionMNIST, MedMNIST) show QDSFormer consistently outperforms standard Vision Transformers (ViT) and Sinkformers, particularly improving training stability on difficult tasks like the Eureka dataset.

优缺点分析

Strengths

Novel parametric DSM layer. Prior work enforced bistochasticity with iterative, non‑learnable operators such as Sinkhorn in Sinkformers or entropic OT; the proposed quantum circuit is the first trainable alternative and therefore enlarges the design space of attention normalisers.
Theoretical grounding: Clear analysis of expressivity advantages over Sinkhorn and QR methods (e.g., Figure 2).
Training‑stability benefit. The proposed model avoids “entropy collapse’’ and triggers Eureka moments significantly earlier than soft‑max or NormSoftmax baselines

Weaknesses

Scale & domain scope. All positive results are on small images or 1‑D spectra; no evidence is given for ImageNet‑level problems where ViTs excel. It would be very interesting to study the proposed method on a slightly larger scale setting such as ImageNet-1k (the model size could be small scale).
Ablation. No experiment shows that trainability of the quantum operator, as opposed to simply using a richer classical layer (e.g., Mixture‑of‑Softmaxes), is essential for the observed gains.

问题

局限性

最终评判理由

I would like to thank the authors for their rebuttal, and I will keep my original score.

格式问题

作者回复

2025-07-31

Thank you for the positive feedback, below we elaborate on the raised concerns.

(Q1): Scale and domain scope. All positive results are on small images, ImageNet-1K...
- Answer: Thanks for this suggestion. While larger scale would indeed always be desirable, we emphasize that among the reported experiments we already included the infrared-spectral dataset of Alberts et al., (NeurIPS 2024) which includes almost one million samples. Having covered object recognition (images) as well as multi-label time-series classification from small-scale (MNIST) up to 1M samples, we hope to have shown sufficient evidence on versatility and scalability.
(Q2): Ablation. No experiment shows that trainability of the quantum operator, as opposed to simply using a richer classical layer (e.g., Mixture‑of‑Softmaxes), is essential for the observed gains.
- Answer: Indeed, an end-to-end training of the quantum circuit parameters jointly with the ViT revealed no significant benefit over a more rigid scheme (see Appendix Figure A8). We believe that this is due to the presence of Barren plateaus, a ubiqitious challenge in quantum ML (for review see Larocca et al, Nature Reviews Physics 2025). Due to (1) the empirical superiority of doubly stochastic attention as reported in the Sinkformer (AIStats 2022) and the ESPFormer (ICML 2025) and the richness of the literature on the softmax in general, we have limited our comparisons largely to flavors of doubly stochastic attention. Among those, QDSFormer consistently performs better than competing methods like Sinkformer or the QR decomposition.

审稿意见

评分: 4置信度: 22025-07-02

This paper proposes a hybrid classical-quantum doubly stochastic Transformer, which modifies the attention layer. Then, the authors study the expressive power and show that the proposed model can preserve more information than classical operators. This model can also match or surpass ViT on some real-world datasets. I am not an expert in this area. I will consider the review from other reviewers and the AC to make a final decision.

优缺点分析

Strengths And Weaknesses: Strengths:

The proposed method looks interesting and novel.
The paper is overall well-written and easy to follow.

Weaknesses:

There are no experiments in the specific settings of quantum computing. The experiments are limited to small-scale datasets.
The improvement in the performance of the proposed model is not significant. Specifically, the improvement in Table 1 is generally below 1%. In Table 2, the performance gain is clearly beyond the error bar on only three out of seven datasets.
The theoretical analysis in Section 4 is unclear. Although the discussion there is about expressive power, there is no complete theorem presented in this section.

问题

Can you theoretically compare the expressive power of the proposed model and any other baseline method or models in Table 1? Can you empirically provide a comparison of the computation efficiency between the proposed method and baselines in Table 1?

局限性

最终评判理由

The responses addressed my questions well. I have increased the score to 4.

格式问题

None

作者回复

2025-07-31

Thank you for your positive comments about novelty and writing. Below we provide responses for specific questions.

(Q1): There are no experiments in the specific settings of quantum computing. The experiments are limited to small-scale datasets.
- Answer: These are two separate points. First, we do provide experiments on quantum hardware. This is rare for quantum ML papers as there are only around a hundred functional quantum computers globally. Our 14-qubit experiment is described in Section 5 (L 322ff, p. 9) and demonstrates that even on noisy hardware, the ordering of the attention scores is preserved, implying that the same tokens would receive high attention as in a noise-free simulation. Secondly, we have provided experiments on datasets with almost one million samples which is not considered "small-scale" for quantum circuits (indeed it is pretty large-scale). Of course, there are scaling limitations in existing quantum hardware, but we unfortunately have no control over this. If there are specific follow-up questions regarding our paper we are happy to discuss further.
(Q2): The improvement in the performance of the proposed model is not significant. Specifically,...
- Answer: Regarding Table 1, the performance gain is well above the standard deviation in allmost all cases. It would be helpful if you could clarify which statistical measure you refer to when you say "not significant". We ran an two-sided Wilcoxon signed-rank test between all ViT and QDSFormer results in Table 1 ( $n=40$ ) and obtained a clearly significant result ( $p<0.017$ ). Our accuracy improvements in Table 1 are often below $1\%$ but absolute numbers are not very meaningful for accuracies above $90\%$ , indeed, the gain translates to reductions in error rate by around $10\%$ which is substantial. Regarding Table 2, it is clearly mentioned in the paper that our performance gain is above the standard deviation for 5 out of 7 datasets with an average improvement of 1.3%. Also, please do not forget our experiment on the Eureka dataset shown in Figure 4. This Figure evidently shows that a plain ViT obtains an accuracy of around $55\%$ while the QDSFormer achieves above $80\%$ in almost all settings. This is a large step forward on this challenging compositional image classification task.
(Q3): The theoretical analysis in Section 4 is unclear. Although...
- Answer: As noted in Section 4, the expressivity is related to counting the number of DSMs, which is an open problem in mathematics (see e.g. ref [47]). A full theorem is beyond the scope of this work, as solving this would be worth a standalone paper in a prestigious mathematics journal (we highlight this already in Section 4). Therefore a fully theoretical comparison of the operators is unfeasible. Note that we obtained a partial result of the discrete size of the Birkhoff polytope. We also provide a thorough empirical comparison based on existing results and techniques in Section 3.
(Q4): Can you empirically provide a comparison of the computation efficiency between the proposed method and baselines in Table 1?
- Answer: The speed of QontOT is determined entirely by the underlying quantum hardware. Since the clock time of quantum computers is in the KHz range, unlike the GHz range of even personal silicon devices, a runtime comparison will always discourage the integration of any quantum technique. Appendix Figure A1 shows performance of QontOT as a function of circuit execution time, which reveals that with less than 10K shots (i.e., several seconds for one pass) good performance can be obtained. In simulation, the QDSFormer runs much faster, at a speed roughly 5x of a conventional ViT. Again, we emphasize that quantum ML methods are generally not compared to classical ML in terms of runtime due to incredible differences in hardware maturity.

评论- Happy with the response

2025-08-04

The responses addressed my questions well. I have increased my score to 4.

2025-08-06

Thank you for carefully reading the rebuttal and being open-minded to adjust your score. We will make sure to include the content from this rebuttal in the final paper.

审稿意见

评分: 4置信度: 42025-07-05

The paper is motivated by the observation that doubly-stochastic attention is often better than row-stocahstic attention made by prior works. As there are many different ways to obtain an attention matrix, there is a question of which one is the best, which is the topic of study of the paper. A few variants are described and benchmarked: sinkhorn approximation, projecting onto the birkhoff polytope, QontOT, QR decomposition, where the metrics are expressitivity and performance on down-stream machine learning tasks.

优缺点分析

Significance

The authors listed several motivations regarding why sinkhorn’s algorithm is not sufficient, but I do not agree with the arguments.

Line 42 states that “It $Sinkhorn’s algorithm$ can guarantee to find a DSMs only if the input matrix is non-negative, which is generally not the case within a Transformer”. However, in Sinkhorn’s algorithm, often included is a step that replaces each entry of the input matrix with its exponential so that we have a non-negative matrix. Notably, it is already part of the softmax operation in transformer.
Line 43 to 44 states that “It $Sinkhorn’s algorithm$ $S inkh or n ’ s a l g or i t hm$ is non-parametric. Thus, in contrast to e.g., a NN layer, it cannot be optimized regarding which DSM should be returned.” Isn’t it standard to apply some learnable parametric operations to the input matrix before passing it to the Sinkhorn’s algorithm?
- For one example, in transformers, the $XW\_Q W\_K^\\top X$ already has learnable parameters $W\_Q, W\_K$ .
- As another example, given a matrix $Y$ , one can do $W \\odot Y$ or $W\_1 @ Y @ W\_2$ where the $W$ ’s are learnable, and pass the output to sinkhorn’s algorithm.

Experiments in section 3.1: I am not sure figure 3 is conclusive.

Figure 3 left: In particular, different operators do not follow the same mathematical definition, so taking the input to be the same discretized n-dimensional hypercube may not be optimal and fair to all the operators.
Have you tried taking discretized points from the unit sphere $\\{x: \\|x\\|\_2=1\\}$ ?
What if you scale the hypercube/sphere by some constant, say 2, 5, 10?
Figure 3 right: it does not make sense to me that birkhoff projection by definition minimizes Frobenius distance between an output doubly stochastic matrix and the input doubly stochastic matrix, yet it gets the largest distance.

Experiments in section 5:

Figure 3: it appears that the QDS former uses multiple quantum circuit layers, whereas for ViT only one (Figure 3a) or two (Figure 3b) layers are used. Is that what the figure is showing? If so, why don’t we vary the layers of ViT as well?
Tables 1 and 2: If QontOT can have parameters, what if we also give Sinkhorn’s algorithm parameters as mentioned above?

Novelty

There are many works related to the current paper. Clarifying connections with them or giving links to related ones would help readers better appreciate the paper.
1. Obtaining a doubly stochastic matrix by projecting a matrix on the birkhoff polytope, there are existing papers on understanding the solution of such a projection as well as efficient sparse-support solvers; see $1$ and references therein.
2. Regarding parameterizing a doubly stochastic matrix via the element-wise squares of a unitary matrix, there are works characterizing the geometry and optimality condition in the vector case $2$ .

Clarity

I do not quite understand the QontOT approach in the following sense.
1. I could not find the definitions of some notations. Perhaps the paper has them but somehow i missed them. For example, on line 123, does $\\odot$ mean element-wise multiplication? Does $\\bar$ mean conjugate?
2. Assuming the answers are yes, I can agree that $U \\odot \\bar{U}$ is doubly stochastic when $U$ is unitary. What is U(p; θ) and how is it computed? Since QontOT is one of the main things that the paper advocates, it would be great if the paper can be self-contained by having basic descriptions on the architecture, efficiency, benefits.
3. Minor typo on line 115.

$1$ Understanding doubly stochastic clustering. ICML, 2022

$2$ From the simplex to the sphere: faster constrained optimization using the Hadamard parametrization. Information and Inference, 2023.

问题

The reviewer would love to see the concerns in the above addressed and revise the rating.

局限性

Yes

最终评判理由

On one hand, multiple revisions have been made as a result of the rebuttal, which is productive. On the other hand, there are many corrections: For example, the correction on experiments (Q6), mathematical claims on algorithms in their paper (Q5, Q1). As a result, my confidence of the authors understanding what they claim decreases (especially when it comes to baselines). Therefore, I maintain my initial rating.

格式问题

N/A

作者回复

2025-07-31

We thank the reviewer for the very solid feedback and good questions. Below we provide responses point by point.

(Q1): Line 42 states that “It [Sinkhorn’s algorithm, SA] can guarantee to find a DSMs only if...
- Answer: QontOT produces a DSM given an arbitrary matrix while Sinkhorn's Algorithm (SA) requires a non-negative input. While exponentiation is indeed used as a trick to enable application of SA to negative matrices, it generally remains inapplicable to negative matrices. This has important practical implications: a large subset of DSMs cannot be produced, namely the faces of the Birkhoff polytope. This is because, in practice no input entry can ever be 0 due to the exponentiation. QontOT avoids this limitation.
(Q2): Line 43 to 44 states that “It [Sinkhorn’s algorithm] is non-parametric. Thus,...
- Answer: This is a great and subtle point. The major difference is that QontOT is naturally a parametric model which aids its expressivity, while, for Sinkhorn, even if the input is learnable, the actual algorithm is always fixed.
(Q3): Figure 3 left: In particular, different operators do not follow the same mathematical definition, so taking the input to be the same discretized n-dimensional hypercube may not be optimal and fair to all the operators.
- Answer: We agree. But it is not trivial to find a more fair way to compare all algorithms than exhaustively iterating over an isotropic and equidistantly-spaced grid of all possible matrices (ideally, as fine-grained as computationally possible).
(Q4): Have you tried taking discretized points from the unit sphere?
- Answer: Yes, Sinkhorn produces collisions, while QontOT remains injective (there is no proof that the map is injective, but it appears to be experimentaly). In detail, with a discretization of $m=3$ (i.e., $a_{ij} \in$ { $0, 0.5, 1$ }), our sphere discretization contains 625 $4 \times 4$ matrices whose columns all have unit lengths. Sinkhorn yields collisions by mapping all rank-1 matrices with constant rows to the center of the Birkhoff polytope: E.g., it fails to differentiate matrix $\mathbf{e}_2 \mathbf{1}^T$ and $\mathbf{e}_4 \mathbf{1}^T$ , where $\mathbf{e}_2$ and $\mathbf{e}_4$ are the second and forth column of the identity, and thus produces only 621 unique DSMs whereas QontOT yields 625 DSMs (QR: 381). This is critical because it implies that Sinkhorn confuses cases where attention matrices are row-wise constant but each row has a unique value. We will highlight this in the final manuscript.
(Q5): What if you scale the hypercube/sphere by some constant, say $2$ , 5, 10?
- Answer: All algorithms are scale invariant. Thus we refrained from analysing vectors above unit length.
(Q6): Figure 3 right: it does not make sense to me that birkhoff projection by definition minimizes Frobenius distance between an output doubly stochastic matrix and the input doubly stochastic matrix, yet it gets the largest distance.
- Answer: Thanks a lot, good observation, indeed you are correct and we identified an incorrect rescaling before calculating the Frobenius distance. We repeated the analysis. The updated Frobenius distance reflect a slightly new trend: Among all operators, the Birkhoff projection now correctly shows by far the lowest Frobenius Distance (avg. $1010$ ) between the unnormalized QK matrix and the produced DSM ( $||D-QK||_F$ ). Sinkhorn and QontOT show roughly the same avg. Frobenius Distance ( $1050$ ). Interestingly, a vanilla Softmax, that is not constrained to produce a DSM, also obtains an avg. Frobenius Distance of ca. $1050$ , indicating that the QontOT transformation does not largely alter the numerical values, yet yields a doubly stochastic attention matrix with higher entropy. We will include the new plot in the final manuscript (note that, at this year's NeurIPS, unlike in the past, authors are unfortunately not allowed to provide links to artifacts/images during the rebuttal period). Thank you for your understanding!
(Q7): Figure 3: it appears that the QDS former uses multiple quantum circuit layers, whereas for ViT only one (Figure 3a) or two (Figure 3b) layers are used. Is that what the figure is showing? If so, why don’t we vary the layers of ViT as well?
- Answer: We did this already. Indeed, Figure 3 shows how increasing the complexity of QontOT (by adding circuit layers) yields better DSMs while keeping the number of ViT layers fixed. Instead, Table 1 shows excessive ablations on varying the ViT layers. More results on ViT layers are in Appendix Table A3 and Figure A5, A6.
(Q8): Tables 1 and 2: If QontOT can have parameters, what if we also give Sinkhorn’s algorithm parameters as mentioned above?
- Answer: Covered in (Q2) above.
(Q9): There are many works related to the current paper...
- Answer: Thank you for pointing out this related literature. It is challenging to keep track of all related work. The work by [1] on spectral clustering builds upon Zass and Shashua (NeurIPS 2006) who proposed Frobenius distance to project onto the Birkhoff polytope, just like we do. The work by [2] on Hadamard parametrization can enforce right-stochasticity without a projection or a Sinkhorn step which might help to mitigate the issues of Sinkhorn with low rank matrices that we reported above. Yet, it is not immediately obvious how such Hadamard parametrization could give rise to a parametric doubly stochastic operator. We will highlight this related work in the final manuscript.
(Q10): I could not find the definitions of some notations. ...
- Answer: Yes, in L123 $\odot$ refers to the Hadamard product. Yes, $\overline{U}=\left(U^\dagger\right)^\top$ defines the complex-conjugate. Sorry for the brevity.
(Q11): Assuming the answers are yes, I can agree that is doubly stochastic when is unitary. What is U(p; $\theta$ ) and how is it computed? ...
- Answer: We agree that it would be great to include more details on QontOT. We opted for a high-level description and pointed in many places to the QontOT paper by Mariella et al. (ICML 2024) since its quantum theory goes well beyond the scope of our paper. In essence, QontOT leverages the Operator Schmidt decomposition and the unitarity principle ( $U \odot \bar{U}$ is a DSM) to propose an ansatz $U(p; \theta)$ for amortized optimization of conditional optimal transport. The ansatz is realized by a variational quantum circuit which provably yields a DSM through a checkerboard structure with various two-qubit gates (mostly ECR) gates. Since QontOT is the first quantum circuit for optimal transport, it is impossible to compare to previous work in quantum ML (see Mariella et al., ICML 2024).

评论- Thanks to the authors for their rebuttal

2025-08-08

Q1: Your explanation makes sense, but it is different than what the original sentence in the paper said “It [Sinkhorn’s algorithm] can guarantee to find a DSMs only if the input matrix is non-negative”. The sentence needs to be revised for precision.

Q5: “All algorithms are scale invariant” l2 projection onto the birkhoff polytope is not scale invariant. For example: X=[.8, .2; .2, .8], so P(X) = X; but 2X = [1.6, 0.4; 0.4, 1.6] and P(2X) = I_2.

Q6: I am glad to see that the comment helped and the issue is resolved!

Q9: “It is challenging to keep track of all related work” Agreed! I am simply pointing out a few. “Frobenius distance to project onto the Birkhoff polytope, just like we do…” This is why [1] sounds related; it complements Zass and Shashua with theoretical grounds.

评论- Final clarifications

2025-08-08

Many thanks for your response, our discussion continues to improve our paper!

Q9: Absolutely, we will better highlight the previous utilization of the frobenius distance for projection on the Birkhoff polytope, including also the theoretical work by [1].

Q1: This is indeed a linguistic subtlety what we consider as "input". Many implementations of Sinkhorn's algorithm (e.g., the one used in the Sinkformer (AIStats 2022)) use an internal exponentiation to ensure that the conditions for applying the algorithm are met. Other implementations like in the python package ipfn require the user to take care of this and simply dont converge otherwise. In the final version we will clarify the difference between the algorithm per se and the implementational endpoints used in practice.

Q5: We apologize for the misunderstanding. You are right that projecting on the Birkhoff polytope while minimzing Frobenius distance is not scale-invariant. What we meant to say is that we limited the expressivity analysis in Section 3 to vectors with unit length because all classical algorithms that we use later in Section 5 as doubly stochastic Transformers are indeed scale-invariant. We dont use the Birkhoff projection inside a doubly stochastic Transformer because it would break differentiability. We will use the extra page in the final paper to document better invariances and equivariances, specifically:

Sinkhorn's algorithm is rotation-equivariant, permutation-equivariant and scale-invariant
QR decomposition is scale-invariant
Birkhoff projection is rotation-equivariant and permutation-equivariant

Thanks again for engaging so much and for your positive evaluation of our work.

评论- Please check authors' rebuttal

2025-08-04

Dear reviewers,

Thank you for your efforts. The authors have provided a detailed response. Please check if any concerns remain and engage in discussion with authors.

Best, AC

最终决定Accept (spotlight)

2025-09-17

The paper is motivated by the destabilization effect of the softmax normalization on the attention matrix and based on previous works showing that doubly stochastic can improve performance. Due to the limitation of the classic implementation of doubly stochastic matrix (DSM) and the possibility of approximating it on a parametric quantum circuit, the paper proposes Quantum Doubly Stochastic Transformer a hybrid quantum classic transformer with variational quantum circuit for attention layers, and empirically shows that the Doubly Stochastic attention improves performance. Experiments are of good scale with respect to the known Quantum limitations along with a hardware implementation.

All concerns were addressed during rebuttal and non of the reviewers recommended rejection. We ask the authors to improve the clarity of the paper based on the rebuttal and reviewers suggestions.