6.6

/10

Poster5 位审稿人

最低5最高8标准差1.2

3.2

置信度

ICLR 2024

Scaling Convex Neural Networks with Burer-Monteiro Factorization

Arda Sahiner,Tolga Ergen,Batu Ozturkler,John M. Pauly,Morteza Mardani,Mert Pilanci

OpenReview PDF

提交: 2023-09-23更新: 2024-03-16

TL;DR

We apply the Burer-Monteiro factorization to two-layer ReLU (fully-connected, convolutional, self-attention) neural networks by leveraging their implicit convexity, and provide insights into stationary points and local optima of these networks.

摘要

关键词

burer-monteiroconvex optimizationneural networksstationary pointsglobal optimarelu activation

评审与讨论

审稿意见

评分: 8置信度: 32023-11-01

This paper revisits recent developments which reformulate the training of a two layers neural network into a convex optimization program but are however often intractable computationally. This paper proposes to apply the Burer-Monteiro factorization to such formulations, in order to make them as tractable as the original convex formulation. In some cases, this retrieves the original non-convex formulation, and in some cases (such as for ReLU MLPs) it gives a novel formulation. The analysis includes linear but also, for the first time in the literature, non-linear ReLU networks, and it tackles MLPs, ConvNets, and self-attention networks. The overall result is that such Burer-Monteiro factorization allows to obtain algorithms for training 2 layers networks which are as efficient as their original non-convex formulations, but which can have guaranteed theoretical properties such as a bound on the relative sub-optimality gap. Finally, these developments are plugged into a (tractable) layer-wise training procedure of (deeper than 2 layers) CNNs, which allows them to inherit from the theoretical guarantees developed in the paper (and empirically, such CNNs are comparable to state of the art training of CNNs in term of test error).

优点

Originality

I believe this paper is original, since up to my knowledge it is the first one which considers the Burer-Monteiro factorization of the convex formulation of non-linear ReLU two layers neural networks: additionally, in such case the BM factorization differs from the original non-convex formulation, which makes it an original object to study in itself.

Quality

I believe the work is of quality, with theorems and their assumptions clearly stated, and their detailed proofs provided in appendix. Additionally, the literature review seems exhaustive, and some care was given to the experiments, with the code provided in the supplemental.

Clarity

I think the paper is clear, with the motivations and main results clearly highlighted.

Significance

I believe this work is very significant, since ReLU networks (MLPs, CNNs, and self-attention), contrary to linear networks, are actually widely used in the machine learning community. Therefore, I believe the BM formulation obtained in the paper, as well as the theoretical guarantees that follow from it will be of interest to the community. Additionally, the empirical result at the end regarding the training of deeper than 2 layers CNNs is encouraging and hints at the applicability of the results in that paper even to deeper than 2 layers neural networks. Additionally I think that the discussions around the convergence of (S)GD, stationary points, global optima, and the new theoretical developments in the paper related to those issues, will be interesting to the community.

缺点

I just have a question regarding the paper (see below).

问题

It is still a bit unclear to me how the training of the BM factorization for ReLU networks (19) is done in practice: since this is a constrained optimization problem, I guess its training should depart from a simple vanilla, unconstrained GD ? From the experimental section and also after briefly looking at the corresponding code, unless I am mistaken, it seems that training of pure ReLU networks has not been investigated (I think only Gated ReLU and Linear networks were trained in practice ?). Since analyzing (non-gated) ReLU networks are one of the main result of the paper, I think it could have been interesting (perhaps in Appendix) to just compare in more details, even just on the toy spiral dataset and for 2 layers ReLU MLPs, the training of the BM factorization vs. the original MLP training, since from eq. (19), the two formulations differ, and since BM has some theoretical guarantees which the original non-convex formulation does not have.

2023-11-20

We would like to thank the reviewer for the feedback and comments. Please see our responses below for answers to provide additional points of clarification on the reviewer's open questions.

It is still a bit unclear to me how the training of the BM factorization for ReLU networks (19) is done in practice

(19) can be solved in practice using a variety of methods, the simplest of which is projected gradient descent. We now include a high-level description of the algorithm in Appendix A.1 to explain how this would be solved, along with the time complexity of the solution.

评论- Thank you to the Authors

2023-11-22

I would like to thank a lot the Authors for their efforts during the rebuttal: I have read the revision and the additional section regarding Projected Gradient Descent and its complexity, and this has answered my question. I have also read the other reviews and response. As a consequence, I have raised my confidence score.

审稿意见

评分: 5置信度: 32023-11-02

The paper considers two-layer neural networks formulated as convex programs, involving minimizing a loss term regularized by some (quasi-)nuclear norm term.

The paper applies the Burer-Monteiro factorization to the neural convex networks, and derives several theorems characterizing local minimizes (Theorem 3.3), duality gap of stationary points (Theorem 3.4), global optimality of stationary points (Theorem 3.5), and similar results for CNNs (Section 3.2) and Multi-head Self-Attention (Section 3.3).

优点

Omitted.

缺点

First of all, there appears to be a mismatch between the title and the actual contents of the paper. The title is "Scaling convex neural networks with BM factorization". However, I found no experimental evidence showing that applying BM factorization to convex neural networks leads to more scalable implementations. In particular, I think the title would be only meaningful if the paper compares the baseline that optimizes the two-layer networks by gradient descent and presents improvements over efficiency. This mismatch also comes with several claims that seem inaccurate to me or lack sufficient justifications. For example,
- The BM factorization is essential for convex neural networks to match deep ReLU networks. The paper is limited to 2 layer networks, I am not sure whether it is suitable to generalize to deep networks. Even in the 2-layer case, the BM factorization is not shown to be essential or promising compared to the above baseline.
- Without the BM factorization, the induced regularizer of convex CNNs is intractable to compute, and the latent representation used for layerwise learning is prohibitively large. However, What remains to be justified is this: Why is layerwise training necessary, and why cannot one just train all layers together (as the above baseline)?
Second, the theorems seem to be straightforward extensions of prior works (e.g., as the paper mentions, Bach et al. (2008) and Haeffele et al. (2014)) Please compare prior works at the technical level. What's the novelty of the paper at the technical level? Please justify. Maybe it makes sense to make a table for comparison.

Overall, I found the paper is focused too much on writing and rewriting the problems in different ways, without making a special effort to actually solve the problems. Having presented many reformulations, the paper leaves the readers wondering whether BM factorization works or not. In particular, the theorems are weaker than the convex counterparts (the latter guarantees global optimality). If the paper is unable to show BM factorization leads to improvements in scalability or efficiency, the values of the proposed approach would be greatly limited.

Minor:

In Eq. (1), maybe it would be great to have parentheses for the summation term.
Sentence below Eq. (5): two "thereby", thereby not reading well.
The paragraph after Eq. (19): "While the original convex program is NP-hard". It is unclear to me why convex programs are NP-hard.

问题

See "Weakness".

2023-11-20

We would like to thank the reviewer for the feedback and comments. We hope that you would consider increasing your score if your concerns are adequately addressed. Please see our responses below.

Showing that applying BM factorization to convex neural networks leads to more scalable implementations.

To help show more results on the fact that the BM factorization leads to more scalable implementations, we have added an additional section in Appendix A.1 which demonstrates the superior time complexity of the BM formulation compared to the original non-convex objective.

In addition, the reviewer has some questions around the need for layerwise training and how it relates to this paper. Below we can explain in more depth the rationale behind these experiments.

While training deep neural networks directly is possible using SGD heuristically applied to poorly understood non-convex objectives, we seek to find approaches that can optimize equivalent architectures to deep neural networks while having convex guarantees. While most of the theory around convex neural networks allows us to re-state two-layer neural networks as convex programs, we wish to extend the capabilities of this approach to deeper networks. Thus, layerwise learning is a natural solution -- each two-layer sub-component of a deeper neural network can be trained to global optimum using convex optimization. Layerwise learning has already shown great success in the non-convex training domain as shown in Belilovsky et al., 2019. However, naively using convex optimization to train layerwise networks does not scale computationally for the reasons discussed in Section 4.2.

Therefore, we show how the BM factorization applied to these convex networks (1) scales in a way that the original convex program doesn't, allowing us to train these architectures in a layerwise fashion in an approach that matches the performance of end-to-end trained deep neural networks; and (2) preserves convex guarantees around optimality and has an easily computable relative optimality gap bound. We hope this makes the motivations of the layerwise training experiments more clear to the reviewer.

What's the novelty of the paper at the technical level?

The primary technical novelty against other works such as Bach et al. (2008) and Haeffele et al. (2014) is in our characterization of stationary points. In Theorems 3.3 and 3.4, we characterize the relative optimality gap for stationary points of the BM factorization, which is not shown in any other work. Most other work around the BM factorization focus on local minima only. This is especially significant because for most NN objectives, there is no guarantee that BM will actually find local minima, as we discuss in Section 2.2. We note that there are other aspects to novelty in our work such as results in neural networks, though we believe the reviewer does not dispute this and is more curious about the technical innovations in how we study the BM factorization.

Thank you for the minor comments as well -- these have been revised in the updated version (see changes in the revised PDF in blue).

评论- Look forward to your feedback

2023-11-22

Dear Reviewer xgT4,

We believe that we have addressed your concerns in our responses. Since the deadline is approaching, we would like to hear your feedback so that we can respond to that before the discussion period ends. Please feel free to raise questions if you have other concerns.

Best regards,

Authors

审稿意见

评分: 6置信度: 32023-11-04

This paper proposes to solve convex formulations of nonlinear two-layer neural networks with Burer-Monteiro factorization, which is known to be computationally tractable for solving convex programs with nuclear norm constraints. It provides theoretical optimality guarantees for two-layer MLPs, CNNs, and self-attention networks. Experiments on FashionMNIST and CIFAR-10 are provided to validate the feasibility and effectiveness of the proposed method.

优点

The topic studied, i.e., scaling recently proposed convex learning methods is interesting and technically challenging.
The proposed method is applicable to different convex neural network structures.
The proposed method seems theoretically sound.

缺点

I would bring to the author's attention that Zhang et. al. proposed a convex version of MLP [1] and of CNN [2]. In particular, for [2], the framework also imposes a low-rank nuclear norm constraint. In the arxiv version of [2] (https://arxiv.org/pdf/1609.01000.pdf), they also provided experiments to the scale of CIFAR-10 dataset, where they also applied low-rank kernel matrix factorization techniques like Nystr¨om approximation, random feature approximation, Hadamard transform, etc. The discussion on how the proposed method is compared to [1] and [2] should be included at least theoretically, if not experimentally.

[1] Yuchen Zhang, Jason D. Lee, Michael I. Jordan. ℓ1-regularized Neural Networks are Improperly Learnable in Polynomial Time. ICML 2016. [2] Yuchen Zhang, Percy Liang, Martin J. Wainwright. Convexified Convolutional Neural Networks. ICML 2017.

The authors could include a big O notation analysis on the computational cost to better illustrate exactly how much the BM factorization helps with scaling the method.

问题

N/A

2023-11-20

We would like to thank the reviewer for the feedback and comments. We have added some recommended clarifications mentioned by the reviewer. Please see our responses below.

The discussion on how the proposed method is compared to [1] and [2] should be included at least theoretically, if not experimentally.

We thank the reviewer for notes on [1] and [2] -- these are interesting works that propose convex alternatives to neural networks. We should note that these are slightly different than the line of work we focus on in our paper, because they do not exactly match the objectives and architectures of typical non-convex neural networks, whereas in this paper we focus on exact convex formulations of an identical neural network objective (i.e. a solution to (3) maps one-to-one with a solution to (1)). We have included a brief mention of [1] and [2] in our related work section.

The authors could include a big O notation analysis on the computational cost to better illustrate exactly how much the BM factorization helps with scaling the method.

We have now included this big O notation analysis to illustrate the per-iteration complexity of solving the BM factorization compared to the original convex formulation in Appendix A.1.

2023-11-22

Thanks for the response. It addresses my concerns. I'd like to keep my score and lean towards acceptance.

评论- Look forward to your feedback

2023-11-22

Dear Reviewer 8nq3,

Best regards,

Authors

审稿意见

评分: 6置信度: 42023-11-04

This paper proposed a Burer-Monteiro (BM) formulation for convex 2-layer neural network training problems. In particular, the BM formulation for MLP (with linear, ReLU and gate ReLU activations), CNN (with ReLU activations) and self-attention networks (with linear activations) are included.

The convex neural network training problem for linear and gate ReLU activated MLP can be directly solved using polynomial time algorithms such as interior-point methods. However, the problem becomes NP-hard for ReLU activated MLP, CNN and self-attention networks due to the use of quasi-nuclear norm regularizer, as it is computationally intractable in practice.

The proposed nonconvex BM formulation turns the quasi-nuclear norm regularizer in convex training problems into a constrained Forbinuous norm, which is computationally tractable. Despite nonconvexity, the authors developed an optimality bound for stationary points in the BM formulation, which upper bounds the optimality gap between the original convex problem and its BM formulation, and, in some cases, is able to provide certification on the global optimality of the given stationary point.

优点

The proposed BM formulation provides a computationally tractable way to handle convex neural network training problems that involve quasi-nuclear norm regularizers.
This paper is the first to combine BM with convex neural network training, which is an interesting direction.

缺点

The proposed BM formulation doesn’t seem to provide any advantages other than dealing with quasi-nuclear norms. As BM requires $m\geq d+c$ to ensure no spurious minima, it increases the number of variables from $dc$ to $m(d+c)$ (for MLP). For linear and gate ReLU activated MLP, it seems like it would be a lot more efficient to directly solve for the original convex problem.
The proposed BM formulation for CNN is still not practical as the memory requirement is too high. In fact, the authors adopt layerwise training for solving the BM formulation, which has no guarantee to converge to global optimum.
Continuing from 1 and 2. The paper titled “Scaling Convex Neural Network” but it seems like the proposed BM formulation only addresses the intractability of quasi-nuclear norm, it does not reduce the time complexity and the memory requirement for solving the convex neural network training problem. It is questionable that the proposed BM formulation would be scalable to large 2-layer neural networks due to the above reasons.
The novelty of this paper is really limited. In particular, this paper is just applying Burer-Monteior to the existing 2-layer convex neural networks, which would be too incremental to be published on ICLR.

问题

Could the authors elaborate the architecture of the CNN network used in Section 4.2? Specifically, exactly which “architecture of (Belilovsky et al., 2019)” is used.
What is the rate of convergence for solving the proposed BM formulation using GD? Does GD become sublinear convergence when it is close to the optimal due to the ill-conditioning on $U$ and $V$ (using MLP for example, $U$ and $V$ must be rank deficient at optimality due to $m\geq d+c$ )?
Could the authors provide experimental results on demonstrating the scalability of BM formulation? For example, the time and memory complexity (per iteration) v.s. the number of neurons.
Could the authors provide more insight on why it is important to find the global optimum in neural network training? Though the global optimum gives the best training error, it does not guarantee to give the best generalization error. It is entirely possible that the local minimum would have better generalization error than the global minimum. In addition, finding the local minimum is a lot cheaper than solving the BM formulation proposed in this paper.

2023-11-20

We would like to thank the reviewer for the feedback and comments. We hope that you would consider increasing your score if your concerns are adequately addressed. At a high level, we have added some additional details to the paper in Appendix A.1 to demonstrate why the Burer-Monteiro factorization is more computationally efficient than solving the original convex neural network, as well as some experimental results in Appendix C.3 that also show this result. Please see our detailed responses below.

Responses to Weaknesses

Proposed BM formulation doesn't provide any advantages other than removing quasi-nuclear norm. It would be more efficient to solve the original convex problem.

We have provided additional explanations of the per-iteration complexity of the Burer-Monteiro factorization compared to the original convex problem.

While the Burer-Monteiro (BM) factorization adds additional variables (the original convex problem has $Pdc$ variables, while the BM factorized optimization problem has $Pm(d+c)$ variables), the per-iteration complexity is greatly reduced in the BM case. In the ReLU activation case, we can compare applying the Frank-Wolfe algorithm presented in (Sahiner et al., 2021b) to the convex problem (3) to projected gradient descent applied to (19).

In (Sahiner et al., 2021b) Appendix A.4.2, it is shown that a single Frank-Wolfe iterate on (3) is $\mathcal{O}(Pn^r)$ , where $r := \mathrm{rank}(X)$ . In contrast, a single iteration of projected gradient descent on (19) naively takes $\mathcal{O}(P(n^3 + n^2(d+c + dm) + dcm))$ time. It's clear that unless $m$ is chosen to be exponential in $n$ (which would never occur in practice), the per-iteration complexity of solving (3) is order of magnitude larger than that of (19) due to the exponential dependence on $r$ . Therefore using the BM factorization greatly reduces the per-iteration complexity of solving the neural network optimization problem

We have added an additional section in the Appendix (Appendix A.1) to walk through the details of this result.

The proposed BM formulation for CNN is still not practical as the memory requirement is too high.

The proposed BM formulation for two-layer CNNs is practical -- this is why we use it in Section 4.2. The purpose of doing layerwise training is to expand these results to carry to deeper neural networks, rather than to overcome shortcomings of the computational complexity of the BM factorization for CNNs.

It does not reduce the time complexity and the memory requirement for solving the convex neural network training problem.

See above and Appendices A.1 and C.3 for an explanation of how we reduce the time complexity and memory requirements for solving the convex neural network problem.

Limited novelty

Aside from the BM factorization proposed in this work being the first time the BM factorization has been applied to neural network optimization, we present novel results in a few ways:

We derive a novel bound on the relative optimality of the stationary points of the BM factorization for neural networks. This is the first time stationary points of BM factorization have been showed in any work -- other works have solely focused on local minima, which are not guaranteed to be found with e.g. SGD.
Accordingly, we are the first to develop conditions to check if stationary points of BM factorizations have achieved their global optimum.
We also use this BM theory to uncover insights about novel neural network architectures, e.g. that linear self-attention networks with sufficiently many attention heads have no spurious local minima.

2023-11-20

Responses to Questions

Elaborate on the architecture in Section 4.2.

See Appendix C.2 for a description of the experimental setup we use in this experiment based on Belilovsky et al., 2019. Namely, we train two-layer CNNs (consisting at convolutional layer with 256 filters + gated ReLU + average pooling to size $2 \times 2$ + fully connected layer) for multiple stages. After training each stage, the fully connected layer is discarded, and the convolutional filters of the current stage are frozen and used to generate the input for the next stage.

What is the rate of convergence for solving the proposed BM formulation using GD?

For this, we can refer to some literature [1] showing the linear convergence of preconditioned GD applied to BM factorized problems with convex objectives, though, as pointed out by the reviewer, naive GD may indeed suffer from sublinear convergence if $m$ exceeds the dimension of the optimal solution, i.e. $U_j$ and $V_j$ are rank deficient. We have noted this in Appendix A.1.

[1] Gavin Zhang, Salar Fattahi, and Richard Y Zhang. Preconditioned gradient descent for overparameterized nonconvex burer-monteiro factorization with global optimality certification., JMLR 2023

Could the authors provide experimental results on demonstrating the scalability of BM formulation? For example, the time and memory complexity (per iteration) v.s. the number of neurons.

We have provided an additional experiment of this nature in Appendix C.3. We demonstrate that while the convex neural network is much more inefficient compared to the BM factorization when training a single two-layer CNN, this only becomes exacerbated by the layerwise learning setting.

Could the authors provide more insight on why it is important to find the global optimum in neural network training?

While we agree that better objective function performance does not necessitate better generalization in general, practically it has been shown that convex formulations of NNs that find global optima also generalize better than SGD applied to the non-convex formulations (see e.g. Ergen et al., 2021; Ergen & Pilanci, 2020). This empirical trend suggests that solving NN problems to their global optimum will typically improve generalization performance.

Furthermore, we argue that as a broader principle, a much better paradigm than solving poorly-understood non-convex problems heuristically and hoping that stationary points generalize is to have well-understood problems with convergence guarantees. If it turns out that such problems do not generalize well, there are many well understood methods to address this issue, such as adding additional convex regularizers (e.g. $\ell_1$ , ...) that induce good generalization properties. This is a much more principled approach to machine learning that can provide robustness and guarantees.

评论- Look forward to your feedback

2023-11-22

Dear Reviewer 4XdN,

Best regards,

Authors

2023-11-22

I want to thank the authors for providing the additional experimental results. My main concern is whether the time and memory complexity of the proposed BM formulation is practical, and I think the authors’ response and the additional experimental results have addressed my concerns. It is now clear that solving the proposed BM formulation with naive PGD has lower per-iteration cost than solving its convex formulation using Frank-Wolfe, however, the naive PGD might suffer from sublinear convergence. It would be interesting to see if it is possible to extend [1], which studies the preconditioned GD for unconstrained BM optimization problems, to derive the preconditioned PGD for solving the constrained BM optimization problems proposed in this paper (for example, equation 19 and 24).

I think with the additional results, the paper now has decent novelty as it is the first paper to propose the BM formulation for convex NN training that guarantees global optimality. Most importantly, the proposed BM formulation is indeed practical and does mitigate the scalability issue faced in convex NN training. I have adjusted my score.

[1] Gavin Zhang, Salar Fattahi, and Richard Y Zhang. Preconditioned gradient descent for overparameterized nonconvex burer-monteiro factorization with global optimality certification., JMLR 2023.

审稿意见

评分: 8置信度: 32023-11-10

The paper adds to the growing literature of viewing 2-layer nonlinear neural networks (MLP, CNN, self-attention) as convex programs (sometimes with large parameters) to study stationary points of these programs via tools from convex and SDP optimization. In particular, using Burer-Monteiro (BM) factorization, the authors give a different parameterizations of these convex programs and demonstrate empirically that the new parameterization allows for faster computability. They also provide theoretical guarantees of local minima of BM programs being global minima of the original program and bound relative optimality gap of BM programs.

优点

I believe the paper is highly interesting, novelly applying existing techniques to a new problem and give ample theoretical and empirical justification for the adoption.

BM theory is shown to be flexible for many different architectures, allowing the authors to show no spurious local minima for linear self attention layer. Although I am not an expert in the theory of transformers, I believe this claim is significant if correct.
BM factorization provides computable objectives for convex 2-layer ReLU networks and is shown empirically to scale under widely used, simple datasets.
If a local minima is obtained, low-rank-ness is sufficient for global optimality.

缺点

The proposed BM factorization for nonlinear MLP is nonconvex and it is not certain how to optimize them to obtain local minima (although the authors do provide methods to check optimality and bound the optimality gap to the original formulation)
It appears that BM factorization corresponds to the "conventional wisdom" in practical machine learning of "adding another layer" to the model and enjoy nicer optimization landscape. The resulting program, then, is nonconvex (as also pointed out by the authors). In light of this, although the authors provide a small paragraph in page 5 comparing BM MLP to original nonconvex problems, it seems that the formulation in the paper is much closer to this original nonconvex problem (and thus needs further comparison) than to the convex formulation in which the framework is developed.

问题

How feasible is it to generalize this framework to more than 1 hidden layer?
It is hard to keep track of different norm notations at times, so the authors may consider defining all of them at once at least in the appendix. For instance in Lemma 3.1, when R and C are introduced, it's not clear that those are just L^p norm with p = R and C or something else. It's also not clear that ||.||_F corresponds to Frobenius norm or some special norm that comes from the objective function F.
Is there a feasible framework to analyze saddle points of the BM formulation and not just local minima?

2023-11-20

We would like to thank the reviewer for the feedback and comments. Please see our responses below for answers to provide additional points of clarification on the reviewer's open questions.

How feasible is it to generalize this framework to more than 1 hidden layer?

Unfortunately, there have not been many satisfactory convex formulations of deeper neural networks that are amenable to this kind of analysis. Hence our emphasis on layerwise training in Section 4.2.

Hard to keep track of all the norms at once.

We have updated our "notations" section in Section 2 to make these more clear.

Is there a feasible framework to analyze saddle points of the BM formulation and not just local minima?

Theorems 3.3 and 3.4 do exactly this -- they show results around the stationary points (which include saddle points) of the BM formulation, rather than just local minima which has been the focus of much of the existing BM literature.

2023-11-22

I'd like to thank the authors for their comments. I am keeping my score.

AC 元评审

2023-12-06

This paper proposes a novel formulation of convex neural networks using BM factorization. Convex Neural Networks are convex optimization problems that correspond to the training of 2 layer NNs. While the new optimization problem obtained is non-convex the authors manage to prove that it has no spurious local minima.

The authors developed a sound theory and applied it to ‘simple’ real world datasets (Fashion MNIST and CIFAR10).

I recommend this work for acceptance.

为何不给更高分

The results are sounds and seem to work in practice. However, I do not think they are ground breaking (restricted to two layers and do not scale crazily well) so I do not recommend this paper for an oral.

为何不给更低分

The paper is solid. There seems to be a consensus among reviewers. The only reviewer who put a lower grade (5) did not engage in the discussion. Their argument where not the strongest:

The work is limited to the two layers. It would be a breakthrough in the field to extend this beyond 2 layers.
Straightforward extension, of prior work. I do not think it is the case as acknowledged by the other reviewers.

最终决定Accept (poster)

2024-01-16

Accept (poster)