/10

Poster3 位审稿人

最低2最高3标准差0.5

ICML 2025

Active Learning of Deep Neural Networks via Gradient-Free Cutting Planes

Erica Zhang,Fangzhao Zhang,Mert Pilanci

OpenReview PDF

提交: 2025-01-24更新: 2025-07-24

摘要

关键词

convex optimizationcutting plane methodactive learning

评审与讨论

审稿意见

评分: 32025-03-10

This paper introduces a novel theoretical framework that extends cutting-plane optimization methods to active learning for deep neural networks. The authors bridge two previously separate domains: deep neural network training and cutting-plane optimization techniques. The primary contribution is showing that cutting-plane algorithms, traditionally limited to linear models, can effectively train neural networks despite their non-convexity and nonlinear decision boundaries. A key theoretical result is the geometric contraction rate guarantee of the feasible set, providing strong convergence properties. Through experiments on both synthetic data and standard benchmarks, the authors demonstrate that their approach achieves competitive performance against established active learning baselines while maintaining theoretical guarantees that most current methods lack. This represents an important step toward principled active learning methods for deep neural networks with provable properties.

给作者的问题

What factors limit the scale of data selection with your method, and what is the maximum scale achievable? Understanding these constraints would help clarify the method's practical applicability.
What is the computational cost when applying your method to deeper networks, and approximately how many neural network layers can your approach effectively handle while maintaining its theoretical guarantees? This would help readers understand the scalability of your approach.
Current active learning approaches often leverage pretrained models (e.g., ActiveFT). Could your method be extended to work with pretrained models, perhaps by training only a linear classifier layer or a shallow two-layer network with ReLU? Some simple experimental validation would be valuable.
If samples selected by your method were used to train deeper neural networks, would they be more effective than samples selected by existing methods? This comparison would help establish broader applicability beyond the theoretical context.

I should note that while I'm not specialized in theoretical machine learning research, the paper appears well-executed and makes a valuable contribution. I have a positive impression overall, though I believe addressing the real-world applications and scalability questions would further strengthen the work.

论据与证据

The claims in this paper are well-supported through rigorous theoretical analysis and empirical demonstrations. The authors clearly establish their theoretical framework and provide thorough proofs for their convergence guarantees, which is particularly valuable in the typically heuristic-driven field of deep active learning.

方法与评估标准

The proposed method represents a creative and theoretically sound approach to the active learning problem. By adapting classical cutting-plane methods to the neural network setting, the authors provide a fresh perspective that addresses limitations of previous approaches.

The evaluation methodology is appropriate and well-executed, using standard benchmark datasets and metrics. The authors effectively demonstrate their method's performance through clear visualizations and comparisons against relevant baselines. Their approach of evaluating on both synthetic data (to verify theoretical properties) and standard benchmarks (to demonstrate practical utility) provides a comprehensive assessment of the method's capabilities.

理论论述

No problems.

实验设计与分析

The experimental design is sound and effectively demonstrates the method's practical utility. The visualizations are particularly strong, providing clear intuition about how the algorithm operates. The experiments validate the theoretical properties while showing competitive performance on standard tasks.

While the baseline comparisons are somewhat limited and focused on simpler approaches, they adequately demonstrate the method's effectiveness relative to established techniques. Given the theoretical focus of the paper, the experimental validation strikes an appropriate balance between demonstrating theoretical properties and practical utility.

补充材料

I reviewed portions of the supplementary material, particularly focusing on the extended proofs. The proofs appear sound and provide the necessary mathematical details to support the claims in the main paper.

与现有文献的关系

This work makes a valuable contribution by bringing together two previously separate research areas: cutting-plane optimization and deep active learning. It builds upon classical optimization techniques while addressing modern deep learning challenges.

The approach offers promising avenues for extension to other sample selection problems beyond active learning, potentially impacting other areas where principled sample selection is critical. The theoretical guarantees provided by this approach distinguish it from many existing methods in the field.

遗漏的重要参考文献

None

其他优缺点

Strengths:

The paper makes a novel connection between classical optimization techniques and modern deep learning
The theoretical analysis is rigorous and provides valuable insights into the method's performance
The visualizations are clear and intuitive, effectively communicating the algorithm's operation

Weaknesses:

The experimental evaluation, while sufficient, could include comparisons to more recent active learning methods
The practical implementation details for large-scale applications could be clarified further
The limitations regarding very large networks and datasets are not fully addressed

其他意见或建议

See Questions.

作者回复

2025-04-01

The experimental evaluation, while sufficient, could include comparisons to more recent active learning methods

Thank you for the suggestion. We've compared against 8–10 standard baselines from scikit-activeml and DeepAL. Since our method builds on a cutting-plane training scheme, we adapted the baselines—sometimes with custom implementations—to ensure fair comparison (see Appendix H.3 for details). Because our method introduces a fundamentally novel setup unlike existing AL approaches, an exhaustive comparison is challenging given the vast literature. However, we’ll consider adding more recent deep AL methods in the baseline for the final version.

The practical implementation details for large-scale applications could be clarified further

We've detailed our implementation of sentimental classification task in Appendix H.1. We are happy to provide more details if there is anything unclear.

The limitations regarding very large networks and datasets are not fully addressed

Our proposed algorithm, still in its early stages, does face scalability challenges with the largest NN models. However, it evolves alongside advances in cutting-plane methods and LP solvers—both active research areas. Recent progress, such as improved cutting-plane techniques [8] and GPU-accelerated LP solvers [9, 10], can be directly applied to ReLU NN training based on the equivalence established in this work.

More importantly, our contribution goes beyond empirical gains. The theoretical insights enabled by this LP-NN connection are significant, especially given the much deeper understanding of LP systems compared to neural networks.

[8] An Asynchronous Proximal Bundle Method: https://link.springer.com/article/10.1007/s10107-024-02088-x; [9] CuClarabel: GPU Acceleration for a Conic Optimization Solver: https://arxiv.org/abs/2412.19027; [10] MPAX: Mathematical Programming in JAX:
https://arxiv.org/abs/2412.09734.

What factors limit the scale of data selection with your method, and what is the maximum scale achievable?

The main limitation in data selection is that more training data leads to more constraints in the LP formulation, increasing the problem size. While we explore pruning schemes to discard redundant constraints (Appendix F.4), scalability remains a challenge. We haven't tested the maximum scale—current experiments use only tens of points, as larger selections require longer solver runtimes. That said, the method remains executable, albeit slower with more data.

What is the computational cost when applying your method to deeper networks

The main computational bottleneck for deeper NNs still comes from solving a large LP since we will have more variables in the LP. So ideally, if we have a very efficient LP solver, there is no limit of our method. As shown in Theorem 4.2, our theory extends to arbitrarily deep NNs. Fortunately, we've seen recent effort in GPU-accelerated LP solvers as [9] and [10] mentioned above, which truly empower our method for large scale application.

Could your method be extended to work with pretrained models?

We've indeed done some experiments with pre-trained large LLMs, following exactly the reviewer's thoughts. Figure 4 shows the result of training a two-layer ReLU classifier on top of Phi-2's embeddings for sentiment classification task. Phi-2 is a 3B pretrained LLM model. Our result shows that when combining Phi-2's embeddings with our active learning scheme, we harness higher prediction accuracy and better query efficiency compared to other baselines.

If samples selected by your method were used to train deeper neural networks, would they be more effective than samples selected by existing methods?

Thank you for the question. We base our theoretical results on cutting-plane training as it offers a cleaner, more analyzable framework. In contrast, gradient-based methods rely heavily on heuristics, and understanding their dynamics (e.g., SGD with varying batch sizes or step sizes) remains limited, making rigorous analysis difficult.

In this revision, we strengthen our theory by showing that the learned model converges not only volumetrically in parameter space, but also in norm to the optimal decision boundary (see proof here). This supports the quality of our selected samples for downstream training. That said, empirically, we tested our selected samples on a deeper network for a quadratic regression task and found our method remained competitive with standard deep active learning baselines, even if the performance gap was smaller due to task simplicity (link). We plan to add further experiments on deeper models and more complex tasks in the final version.

审稿人评论

2025-04-02

Thank you for your reply. I don’t have any further questions, and I’ll keep my rating. Wishing you the best of luck!

审稿意见

评分: 22025-03-15

This paper proposes a novel method for training ReLU deep neural networks and selecting data queries within an active learning framework. Extending previous work on cutting plane algorithms to multi-layer ReLU networks, the authors formulate network training as a linear programming problem, decomposing the task of data fitting into a series of linear programs that can be solved efficiently. Additionally, the paper introduces a new active learning strategy that prioritizes querying the most confident samples to effectively identify misclassified instances and substantially reduce the parameter space. The proposed framework is validated through experiments on multiple synthetic datasets, demonstrating its efficacy.

给作者的问题

See ``Other Comments Or Suggestions''.

论据与证据

Yes.

方法与评估标准

Yes. The paper introduces a novel approach to neural network training from cutting-plane algorithms.

理论论述

Theorem 6.3, the main theoretical result in this work, is quite similar to the convergence analysis in (Louche & Ralaivola, 2015). It would be appropriate to mention this prior work in the theory section.

实验设计与分析

Much of today's Deep Active Learning research focuses on large-scale datasets using batch-mode querying, aspects not addressed by the experiments in this paper. Instead, the authors evaluate their method primarily on simple tasks like a simple 2D spiral binary classification and a regression problem. Consequently, this empirical evaluation does not convincingly demonstrate how the proposed method addresses the scalability challenges typically targeted by active learning.

补充材料

I checked the proof of the theoretical results and the experimental setups.

与现有文献的关系

The problem of applying convex optimization methods for neural network training is relevant, and the theoretical contributions appear sound. The proposed method effectively addresses common challenges inherent in gradient-based methods, such as hyperparameter sensitivity and slow convergence, while achieving competitive performance.

The experimental results demonstrate promising outcomes in both classification and regression tasks on smaller-scale datasets, suggesting potential suitability for specific practical scenarios.

遗漏的重要参考文献

N/A

其他优缺点

I have a concern that the method may not work well with larger-scale neural networks. The size of the activation pattern $D$ can be exponential in $n$ .
The proposed method has significant limitations in terms of applicability, as it exclusively supports neural networks composed of linear layers with ReLU activation functions. This constraint renders it incompatible with modern architectures, including transformers, convolutional neural networks, and models utilizing alternative activation functions. Consequently, this severely restricts the practical impact and broader relevance of the method within the current deep learning landscape.

其他意见或建议

The paper does not address the volumetric stopping criterion in detail, particularly regarding its dependence on the dimensionality of the hypothesis space. It remains unclear how a user would practically specify or adjust this criterion.
One potential advantage of the proposed cutting-plane method could be reduced computational cost compared to gradient-based methods. However, this aspect hasn't been sufficiently discussed or analyzed. Providing empirical runtime comparisons between the cutting-plane approach and gradient-based algorithms on the tested datasets would help to quantitatively highlight any computational benefits. Such an analysis would significantly strengthen the evaluation by clarifying whether the proposed approach offers practical efficiency improvements.

作者回复

2025-04-01

Theorem 6.3, the main theoretical result in this work, is quite similar to the convergence analysis in (Louche & Ralaivola, 2015)

Dear reviewer, we first cite L&R’s work in Section 2 under “Cutting-Plane-Based Active Learning with Linear Models.” Our contribution goes well beyond theirs, which applies only to linear models. Theorem 6.3 resembles classic cutting-plane convergence results—standard in the literature (e.g., see pp. 9–10 in these notes)—but it's not specific to L&R but just a classic result. Our key novelty is extending this to the nonlinear two-layer NN $f^{\text{two-layer}}$ , where $f^{\text{two-layer}}$ , which is never obtained before.

Empirical evaluation does not convincingly address scalability

Our goal is to show the feasibility of training ReLU NNs via LPs (see abstract, Sections 1 & 3), enabling theoretical insights not possible with traditional methods. While not yet faster than gradient-based training, our method offers stronger query efficiency, consistently outperforming baselines with fewer queries (Figures 2–4). This makes it well-suited to query-limited settings.

Scalability challenges stem from:

Constraint growth: We mitigate this via subsampling and iterative pruning (Appendix F.4).
Solver overhead: Recent GPU-based LP solvers like CuClarabel [3] and MPAX [4] offer promising future improvements.

[3] CuClarabel: GPU Acceleration for a Conic Optimization Solver. https://arxiv.org/abs/2412.19027
[4] MPAX: Mathematical Programming in JAX. https://arxiv.org/abs/2412.09734

Concern over exponential growth in activation patterns

Thanks for raising this. The number of ReLU activation patterns does not scale exponentially in $n$ ; rather, it scales with the rank of the data (see Section 3 in [5]). We further reduce complexity using pruning strategies (Appendix F.4). Recent work [6] also proposes geometric-algebra-inspired sampling, although it doesn’t yet connect to LP-based training.

[5] Neural Networks are Convex Regularizers: Exact Polynomial-time Convex Optimization Formulations for Two-layer Networks
https://arxiv.org/pdf/2002.10553; [6] Randomized Geometric Algebra Methods for Convex Neural Networks
https://arxiv.org/pdf/2406.02806.

Applicability limited to ReLU networks

We've addressed a similar issue in our response to Point 2, but we’d like to re-emphasize the broader potential of connecting NN training to classical LP solving—both for advancing theory and enabling new algorithms. Our current work offers a preliminary demonstration of this:

Theoretical advancement: We provide a convergence result that was previously unattainable due to the nonconvexity of NN training. Under our framework, the prediction function converges in norm to the optimal decision function (see our added proof here (anonymous link). In contrast, LP systems are well-understood thanks to centuries of study. Framing ReLU training as LP solving opens a promising new lens for analyzing deep NN properties.
Algorithmic novelty: We show that cutting-plane methods can be applied to ReLU NN training—an approach not explored before. While we use a basic variant, there’s a rich and growing body of work on advanced cutting-plane methods (e.g., [7], published in Mathematical Programming, Jan 2025), which could be leveraged in future extensions.
Solver development: GPU-supported LP solvers have only recently emerged, and represent an exciting and active area of ongoing research.

[7] An Asynchronous Proximal Bundle Method: https://link.springer.com/article/10.1007/s10107-024-02088-x

The paper does not address the volumetric stopping criterion in detail

Algorithm 1 gives the general cutting-plane training workflow. On its own, it lacks convergence guarantees, as random training samples may offer negligible volume reduction. The "volumetric stopping criterion" applies specifically to our active learning scheme, where shrinkage can be precisely measured. For general use, we rely on standard stopping rules like max iterations, data budget, or validation error.

Potential advantage over gradient-based methods

Our method, still in early stages, is slower than gradient-based training, so we focus on query efficiency—where we show stronger convergence guarantees and consistently outperform baselines with fewer queries (see Figures 2–4). This makes our approach ideal under a query budget. On runtime and scalability, our method evolves alongside advances in cutting-plane methods and LP solvers—both active research areas. As new breakthroughs emerge, they can be directly integrated to improve our training algorithm. Moreover, theoretical progress in these fields—arguably more likely than in learning theory—would naturally carry over to ReLU NN training through our established equivalence.

审稿意见

评分: 32025-03-17

This paper provides a very interesting results for training ReLu neural networks. The authors show that training a binary classification problem using ReLu neural networks is essentially solving a linear program (LP), and therefore in the context of active learning, adding a new data point in the training set is equivalent to adding new linear constraints in the proposed LP. The authors also show that they can add linear constraints in a way that the volume of the feasible space is decreasing by $1-\frac{1}{e}$ for each query, and compare test accuracy on several datasets among linear models and ReLu network trained by SGD.

给作者的问题

I found that Algorithm 2 in the paper could be confusing. Could the authors clarify on what those variables mean?
Could the authors explain how to interpret the convergence results? For example, if the underlying decision boundary of the true decision boundary is given by $f(x)$ , can the LP-based NN training models approximate such function?
It is connected to the above problem. If the results are unknown for a general $f(x)$ , is that possible to show that for some certain class of functions, the decision boundary could be learned?
Could the authors comment of the computational aspect of their proposed method? For example, on the runtime, maximum number of neurons allowed, and the average queries needed?

论据与证据

I am good with the theoretical claims. However, given the numerical tests conducted, I am not able to draw solid conclusion that this method will surpass current SGD-based training of neural networks.

方法与评估标准

I think the benchmark datasets are wide enough, from my perspective. However, I am not sure about the evaluation criteria that solely relies on test accuracy versus SGD. Since the number of variables grow fast as neurons increase, it seems like the runtime for LP would increase by a lot. Another issue is that momentum-based methods seem to be more popular, which might provide better performance.

理论论述

No, I do not check the correctness of the proofs.

实验设计与分析

Please see my comments for the session "Methods And Evaluation Criteria".

补充材料

Yes, but I did not run the code on my computer.

与现有文献的关系

I believe that this paper provides a very interesting perspective that could potentially connect the field of (Mixed-Integer) linear optimization with the well-developed field of neural network. For a long time people believe that only continuous optimization techniques would be great for NN training, but it might be the case that MILP would play a role as well.

遗漏的重要参考文献

I did not find any essential references not discussed.

其他优缺点

Strengths:

The authors developed the connection of modeling the training of NN with LPs, which is a novel topic in the field of Machine Learning.
The authors further provides theoretical results on the convergence of their proposed LP-based method for binary classification for active learning
The paper is written in a very clear way.

Weaknesses:

Although convergence results are given, I am not sure how to understand the results. As the volume decreasing exponentially might not imply that the algorithm is capturing the correct hidden function, and therefore increasing the test accuracy
The authors did not discuss the computational efficiency of their proposed method, which might be a big problem when it comes to large scale NN models

其他意见或建议

I think the paper would be greatly improved if the authors could show results regarding learning the correct function, which should not be too hard given the rich literature regarding MLPs. For example, the well-known Universal approximation theorem.

Moreover, I think an interesting question is that whether this method could reduce the number of queries needed for active learning - it would be very exciting to learn about this if this is true.

作者回复

2025-04-01

I am not able to draw solid conclusion that this method will surpass current SGD-based training of neural networks.

Thank you for raising this point. We do not claim that our method currently outperforms gradient-based training for deep NNs, especially as it remains in an early stage compared to mature gradient-based approaches.

Our key contribution is linking ReLU NNs to LPs, enabling convergence proofs previously thought infeasible due to non-convexity. Moreover. emerging GPU-based LP solvers like CuClarabel [1] and MPAX [2] offer promising avenues for scaling our approach.

[1] CuClarabel: GPU Acceleration for a Conic Optimization Solver https://arxiv.org/abs/2412.19027; [2] MPAX: Mathematical Programming in JAX https://arxiv.org/abs/2412.09734.

However, I am not sure about the evaluation criteria that solely relies on test accuracy versus SGD.

As noted, our primary goal is to demonstrate the feasibility of framing deep ReLU NN training as LP solving, which opens the door to theoretical insights previously out of reach—e.g., our convergence result—by leveraging centuries of progress in LP theory. That said, we also address empirical concerns.

The main computational burden stems from (1) the growing number of LP constraints, and (2) the cost of solving each LP:

To manage constraint growth, we analyze LP structure evolution and propose an activation pattern subsampling and iterative filtering scheme (Appendix F.4) to prune redundant constraints.
For solving LPs, recent GPU-enabled solvers like CuClarabel and MPAX offer promising directions to improve runtime and scalability.

Another issue is that momentum-based methods seem to be more popular, which might provide better performance.

Thank you for bring up this point. We don't involve different momentum-based methods into comparison since our main experiments are done for our active learning scheme, not for our training scheme. So in figure 2 and 3, we are mainly comparing against different active learning (or query selection in another word) baselines, not optimization algorithms. The only point we compare our training scheme to gradient-based method is in figure 4 (the right most figure), where we take SGD as a representative.

The key here is we do acknowledge our method in its current stage cannot beat SGD in deep NN training, but our theoretically-supported query acquisition strategy is already superior in terms of query efficiency. That's why most of our experiments are done for active learning. Though we are optimistic that, with the development of more advanced LP solvers, our method will be more and more practical in the near future.

The authors did not discuss the computational efficiency of their proposed method. We’ve discussed this in our reply to point 2.

Although convergence results are given, I am not sure how to understand the results.

Thank you for raising this point. In fact, a consequence of Theorem 6.3 is that the prediction function converges in norm to the optimal decision function. Please see our added proof here (anonymous link).

I think the paper would be greatly improved if the authors could show results regarding learning the correct function.

Thanks for the insightful suggestion! Yes we are able to make it, see above.

I think an interesting question is that whether this method could reduce the number of queries needed for active learning.

Yes. Despite the connection between training deep ReLU NN and solving LPs, our another contribution is the convergence result for query efficiency. Specifically, we theoretically prove that our proposed active learning strategy can exponentially shrink the parameter search space, this is (to our knowledge) the first theoretic guarantee for query efficiency among various active learning methods. Empirically, figure 2, 3, 4 all demonstrate that we can achieve better performance with fewer queried points.

Could the authors clarify on Algorithm 2?

We apologize for the confusion. Due to word limit, we include a detailed description here.

Could the authors comment of the computational aspect of their proposed method?

Our cutting-plane training scheme theoretically extends to arbitrarily deep neural networks with no upper limit on model size (Theorem 4.2). The active learning method exponentially shrinks the parameter search space, enabling analytical bounds on query complexity (Theorem 6.3). The main remaining concern is runtime, as noted in our responses to points 2 and 3, though promising progress is underway to address this.

审稿人评论

2025-04-08

I would love to thank the authors for their detailed explanation. I have no concern in understanding Thm 6.3 given the additional details, but I am still interested in seeing how large neural networks could be trained using the current method.

Sorry for the last-minute question, but could the authors show me some numbers in terms of the scale of the NN trained? I am happy to increase score if the scale is moderately large for tasks like number recognition.

作者评论

2025-04-08

Thank you for your follow-up and for the thoughtful engagement with our work. We’re very happy to clarify the scalability of our current method.

In our current experiments in the submitted paper, we train both two-layer and three-layer ReLU networks using our cutting-plane framework:

For the two-layer network, we use 623 neurons (Figure 2, spiral task).
For the three-layer model, we used 57 and 34 neurons in the two hidden layers, respectively, while still achieving competitive performance on the same task.

While our method is not currently suited for large-scale pretraining due to scalability challenges we have noted from solving an LP system (though we expect continued progress as research in LP optimization advances), crucially, our framework is flexible and scalable when used with pretrained large models. For instance, in our IMDB sentiment analysis task (Figure 4), we use 2560-dimensional embeddings from the 3B-parameter Phi-2 LLM, and train a two-layer ReLU model using our method with 500 sampled activation patterns. Since we operate only on the embeddings, our approach is agnostic to the size of the upstream model and can scale seamlessly with larger backbones (e.g., GPT, LLaMA).

As noted earlier, our current scalability is limited by two factors (though remedies for both are already present):

(1) The number of constraints (due to activation pattern enumeration): Our paper already addresses this via activation pattern subsampling and iterative filtering (Appendix F.4), which greatly reduce the number of constraints in practice while preserving model expressivity. This enables training of moderately sized NNs without exhaustively computing all patterns.
(2) Use of a CPU-based general-purpose solver (CVXPY) — which we plan to replace with faster, structure-aware solvers. In particular, the method of Mishkin et al. (2022) (arXiv:2202.01331) offers accelerated convex optimization for two-layer ReLU networks and scales to image tasks like MNIST and CIFAR-10. These solvers provide an easy drop-in replacement for CVXPY in our framework and can significantly accelerate training and enable larger models. In parallel, GPU-based LP solvers such as CuClarabel and MPAX offer complementary performance gains by leveraging hardware acceleration.

In short, our current method is not yet compatible with large-scale training, but the framework is modular and extensible, and we believe the recent developments in solver efficiency and activation filtering strongly support the path toward broader applicability.

Please don't hesitate to let us know if you have further questions / need clarifications!

最终决定Accept (poster)

2025-05-01

This paper proposes a novel active learning approach for ReLU neural networks that leverages cutting plane methods and operates in a gradient-free manner. The method is enabled by reformulating ReLU training as a linear program, which offers a new perspective on ReLU active learning. The authors also develop convergence guarantees for the proposed active learning approach. Most reviewers acknowledged the novelty of the algorithm and the significance of the theoretical results.

The main concern raised by reviewers is on the scalability of the proposed method, due to the number of constraints in its LP formulation. The authors acknowledge this issue and develop pruning techniques to remove redundant constraints. Scalability issues could also be partially mitigated through the parallel development of GPU-accelerated LP solvers. While the scalability issue should not be overlooked, it is important to recognize that this submission presents the first algorithm to connect ReLU active learning with linear program, opening avenues for future work in this direction.