/10

Poster3 位审稿人

最低3最高4标准差0.5

ICML 2025

DTZO: Distributed Trilevel Zeroth Order Learning with Provable Non-Asymptotic Convergence

提交: 2025-01-23更新: 2025-07-24

摘要

Trilevel learning (TLL) with zeroth order constraints is a fundamental problem in machine learning, arising in scenarios where gradient information is inaccessible due to data privacy or model opacity, such as in federated learning, healthcare, and financial systems. These problems are notoriously difficult to solve due to their inherent complexity and the lack of first order information. Moreover, in many practical scenarios, data may be distributed across various nodes, necessitating strategies to address trilevel learning problems without centralizing data on servers to uphold data privacy. To this end, an effective distributed trilevel zeroth order learning framework DTZO is proposed in this work to address the trilevel learning problems with level-wise zeroth order constraints in a distributed manner. The proposed DTZO is versatile and can be adapted to a wide range of (grey-box) trilevel learning problems with partial zeroth order constraints. In DTZO, the cascaded polynomial approximation can be constructed without relying on gradients or sub-gradients, leveraging a novel cut, i.e., zeroth order cut. Furthermore, we theoretically carry out the non-asymptotic convergence rate analysis for the proposed DTZO in achieving the $\epsilon$-stationary point. Extensive experiments have been conducted to demonstrate and validate the superior performance of the proposed DTZO.

关键词

Distributed AlgorithmTrilevel OptimizationZeroth-Order Optimization

评审与讨论

审稿意见

评分: 42025-03-11

The paper introduces DTZO (Distributed Trilevel Zeroth Order Learning), a novel framework for solving distributed trilevel learning problems with missing gradient information. This is achieved by constructing a cascaded polynomial approximation without relying on gradients or sub-gradients, leveraging zeroth-order cuts of inner and outer layers and dropping inactive zeroth-order cuts. The author also theoretically carries out the non-asymptotic convergence rate analysis for the proposed framework. Finally, the paper demonstrates the effectiveness through experiments on black-Box trilevel learning and robust hyperparameter optimization.

update after rebuttal

Thank authors for the comprehensive rebuttal. The added experiment results help enhance the findings of the paper and addressed my concerns. I have changed the scores accordingly.

给作者的问题

N/A

论据与证据

Yes, the claims in the submission are supported by clear and convincing evidence. Specifically, the authors prove theoretical analysis of their polynomial approximation and zeroth-cuts methods, provide the distributed algorithm on zeroth order and provide experiment results to validate the performance of the proposed algorithm.

方法与评估标准

Yes, authors compared their methods with FedRZO on Black-Box Trilevel Learning in LLMs. They used Qwen 1.8B-Chat as black-box LLM and GLUE benchmark for evaluation, compared the ASR and ACC performance of the 2 algorithms. The paper also compared their methods with FedZOO and FedRZO on hyper-parameter optimization. They leveraged a ReLU neural network and benchmarked on datasets including MNIST, QMNIST, F-MNIST, USPS, demonstrating the effectiveness of the algorithm.

理论论述

Yes, the paper provides rigorous theoretical analysis for the proposed methods. It made claims of the stationarity, boundedness, smoothness and complexity of the method and provided details in the appendix.

实验设计与分析

In general the experimental designs and analysis are sound, which is partially discussed in Methods And Evaluation Criteria section. But dataset choice in hyper-parameter optimization lacks diversity since they are largely MNIST-related. And the treatment methods are limited, the paper only compares with 2 methods, FedZOO and FedRZO.

补充材料

Yes, I went through the proof analysis part in supplementary material, but due to the extra length of appendix, not all math details are carefully checked.

与现有文献的关系

The paper is related to broader scientific literature like distributed trilevel learning, distributed zeroth-order optimization and cutting-plane methods for bilevel/trilevel optimization.

遗漏的重要参考文献

N/A

其他优缺点

The paper has good novelty and solid theoretical foundations. But experimental study is limited as mentioned above. In general, it should be above borderline acceptance.

其他意见或建议

In Assumption 4.4. there is a typo that word "Following" is mis-included in the cite hyperlink.

作者回复

2025-04-01

We truly appreciate your insightful suggestions. We have provided a point-by-point reply to all your questions and hope we have successfully addressed all your concerns.

(Q1) In general the experimental designs and analysis are sound, which is partially discussed in Methods And Evaluation Criteria section. But dataset choice in hyper-parameter optimization lacks diversity since they are largely MNIST-related. And the treatment methods are limited, the paper only compares with 2 methods, FedZOO and FedRZO.

(R1) Thanks for your helpful suggestions. Addressing the trilevel zeroth order optimization problem in a distributed manner is significantly challenging, as even $*finding a feasible solution*$ in a linear trilevel optimization problem is NP-hard. This is the first work to explore solving trilevel zeroth-order optimization problems in a distributed manner while providing theoretical guarantees. We have conducted experiments on several more complex datasets, including CIFAR-10 and CIFAR-100 (using CNN models), as well as time series datasets such as MelbournePedestrian, Crop, and UWaveGestureLibraryAll from the UCR Archive [1], as suggested. Additionally, we have incorporated more state-of-the-art distributed single-level, bilevel, and trilevel optimization methods with zeroth-order estimators [2], including FedAvg+ZO [3], FEDNEST+ZO [4], ADBO+ZO [5], and AFTO+ZO [6], as baseline methods. The experimental results are presented in Table 3.1 below. Please note that combining distributed nested optimization methods with zeroth order estimators does not provide any theoretical guarantees; these methods are included only for comparative evaluation. As shown in Table 3.1, the proposed DTZO exhibits superior performance. This can be attributed to two key factors: (1) Compared to existing methods, the proposed DTZO is capable of effectively addressing higher-nested zeroth order optimization problems with non-asymptotic convergence guarantees. (2) The proposed nonlinear zeroth order cuts facilitate the development of a more refined cascaded polynomial relaxation.

Table 3.1 Experimental results on robust hyperparameter optimization.

Datasets	FedAvg+ZO	FedZOO	FEDNEST+ZO	ADBO+ZO	FedRZObl	AFTO+ZO	DTZO
MNIST	0.5196	0.5289	0.5503	0.5341	0.5405	0.7501	0.7927
QMNIST	0.5204	0.5245	0.5398	0.5487	0.5467	0.7389	0.7804
F-MNIST	0.4786	0.4874	0.5065	0.5102	0.5023	0.6448	0.7007
USPS	0.7211	0.7277	0.7354	0.7323	0.7379	0.7987	0.8513
CIFAR-10	0.3731	0.3829	0.4034	0.3987	0.4079	0.4692	0.5147
CIFAR-100	0.1967	0.2102	0.2243	0.2354	0.2321	0.2774	0.3023
MelbournePedestrian	0.6214	0.6295	0.6454	0.6412	0.6487	0.6924	0.7250
Crop	0.5379	0.5468	0.5607	0.5681	0.5645	0.6016	0.6351
UWaveGestureLibraryAll	0.6652	0.6714	0.6924	0.6983	0.7002	0.7689	0.8243

Furthermore, we have conducted additional experiments to evaluate the proposed DTZO in three aspects:

1. Experimental results on another domain, i.e., continual learning, please see the results in (R1) to Reviewer 1qiE.
1. Ablation experiments, please refer to (R3) to Reviewer LX43 for details.
1. Experimental results when using larger LLMs, i.e., Qwen-2-7B and Llama-3.1-8B, please see (R2) to Reviewer LX43 for results.

(Q2) The paper has good novelty and solid theoretical foundations. But experimental study is limited as mentioned above. In general, it should be above borderline acceptance.

(R2) We appreciate your comments and support. Please refer to our modifications regarding the experimental study in (R1). We hope that our revisions have adequately addressed your concerns.

(Q3) In Assumption 4.4. there is a typo that word "Following" is mis-included in the cite hyperlink.

(R3) We sincerely appreciate your careful review and thank you for pointing it out. We have corrected it as suggested.

Reference

[1] The UCR Time Series Classification Archive, 2018

[2] A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications

[3] Communication-efficient learning of deep networks from decentralized data

[4] FEDNEST: Federated bilevel, minimax, and compositional optimization, ICML 2022

[5] Asynchronous Distributed Bilevel Optimization, ICLR 2023

[6] Provably Convergent Federated Trilevel Learning, AAAI 2024

审稿意见

评分: 32025-03-17

This paper proposes a zeroth-order constrainted trilevel learning optimizer which is versatile and can be adapted to a wide range of TLL problems. The authors provide both convergence analysis and experiment validation for the proposed DTZO framework. The improvement in performance from the experiments is very significant.

给作者的问题

The performance of the proposed DTZO method in the experiments is significantly better than the baselines, which is excellent. However, I remain unclear about the source of this improvement. Could the authors design ablation experiments to verify the source of this performance boost? For example, they could identify which techniques were added during the transition from the baseline to the DTZO method and validate the importance of each technique individually.
What is the core theoretical contribution of this paper compared to existing work? Does this result represent a theoretical improvement, or can the outstanding performance of the DTZO algorithm be reflected through theoretical analysis?
Can the experiments be conducted on larger models, such as foundational models like LLAMA-7B? How do their results perform?
Could the authors provide more detailed experimental records? For example, the communication bits for each method, the actual training time (in seconds), and other related details. This would help me better understand the performance of the DTZO method.

论据与证据

The structure of the article is generally complete, and the viewpoints are reasonable.

方法与评估标准

The experiments look excellent. I have listed a few questions regarding the experiments in the questions, and I hope the authors can provide detailed answers.

理论论述

The paper analyzes the convergence of the DTZO framework, but the introduction is not detailed enough. It would be beneficial for the authors to provide a table that compares the latest progress with the related analysis, presenting the detailed assumptions and conclusions to highlight the advantages of the current analysis. The current version does not make this clear enough.

实验设计与分析

The results demonstrated by the experiments are impressive, but the data sample size is insufficient. The authors should expand the testing to include more models and datasets.

补充材料

no supplementary materials

与现有文献的关系

It is also beneficial for other ZO methods.

遗漏的重要参考文献

Enough discussions.

其他优缺点

see above (each block)

其他意见或建议

see above (each block)

作者回复

2025-03-31

We sincerely appreciate your insightful and valuable suggestions. We have provided a detailed point-by-point response to your questions below.

(Q1) A table that compares the latest progress with the related analysis.

(R1) This work is the first to explore solving the distributed TLL without relying on first-order information while providing theoretical guarantees. Per your suggestion, we have included a table comparing the non-asymptotic convergence results of the proposed DTZO with SOTA distributed nested optimization methods, under the assumption/setting where first-order information is either available (FO) or unavailable (ZO). Please note that ZO is more general and challenging than FO.

Table 2.1

Methods	Bilevel (FO)	Bilevel (ZO)	Trilevel (FO)	Trilevel (ZO)
FEDNEST	$\mathcal{O}(1/\epsilon^2)$	-	-	-
ADBO	$\mathcal{O}(1/\epsilon^2)$	-	-	-
MDBO	$\mathcal{O}(1/\epsilon^2)$	-	-	-
FedBiOAcc	$\mathcal{O}(1/\epsilon^{1.5})$	-	-	-
FedRZObl	-	$\mathcal{O}(1/\epsilon^2)$	-	-
AFTO	-	-	$\mathcal{O}(1/\epsilon^2)$	-
DTZO	-	-	-	$\mathcal{O}(1/\epsilon^2)$

(Q2) More models and datasets, i.e. LLAMA-7B.

(R2) Additional experiments are conducted on (1) more datasets and models in robust hyperparameter optimization, and (2) larger LLMs in black-box TLL. Specifically,

(1) Due to character limitations, please refer to (R1) to Reviewer urFe.

(2) Experiments are conducted using larger LLMs, i.e., Qwen-2-7B and LLAMA-3.1-8B, and the results are shown in Table 2.2.

Table 2.2

Methods	SST-2(ASR)	SST-2(ACC)	COLA(ASR)	COLA(ACC)	MRPC(ASR)	MRPC(ACC)
FedRZObl (Qwen)	0.9147	0.6253	0.8774	0.7094	0.9384	0.7142
DTZO (Qwen)	0.9647	0.7053	0.9545	0.7653	0.9571	0.7492
FedRZObl (LLAMA)	0.9854	0.8591	0.8623	0.6556	0.9710	0.7454
DTZO (LLAMA)	1.0	0.8968	0.9065	0.6891	0.9910	0.7866

(Q3) Ablation experiments.

(R3) Ablation Study. To analyze DTZO’s performance improvements, we conduct an ablation study comparing DTZO against its variants: DTZO(-) and DBZO. DTZO(-) replaces the proposed nonlinear cuts in DTZO with linear cuts, while DBZO removes cascaded polynomial approximation, using only single-layer polynomial approximation. It is seen from Table 2.2 that DTZO outperforms all variants, demonstrating the benefits of cascaded polynomial approximation and nonlinear ZO cuts. In addition, we also compare DTZO with a variant of DTZO without removing inactive cuts. It is seen from Figure 3 in page 33 that removing inactive cuts greatly enhances DTZO’s efficiency, underscoring the importance of this step.

Table 2.3

Methods	F-MNIST	USPS	UWaveGestureLibraryAll	MelbournePedestrian
DBZO	0.5343	0.7492	0.7143	0.6536
DTZO(-)	0.6685	0.8212	0.7921	0.7013
DTZO	0.7007	0.8513	0.8243	0.7250

(Q4) Core theoretical contribution.

(R4) Existing works focus on single-level and bilevel ZO, while tackling the trilevel ZO is under-explored and significantly more challenging (even finding a feasible solution in TLL is NP-hard). This work marks an initial step in solving TLL without first-order information. A key theoretical contribution is the first trilevel ZO framework and first non-asymptotic convergence guarantee (e.g., iteration and communication complexity) for trilevel ZO, aligning with DTZO’s superior performance in experiments, i.e., DTZO outperforms SOTA methods by handling higher-nested ZO with theoretical guarantees. Another key theoretical contribution is the construction of the first theoretically guaranteed ZO cuts in nested optimization, enabling cascaded polynomial relaxation without gradients. The proposed ZO cut is also the first nonlinear cut in nested optimization, offering better approximations for complex functions than linear cut. This advancement can also be observed in ablation study, where DTZO outperforms its variant without ZO cuts.

(Q5) More detailed records (communication bits, training time).

(R5) Compared to single-level and bilevel optimization, solving TLL inherently demands higher complexity. This is because 1) TLL has higher-nested structure, finding a feasible solution in TLL requires solving a bilevel optimization problem; 2) it involves more optimization variables than bilevel and single-level optimization. Per your suggestions, we provide a comparison of communication complexity and training time between DTZO and SOTA distributed TLL method. As shown in Table 2.4, DTZO achieves superior performance due to its non-asymptotic convergence guarantees and simplified optimization procedure.

Table 2.4

Methods	Communication complexity/bits	Crop (training time/s)	UWaveGestureLibraryAll (training time/s)
AFTO+ZO	NA $^1$	652.3	375.2
DTZO	$32T(\epsilon)(2d_1+3d_2+3d_3)N+64 N \lfloor \frac{T_1}{\mathcal{T}} \rfloor \mathcal{T}(d_2+d_3)$	485.5	244.6

$^1$ Because this method lacks non-asymptotic convergence guarantee.

审稿意见

评分: 42025-03-21

This paper introduces DTZO, a framework for solving trilevel learning problems where gradient information is unavailable at all levels like black-box or with partial zeroth order constraints with analysis of the convergence rate and communication complexity.

给作者的问题

It seems the authors only talked about trade-off between performance and complexity, but not the additional computational trade-off compared to other methods? Also see previous sections.

论据与证据

The paper provides an extensive related work section discussing bilevel and single-level zeroth order methods with DZTO to address trilevel zeroth order learning problems.

DTZO achieves up to a 40% improvement in performance over state-of-the-art methods from empirical results.

The authors also provide a non-asymptotic convergence guarantee.

While the framework is theoretically flexible, no experiments explore its adaptability across diverse domains beyond LLM prompt learning and hyperparameter optimization or ablation study to explore the contribution of different optimization structures.

方法与评估标准

The benchmarking tasks (LLMs and hyperparameter optimization) are relevant applications for zeroth order learning.

Performance metrics such as accuracy, robustness, convergence speed, and computational efficiency are reasonable.

However the datasets selection is limited to small or simple ones (which might also not be suitalbe for LLM benchmarking). Also other bilevel methods or other black-box optimization techniques should be tested as references given the motivation here.

理论论述

The theoretical contribution looks correct to me at initial examination

实验设计与分析

The stationarity gap and $\epsilon$ -stationary point definition are well-established in optimization. The use of penalty-based relaxation aligns with existing literature.

However $T_1$ is not bounded, more explanation to this is desired here.

补充材料

I checked Notations and Experiment Settings.

与现有文献的关系

The paper builds on existing work in bilevel optimization and extends it to trilevel settings with broader connection to black-box optimization, federated learning, and hierarchical machine learning.

遗漏的重要参考文献

Some other zeroth order optimization methods are missing.

其他优缺点

Pros:

This paper is well-written and with theoretical contribution with well-structured proofs.

Cons:

No discussion of computational limitations. Dataset and benchmark size is limited.

其他意见或建议

A comparison with gradient based or partial gradient based methods would help quantify the trade-off between gradient-free vs. gradient-based approaches in cases where some gradient information is available.

作者回复

2025-03-31

We truly appreciate your insightful and valuable suggestions. We have provided a detailed response addressing each of the questions you raised.

(Q1) Experiments on another domain and ablation study.

(R1) This work is the first to investigate solving the distributed TLL problem without relying on first-order (FO) information while providing theoretical guarantees. The proposed DTZO exhibits high adaptability which can be applied to a broader range of TLL problems, please refer to (R5) for details. To address your concerns, we have incorporated additional experimental results on another domain, i.e., continual learning, along with ablation study, as detailed below.

Continual Learning. In the experiment, the adversarial architecture-based continual learning task is considered, which can be formulated as a TLL problem.

\min&{\sum_j\sum_{(x,y)\in D_j ^B}\frac{{L([\theta_1,\theta_2],x,y)}}{|D_j^B|}}\\ s.t.&\theta_1=\mathop{\arg \min}\limits_{\theta_1'}\sum_j\sum_{(x,y)\in D_j^A}\frac{L(\theta_1',x,y, P_1)}{|D_j^A|}\\ &s.t.P_1=\mathop{\arg \max}\limits_{P_1'} \sum_j \sum_{(x,y) \in D_j^A}\frac{L(\theta _1',x,y,P_1')}{|D_j^A|}

The results are reported in Table 1.1. It is seen that the proposed DTZO outperforms SOTA methods because: 1) DTZO is capable of solving higher-nested zeroth order (ZO) problems; 2) the proposed nonlinear ZO cut has superior approximation performance.

Ablation study. Due to character limitations, please refer to (R3) to Reviewer LX43.

Table 1.1

Method	MNIST	USPS
FedZOO	0.6103	0.6652
FedRZObl	0.6967	0.7254
FEDNEST	0.7014	0.7279
ADBO	0.6942	0.7316
AFTO	0.9275	0.9423
DTZO	0.9536	0.9654

(Q2) Additional datasets and baseline methods.

(R2) Due to character limitations, please see (R1) to Reviewer urFe.

(Q3) Explanation to $T_1$ .

(R3) In the proposed framework, $T_1<\infty$ is a constant that can be flexibly adjusted based on specific requirements. Specifically, if the distributed system has sufficient computational and communication resources, a relatively larger $T_1$ (but still finite) can be chosen to achieve a better cascaded polynomial relaxation. Conversely, if the resources are limited, a smaller $T_1$ can be set to reduce the iteration and communication complexity.

(Q4) No discussion of computational limitations. Dataset and benchmark size is limited.

(R4) Solving TLL is highly challenging, as even $*finding a feasible solution*$ for a linear TLL is NP-hard. The proposed DTZO is a distributed optimization method, making it well-suited for large-scale learning tasks due to the scalability of distributed algorithms. It decomposes large tasks into subproblems assigned to individual workers, enabling efficient problem-solving. Moreover, even in scenarios with limited computational resources, DTZO can effectively handle them due to its flexibility. By setting a smaller $T_1$ , the communication and iteration complexities are reduced. Experimental results on more datasets and larger models are reported in (R2) to Reviewer LX43 and (R1) to Reviewer urFe.

(Q5) Trade-off between gradient-free vs. gradient-based method.

(R5) The proposed framework is highly adaptable, accommodating both fully gradient-unavailable TLL and cases with partial gradient access with minimal modifications. We further discuss and compare the trade-off below.

Gradients at 1-level are available. In this case, Eq. (16–18) can be replaced with gradient descent steps in DTZO, which introduce less noise per iteration and improve convergence rate ( $O(1/\epsilon)$ ). However, Eq. (16–18) do not rely on gradients, making them more applicable to scenarios where gradients are unavailable. This represents a trade-off between convergence efficiency and applicability, which can be flexibly adjusted within DTZO.
Gradients at 2-level are available. In this case, the outer layer ZO cut can be replaced by FO cut: $\nabla\phi_{out}{(\\{{x_{2,j}^t}\\},\\{{x_{3,j}^t}\\},\\{z_i^t\\})^{\top}}[\\{x_{2,j}-x_{2,j}^t\\};\\{x_{3,j}-x_{3,j}^t\\};\\{z_i-z_i^t\\}]+\phi_{out}(\\{{x_{2,j}^t}\\},\\{{x_{3,j}^t}\\},\\{z_i^t\\})\le L/2(\sum_{i=2}^3\sum_j||x_{i,j}-x_{i,j}^t||^2+\sum_i||z_i-z_i^t||^2)+\varepsilon_{out}$ . Since FO cut is generated based on gradients, it introduces less noise in generation process and can get a superior polynomial relaxation. In contrast, ZO cut exhibits broader applicability, as its generation does not rely on gradients. This represents a trade-off between polynomial relaxation and applicability, which can be effectively controlled in DTZO.
Gradients at 3-level are available. Similar to 2), there exists a trade-off between inner layer polynomial relaxation and applicability.

(Q6) Some ZO references are missing.

(R6) More discussion on ZO will be added as Appendix J.3. We are happy to include any references the reviewer may suggest.

最终决定Accept (poster)

2025-05-01

This paper introduces DTZO, a framework for solving trilevel optimization problems where some parameters are themselves optimal solutions for lower level optimization problems. This is achieved by constructing a cascaded polynomial approximation without relying on gradients or sub-gradients, leveraging zeroth-order cuts of inner and outer layers and dropping inactive zeroth-order cuts. The reviewers found the theoretical part of the paper to be clear, and in experiments the paper significantly improves the performance of existing solutions.