/10

Poster3 位审稿人

最低3最高4标准差0.5

ICML 2025

Wait-Less Offline Tuning and Re-solving for Online Decision Making

Jingruo Sun,Wenzhi Gao,Ellen Vitercik,Yinyu Ye

提交: 2025-01-23更新: 2025-07-24

摘要

Online linear programming (OLP) has found broad applications in revenue management and resource allocation. State-of-the-art OLP algorithms achieve low regret by repeatedly solving linear programming (LP) subproblems that incorporate updated resource information. However, LP-based methods are computationally expensive and often inefficient for large-scale applications. By contrast, recent first-order OLP algorithms are more computationally efficient but typically suffer from weaker regret guarantees. To address these shortcomings, we propose a new algorithm that combines the strengths of LP-based and first-order OLP algorithms. Our algorithm re-solves the LP subproblems periodically at a predefined frequency $f$ and uses the latest dual prices to guide online decision-making. In parallel, a first-order method runs during each interval between LP re-solves and smooths resource consumption. Our algorithm achieves $\mathcal{O}(\log (T/f) + \sqrt{f})$ regret and delivers a "wait-less" online decision-making process that balances computational efficiency and regret guarantees. Extensive experiments demonstrate at least $10$-fold improvements in regret over first-order methods and $100$-fold improvements in runtime over LP-based methods.

关键词

Online Linear ProgrammingSequential Decision MakingResource Allocation

评审与讨论

审稿意见

评分: 42025-03-09

This paper introduces a hybrid algorithm for Online Linear Programming (OLP) that combines LP-based and first-order methods. By periodically re-solving LPs at frequency $f$ and using first-order updates in between, the method achieves a regret bound of $\mathcal{O}(\log(T/f) + \sqrt{f})$ , balancing computational efficiency and decision quality. Experiments demonstrate significant improvements: 20x lower regret than pure first-order methods and 100x faster runtime than pure LP-based methods. The theoretical analysis is rigorous, and the parallel framework addresses key challenges in unifying LP-based and first-order regret formulations.

给作者的问题

(a) Can the proposed algorithm and analysis be extended to the hard-stop setting, where the decision-making process terminates as soon as any one of the resources is exhausted? Additionally, how can the approach be adapted to handle scenarios where the time horizon is unknown?

(b) It appears that in certain cases, LP-based methods can achieve $\mathcal{O}(1)$ regret. Could the proposed algorithm attain this bound as well? If not, what are the key limitations preventing it from doing so?

论据与证据

Supported Claims:

"Balanced Regret and Runtime": The claim of achieving $\mathcal{O}(\log(T/f) + \sqrt{f})$ $O (lo g (T / f) + f)$ regret with reduced computational cost is well-supported by:
- Theoretical proofs (Theorems 3.1–3.2, Appendix B–C).
- Empirical results (Tables 2–3, Figures 2–3) showing 20x lower regret than first-order methods and 100x faster runtime than LP-based methods.
"Wait-Less Decision-Making": The absence of decision delays is validated by comparisons with literature in Table 1 and Section 4.2.

Potentially Problematic Claims:

"Minimal Assumptions": The paper states that Assumptions 2.1–2.2 are "minimal," but these include strict non-degeneracy assumption (Assumption 2.2(b)) . These may not hold in real-world scenarios.

方法与评估标准

The performance metrics based on regret and constraint violations are commonly used in the literature. Empirical comparisons against LP-based (Algorithm 3) and first-order (Algorithm 4) baselines are also appropriate.

理论论述

I have not fully checked the proof details. The decomposition into dual convergence, constraint violation, and leftover resources is logically sound. The final bound $\mathcal{O}(\log(T/f) + \sqrt{f})$ relies on Lemma B.5 (dynamics of resource usage) and Lemma B.10 (warm-start regret). These assume non-degeneracy, but no critical gaps were found.

实验设计与分析

It is nice that results are averaged over 100 trials (Section 4.1), which reduces variance. However, the experiments were conducted solely on uniform and normal distributions, which may limit the general validity of the results.

补充材料

I have only reviewed Appendix A, as well as Sections B.1 and B.2.

与现有文献的关系

The paper extends two key lines of work:

(a) LP-Based OLP. Builds on Agrawal et al. (2014) and Li & Ye (2022) by reducing computation via periodic re-solving.

(b) First-Order OLP. Improves upon Gao et al. (2023) and Balseiro et al. (2022a) by integrating LP-guided dual prices for lower regret.

The "wait-less" framework addresses limitations in Xu et al. (2024), which requires batch-level delays.

遗漏的重要参考文献

I did not realize that there was an omission of closely relevant literature.

其他优缺点

Strengths

(a) The parallel framework effectively leverages the strengths of LP-based and first-order methods, addressing a notable gap in existing OLP literature.

(b) The regret decomposition and "spectrum theorem" provide a unified analysis of LP-based and first-order methods, overcoming incompatibilities in prior work. The derived regret bound interpolates between $\mathcal{O}(\log T)$ (pure LP) and $\mathcal{O}(\sqrt{T})$ (pure first-order), allowing practitioners to tune $f$ based on computational budgets.

(c) The experiments validate the framework’s superiority, with orders-of-magnitude improvements in both runtime and regret across synthetic datasets.

Weaknesses

(a) The analysis relies heavily on non-degeneracy (Assumption 2.2). The impact of relaxing these assumptions (e.g., degenerate) is unclear.

(b) The experimental evaluation lacks a direct comparison with existing literature. For instance, [Li et al. 2024] also studies a similar problem, and it would be beneficial to include a comparative analysis to better highlight the advantages of the proposed approach.

其他意见或建议

N/A

作者回复

2025-03-31

Response to Reviewer QGM5

We appreciate your valuable feedback.

Claims And Evidence

Claims on the assumptions (non-degeneracy)

Thank you for pointing this out. We will add a more detailed discussion to clarify the assumptions in the paper. To achieve $\mathcal{O}(\sqrt{T})$ regret, we agree that these assumptions are not minimal. But we also want to note that without non-degeneracy (or its variant), [3] has shown that $\Omega(\sqrt{T})$ is a lower bound for OLP, and most of the literature in this line assumes non-degeneracy (or similar assumptions).

Experimental Designs Or Analyses

Experiments are only on uniform and normal distributions

We have added additional experiments on more problem distributions and tested performance under different re-solving frequencies. Our algorithms continue to demonstrate strong performance, consistent with the result in Theorem 3.2. We kindly refer the reviewer to the anonymous link https://anonymous.4open.science/r/icml_2025_olp-6C17/ for more details.

Other Strengths And Weaknesses

Requirement on non-degeneracy

Please see our response to Claims And Evidence.
Lack of direct comparison with existing literature

Thank you for pointing this out. [1] studies a similar problem under the finite support setting. We adapted our algorithms to finite support and added experiments to compare with [1] in https://anonymous.4open.science/r/icml_2025_olp-6C17/. In the finite-support setting, we find that our algorithm achieves lower regret while [1] is often faster. We will add additional experiments and discussions in the revision.

Questions

Stopping time analysis

We believe that a stopping time analysis is possible. This paper adapts the analysis of the LP-based method to the metric of first-order methods, and we can do the other way around by properly defining a stopping time for first-order methods [3]. Another simple way is to "subtract" $O(f + \log T)$ resource from the initial inventory and run OLP on this new problem, and in expectation, the resource violation from the original problem can be removed.
Unknown time horizon

To our knowledge, in the OLP setting, not knowing the horizon $T$ makes the problem harder. In [1], it is shown that only a competitive ratio result is achievable when $T$ has uncertainty. The most challenging part is that the average resource assumption A2.1 (c) is no longer well-defined. Since both LP-based and first-order methods require some estimate of average resources to work, a lack of this knowledge makes the problem more difficult.

Although the problem is theoretically challenging, in practice, it is often possible to make some prior prediction using data-driven approaches or based on the resources available at hand. For example, in online advertising, the number of customers can be predicted from historical statistics. Besides, in practice, the selling horizon is sometimes decided by some upper-level decision-makers. e.g., a company selling products before festival events. In this case, it is available to the decision-maker as a problem parameter.
$O(1)$ regret bound under finite support setting

This is also a very good question. To our knowledge, $O(1)$ regret is achievable in the finite-support setting with the same LP-based algorithm and a non-degeneracy assumption [4]. Since our analysis interpolates between the regret guarantee of first-order and LP-based methods, we believe the analysis in [4] can be adapted to give a similar result, which we leave to future work.

Thank you for your time in the review process!

References

[1] Li, G., Wang, Z., & Zhang, J. (2024). Infrequent resolving algorithm for online linear programming. arXiv preprint arXiv:2408.00465.

[2] Balseiro, S., Kroer, C., & Kumar, R. (2023, June). Online resource allocation under horizon uncertainty. In Abstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (pp. 63-64).

[3] Balseiro, S., Lu, H., & Mirrokni, V. (2020, November). Dual mirror descent for online allocation problems. In International Conference on Machine Learning (pp. 613-628). PMLR.

[4] Chen, G., Li, X., & Ye, Y. (2024). An improved analysis of LP-based control for revenue management. Operations Research, 72(3), 1124-1138.

审稿人评论

2025-04-05

Thanks for your efforts in conducting new experiments and providing a detailed response. This is a great paper, and I have learned a lot from it.

审稿意见

评分: 42025-03-13

This paper studies online linear programming (OLP) under stochastic inputs. Algorithmic approaches to this problem can be broadly categorized into two types: (i) the LP-based approach, which repeatedly solves an LP using the entire history of observations and decisions, and (ii) the first-order approach, which employs a general dual mirror descent to update dual prices and makes online decisions based on the updated prices. It is known that the LP-based approach can achieve $O(log(T))$ (or even $O(1)$ in the case with discrete support) regret but suffers from high computational complexity. but suffers from high computational complexity. In contrast, the first-order approach can only achieve $O(\sqrt{T})$ regret but but is computationally efficient.

This paper proposes an algorithm that integrates the key ideas of both existing approaches. Specifically, the algorithm solves the LP at a frequency $f$ (i.e., for each batch of arrivals of size $T/f$ ) and applies the first-order approach only to the first and last batches. The final regret is $O(\log(T/f) + \sqrt{f})$ , which smoothly interpolates between the performances of the two existing approaches. To achieve this, the paper unifies the regret analysis of the two approaches by decomposing the regret into three components.

The paper also presents an enhanced version of the algorithm that applies the first-order approach to each batch while using the dual price from the LP-based approach as the initial point. Although this enhanced algorithm shows performance advantages over the original algorithm, no performance guarantees are provided.

给作者的问题

Can you give any insights on the performance of the enhanced algorithm (Algorithm 2)?
In Section 4.2, why is the re-solving frequency fixed to $T^{1/3}$ ? Can we use the optimal frequency suggested in Section 3.4? What are the pros and cons?

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

Yes, I checked the proofs of the main Theorems 3.1 and 3.2. To the best of my knowledge, the proofs are correct, and the unified regret analysis is interesting and could potentially be applied to more general settings.

实验设计与分析

Yes, I reviewed the numerics in Section 4. Although all experiments are conducted using synthetic data, they effectively demonstrate the regret of the proposed algorithms under varying frequencies. However, there are two issues with the experiments: (i) although the paper discusses computational efficiency, no detailed experiments are provided in this section; (ii) there is a lack of discussion on parameter settings. First, in Section 4.2, the frequency is fixed to be $T^{1/3}$ without explanation. Does this relate to Proposition 3.5, which computes the optimal re-solving frequency? Second, the learning rates used in the experiments are not specified for either Algorithm 1 or Algorithm 2.

补充材料

Yes, I reviewed the appendix B and C related to proofs of Theorem 3.1 and 3.2.

与现有文献的关系

The combined algorithms for OLP in this paper bridge the algorithms and results from LP-based approaches and first-order methods. They elegantly demonstrate the trade-off between performance and computational complexity in online decision-making literature.

遗漏的重要参考文献

I am not aware of any missing references.

其他优缺点

There are still no performance statements on the enhanced algorithm (Algorithm 2). Some additional discussion is needed to clarify the insights and limitations of this algorithm to help better understand it. For example, the learning rates for the first-order updates in Algorithm 1 and Algorithm 2 are set differently. What is the reason for this setup, and is there any potential for improvement?
It is not very clear why the LP approach is re-solved at very $T/f$ intervals. The re-solving idea was proposed in the very first OLP paper [Agrawal2014], where the LP is solved at geometric time intervals. In fact, it might make more sense to reduce the frequency of re-solving the LPs as time goes on under stochastic inputs (since the dual prices are gradually learned). There is a lack of discussion on the choice of re-solving frequency at the beginning.

其他意见或建议

N/A

作者回复

2025-03-31

Response to Reviewer GkFi

Thank you for your positive evaluation of our paper!

Experimental Designs Or Analyses

Computational efficiency.

We kindly refer the reviewer to Table 3 (Page 7 top) for a running time comparison between different methods, and in the revision we'll add a more comprehensive experiment to compare the computational aspects of different algorithms.
Parameter setting.

Thank you for your comments. In Algorithm 1, we take the learning rate as $\mathcal{O}(1/f^{1/2})$ when $t \leq f$ and $\mathcal{O}(1/f^{2/3})$ when $t \geq kf$ . In Algorithm 2, we take the learning rate as $\mathcal{O}(1/t)$ . We'll add a more detailed experiment setup section in the revision to cover the parameter settings (frequency/learning rate) of different algorithms.
$T^{1/3}$ solving frequency.

Thank you for pointing this out. We have experiments across different re-solving frequencies $f \in \\{ T^{1/3}, T^{1/2}, T^{2/3} \\}$ in Section 4.1 to evaluate the performance of our main algorithms. A smaller $f$ leads to lower regret but higher computational cost, while a larger $f$ improves the decision efficiency at the expense of decision optimality. Section 4.2 is designed to compare the performances among different algorithms, and the choice of $T^{1/3}$ for Section 4.2 is simply one of the candidate frequencies.

Other Strengths And Weaknesses

Insights of Algorithm 2.

Thank you for pointing this out. Algorithm 1 has a solid theoretical guarantee, but it does not update the dual price between two consecutive LP resolves. Algorithm 2 is proposed to alleviate this issue: between two LP resolves, the dual price is also updated using the first-order method. We adopt a smaller step size for first-order updates between LP resolves in Algorithm 2 to avoid deviating too far from the LP-guided solutions. We leave a more rigorous analysis of Algorithm 2 to future work.
Discussion on the frequency of resolving and comparison with [1].

Thank you for the comment. Indeed, in [1], the LPs are solved at geometric intervals. However, this type of resolving only guarantees regret of order $O(\sqrt{T})$ , and to achieve $o(\sqrt{T})$ regret under our setting, the only two ways we are aware of are
- solving the LPs at every time interval with an action-history dependent algorithm (adjusting $d_t$ ). This achieves $O(\log T)$ regret [2].
- using an $o(\sqrt{T})$ stepsize for the first-order method starting from a good initial guess of $p^\star$ . This achieves $O(T^{1/3})$ regret [3].
Our algorithm essentially interpolates between these two $o(\sqrt{T})$ regret bounds, and our LP-based algorithm also requires periodic re-solving. Otherwise, only $O(\sqrt{T})$ regret can be guaranteed.

Questions For Authors

Insights on the performance enhancement from Algorithm 2.

Please see our response to "Other Strengths And Weaknesses".
Frequency $T^{1/3}$ .

Please see our response to Experiment Designs or Analyses.

Thank you again for your efforts in the review process!

References

[1] Agrawal, S., Wang, Z., & Ye, Y. (2014). A dynamic near-optimal algorithm for online linear programming. Operations Research, 62(4), 876-890.

[2] Li, X., & Ye, Y. (2022). Online linear programming: Dual convergence, new algorithms, and regret bounds. Operations Research, 70(5), 2948-2966.

[3] Gao, W., Sun, C., Xue, C., Ge, D., & Ye, Y. (2024). Decoupling Learning and Decision-Making: Breaking the $\mathcal {O}(\sqrt {T})$ Barrier in Online Resource Allocation with First-Order Methods. arXiv preprint arXiv:2402.07108.

审稿意见

评分: 32025-03-17

This paper studies online linear programming and proposes an algorithm based on switching between a first-order method and a linear programming method proposed in previous literature to achieve both better computational efficiency and smaller regret by properly choosing the switching frequency f. There are also simulation results provided to demonstrate the improvements in computation time and regret performance.

给作者的问题

Please provide more technical discussions on the technical novelty to highlight the challenges and novelty compared to (Li et al. 2020) and (Li Ye 2022).

论据与证据

Most of the technical claims are clearly supported by convincing evidence.

方法与评估标准

Yes. The proposed method is evaluated by both regret and constraint violation, which makes sense for the OLP problem and its relevant applications.

理论论述

The theoretical claims seem correct after a quick read. More comments on the technical results are listed in later boxes on strengths and weaknesses.

实验设计与分析

Yes.

补充材料

I reviewed the relevant appendices on the first order method and the LP method to better understand the algorithm. I also glanced through the proof of the main theorem.

与现有文献的关系

This paper is related to online decision making under constraints.

遗漏的重要参考文献

I didn't find any.

其他优缺点

Strengths: This paper proposes a very interesting question in online decision making: how to achieve both efficiency and low regret at the same time. The paper designs a novel algorithm based on switching between two previously designed algorithms to achieve the best of both worlds. By tuning the switching frequency, the proposed algorithm include LP and the first-order method as special cases.

The illustration figure in Fig 1 is very clear and helpful for understanding the algorithm.

However, I have some questions.

1.This algorithm relies on switching back to the first-order method in the last f steps. But what if T is unknown? How to determine when to switch?

Relying on a pre-determined f can be conservative. Is there any idea to determine the switching time adaptively based on the remaining demand?
The algorithm heavily relies on two algorithms previously developed. The paper could benefit from more discussion on the technical novelty, otherwise the paper may seem to have marginal contribution compared to the previous literature, e.g. (Li etal 2020) and (Li Ye 2022)

其他意见或建议

作者回复

2025-03-31

Response to Reviewer LeBH

Thank you for the valuable feedback on our paper.

Other Strengths And Weaknesses

Unknown horizon $T$

This is a very good point. To our knowledge, in the OLP setting, an unknown horizon length $T$ can make the problem much more complicated: it is shown in [1] that it may not be possible to achieve sublinear regret with horizon uncertainty. In particular, the average resource assumption A2.1 (c) widely used in OLP literature is no longer well-defined, and an algorithm would also have to adapt to horizon uncertainty. Addressing this challenge would likely require first adapting LP-based and first-order OLP methods to uncertain horizons before tackling a hybrid approach.

Despite the theoretical challenge, in practice, it is often possible to make some prior prediction of $T$ using data-driven approaches or based on the resources available in hand. For example, in online advertising, the number of customers can be predicted from historical statistics. Besides, in practice, the selling horizon is sometimes decided by upper-level decision-makers. e.g., a company selling products before festival events. In this case, $T$ is available to the decision-maker as a problem parameter.
Determine $f$ based on the remaining resource

Thank you for the insightful comment. Since $f$ determines both the LP resolving frequency and the switching time to the first-order method, the analysis would be challenging if the LP resolving frequency is dynamically adjusted.

However, we believe it's a good idea to determine the switching time based on the remaining resources:
- One way is to define a new stopping time based on some pre-specified resource level (e.g., $f \cdot d$ when there's only one resource), after which the algorithm switches to the first-order method. The regret accumulated by the first-order method would be of order $O(\sqrt{\tau_f})$ . A rough estimate of $\tau_f =O(\max\\{\log f + \log T, f\\})$ can be obtained from the analysis of the constraint consumption process [1], where $\log f + \log T$ arises from the deviation of $d_t$ (average remaining resource at time $t$ ) from $d$ .
- The other way (more heuristic) is to determine the switching frequency based on the stability of the dual price $\{p_t\}$ obtained by LP. Suppose $\{p_t\}$ remains stable near the end of the horizon, then it suggests the resource consumption is smooth, and solving LP frequently is less necessary--switching to first-order methods would be ideal in this case.
We are glad to add a discussion in the revision but will leave a more rigorous regret analysis for future work.
More discussions on technical novelty

Thanks for the comment. To our knowledge, our paper is the first result combining LP-based and first-order OLP algorithms, achieving the best-of-both-worlds guarantee: achieving efficient decision-making with regret that interpolates the performance of two algorithms. Technically, our contributions include
- We develop a regret analysis framework that unifies both LP-based and first-order methods. In the current OLP literature, there are two types of analyses, one based on stopping time [2] and the other based on a joint metric of regret and resource violation [3]. Although intuitively, these two analyses should yield similar guarantees, it is non-trivial to combine them when two algorithms need to switch to each other. In particular, we need to carefully handle the evolution of the dual price $p_t$ when switching between two algorithms. Since first-order methods are known to be less stable, it is important to also take care of the stepsize of the algorithm (see our discussion in B.2 and B.9).
- In addition, our work addresses a technical limitation of a previous work [4], where the authors also consider solving LPs periodically but require customers to wait at the beginning/end. Our analysis removes the need of waiting.
From a high-level point of view, our method resembles the pre-training (LP) and fine-tuning (first-order methods) of large-language models. "Pre-training" provides baseline regret guarantee $\log (T/f)$ and "fine-tuning" performs minor refinement with improved efficiency.

We will add more discussion about the technical intuitions in the paper in the revision.

Questions

Discussion of technical novelty.

Please see our response to "Other Strengths And Weaknesses".

We hope our response addresses your concerns and thank you again for your efforts in the review process!

References

[1] Balseiro, Kroer, & Kumar (2023). Online resource allocation under uncertain horizons. ACM SIGMETRICS Abstract Proceedings.

[2] Li & Ye (2022). Online linear programming with new algorithms and regret bounds. Operations Research, 70(5), 2948–2966.

[3] Li, Sun, & Ye (2020). Fast algorithm for binary integer and online LP. NeurIPS, 33, 9412–9421.

[4] Xu, Glynn, & Ye (2024). Online LP with batching. arXiv:2408.00310.

最终决定Accept (poster)

2025-05-01

The paper considers the problem of online resource allocation over a horizon of T steps under an unknown iid arrival model. Existing results show that resolving an planning problem at each time t gives a log T regret. While attractive in terms of regret, this requires too much computational effort. The authors show that why carefully solving the planning optimization problem every f time steps, and using SGD for learning the duals in the first f and the last f steps, they can get a regret of O( log(T/f) + sqrt(f)) offering a tradeoff between computational cost and regret. Once caveat is that authors study a setting where they are allowed to violate the resource constraints, and therefore the objective is to maximize regret minus a penalty for resource overconsumption (which simplifies the analysis since the authors just need to study in some sense the dual function).

The reviews are mixed, with some concerns about technical novelty and comparison with prior work. Reviewer GkFi asks for comparison with Agrawal et al (2014) [who do not assume resource capacity scales with T, and also do not make the dual non-degeneracy assumption made in this paper]. I also found the description of Li et al. (2024) in Table 1 a bit misleading since the authors call their decision making delayed and that the paper required non-degeneracy -- my interpretation is that the decision is made at the time of arrival of a customer, so it is not delayed, and further from the abstract of tha paper "the proposed algorithm achieves a constant regret (even for the hard “degenerate” case) while solving LPs only O(log log T) times." The algorithm in Li et al. (2024) is primal based while in the current paper it is dual based (and requires dual to be non-degenerate), so there is sufficient gap in the methodology.

The review team is leaning towards weak accept decision as a result.