PaperHub
4.8
/10
Rejected4 位审稿人
最低3最高8标准差2.0
5
3
3
8
3.3
置信度
ICLR 2024

Communication Bounds for the Distributed Experts Problem

OpenReviewPDF
提交: 2023-09-23更新: 2024-02-11
TL;DR

Towards optimal communication bounds for the distributed expert problem under various scenarios

摘要

关键词
Online LearningCommunication BoundsDistributed Expert ProblemBandit Learning

评审与讨论

审稿意见
5

This paper studies the communication efficient algorithms for a distributed expert problem. In this problem, the cost function associated with each expert is split among multiple servers, and hence, the goal is to find the most communication-efficient learning strategy considering the cost of fetching information from the server as the communication cost. The paper considers two types of cost functions (sum and max) and two communication models (broadcast and message passing) and proposes multiple algorithms to tackle this problem.

优点

++ The studied problem is interesting, timely, practically relevant, and intellectually challenging.

++ The authors consider multiple settings of the problem and propose algorithms for diverse settings of the problem.

缺点

-- Overall, the writing of this paper needs substantial effort to be ready for publication. The current format does not provide enough details and insights into algorithms and theoretical and numerical results. For example, in the current presentation of the introduction, there is no direct mapping between the theory results summarized in Tables 1 and 2, and those descriptive statements of the paragraphs on top of page 3 of the paper.

-- Also, earlier in the introduction, the authors try to distinguish their work with the streaming algorithms claiming that in their setting, the coordinator does not have any memory constraints. However, later in the introduction, they talk about lower bounds in the case of limited memory.

-- I am not sure how much making an assumption of T=O(log(ns))T=O(\log(ns)) makes sense since, typically, in an online learning setting, the time horizon is assumed to be sufficiently large such that the sublinear regret in TT makes sense.

-- The description of the algorithms is very brief and technical and does not provide any insights into the algorithmic steps and ideas. This reduces the paper's readability for a broader audience. In addition, the paper's organization in terms of theoretical results is not clear a lot of theorems are presented inside the algorithm section, while there is a separate section for formal guarantees.

-- The numerical evaluation is very brief and does not provide any comparison between the proposed communication-efficient algorithms and other possible algorithms that try to be aware of the communication costs. Also, the benchmark is not introduced and even cited in the paper. The experimental setup is not clearly defined, and the impact of parameters is not investigated.

--

问题

See weaknesses section.

评论

We thank the reviewer for their review and address their main concerns below.

Q1. Writing

We thank the reviewer for pointing out the deficiencies in our presentation. In our revised version, we have unified the presentation of our communication bounds in the tables, as well as in our contributions section at the top of page 3.

Q2. Memory Bounds

For our memory bound condition in the proof of our lower bound, we clarify that we only impose a memory bound on the downstream servers with no memory assumption on the central coordinator. We believe this assumption is both practical and provides a novel solution for bounding the communication of a server with itself across multiple days. A lower bound proof without a memory bound being imposed on downstream servers is indeed more challenging, which we leave as a very intriguing open question.

Q3. Horizon Assumption

The TO(log(ns))T \in O(\log{(ns)}) assumption is only required for our lower bound argument in the message-passing model with maximum aggregation function. The point here is to show that the best we can do in the message-passing model with a maximum aggregation function is the naive EWA, which is the reason why we can only propose an upper bound in the blackboard model when the maximum aggregation function is used.

Q4. Evaluation

As far as we know, our generic distributed experts setup has not been studied before. [1] only studies the setting when the cost is distributed to a single server rather than arbitrarily aggregated across multiple servers. Also, the memory-efficient streaming algorithms in [2, 3] assume a memory bound on the coordinator, which we don’t have. Thus, given our generic setup, we can think of two baselines: EWA, which is known to achieve optimal regret, and EXP3, which uses less communication and can achieve near-optimal regret. Regarding HPO-B, we have cited the paper in the third paragraph of the introduction. For further clarification, we also cite this benchmark in our evaluation section as well in the revised version of our paper. We have further added more description of our experimental setup and an ablation study on the hyperparameter beb_e in the appendix of our revised paper. This demonstrates the communication-regret tradeoff of our methods.

References:

  1. Kanade, Varun, Zhenming Liu, and Bozidar Radunovic. "Distributed non-stochastic experts." Advances in Neural Information Processing Systems 25 (2012).
  2. Srinivas, Vaidehi, et al. "Memory bounds for the experts problem." Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing. 2022.
  3. Peng, Binghui, and Fred Zhang. "Online prediction in sub-linear space." Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). Society for Industrial and Applied Mathematics, 2023.
评论

Thank you for the response. I read them, and my assessment will remain the same.

审稿意见
3

The article focuses on the expert problem in the distributed setting, where an expert’s cost needs to be aggregated across multiple servers. Each server is faced with a different instance of the Expert problem. In this work, the authors considered various communication models such as the message-passing model and the broadcast model, along with multiple aggregation functions, such as summing and taking the maximum of an expert’s cost across servers. This article presents the first communication-efficient protocols that guarantees near-optimal regret in these settings, even against a strong adversary who can choose the inputs adaptively.

优点

In this article, communication versus regret tradeoffs in various scenarios are considered, which are of great interests to the distributed online learning field. The proposed algorithms have achieved near-optimal regret using much less communication than the baseline EWA algorithm, that is DEWA etc. Moreover, lower bounds are provided to show either the regret or communication bound of the algorithms is optimal.

缺点

My major concerns lie in the motivation of this work. The problem considered in this paper should be better motivated with convincing application examples. Admittedly, the expert problem has achieved great success with broadly employed in may practical application scenarios, but regarding the distributed setting, it is hard for me to map it to any specific application scenario. To make the contribution clear, please provide some examples and discussions on that in the introduction part. In addition, both the communication protocols mentioned in this paper rely on a coordinator. Why is it a reasonable assumption. Does it make any difference on the results if we change the model to a fully distributed one?

Also, all the experts are forced to commit to the same expert according to the depicted problem setting. Please provide reasons for making such an assumption.

The presentation of this work is poor and should be substantially revised.

The simulation results are too simple without any support of real-world data traces and lack comparison to other related works.

问题

see weakness

评论

We thank the reviewer for their review and address their main concerns below.

Q1. Problem Definition and Practical Scenarios

We have highlighted the motivating examples in our introduction and describe them here. For distributed online learning problems (e.g., online federated learning), you can try different sets of hyperparameters that correspond to different experts in the distributed experts problem. The goal is to adaptively choose the optimal hyperparameters on each day. A central coordinator is commonly used in such scenarios, which receives information from users along edges, and updates their model locally. If it is fully decentralized and has point-to-point communication, our communication upper bounds will still hold up to a multiplicative constant factor of 2 (and additive log(# servers)), since the coordinator can forward a message from one server to any other server. Otherwise, our communication bound depends on the specific communication topology, which is not the focus of our paper. In addition, for distributed online learning or federated learning, it is common for the coordinator to commit to the same model due to fairness and privacy issues as well as the memory overhead of storing multiple models for each downstream server.

Q2. Presentation

We thank the reviewer for pointing out the deficiencies in our presentation. We have updated the writing accordingly in the revised version of our paper.

Q3. Evaluation

We have evaluated our methods against real-world data trace (HPO-B) in section 6 and provided simulation evaluations in the appendix to demonstrate that our upper bound methods are robust even against extreme cost distributions (i.e., very sparsely distributed), which do not have corresponding real-world data traces.

评论

Thanks very much for your response. I keep my score, since my concerns on motivation and contribution remain.

审稿意见
3

This paper proposes the experts problem in the distributed setting. The idea is that each expert i experiences a loss at each step j at each time step t. Thus, the losses are l^t_i,j \in [0,1]. Looking at (generally a subset of) historical data, the algorithm needs to pick an expert at each time step t, so as to minimize regret (as compared to choosing the best possible fixed expert with hindsight). The key distributed twist to the problem is that communication cost is considered alongside regret, thus obtaining obtaining historical data from servers has a cost; in this context, two cost models are considered: message-passing and broadcast. Furthermore, multiple aggregation methods are considered for total loss: in summation aggregation, the cost of an expert is the sum of the cost of an expert across servers; in maximization aggregation, the cost of an expert is the max cost of that expert across servers.

The paper gives algorithms and lower bounds, and discusses trade-offs between regret minimization and communication minimization. The upper bounds are in the strong (adaptive) adversary model, where an adversary get to observe the full history realization before choosing the costs at the next time step. Lower bounds are in the weaker, oblivious adversary model; this means the lower bounds are more powerful, since they apply even if the adversary is weak. However the lower bounds require memory bounding assumptions; this means that the lower bounds are less powerful, since they only apply when the memory restrictions exist.

The paper also provides experimental data comparing its algorithms with distributed adaptations of classical multiplicative weights.

优点

Originality: To the best of my knowledge, this is the first paper on Distributed Experts. I believe the problem is interesting, non-trivial, and has the potential to kick off a series of papers on the topic. There is also scope for expansion of the problem, by adding in extra factors, such as privacy and asynchrony.

Result Quality: None of the theorems are proved in the body of the paper. I don’t anticipate that their proofs involve substantial novelty. I think the more substantial contribution of this paper is its model which expands the scope of the experts problem to the distributed setting. That, I feel, is a good quality contribution.

Writing Quality / Clarity: The paper was readable with effort (this is not true of all papers).

Significance: I think the topic of study is novel, and interesting, especially in a world where distributed systems, distributed machine learning, and distributed inference are increasingly relevant.

缺点

Originality: The algorithms and analyses are not particularly novel, but they do need to adapt ideas from the non-distributed world appropriately, and I wouldn’t be surprised if they took some effort to adapt correctly.

Result Quality: The results likely involve limited technical ingenuity.

Writing Quality / Clarity: The paper could use improvement in the writing. In particular, the first paragraph should describe the problem in more detail: when it says the experts make predictions, a reader would assume the prediction matters; however, in the model of the paper, experts don’t matter; it is also relevant that the adversary must pick the costs in a bounded interval, in this case [0,1]; without explaining these types of details clearly, the substance would be confusing to readers who are not familiar with the experts problem, or even those who are familiar with it, but have seen other versions of the problem.

Significance: Like most papers at ICLR, this paper is probably of interest to a small subcommunity at the conference. Once again, I believe this is true of most papers, so it is not a negative comment towards a paper.

问题

My assessment of the paper is that the major contribution of this paper is in the problem definition, lifting the experts problem to the distributed setting. However, I’m of the impression that the algorithms and analyses are largely straightforward, given what is known about the experts problem in the sequential setting, and given the statement of the distributed setting—which is a novel contribution of the paper. Is this assessment correct in the authors’ opinion? If not, what are the most interesting technical contributions of this paper in terms of algorithmic ingenuity or analysis?

评论

We thank the reviewer for their review and address their main concerns below.

Q1. Major Contribution

We thank the reviewer for acknowledging the definition of the distributed experts problem as interesting and novel.

Regarding our technical contribution, we agree with the reviewer that our constant probability upper bound with a summation aggregation function (i.e. DEWA-S) is intrinsically an unbiased sampling method, and the proof is mostly similar to that in the non-distributed setting. However, in order to achieve a high probability guarantee, we propose a novel hierarchical EWA algorithm, where each i.i.d. subroutine DEWA-S or DEWA-M can be regarded as a meta-expert. Thus, by running a conventional EWA on top of these meta-experts, we obtain optimal regret with a high probability guarantee (i.e., DEWA-S-P and DEWA-M-P). We note that we also consider the maximum aggregation function in the distributed setting, and here one cannot obtain an unbiased estimator. Instead, we propose a random-walk-based communication protocol that achieves near-optimal regret with constant probability (DEWA-M) or with high probability (DEWA-M-P). To summarize, our algorithms go significantly beyond simple unbiased estimation - we (1) propose a hierarchical EWA algorithm to achieve a high probability guarantee as well as (2) a random-walk-based communication protocol when the maximum aggregation function is used.

Q2. Writing

We thank the reviewer for pointing out the issues in our presentation. We have improved our writing accordingly in our revised submission.

评论

Thank you to the authors for responding to my comments and questions.

Thanks for clarifying and explaining your algorithmic contributions. Also, I'm glad that you have taken the comments from me and other reviewers about the writing of the paper to heart: good writing can make a world of difference in how well a reader can understand your contributions, assess them, and use them in further work. Incidentally, I recommend writing a paragraph conveying your ideas in your response to "Q1.Major Contribution" in the introduction of any future revision of the paper.

(I tried to find your revisions, so I could comment on them; but I am unable to find them on my end of the interface.)

Overall, I believe that the main idea of your paper is nice. More motivating applications would, of course, be nice (as other reviewers observed), but I feel that the theoretical question is compelling in itself. Furthermore, in your response, you noted some interesting algorithmic work (perhaps the analysis is also interesting?). However, the paper can do a substantially better job of explaining these results.

I am not changing my score, since the rating is reflective not only of the quality of the results, but also of how those results are ultimately presented as a research paper; but I imagine that the results, algorithms, and analyses can be recast as a compelling research paper in the future.

审稿意见
8

This paper presents a distributed variant of the expert problem. In this problem, there are a set of several servers, and there is an instance of each expert on each server. The authors consider two objectives; one objective is to minimize the sum of costs across the servers, while another objective is to minimize the maximum cost achieved across the servers. The authors also consider two message passing protocols, namely a protocol in which the central coordinator communicates directly with one server at a time, and a protocol in which the central coordinator can broadcast a message to all servers. The authors develop several algorithms with associated regret and communication costs, and also provide a lower bound communication cost for this problem. Some computational experiments are provide, applying these algorithms to a benchmark related to hyperparameter optimization.

优点

The problem considered is interesting, and the theoretic results are reasonably strong.

缺点

Some critical assumptions made by the authors seem to be going unstated. Most critically, the authors don't state any assumptions about the costs litl_i^t other than that they are in [0,1][0,1]. It seems that the authors are assuming that l_it_t=1\\{l\_i^t\\}\_{t=1}^\infty are i.i.d. or something similar, but this is not stated anywhere.

Overall, I thought that the problem could be motivated better. It would help to provide more details about the HPO-B benchmark, as well as providing other instances where the distributed experts problem could apply.

The computational experiments are pretty sparse. I'm not sure why Exp3 is only compared against against the authors' algorithms with be=1b_e=1 and EWA is only compared against authors' algorithms with be=nb_e=n. It would also be useful to show experiments with beb_e taking a wider range values, to understand how this value affects the performance of the algorithm.

问题

What are the assumptions placed on litl_i^t?

评论

We thank the reviewer for their review and address their main concerns below.

Q1. Assumptions post on the range and distribution of the cost

Indeed, the range of the litl_i^t can be within [0,ρ][0, \rho] for any ρ>0\rho > 0, and all the regret terms (both upper and lower bounds) will be multiplied by a factor of ρ\rho accordingly. Thus, the upper-bound methods we propose are still nearly optimal. The only reason we make the assumption that lit[0,1]l_i^t \in [0, 1] is to simplify the notation in the proof, where we can cancel out the ρ\rho factor. We have updated this description in the revised version of our paper to clarify this confusion (note that changes in the revised version are in blue). Regarding the cost distribution, our upper bound methods are derived against strong adversaries, where the adversary can define the cost vector arbitrarily, while our lower bound proofs hold against oblivious adversaries where the adversary can choose the cost functions arbitrarily but without interacting with the algorithm.

评论

Thank you for the clarifications.

评论

We thank the reviewers for all their constructive comments. Based on the comments (please refer to our individual replies for details), we have prepared a revised submission, where we focus on clarifying the most important confusions we identified and providing additional results as much as possible. All revised texts are in blue. The additional experiment is included in Appendix C.4 of our revised submission.

Summary:

  1. Improved the writing and presentation in the revised paper.
  2. Added additional ablation results on the hyperparameter beb_e that controls the communication regret tradeoff in Appendix C.4 of our revised paper.
AC 元评审

This paper consider the distributed experts problem, where the costs are aggregated across servers. The reviews point to potentially interesting results, but suggest that a) the contributions should be made clearer, b) presentation be improved. While the authors have addressed some of these concerns, for a proper check of this, it may need another review process.

为何不给更高分

Requires revision.

为何不给更低分

N/A

最终决定

Reject