Thank you for your feedback! We address the points raised in the review below.

Bandit feedback: One possibility for bandit feedback is when the agent does not observe the contract but only observes the payoff induced by this contract for the action that was played. A different scenario is when the agent may observe the contract but does not know the outcome distributions or what they imply for the actions that were not played, which again requires some additional exploration. Our analysis, however, holds for full feedback or any of these partial feedback scenarios.

Theorem 3.2: The theorem shows that there is a positive measure of contract games where learning by the agent and an optimal dynamic contract by the principal lead to a strict Pareto improvement (win-win) outcome. This is, however, not true for all contract games; there are games where learning reduces a player’s utility compared to a myopic best response (for instance, the example in the introduction). The construction in Theorem 3.2 shows that there are cases where using either a best-response strategy or, alternatively, smarter learning (no-swap-regret) leads to a loss compared to using simpler learning (mean-based learning; see our remark at the end of Section 3). So, whether it is beneficial to use learning, and which type of learning to use, depends on the game.

Theorem 4.3: Whether or not is feasible just depends on whether is less than the ratio between the optimal (known-time horizon) dynamic contract and the optimal static contract; for any problem there is some sufficiently large to make infeasible. The theorem statement is perhaps easier to conceptualize by considering its contrapositive: if there is an such that is feasible, then for any there exists an such that is feasible as well. We will provide additional context for this theorem to better clarify the situation.