PaperHub
6.0
/10
Poster4 位审稿人
最低3最高4标准差0.4
4
4
3
4
3.0
置信度
创新性2.5
质量2.5
清晰度2.3
重要性2.0
NeurIPS 2025

Prompt Tuning Decision Transformers with Structured and Scalable Bandits

OpenReviewPDF
提交: 2025-05-09更新: 2025-10-29
TL;DR

We provide a scalable bandit architecture for prompt tuning of decision transformers for increased downstream performance.

摘要

关键词
Reinforcement LearningDecision TransformerPrompt tuningBanditsMulti-task transfer

评审与讨论

审稿意见
4

This paper presents a bandit-based prompt-tuning method for the Prompting Decision Transformer (PDT) to enhance task generalisation in offline Reinforcement Learning, particularly in multi-task and few-shot settings. PDT, which utilises trajectory prompts from expert demonstrations, has limitations due to its uniform sampling of prompts, thereby ignoring their varying informativeness. The proposed method optimises prompt selection using a structured bandit architecture, enabling efficient reward modelling and linear scaling concerning prompt size. The method outperforms existing baselines across tasks, high-dimensional environments, and out-of-distribution scenarios without requiring costly weight updates to the underlying Transformer backbone.

优缺点分析

Strengths:
This work introduces a scalable and efficient bandit-based prompt-tuning method for PDT, optimizing prompt selection to improve task performance without requiring costly weight updates. It shows consistent performance gains across various tasks and environments.

Weaknesses:
The core limitations stem from its oversimplified assumption of inter-segment dependency and an insufficient credit assignment mechanism. While it demonstrates better scalability than naive bandit approaches, its application in large-scale multi-task environments remains constrained by sample efficiency issues, computational overhead, and lack of cross-task knowledge transfer.

问题

In Theorem 4.1, the assumption is made that the error of the interaction term between segments, h(τ~1,...,τ~J)h(\tilde{\tau}_1, ..., \tilde{\tau}_J), is bounded (ε|\cdot| \leq \varepsilon), and it claims that this assumption holds due to PDT's "modular structure." However:

  • The core of the Transformer architecture is the self-attention mechanism, which inherently models cross-segment dependencies. How can we prove that these global dependencies do not significantly increase ε\varepsilon?
  • If ε\varepsilon is indeed large, the KεK\varepsilon term in equation (8) would cause regret to grow linearly (especially when KK is large). Does this fundamentally undermine the theoretical advantages of the method?
  • Are there any experimental quantifications of the actual ε\varepsilon value? For example, by comparing the independent segment reward predictions with the actual return of the complete prompt?

Section 4.4 suggests using the pre-trained PDT model as a feature extractor Ψ\Psi, which maps original segments to embeddings:

  • The core objective of this method is to "quickly tune at inference time," but obtaining embeddings for each candidate segment τ~\tilde{\tau} requires a forward pass through a computationally intensive Transformer. When Pi=M×(LH+1)|\mathcal{P}_i| = M \times (L - H + 1) becomes large (e.g., M=50,L=100,H=54800M = 50, L = 100, H = 5 \rightarrow 4800 segments), does this embedding computation become a new bottleneck?
  • On the other hand, in high-dimensional state spaces (e.g., pixel-based observations), would the dd-dimensional embeddings need to increase to preserve useful information significantly? Could this cause the O(KlogP)O(\sqrt{K \log |\mathcal{P}|}) term of the bandit method to deteriorate?

局限性

The limitations are discussed in the conclusion.

最终评判理由

I appreciate the explanations regarding the segment independence assumption, computational overhead, and the empirical quantification of the interaction term. The additional experiments further strengthen the justification for the independence assumption and demonstrate the robustness of the proposed method. Therefore, I have raised my rating to 4.

格式问题

No major formatting issues.

作者回复

We thank the reviewer for their thorough and thoughtful review, as well as for raising important points about the core aspects of our method. Below, we address the concerns and suggestions in detail, and outline the modifications we plan to incorporate in the revised manuscript.

W1: On the segment independence assumption

We elaborate further in our detailed response below, but in brief: we do not believe the segment independence assumption is oversimplified. Many MDPs can be effectively identified by their optimal state-action marginals, meaning that the inclusion of a few informative (r^,s,a)(\hat{r}, \mathbf{s}, \mathbf{a}) pairs in the prompt ought to be sufficient for downstream adaptation. In such settings, inter-segment interactions are minimal. Empirically, our method consistently outperforms strong baselines across environments (Tables 1-3), indicating that this assumption holds well in practice. In Section 5.3, we further quantify the interaction term h()h(\cdot) and find that it remains small, supporting the theoretical justification of our bandit decomposition.

W2: On the computational overhead

When using Transformer-derived prompt embeddings, segment features are computed once per segment and cached, removing the need for repeated forward passes during tuning. The bandit updates themselves involve only small MLPs and incur negligible overhead compared to full-model updates. As shown in the results, our method delivers better performance than full-model fine-tuning, despite being significantly more efficient in wall-clock time.

W3: Cross-task knowledge transfer

Our setting follows the standard PDT prompt-tuning framework, where a multi-task pre-trained model is adapted to a single target task via prompt selection. In this formulation, knowledge transfer occurs from pretraining to online inference, and our method directly builds on this assumption. While extending the framework to continual or multi-task online adaptation is an exciting direction, it goes beyond the scope of our current study and the literature we aim to advance.

Q1: How can we prove that global prompt dependencies don't increase ε\varepsilon

Thank you for raising this insightful question. While PDT does attend to the full input sequence, including all prompt segments and the current history (as described in Equation 3), the pretraining procedure samples each prompt segment independently from multiple trajectories in the demonstration pool P\mathcal{P}. That is, prompts are constructed without enforcing order or co-occurrence constraints between segments.

As a result, the model is not trained to rely on segment interactions or their specific arrangement. Instead, it learns to attend to individual informative (r^,s,a)(\hat{r}, \mathbf{s}, \mathbf{a}) pairs, which are sufficient for identifying the task in many offline RL settings. This training setup implicitly encourages invariance to inter-segment dependencies, and thus limits the influence of global prompt structure on the model’s behavior.

Under this assumption, the interaction term ε\varepsilon, which captures residual dependencies between segments, is expected to remain small. This is supported by our empirical analysis in Section 5.3 and Figure 4, where we quantify the deviation and show that the independence-based approximation holds well in practice.

To further support this interpretation, we conducted an additional study in the 2D environment:

  • For each task, we constructed prompts containing one highly informative segment and J1J - 1 less informative ones, then permuted their order across rollouts.
  • PDT performance remained stable, with a return of 6.27±0.346.27 \pm 0.34, indicating permutation invariance.
  • Analyzing attention weights, we found that in 99% of rollouts, PDT focused on the informative segment, regardless of position.
  • When we masked attention to prompt segments, performance dropped sharply if the informative segment was excluded, but remained near-optimal when only it was retained:
Attention ScopeReturn
Full prompt6.27±0.346.27 \pm 0.34
Only informative segment5.88±0.235.88 \pm 0.23
Only uninformative segments34.06±4.49-34.06 \pm 4.49

These results suggest that PDT behavior is largely driven by the most informative segment, and that inter-segment interactions are negligible in practice. We will also include a visualization of attention weights in the appendix of the final paper, but we are not allowed to link additional image material during the rebuttal.

With that being said, it is true that if an MDP is characterized by cross-segment interactions (e.g., temporal ordering across segments), the independence assumption may not hold, and ε\varepsilon could be larger. This would weaken the regret bound. However, such tasks are not aligned with PDT's training procedure, which uses uniform, position-independent sampling of prompt segments. As a result, PDT is not trained to model or exploit inter-segment dependencies, and its performance is inherently limited in those settings. We will clarify this limitation in the revised manuscript.

Lastly, when using PDT to encode prompt segments, we ensure that each segment is encoded independently: we pad the segment into a fixed-length sequence and use the representation of the last token, which only attends to tokens within the same segment. This prevents leakage of information from other segments and preserves the independence assumption in the segment embeddings used by the bandit.

Q2 & Q3: If epsilon is indeed large, the term in equation (8) would cause regret to grow linearly (especially when K is large). Are there any experimental quantifications of the actual value epsilon?

We agree that a large ε\varepsilon would lead to linear regret growth and undermine the benefit of our structured bandit formulation. To assess this directly, we quantify the interaction term ε\varepsilon empirically in Section 5.3 using a synthetic prompt-tuning task designed to introduce segment dependencies.

In this setup, the bandit must identify a correct integer sequence of length JJ that maximizes reward, but it receives feedback only for the full sequence, while modeling each position with an independent reward model -- exactly mirroring the structure of our method. We compute the interaction term as:

h(x1,,xJ)=R(x)1Jjϕj(xj)h(\mathbf{x}_1, \dots, \mathbf{x}_J) = \left| R(\mathbf{x}) - \frac{1}{J} \sum_j \phi_j(\mathbf{x}_j) \right|

as suggested by the reviewer.

Even in this adversarial setting, the measured interaction term remains small and increases slowly with JJ, as shown in the right-most plot of Figure 4. This suggests that the independence assumption introduces only a minor approximation error in practice.

We will clarify this point in the revised version to make the connection to the regret bound more explicit.

Q3: Do prompt segment embeddings introduce a computational bottleneck?

While it is true that encoding each prompt segment with the PDT requires a forward pass through the Transformer, this cost is incurred only once per segment. All segment embeddings are pre-computed and cached prior to bandit tuning, so no further Transformer forward passes are required during prompt selection or reward modeling.

In our experiments, the number of candidate segments remains modest, typically on the order of a few hundred (with M10M \approx 10 expert trajectories). As a result, the one-time embedding cost does not constitute a bottleneck in practice.

We also note that while PDT-encoded segments (denoted Ψ\Psi) do not always outperform raw segments in terms of return (see updated results for KYsS), they offer a practical advantage: they decouple the input dimensionality of the reward models from the environment’s state and action spaces. Without embeddings, the input size would scale with (S+A+1)×H(|\mathcal{S}| + |\mathcal{A}| + 1) \times H; with embeddings, it remains constant. This leads to significant efficiency benefits in high-dimensional domains. As shown below, average rollout times in the MuJoCo Ant environment are notably lower when using cached embeddings:

Ant (J=1, H=5)Ant (J=2, H=20)
TSΨ^\Psi~28m~34m
TS~33m~172m

We believe this confirms that using PDT-derived segment embeddings offers a practical and scalable solution without introducing a computational bottleneck.

Q4: Does O(KlogP)O(\sqrt{K \log |\mathcal{P}|}) deteriorate in high-dimensional states (and high dd)?

Thank you for the question. While high-dimensional inputs (e.g., raw pixels) can be challenging, we rely on pre-trained PDT embeddings to compress each segment into a fixed-length representation. This keeps the input to the reward models constant, regardless of the original state/action dimensions. Our experiments already involve high-dimensional settings, e.g., Ant ((27+8+1)H(27 + 8 + 1) * H), Meta-World ((39+3+1)H(39 + 3 + 1) * H), Cheetah ((20+7+1)H(20 + 7 + 1) * H), and show that PDT embeddings perform comparably to raw inputs for prompt tuning.

Regarding the regret term O(KlogP)O(\sqrt{K \log |\mathcal{P}|}), we clarify that the term P|\mathcal{P}| refers to the size of the prompt dataset, not the dimensionality of the prompt or segment embeddings. Therefore, the regret is neither directly influenced by the embedding dimension dd nor by the dimensionality of unencoded modalities. There is, however, an implicit dependence on the PDT's capacity: if the PDT cannot solve a downstream task well regardless of the prompt, then the maximum achievable reward by the bandit is capped. In such cases, observed regret may be high, but this reflects a limitation of the underlying model, not in the bandit optimization or its theoretical guarantees.

We will clarify this interpretation in the revised manuscript.

评论

Thanks for the author's response. I will increase my rating to 4.

评论

Thank you!

We're glad the rebuttal helped clarify our contributions and addressed your concerns. We welcome any further questions or suggestions you may have and appreciate the time and feedback.

审稿意见
4

This work proposes a method for tuning prompts for Decision Transformers. The idea is to outsource the selection of sub-trajectories to be used as prompts to a multi-arm contextual bandit. Crucially, the method maintains an independent reward model for each hard, thus alleviating the combinatorial growth of the optimization space. Furthermore, the authors suggest that arm fathers can be extracted from the embedding function already existing within the decision transformer. An experimental evaluation is provided, spanning from a simple 2D integrator, to standard mujoco tasks (ant, half-cheetah, meta-world). The method is compared to perturbation-based methods, and demonstrates better performance. Several exploration algorithms for the bandit are compared. Finally, an empirical regret evaluation is carried out to motivate the structured nature of the bandit.

优缺点分析

Strengths:

  • The submission is very well written, and motivated.
  • The application of bandits for optimizing prompts of decision transformers is, to the best of my knowledge novel.
  • The experiments are well designed and motivated. In particular, Figure 3 nicely illustrated the method at work.

Weaknesses:

  • The experimental section has contradictory or confusing results:
    • Baselines appear to be rather weak: ZORankSGD appears to underperform with respect to a naive Hill-climbing strategy in Table 1. Why is this the case? Furthermore, baseline (ii) (line 246) seems to be missing from the table, but I would expect it to represent a bottom line.
    • If I understand correctly the design of the current hill-climbing baseline, an interesting baseline would be hill-climbing over the (discrete) space of prompts (Pi\mathcal{P}_i). Can the authors comment on whether this would be a reasonable inclusion? Essentially, it would ablate whether the bandit is better at optimizing than a simple hill-climbing routine over a similar space.
    • The adoption of transformer-encoded trajectories is stated to be beneficial (262). However, average performance seems to decrease in Table 1. Can the authors comment on this?
    • Results in Table 1 have very large standard deviations, to the point that it’s hard to claim significant improvement. How are the standard deviations computed?
  • The experimental selection of 4 relatively simple tasks could be expanded.

问题

Questions:

On top of the ones listed above:

  • Authors assume that a simulator of the environment is available for online evaluation of the policy, but not for data collection. Why is this a reasonable assumption? Trajectories observed during policy evaluation could easily be used for policy improvement.

Others:

  • equations 2 and 3 should be moved below (between lines 72 and 73), as the notation is not yet introduced.

局限性

yes

最终评判理由

My issues were partially resolved during the rebuttal. The authors explained baselines more in details, and promised to correct an inconsistency between what is claimed in the submission (line 262) and their experimental results. I am not entirely convinced by the way in which performance is aggregated over environments, as standard deviation capture cross-task variability, rather than actual confidence intervals on, say, mean performance. A more informative aggregation, and an evaluation on more environments, would make this submission more convincing.

Nevertheless, this paper remains an interesting submission and I would still vote for acceptance.

格式问题

none

作者回复

We thank the reviewer for the detailed and constructive feedback. We appreciate the recognition of our paper’s motivation, clarity, and the novelty of applying structured bandit-based prompt tuning to PDT. Below, we respond to the specific concerns raised, and outline how we plan to clarify or extend the paper in the final version.

W1: ZORankSGD appears to underperform.

We appreciate pointing this out. Indeed, we observe that ZORankSGD performs similarly to or slightly below hill-climbing across several environments in Table 1.

Both methods apply black-box optimization over the prompt space using perturbed prompt variants, but differ in how they process feedback:

  • Hill-climbing updates prompts based directly on absolute return values, preserving the full, informative feedback signal.
  • ZORankSGD, by contrast, relies on rank-based comparisons among perturbations for gradient estimation, which discards magnitude information in favor of relative ordering.

This design may make ZORankSGD more robust in noisy settings, but also less sample-efficient, and explains why it often performs similarly to, or slightly worse than, the simpler hill-climbing baseline in our experiments.

W2: Hill-climbing over the (discrete) space of prompts?

Hill-climbing over the discrete prompt space is an interesting idea, but there are practical limitations that make it challenging and potentially less effective than our proposed approach.

The space of possible prompts is discrete and combinatorial, growing rapidly with the number of prompt segments JJ and the size of the prompt dataset P|P|. Standard hill-climbing requires defining a notion of local neighborhood (e.g., single-segment substitutions), but this is difficult to implement in a high-dimensional, unordered prompt space.

Our method avoids the combinatorial growth by structuring the exploration space: we model each prompt slot independently, and leverage contextual bandits to generalize across similar segments. This enables scalable and sample-efficient search, while exploiting structure in the prompt space.

That said, we appreciate the suggestion. We will clarify this distinction in the revision.

W3: The adoption of transformer-encoded trajectories is stated to be beneficial, but performance seems to decrease in Table 1.

We appreciate the opportunity to clarify. First, we would like to note that we have updated Tables 1 and 2 after identifying an issue in the PDT reference codebase that affected pretraining on the ML1 Pick-Place environment. All relevant experiments have been rerun, and the updated results are included in the revision:

Method / EnvID Pick-Place (J=1, H=5)ID Pick-Place (J=2, H=2OOD Pick-Place (J=1, H=5)OOD Pick-Place (J=2, H=2)
CQL oracle535.84 ±\pm 31.02535.84 ±\pm 31.02525.07 ±\pm 60.15525.07 ±\pm 60.15
PDT, no tuning551.58 ±\pm 26.09535.52 ±\pm 24.86502.8 ±\pm 63.98524.37 ±\pm 39.56
PDT, finetunedN/AN/A495.37 ±\pm 57.87488.17 ±\pm 50.15
Hill-climbing555.79 ±\pm 22.72540.15 ±\pm 23.49560.92 ±\pm 27.04544.19 ±\pm 28.83
ZORankSGD554.26 ±\pm 23.00537.20 ±\pm 25.5503.56 ±\pm 66.16538.05 ±\pm 31.46
ϵ\epsilon-greedyΨ^{\Psi}555.35 ±\pm 24.15541.32 ±\pm 22.91531.49 ±\pm 49.86552.11 ±\pm 23.83
TSΨ^{\Psi}556.87 ±\pm 24.11540.82 ±\pm 22.83549.38 ±\pm 36.12553.12 ±\pm 20.81
UCBΨ^{\Psi}552.68 ±\pm 24.77538.84 ±\pm 23.14506.43 ±\pm 65.38539.42 ±\pm 31.12
ϵ\epsilon-greedy556.22 ±\pm 25.16541.80 ±\pm 23.76530.26 ±\pm 51.61550.06 ±\pm 22.38
TS556.11 ±\pm 24.56541.33 ±\pm 22.79551.33 ±\pm 34.52553.34 ±\pm 18.82
UCB554.92 ±\pm 26.06538.50 ±\pm 23.64512.25 ±\pm 69.62534.58 ±\pm 35.35

The results are consistent with the trends observed in other environments. The relatively modest gains on ML1 Pick-Place in the ID setting are due to characteristics of the prompt dataset: prompt trajectories are short, homogeneous, and already near-optimal. This limits the available headroom for further improvement through prompt tuning.

To support our claim that the pre-trained PDT provides useful embeddings for bandit-based prompt-tuning, we report relative performance differences when using Transformer-encoded prompt segments (denoted as Ψ\Psi) and unencoded prompts:

ID tasksHalf Cheetah (J = 1, H = 5)Half Cheetah (J = 2, H = 20)Ant (J = 1, H = 5)Ant (J = 2, H = 20)Pick-place (J = 1, H = 5)Pick-place (J = 2, H = 2)
Best bandit (no encoding)-33.56 ±\pm 13.48-26.28 ±\pm 10.14835.38 ±\pm 171.25816.77 ±\pm 268.42556.22 ±\pm 25.16541.80 ±\pm 23.76
Best bandit (Ψ\Psi encoding)-36.55 ±\pm 11.93 (-8.9%)-26.87 ±\pm 14.10 (-2.2%)815.17 ±\pm 182.28 (-2.4%)819.52 ±\pm 191.39 (+0.3%)556.87 ±\pm 24.11 (+0.1%)541.32 ±\pm 22.91 (-0.1%)
OOD tasksHalf Cheetah (J=1, H=5)Half Cheetah (J=2, H=20)Ant (J=1, H=5)Ant (J=2, H=20)Pick-Place (J=1, H=5)Pick-Place (J=2, H=2)
Best bandit (no encoding)-29.26 ±\pm 21.25-25.18 ±\pm 22.17480.85 ±\pm 84.76438.82 ±\pm 125.40551.33 ±\pm 34.52553.34 ±\pm 18.82
Best bandit (Ψ\Psi encoding)-32.61 ±\pm 19.85 (-11.4%)-23.93 ±\pm 14.14 (+5.0%)477.24 ±\pm 84.64 (-0.8%)441.44 ±\pm 80.25 (+0.6%)549.38 ±\pm 36.12 (-0.4%)553.12 ±\pm 20.81 (-0.0%)

We see this as a clear positive result: it shows that PDT’s latent representations provide a compact and task-relevant embedding of trajectory segments, enabling the reward models to scale more effectively in high-dimensional domains. This is especially valuable because, without using Ψ\Psi, the input dimensionality of each reward model would scale as (A+S+1)×H(|\mathcal{A}| + |\mathcal{S}| + 1) \times H. Using Transformer-derived embeddings allows us to decouple model input size from the raw environment dimensionality while preserving informative structure. This results in near-constant inference time independently of prompt size, as can be seen in the following table that shows average rollout times for representative configurations in the MuJoCo Ant environment:

Ant (J=1, H=5)Ant (J=2, H=20)
TSΨ^\Psi~28m~34m
TS~33m~172m

W4: Large standard deviations in Table 1.

The standard deviations in Table 1 reflect variation across many downstream tasks and random seeds, following the evaluation setup. Specifically, we report the mean and standard deviation of the final 10 rollout returns, averaged over three seeds and all tasks.

We acknowledge that the earlier results for ML1 Pick-Place had particularly high variance. As noted above, we identified and fixed an issue in the reference PDT codebase affecting that environment, and reran the relevant experiments. The updated results exhibit more stable behavior and continue to support the performance trends reported.

We hope this helps clarify the source of variability and increases confidence in the robustness of our method.

Q1: Why assume simulator for policy evaluation but not improvement?

Thank you for this very insightful question. We agree -- in principle, online evaluation rollouts could be repurposed for policy improvement. In fact, this corresponds to freezing the prompt and fine-tuning the full PDT using the observed trajectories, which we include as a baseline in Table 2.

Compared to fine-tuning, bandit-based prompt selection is significantly more sample-efficient: in practice, optimal prompts are often found within the first 10–20 rollouts (see Appendix D), whereas full-model fine-tuning on the same small dataset leads to degraded performance, as shown in our results.

This is consistent with observations from the offline-to-online RL literature, where limited online data, distributional shift, and noisy gradients often hinder fine-tuning. Prompt-tuning, in contrast, leverages a strong pre-trained policy and requires only selection, rather than learning, at inference time.

We believe this makes the simulator-for-evaluation assumption reasonable in settings where rapid, low-cost adaptation is required.

Q2: Equations 2 and 3 should be moved.

Thank you for spotting this! We've updated the manuscript to move them after the relevant notation is introduced.

评论

Thank you for your answer.

  • W3: I understand the argument on inference time, but judging from the tables you presented, I do not see improvements in performance due to encoding. Isn't there a large drop, for instance, in Half Cheetah (J = 1, H = 5)?
  • W4: I would suggest reporting standard deviations over seeds alone.
评论

Thank you for the response!

W3: Regarding the use of Transformer-encoded prompt segments

Indeed, there is a performance drop of roughly 10% on Half-Cheetah with J=1,H=5J=1, H=5 when using transformer-encoded segments, compared to using unencoded ones. This is the most noticeable drop across all experiments. In the other environments and configurations, the performance difference between encoded and unencoded prompts is consistently negligible, often within ±1\pm 1%. Thus, while Transformer embeddings do not consistently improve performance, they achieve parity in most cases, which we see as a positive result.

Our motivation for including this finding is not to claim universal performance improvements when using segment embeddings, but to illustrate the practical utility of using the pre-trained PDT as a feature extractor for downstream prompt-tuning tasks. We note that no per-environment tuning of the transformer architecture was performed, meaning that all encodings were obtained from PDTs pre-trained with default hyperparameters.

So all in all, we agree with the reviewer that transformer-encoded prompts are not always superior in terms of task return. However, we believe that their runtime and scalability benefits, combined with competitive performance, make the contribution noteworthy and valuable to the community. We will clarify our claim and emphasize the runtime benefits (as opposed to return gains) of using PDT segment embeddings in the final version of the paper.

W4: On the computation of standard deviations

Thank you for the suggestion. Computing standard deviations across seeds is indeed common practice in RL research. In our setup, we chose to report standard deviations over both seeds and tasks to reflect variability across different environment configurations, not just seed sensitivity. We acknowledge that this choice may obscure per-seed stability of the overall PDT + bandit system. However, while RL algorithms are often highly sensitive to seed variation and initialization, the PDT is trained on fixed offline data, and the bandits are relatively robust to initialization. As a result, we found that reporting standard deviations over all seeds and tasks was more descriptive of general model behavior.

We again thank the reviewer for the helpful questions and suggestions.

评论

Thank you for your further answers. I would suggest toning down the claims on line 262 accordingly. Despite this inconsistency and the restricted selection of tasks, I still (weakly) recommend acceptance.

评论

Thank you!

We're glad the rebuttal and discussion clarified our contributions and claims, we will revise the final version accordingly. We welcome any further questions or suggestions and appreciate the time and valuable feedback.

审稿意见
3

This paper introduces a prompt tuning method for Prompt Decision Transformers to improve performance in solving multitask and meta-reinforcement learning (meta-RL) problems. Compared with previous methods, it proposes leveraging a Multi-Armed Bandit algorithm to select the best prompt for Prompt Decision Transformers. The paper provides a theoretical analysis of regret and presents empirical results on the HalfCheetah, Ant, Meta-World, and Sparse 2D Point environments.

优缺点分析

Strengths:

Quality: This paper proposes to model the prompt selection as a contextual bandit problem and uses bandit algorithms to select the best prompt for the decision transformers, achieving good performance. Claims are supported by the experiments compared with baseline methods.

Clarity: The paper is well-organized in the methodology and experiments.

Significance: The paper provides a novel prompt tuning method for the prompt decision transformer in solving the RL tasks.

Originality: This paper demonstrates that prompt tuning based on the bandit algorithm can also improve the performance of prompt decision transformers.

Weaknesses:

Quality: Although the paper provides the theoretical analysis of the regret for prompt tuning, it assumes that the reward approximation error is bounded by ϵ\epsilon which seems to be a strong assumption. The empirical results show that the proposed method outperforms the baselines. However, it demonstrates that not all the bandit tuning methods can improve the performance. The best results appear across all the bandit algorithms in Table 1 and 2. This observation leads to the question of whether the bandit algorithm can truly improve the performance. I doubt it’s worth spending significant inference time for only a marginal performance gain.

Clarity: This paper spends a lot of space introducing the previous methods. It should elaborate more on the Scalable and Sample-Efficient Bandit Architecture section.

Significance: The paper admits that the limitation of the combinatorial explosion is on the increasing prompt dataset size. It impedes the practical deployment of the method.

Originality: This paper leverages the bandit-based prompt tuning method in the context of prompt decision transformers. The method is a combination of the existing techniques and does not provide any novel perspective or insights on the prompt tuning or decision transformers.

问题

  1. Can you elaborate more on the Scalable and Sample-Efficient Bandit Architecture section? How is the contextual multi-armed bandit modeled, and how does the bandit algorithm work in the prompt tuning?
  2. Can you elaborate more on the assumption of section 4.3? Why do you suppose the reward approximation error is bounded by ϵ\epsilon?
  3. Can you explain why you use so many bandit algorithms in the experiments? The empirical results show that not all the bandit algorithms can improve the performance.
  4. Can you provide the inference time of the proposed method compared with the baseline methods?
  5. Why use the multi-arm bandit algorithms in the prompt tuning? I think the best arm identification is more appropriate for the prompt tuning. Can you explain more about this?
  6. Can you provide more baseline results, such as the Meta-DT [1]?

[1] Meta-DT: Offline Meta-RL as Conditional Sequence Modeling with World Model Disentanglement. 2024

局限性

Yes

最终评判理由

I read the rebuttal and discussion. While some clarifications were helpful, two major concerns remain: (1) missing baselines and (2) an error in the ML1 pick-place environments that affects evaluation validity. Given their impact on significance and reproducibility, I maintain my score and confidence.

格式问题

No formatting concerns.

作者回复

We thank the reviewer for their thoughtful review and for raising important points about the scope, assumptions, and empirical evaluation of our method. We address each concern below and outline planned additions and clarifications to improve the manuscript accordingly.

Please note that we have updated results for the ML1-pick-place environment due to an issue in the reference implementation. The updated results can be seen in our replies to Reviewers 1Mcn and KYsS, and Tables 1 & 2 have been revised accordingly.

On the performance gain

We appreciate the opportunity to clarify both the magnitude and practical significance of our contribution. We emphasize that the performance gains from our bandit-based method are both consistent and substantial across a diverse and challenging set of environments:

  • 33/36 ID tasks and 35/36 OOD tasks (updated Table 2) show improvement over standard PDT with uniform prompt sampling.
  • All 18 cases in the 2D environment (Table and Figure 3) show consistent improvements.
  • Our method outperforms ZORankSGD, the strongest inference-time tuning baseline to date, across all environments.

To illustrate this, we include a summary of representative settings:

HC (J=1, H=5)HC (J=2, H=20)Ant (J=1, H=5)Ant (J=2, H=20)Pick-Place (J=1)Pick-Place (J=2)2D (J=1)2D (J=2)2D (J=5)
PDT-64.8 ± 36.9-40.9 ± 43.2363.9 ± 105.4360.1 ± 72.4502.8 ± 64.0524.4 ± 39.60.0 ± 2.16.3 ± 0.88.3 ± 0.6
ZORankSGD-59.9 (+7.6%)-36.6 (+10.6%)383.6 (+5.4%)340.7 (−5.4%)503.6 (+0.2%)538.1 (+2.6%)−0.64.4 (−30.2%)3.1 (−62.7%)
Bandit (best)-29.3 (+54.8%)-23.9 (+41.6%)480.8 (+32.1%)441.4 (+22.6%)551.3 (+9.7%)553.3 (+5.5%)9.99.9 (+57.1%)9.8 (+18.1%)

In many cases, our method lifts performance from suboptimal to near-oracle levels, especially in high-variance tasks. Furthermore, Figure 3 demonstrates the qualitative gap: while uniform sampling fails to identify informative demonstrations, our method rapidly concentrates on high-quality prompts, significantly improving downstream return.

Regarding efficiency, our method:

  • Requires no weight updates to the PDT backbone.
  • Performs no backpropagation.
  • Operates entirely in inference mode, with optional embedding caching.

This makes the runtime burden low. Full rollout time comparisons are included in Q4. In summary, our approach offers strong empirical gains at minimal additional cost, making it a practical solution for high-stakes or resource-constrained deployment.

Clarity of the presentation

We appreciate the suggestion. In the revision, we will enhance Section 4.2 with additional commentary and examples to make the bandit architecture and prompt selection process more transparent. In particular, we will emphasize the full algorithm pseudocode in Appendix A and reallocate space from the Related Work section if necessary to maintain clarity and making the algorithm more accessible.

On the practicality in real-world deployment

Indeed, the combinatorial growth of the prompt space poses a challenge for naïve prompt selection strategies. However, our method is explicitly designed to address this issue. By decomposing prompt selection into slot-wise decisions and modeling the reward of each segment independently, we reduce the problem from combinatorial to linear in the number of prompt slots JJ.

We validate this scalability across high-dimensional environments such as MuJoCo Ant (27 states, 8 actions), HalfCheetah (20 states, 7 actions), and Meta-World ML1 (39 states, 3 actions), where our method consistently improves performance over PDT and prompt-tuning baselines. We further demonstrate that our bandit scales efficiently to prompt spaces with up to 10510^5 candidates, far exceeding the size of typical PDT prompt datasets.

We believe these results show that our method not only addresses the combinatorial issue in theory but also scales in practice, enabling deployment in realistic offline RL settings.

On the novelty

While grounded in existing components, our method introduces several distinct and novel contributions:

  • We are, to our knowledge, the first to frame inference-time prompt selection in PDT as a structured contextual bandit problem.
  • Our architecture uses independent reward models per prompt position, avoiding combinatorial explosion and enabling parallel, modular learning.
  • It leverages PDT-derived embeddings to enable fast learning even in high-dimensional or pixel-based settings.

These components form a new class of inference-time adaptive prompt tuning, distinct from prior work in finetuning or gradient-based selection.

We will revise both the Introduction and Related Work sections to highlight these conceptual distinctions and novel contributions more clearly.

Q1: Method elaboration

Our method maintains JJ independent reward models ϕ1,,ϕJ\phi_1, …, \phi_J, where each model scores prompt segments for a specific slot position. This is motivated by the observation that prompts in PDT are structured concatenations of JJ segments.

As shown in Figure 1 and described in Section 4.2:

  • Each ϕj\phi_j is trained to predict returns conditioned on the segment occupying slot jj.
  • This decomposition ensures modularity and linear scaling, reducing the complexity of learning over full prompts.
  • During inference, a candidate prompt ρk\rho_k is selected via exploration or exploitation and passed to the frozen PDT for rollout.
  • The rollout return G(ρk)G(\rho_k) is then used as feedback, and each ϕj\phi_j is updated with the segment it contributed.

Full procedural details are available in Appendix A, Algorithms 1-3.

Q2: On the bounded ε\varepsilon assumption

The interaction term h(τ~1,,τ~J)h(\tilde{\tau}_1, …, \tilde{\tau}_J), introduced in Equation 5, captures residual dependencies between prompt segments that are not modeled by the additive architecture.

While exact theoretical bounds are intractable for real PDTs, we offer empirical support in Section 5.3 using a synthetic benchmark. There we find:

  • The residual term is small and stable across a range of segment pool sizes and prompt lengths.
  • The regret bound in our theorem holds empirically.

Furthermore, we argue that prompt segment informativeness is additive in practice. This is due to:

  • The ability to infer task identity from a few key (s,a)(\mathbf{s,a}) pairs
  • The random and unordered segment sampling during PDT pre-training, which precludes reliance on inter-segment dependencies.

These points supports the assumption that h()ε|h(\cdot)| \le \varepsilon holds in both theory and practice. We will add a discussion of this reasoning in the revision of Section 4.3.

Q3: On the bandit exploration strategies

We included multiple exploration strategies (UCB, Thompson Sampling, and ϵ\epsilon-greedy) to demonstrate that the architecture is robust across exploration mechanisms and that the gains are not due to a specific heuristic.

Results:

  • Across 90 total experiments, our method improves over PDT in 86 cases.
  • Even in the few settings where a strategy underperforms, the others compensate, which shows architectural robustness.
  • Thompson Sampling delivers the most consistent gains and is our default choice.

This broad-based evaluation allows users to choose strategies that best match environment properties (e.g. reward sparsity, task diversity).

Q4: On the inference time

Our method is computationally lightweight by design:

  • It operates entirely at inference time.
  • Requires no gradient updates or fine-tuning of PDT.
  • Trains only small MLP-based reward models.
  • When using PDT embeddings, segment encodings can be cached and reused.

Below is the wall-clock time for representative Ant configurations (equal rollout iteration counts):

Ant (J=1, H=5)Ant (J=2, H=20)
PDT~28m~38m
HillClimb~16m~24m
ZORankSGD~52m~120m
TSΨ^\Psi~28m~34m
TS~33m~172m

These numbers confirm that our approach is either competitive or significantly faster than prior baselines while achieving stronger performance. We will include these comparisons in the updated paper to help contextualize our method’s cost-effectiveness.

Q5: On CMAB vs. best-arm identification

Best-arm identification (BAI) assumes a fixed reward distribution per arm, which is unsuitable for prompt selection:

  • Prompt quality varies dynamically with the rollout context.
  • Rewards are stochastic and contextual, influenced by both task identity and trajectory history.
  • CMABs enable conditioning on rich segment embeddings and incorporate contextual similarity, allowing effective generalization across prompts.
  • Our structured CMAB architecture ensures scalability via linear slot-based decomposition.

We will elaborate to discuss these differences explicitly and clarify our methodological choice.

Q6: On the Meta-DT baseline

We appreciate the reviewer highlighting Meta-DT as a relevant baseline. Due to its architectural and training complexity, we could not incorporate it within the rebuttal timeframe. However, we see Meta-DT as highly complementary: it augments PDT with a contextualized world model, and we believe our bandit could be applied on top of it for additional gains. We plan to explore this integration in future work.

评论

Thank you for the detailed response. It addresses some of my concerns. I still have some questions.


General: What does this ‘Please note that we have updated results for the ML1-pick-place environment due to an issue in the reference implementation.’ mean? Why is the score in the main paper around 2000, and the score 500 in the updated tables?

Inference Time: Can you elaborate on why this method can improve the inference time compared to the PDT?

Performance: Since this manuscript misses some important baselines, the performance improvement is hard to evaluate comprehensively.

评论

We thank the reviewer for the time allocated to our rebuttal and for engaging in the discussion. We are happy that our rebuttal has answered some of the questions and appreciate the opportunity to elaborate further:

On the updated ML1-pick-place results

The initial results were obtained using the ML1 pick-place environment and evaluation code from the reference PDT codebase, where the maximum episode length was set to 500. Upon closer inspection, we found that the offline dataset had a maximum episode length of 100, resulting in a mismatch with the online environment configuration. This discrepancy allowed the agent to repeatedly achieve the success condition within a single rollout, introducing high return variance. After correcting the episode length to 100, the results more accurately reflect single-attempt performance. Under these conditions, our method still matches or exceeds both PDT and the single-task oracle (CQL baseline), demonstrating strong performance on this environment.

On the inference time

In inference time table posted above in response to Q4, PDT refers to full model fine-tuning, which is computationally intensive due to the need to update a large number of parameters. In contrast, our method updates only the lightweight bandit MLP reward models, leaving the PDT model weights untouched. While the bandit component introduces a small computational overhead, it is negligible relative to updating the full backbone and is well justified by the consistent performance gains observed across all benchmarks.

On the used baselines

Our experiments aim to provide a thorough comparison against the most relevant and well-established baselines:

  • Conservative Q-Learning [1] serves as an offline RL single-task expert oracle, and our method frequently matches or exceeds its performance;
  • PDT [3] without tuning is a critical baseline, as our method directly addresses its main limitation: uninformed prompt selection. We show that our approach substantially and consistently improves PDT's performance;
  • ZORankSGD [2], the most widely used prompt-tuning baseline, is also included. We show that our method not only outperforms it across multiple benchmarks, but also offers additional advantages in sample efficiency, applicability across both discrete and continuous domains, and is supported by theoretical guarantees;
  • We additionally include prompt-space hill climbing as an informative baseline for ablation.

Furthermore, we evaluate our method on diverse and established multi-task offline RL benchmarks [4-6] and demonstrate consistent improvements across various tasks and settings. We also performed a qualitative analysis to attribute performance gains to effective prompt-space exploration. While we believe the current set of baselines is comprehensive, we remain open to exploring additional baselines that may highlight underexplored or complementary dimensions of performance.

  1. Kumar et al., Conservative q-learning for offline reinforcement learning
  2. Hu et al., Prompt-Tuning Decision Transformer with Preference Ranking
  3. Xu et al., Prompting Decision Transformer for Few-Shot Policy Generalization
  4. Yu et al., Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  5. Fu et al., D4RL: Datasets for Deep Data-Driven Reinforcement Learning
  6. Finn et al., Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
审稿意见
4

In this paper, the main goal is to improve prompt tuning for Decision Transformers (DTs) in offline reinforcement learning. The authors propose a bandit-based prompt-tuning method. This new approach can select optimal trajectory prompts at inference time using a structured bandit framework and scales linearly with prompt size. The empirical evaluations are conducted over benchmarks and demonstrate the applicability of the model in real-world scenarios.

优缺点分析

Strengths:

  1. The paper is well-organized, and the illustrative example and figures are easy to understand
  2. The paper addresses an important research problem in the field of offline RL via DT.
  3. The use of the bandit algorithm in the prompt-tuning procedure is an interesting idea.
  4. The authors conduct a comprehensive set of experiments to evaluate the performance.

Weaknesses:

  1. The motivation for using DT in offline RL problems and the introduction to the underlying DT framework need to be further explained.

  2. The selection of the prompt is heavily dependent on the online return. How to ensure the quality of the estimated reward and causal correlation with the bandit selection.

  3. In general, does the independence of the segments negatively impact the prompt reward?

  4. In the case of large JJ, how to address the scalability and weak signal in the bandit selection procedure for arms.

  5. The regret analysis is only for the bandit algorithm, but it is not for the optimal generalized policy

  6. As the authors discussed OOD, what are the technical insights of the proposed model that perform well in this case?

问题

see strengths and weaknesses

局限性

see strengths and weaknesses

最终评判理由

I am grateful for the author's detailed responses.

格式问题

N/A

作者回复

We thank the reviewer for their thoughtful comments and constructive feedback. We are particularly grateful for their recognition of the significance of our method and the value of the empirical insights presented. In the following, we address each of the reviewer’s concerns and questions point by point:

W1. Clarify motivation.

We appreciate the feedback and will revise the Introduction and Background sections to improve clarity and motivation. We will highlight why PDT is effective for offline, multi-task pre-training without relying on policy iteration or value functions. We will also emphasize how prompting supports fast, reliable adaptation to downstream tasks compared to full-model fine-tuning, which is more sensitive to distribution shift.

W2. Causal relation between bandit and PDT.

We thank for the comment and would like to clarify that the reward feedback for the bandit is the actual return obtained by the pre-trained PDT when rolled out with a given prompt. Since this is the performance metric we aim to maximize, it provides a direct and causally grounded reward signal for the bandit.

The goal of the bandit is to identify the prompt that elicits the highest return as efficiently as possible, minimizing regret over rollout budget. The feedback signal (i.e., return) is thus fully aligned with the optimization objective, and does not rely on proxy metrics or estimated values. As such, we believe the return is both reliable and causally valid for guiding prompt selection.

We hope this addresses the concern, and we welcome further clarification if we’ve misunderstood the intent of the question.

W3. Does the assumption of independent segments negatively impact the prompt reward?

Thank you for raising this important point. Our method assumes that prompt segments contribute independently to the return -- an assumption that could, in theory, lead to performance degradation if inter-segment interactions were critical. However, both the design of PDT and our empirical analysis support the validity of this assumption in practice.

First, as described in the Preliminaries, PDT is pre-trained using independently sampled prompt segments drawn from expert trajectories. No ordering or structural constraints are enforced across segments. As a result, the model is unlikely to rely on cross-segment dependencies and instead learns to attend to individually informative (s,a)(\mathbf{s}, \mathbf{a}) pairs.

To test this empirically, we conducted an additional study in the 2D environment:

  • For each task, we constructed prompts containing one highly informative segment and J1J - 1 less informative ones, then permuted their order across rollouts.
  • PDT performance remained stable, with a return of 6.27±0.346.27 \pm 0.34, indicating permutation invariance.
  • Analyzing attention weights, we found that in 99% of rollouts, PDT focused on the informative segment, regardless of position.
  • When we masked attention to prompt segments, performance dropped sharply if the informative segment was excluded, but remained near-optimal when less informative segments were masked out:
Attention ScopeReturn
Full prompt6.27±0.346.27 \pm 0.34
Only informative segment5.88±0.235.88 \pm 0.23
Only uninformative segments34.06±4.49-34.06 \pm 4.49

These results suggest that PDT behavior is largely driven by the most informative segment, and that inter-segment interactions are negligible in practice.

Finally, in Section 5.3, we explicitly measure the interaction term h()h(\cdot), which quantifies the deviation from full segment independence, and find that it remains small across problem sizes. Combined with strong empirical performance across tasks, we believe this offers strong support for the independence assumption underlying our bandit design.

W4. How to address the bandit scalability with large JJ?

Scalability with respect to the number of prompt segments JJ is a central consideration in our design, since the total prompt space grows combinatorially with JJ. Our method addresses this through two key architectural choices:

  1. Structured decomposition: We maintain JJ independent reward models, one per prompt slot. This reduces the effective search space from combinatorial in JJ to linear, making the exploration problem significantly more tractable.
  2. Contextual modeling: Our use of a contextual multi-armed bandit allows the model to exploit feature-based similarities between prompt segments. This shared representation improves generalization across arms, which is particularly important in settings with large JJ or sparse downstream rewards.

We further demonstrate in Section 5.3 that our architecture remains effective at scale. In synthetic experiments, our method handles prompt spaces with up to PJ=105|P|^J = 10^5 combinations and maintains sublinear regret, validating both the scalability and sample efficiency of the approach.

In summary, while larger JJ naturally increases the search space, our structured and contextual bandit design is specifically built to address this challenge.

W5. No regret analysis for the optimal policy.

Indeed, the formal regret bound we present is defined with respect to the bandit's ability to identify high-return prompts. However, because the bandit selects prompts that are passed directly to the frozen PDT, the regret incurred by the bandit translates directly into regret in the downstream return of the policy.

Thus, our regret analysis characterizes how efficiently the combined system of bandit + fixed PDT converges to optimal behavior in the prompt-tuning setting, which is the focus of our contribution.

W6. Why does the proposed method perform well OOD?

Our main insight is that bandit-based prompt-tuning identifies optimal trajectory prompts reliably and efficiently, without requiring prior knowledge of task structure, reward function, or dynamics. This is especially useful OOD, where knowledge about the task is often limited.

Empirically, we find that our bandit-based prompt-tuning consistently and considerably outperforms standard PDT and ZORankSGD prompt-tuning in OOD settings, as shown in the following table:

Half Cheetah (J=1, H=5)Half Cheetah (J=2, H=20)Ant (J=1, H=5)Ant (J=2, H=20)Pick-Place (J=1, H=5)Pick-Place (J=2, H=2)
PDT-64.78 ±\pm 36.91-40.95 ±\pm 43.19363.90 ±\pm 105.42360.07 ±\pm 72.36502.80 ±\pm 63.98524.37 ±\pm 39.56
ZORankSGD-59.85 ±\pm 32.37 (+7.6%)-36.60 ±\pm 19.45 (+10.6%)383.57 ±\pm 193.35 (+5.4%)340.68 ±\pm 44.60 (-5.4%)503.56 ±\pm 66.16 (+0.2%)538.05 ±\pm 31.46 (+2.6%)
Bandit (best)-29.26 ±\pm 21.25 (+54.8%)-23.93 ±\pm 14.14 (+41.6%)480.85 ±\pm 84.76 (+32.1%)441.44 ±\pm 80.25 (+22.6%)551.33 ±\pm 34.52 (+9.7%)553.34 ±\pm 18.82 (+5.5%)

Note that we have updated the results for the ML1 Pick-Place environment after identifying an issue in the official PDT codebase that affected PDT pretraining on this benchmark. The corrected results are now included in the revised tables and are fully consistent with the trends observed on other environments:

Method / EnvID Pick-Place (J=1, H=5)ID Pick-Place (J=2, H=2OOD Pick-Place (J=1, H=5)OOD Pick-Place (J=2, H=2)
CQL oracle535.84 ±\pm 31.02535.84 ±\pm 31.02525.07 ±\pm 60.15525.07 ±\pm 60.15
PDT, no tuning551.58 ±\pm 26.09535.52 ±\pm 24.86502.8 ±\pm 63.98524.37 ±\pm 39.56
PDT, finetunedN/AN/A495.37 ±\pm 57.87488.17 ±\pm 50.15
Hill-climbing555.79 ±\pm 22.72540.15 ±\pm 23.49560.92 ±\pm 27.04544.19 ±\pm 28.83
ZORankSGD554.26 ±\pm 23.00537.20 ±\pm 25.5503.56 ±\pm 66.16538.05 ±\pm 31.46
ϵ\epsilon-greedyΨ^{\Psi}555.35 ±\pm 24.15541.32 ±\pm 22.91531.49 ±\pm 49.86552.11 ±\pm 23.83
TSΨ^{\Psi}556.87 ±\pm 24.11540.82 ±\pm 22.83549.38 ±\pm 36.12553.12 ±\pm 20.81
UCBΨ^{\Psi}552.68 ±\pm 24.77538.84 ±\pm 23.14506.43 ±\pm 65.38539.42 ±\pm 31.12
ϵ\epsilon-greedy556.22 ±\pm 25.16541.80 ±\pm 23.76530.26 ±\pm 51.61550.06 ±\pm 22.38
TS556.11 ±\pm 24.56541.33 ±\pm 22.79551.33 ±\pm 34.52553.34 ±\pm 18.82
UCB554.92 ±\pm 26.06538.50 ±\pm 23.64512.25 ±\pm 69.62534.58 ±\pm 35.35

While all prompt-tuning methods show only modest performance gains on the in-distribution tasks for ML1 Pick-Place, we attribute this to the nature of the dataset: the publicly available prompt trajectories are short, homogeneous, and densely clustered, leaving limited headroom for prompt optimization.

We believe that the updated results strengthen the overall conclusions and confirm that our method remains effective and competitive across diverse settings, both on ID and OOD tasks.

评论

With the discussion period ending soon, we would like to thank the reviewer again for their time and effort.

We believe to have addressed all concerns raised in the initial review, and hope the rebuttal has clarified our contributions and resolved the reviewer's questions. If there are any further questions, we would be happy to discuss them before the deadline. We value the reviewer's feedback and will incorporate it into the final manuscript as discussed above.

最终决定

This paper introduces a bandit-based prompt-tuning framework for Decision Transformers (DTs) in offline Reinforcement Learning, addressing the limitations of uniform prompt sampling in the Prompting Decision Transformer (PDT). The method learns to construct informative trajectory prompts at inference time, using a structured bandit architecture that scales linearly with prompt size rather than combinatorially. Leveraging the pre-trained PDT as a feature extractor, the approach enables efficient reward modeling and decision-making across diverse environments. The authors provide theoretical regret guarantees and empirically demonstrate significant performance gains over existing prompt-tuning baselines across multi-task, high-dimensional, and out-of-distribution RL settings.

During the rebuttal phase, the authors successfully addressed most of the reviewers’ concerns. While one reviewer noted the absence of a comparison with the Meta-DT baseline, the authors explained that this baseline could not be reproduced within the rebuttal period. Since the paper already compares against other important baselines and demonstrates effectiveness, we strongly encourage the authors to include Meta-DT in the final version. In addition, one reviewer pointed out that aggregating results across environments using standard deviation captures cross-task variability but does not reflect confidence intervals on mean performance. We therefore encourage the authors to adopt more informative aggregation methods and evaluate on additional environments in the final version.

Despite these limitations, the work demonstrates novelty and makes valuable contributions to the community. Taken together, we have decided to accept this paper.