We greatly appreciate the reviewers for their constructive comments and perceptive suggestions. We will further explain the meaning of the reward shaping design in Figure 12 on page 23:

For Frozen Lake, the original reward schedule specifies that reaching the goal gives +1, while reaching a hole or frozen area gives 0. We consider that falling into a hole, as a failure scenario, should be penalized, so we adjusted the reward for falling into a hole to -1. This is equivalent to giving the model a prompt: reach the goal under the safest conditions. Additionally, in the non-slippery setting, Frozen Lake becomes a navigation problem where the optimal strategy is to reach the goal as quickly as possible without falling into holes; thus, we added a step-wise cost, meaning that reaching a frozen area yields a penalty of -0.05. At this point, the prompt becomes: reach the goal as quickly as possible.
For Cliff Walking, the original reward is such that each time step yields -1, unless the player steps onto the cliff, in which case it yields -100. Since the rewards in our training data range from -1 to 1, a reward of -100 is excessively large and thus requires normalization—this is also common in RL algorithms. For example, in Stable Baselines3, the VecNormalize tool can be used for such numerical processing.
For Pendulum, . First, we performed value normalization, and then we clipped the excessively low rewards. The purpose of the Pendulum environment is to upright the pole, which requires first swinging left and right at the bottom to accelerate and gain kinetic energy—this means it must first go through a period where the reward gradually decreases. To encourage the algorithm to explore, we set a lower bound for the reward, and this is the reason for the clipping operation.
For Switch, the original reward scheme is as follows: each agent receives a reward of +5 for reaching its home cell. The episode terminates when both agents have reached their home cells or upon hitting a maximum of 100 steps in the environment. In addition to value normalization, we added rewards to each agent: +0.08 for approaching the target, -0.12 for moving away from the target, and -0.04 for remaining stationary. We then combined the rewards of all agents and provided them to the agents. The purpose of this design is to clarify that there is a reward for approaching the target while introducing a short-term reward bias—if both agents act greedily, neither will reach the goal. Surprisingly, we found that one of the agents was able to use the most greedy approach to reach the goal in the minimum number of steps, while the other agent learned to give way first, adopting a strategy of sacrificing itself in the short term for the greater good.