We thank the reviewer for the constructive and valuable feedback, and we hope our response fully resolves your concern. - **[How the prompts are used]** We apologize for an unclear presentation about how the prompts are used to synthesize data. Here is our explanation and we will modify our paper accordingly. Table 1 shows the prompt templates for different data types. Indeed, there is an instruction set for each data type to form the prompt. Here we list some examples of our synthetic training data: *Sample1*: Translate Chinese text " 顾客：打扰一下，是在这里排队吗？ " to English text: Customer: Excuse me, do I queue here? *Sample2*: Chinese text " 顾客：打扰一下，是在这里排队吗？ " in English text: Customer: Excuse me, do I queue here? *Sample3*: Translate the following sentence, "顾客：打扰一下，是在这里排队吗？" from Chinese to English: Customer: Excuse me, do I queue here? *Sample4*: Translate Chinese unit "..." to Chinese text: 你带着现金离开，减去3%左右的费用。 *Sample5*: Translate English text "douglas mcgray is going to be our guide you walk through the door, you see the red carpeting, you see someone in a suit. they may be greeting you." to English unit: <293><63><662>...<6><407><334> *Sample6*: Translate Chinese unit "..." to English unit: <499><334><226>...<544><991><39> By utilizing diverse construction templates or prompts, the training data can be significantly increased. This approach is also a mainstream method in training Large Language Models (LLMs) to enhance the diversity of instructions and improve model robustness. “44M sentences” refers to the statistical count of the original machine translation (MT) parallel data. They are multiplied by different instruction prompts selected from the instruction set. - **[Training data comparison with the baseline]** In our paper, we use VALL-E X as our baseline as it is the closest work to us. Regarding the comparison of training data, please refer to the following table, where WenetST represents the end-to-end speech translation training data, expanded upon WenetSpeech by the authors of VALL-E X. —————————————————————————————————————————————————————— System | **VALL-E X Trans (SpeechUT + VALL-E X )** | **PolyVoice** ASR data | LibriLight (60k hrs), WenetSpeech (10k hrs) | LibriLight (60k hrs), In-house (60k hrs) MT data | 73M sentences | 44M sentences ST/S2S data | WenetST (10k hrs), GigaST (10k hrs) | WenetS2S (10k hrs), GigaS2S (10k hrs) —————————————————————————————————————————————————————— We have incorporated an additional in-house ASR dataset, specifically a Chinese ASR dataset, into the training of our U-XLM module. The primary objective behind using this dataset is to enhance the performance of our translation module. As a result of this integration, we have observed a notable improvement of 1-2 points in BLEU scores for the entire system. This increase offsets the performance loss incurred by employing unsupervised discretized units instead of phonemes. Although we do not release this particular ASR dataset, we believe a dataset with similar data volume can reproduce our results.