Clarity Issues:

For Equations (2) (3) (4), several notations are left unexplained. For example, what is ? What does mean? What is . The authors provide these formulations but I don't feel like they are well defined. The authors should either remove those formulations or provide more explanation on them.
The methods in Table 2 are not well-explained. There is nowhere in the main text that explicitly explain what are "GT. + Codec" and "Acoustic LM + GT Style". While I can infer from the text that "Acoustic LM + GT Style" is passing the ground-truth style tokens and use the Acoustic LM to infer the codec tokens, I have no idea what is "GT. + Codec" though.
Figure 3 is super-hard to read. It is too small. Please consider enlarge it.

I think most of my comments about the paper come from the experiments. Experiments:

In Section 4.3.1, you are comparing existing TTS systems (YourTTS, XTTS-V2) that conditioned only on phonemes, with your system that conditioned on phonemes and ground truth style tokens. This is misleading as your system has an inherent advantage to access some ground truth attributes of the target speech. Furthermore, this is the only experiment you compare with the existing systems. My suggestion is to run experiments comparing your method with the whole two-stage inference and the existing TTS systems. Ideally, you should also compare with these methods trained with the similar datasets and number of parameters, where in this paper, the information is not given.
Also, in Section 4.3.1, you mentioned that "when we swap the phonemes and style tokens from different samples, the model fails to generate meaningful speech. This demonstrates that our style representation does not result in significant content information leakage." But to me, this sounds like a content leakage issue. As the syntheses of the swapped version should still be producing the same content as the given phoneme sequence.
You claimed that "training the first-stage model on extensive datasets enhances the robustness of the two-stage model in terms of quality and content accuracy". I don't think I saw experiments that support this. I thought that you will compare a model that the first-stage is trained on extensive datasets v.s. a ablated model that the first-stage is not trained on extensive datasets.
See more in the questions...