Dear Reviewer CUJA,

Thank you very much for the timely feedback and the follow-up question! We are glad to further clarify your concern.

What troubled me is that I am still not convinced what you said that: "DiffMD is not explicitly a generative model on geometric trajectories, while to the best of our knowledge our GeoTDM is the first. ". I believe that both of your methods are a generative model, but each of you approached in some detailed difference, so I need you to clarify this in more details.

Yes, we agree that both DiffMD and GeoTDM are generative models. The core difference lies in whether the method is modeling the distribution of geometric trajectories which is a sequence/collection of frames, or a single frame . In particular, our GeoTDM explicitly models the distribution of geometric trajectories (see line 116-122), while DiffMD is designed to model the distribution of the next frame (see Model Overview section on Page 3 of DiffMD [1]), where denotes certain conditioning if applicable (e.g., the previous frames).

Such difference leads to several distinctions in model design, inference, and empirical performance, as we will discuss below.

Model design. In order to jointly model , in the transition kernel we need to additionally handle the temporal dimension with size . This motivates us to design the equivariant temporal attention layer (Eq. 5-7), which is absent in DiffMD since DiffMD does not require to explicitly model temporal correlation. Our framework also enables conditioning on a trajectory with multiple frames while DiffMD is based on Markovian assumption and always consumes the previous one single frame to generate the next frame.
Inference. The benefit of directly modeling the distribution of geometric trajectories over single frames also presents at inference time. Within one diffusion loop, our GeoTDM can generate an entire trajectory with frames, while DiffMD requires an additional outer loop that sweeps through frames in order to achieve the same effect.
Performance. We also demonstrate the benefit in terms of empirical performance. Since DiffMD does not consider the correspondence between multiple frames, it is more vulnerable to error accumulation when generating a long trajectory through iterative rolling out. Our extra experiment provided in the rebuttal verifies this point, with GeoTDM outperforming DiffMD by a remarkable margin.

While you claimed that your "modeling the joint distribution of all frames within the geometric trajectory", what I saw is you still look at the individual conditional probability like DiffMD, see line 193. It seems like you take into data sampling of time into consideration, to me, I think this is considered as generalize of DiffMD.

Thank you for the question. However, there might be a misunderstanding here and we would like to respectfully clarify.

Specifically, line 193 depicts the transition kernel . Here refers to the diffusion step instead of the frame index on the geometric trajectory. The superscript indicates that the latent variable here is a geometric trajectory instead of a single frame. Therefore, the individual conditional probability is enforced on diffusion step, which is due to the Markovian assumption of the diffusion process. This is a common practice for diffusion models and is shared in both GeoTDM and DiffMD, and we are not claiming any difference on this point.

However, we are not enforcing conditional independence on the actual temporal dimension of the geometric trajectory, while DiffMD has a Markovian assumption on molecular dynamics and only models the distribution of a single frame. This is what we refer to by stating "modeling the joint distribution of all frames within the geometric trajectory". This point leads to the core difference between these two approaches, as we have discussed.

For the above reason, I believe that is why you both used Equivariant Graph Convolution Layer.

We are both using EGCL to process spatial information on the geometric structure. However, since we are modeling the whole trajectory, we need to additionally introduce the equivariant temporal attention layer to process the temporal correspondence, which never presents in DiffMD.

Thank you again for the follow-up discussion! We promise to include the discussion and the distinctions between these two methods in the manuscript. Please let us know if this addresses your concern and we are happy to further clarify if you have any questions.

Best,

Authors