There are several weaknesses in the paper in its current form:

For the greater part of the ICLR audience WIFI signals are not a daily business. The paper lacks an intuitive explanation of what information is contained in CSI and what’s not. The paper talks a lot about tensor dimensions which are not clearly justified and explained. The inputs and outputs of each component of the models do not become clear.
There are not enough details about the hardware topology and the used bandwidth. There is only a statement “…a bandwidth of 20/40MHz…” in Section B. However, you did not describe, which bandwidth is used. Also, a detailed discussion, why a radio system with only 20/40MHz can achieve such high accuracies is missing. Such high localization results can even not be achieved by active, WIFI based localization systems, also working with OFDM signals. Why should your passive approach be even better than those?
We also miss a comparison of other algorithms on your dataset, as well as your method evaluated on publicly available datasets. The results shown in Table 1 cannot be used to compare your results to the state of the art, this is unfair. The accuracy is highly correlated to the quality of the dataset. We acknowledge difficulty in comparing the datasets, but WiTR could still be compared with one of the single-person methods on the dataset that the paper proposes.
The paper aims at providing novelty in mesh reconstruction. At the algorithmic level however, there is not much contribution.
It is unclear how such a system works out of the lab. I am less concerned about real-time processing but the CSI data itself. How well does it generalize to unseen trajectories, movements etc. It is totally unclear how the training and test data are organized.

Some minor comments:

“…which typically consists of only about 10,000 elements if…” - This depends highly on the used radio system, employed bandwidth and topology. There is no “typical” size of a CSI.
“...meshes (with a size of N × 6890 × 3) representing…” – What is “N”?
“…which is at the same level as previous image-based and radar-based methods” - I cannot find any comparison to such methods in your work. The accuracy highly depends on the environment and scenario. A direct comparison of accuracies is not useful.
“…random phase drift and flip.” - You may explain what a phase flip is.
“…linear transformation to denoise the phase signal following…” - How did you deal with the phase shift and drift among the four network cards?
“…CSI signals Z ∈ C3×3×20×30 are complex signals.” - What do the individual dimensions stand for?
“We propose a new way of understanding the CSI signals, where…” - Your understanding of the CSI is not in line with the communication theoretical definition.
“…collects CSI tensors ∈ C3×3×30×300 per second.” - What is the last dimension and why do you only use 20 samples of the 300 and which one of them? You must explain the recorded data detailed and reason why you can just crop the dimension.
How big is the label noise resulting from the employed camera-based mesh regression?
What is G in Fig. 2? It seems to be the number of coarse poses but how is it defined?
While Geng et al. 2022 in general being different from the proposed work (i.e., not working on mesh models) it would be beneficial to also differentiate from it in the paper.
Intro: “We present a[-n-] fully end-to-end
Sec. 2: “body mesh.[ ]One”
Sec. 2: what is a top-down method?
Sec. 3.: “sensing with WiFi ”
Sec. 3: “coarse decoder[ ](i.e.,…”
Sec. 3: “identity token.[ ]We”
Sec 3.: [w]here after Equation 2
Table 1: it is not immediately clear what the unit is for the mentioned results

Appendix A

How did you separate Training and Test data? Are they different recordings on different days or just random splits? A random split is not a valid evaluation. You have to share more details about the generation of the datasets to emphasize the generalization abilities and high localization results.