Weaknesses

[W1. LACK OF CLEAR HYPOTHESES AND REASONING.] Sorry for the confusing, and you are right that clear hypotheses and reasoning are important. However, at present, the data-driven learning training of AI models allows researchers to conceptuate and design algorithms at a higher level (the functional and motivated level), which means the classification interface (Modeling functions of network) is no longer expressed in an explicit way by formulas, but is learned implicitly in the network parameters. It solves the problem that the classification interface cannot be shown by existing expressions. In this way, the algorithm performance has been further improved and generalized. The writing of TARSS-Net exactly follows this theoretical system of priori interpretability. Of course, TARSS-Net essentially involves a series of theories and derivations of feature engineering, metric learning, differentiable layer design and so on with radar signal as input. Due to the space limitation and ease of understanding for readers with general AI research background, this paper starts from the high level design and implementation, but the necessary derivations in key parts are preserved.

In addition, this paper gives a detailed introduction on why TARSS-Net is effective for RSS. We have conducted a comprehensive discussion for existing time series modeling paradigms, and analyze their own drawbacks and the factors that make them unsuitable for RSS (see Sec. 2). Based on the above analysis, we illustrate the design motivation of TARSS-Net for RSS task one by one (see L144-L155), as well as detailed implementation methods in Sec. 3. Also, we further verify the superiority of TARSS-Net over the existing methods that consider temporal relation information.

We believe that the confusion you mentioned can be eliminated after readers carefully read the full paper, the Appendix and the code.

[W2. INSUFFICIENT EMPHASIS ON THE DESIGN MOTIVATION.] In Sec. 2, we have conducted a comprehensive discussion for existing temporal modeling paradigms, and based on the above analysis, we illustrate the design motivation of TARSS-Net for RSS one by one, including the design motivation of TH-TRE module. We believe that after you go back and read the first two sections of this paper again, your questions will be answered.

[W3. ISSUES WITH PAPER FORMATTING AND VISUALIZATION.] Sorry for the reading trouble caused by unreasonable formatting and visualization. Due to the space limit of paper submission, we had to make some typography which might make it uncomfortable to read. These will be corrected in next manuscript version, including the order of Fig. 3 and Fig. 4, more rigorously draw for elements in the figures, etc.

[W4. POTENTIAL ISSUES IN THE EXPERIMENTAL SECTION.] Due to space limitation, we show the experimental results that can best help to verify the performance of TARSS-Net in the most concise way. Due to the sparsity of radar targets, the dense computation of SA will inevitably introduce redundant computations on irrelevant information thus degrading the RSS performance. The performance of the VIT model is supplemented in Table 1 of attached PDF for this rebuttal. We also promise to add it to Table S2 in the Appendix of the revised manuscript.

Questions

[Q1. THE CORE INNOVATION OF THIS PAPER.] The core innovation is to propose an effective temporal modeling method specific to RSS tasks, i.e., the plug-and-play TRAM which combines the advantages of causality, end-to-end learnability, constant model parameters under arbitrary length input, and linear growth of computational complexity with the length of the sequence. These advantages cannot be satisfied in the same time when using other existing temporal modeling methods including Tranformers, 3DConv, RNNs and HMMs. For its significance, innovation and advantages, please read the first two Sections of the paper in details. In terms of causal dilated convolution (CDC), it definitly meets the parallel-computation, larger RF with fixed-size kernels and causal computing mechanism, however the dilation rate should be pre-defined, i.e., if the input length changes, the hyper-parameter, dilation rate should be changed accordingly before training. While TRAM does not require any adjustment when handeling different lengths of input. Morover, as far as we know, CDC is not in the 3D form which has the limitation for handeling temporal-spatial data such as radar RAD sequence. Hence, instead of talking about CDC let's dive into 3DConvs, which are more prefered by the researchers in RSS field.

[Q2. IS IT MEANINGFUL TO DISCUSS REAL-TIME PERFORMANCE FOR RSS?] Yes, it is very important to discuss the real-time capability of RSS. As a remote sensing device, radar is applied in many fileds, such as automatic driving, security warning and so on. Taking Ku-band drone surveillance radar as example, the PRT (pulse repetition time) is around 80us, and it has 128 coherent pulses in one CPI (one Range-Doppler frame), then the data rate for detection will be . This requires subsequent signal processing and detection/segmentation algorithms to match this data rate as much as possible. Hence, in order to accurately detect and stably track the target, real-time performance is one of the important indicators of RSS task, which has practical significance at the application level. Taking automatic driving as another example, the moving car needs real-time feedback of detection results in the surrounding environment, otherwise it will lead to unexpected consequences. Therefore, RSS needs to balance accuracy and efficiency.