Dear Reviewer LUV6:

Thank you for your great efforts on our work, for your thorough and comprehensive summary on the strengths and weaknesses, for your insightful comments and for your constructive suggestions.

W1

We wholeheartedly concur with your insightful comments, and thank you once more for them. In this work, we assume a Gaussian distribution of the data, a hypothesis that has been reasonably validated in previous studies on MLPs and CNNs; thus, our approach for MLPs and CNNs is grounded in theoretical validation. However, as you noted, the self-attention mechanisms in transformers may exhibit significantly different distribution patterns, and no studies have yet elucidated the precise form of these distributions. Consequently, we employ transformers in our experiments, relying on empirical validation. We believe that current research on the distribution patterns of modern architectures is still limited and insufficient to support the theoretical guarantees regarding how distribution deviations impact this work. Therefore, we intend to leave this issue for future exploration.

Regarding the convergence properties, we would like to offer some additional insight. We would like to further analyze the Lipschitz continuity of them. In prior work [1][2][3][4], it is shown that Lipschitz continuity exerts a significant influence on convergence, and in work [5], the authors demonstrate the Lipschitz continuity of GELU and computes its Lipschitz constant.

Defination A function is said to be Lipschitz continuous if there exists a constant such that for all x, y R, the following inequality holds:

Moreover, a smaller Lipschitz constant indicates a higher degree of Lipschitz continuity.

In work [5], the authors compute Lipschitz constant by finding absolute value of the derivative of GELU function. And the Lipschitz constant is computed to be 1.084. Furthermore, we intend to initially compute the Lipschitz constants of SiLU and Mish.

Insight1 Lipschitz constant of SiLU is 1.09984.

proof of scratch:

Upon further calculations, we ascertain that is bounded in the range of [-0.0998,1.0998].

Insight2 Lipschitz constant of Mish is 1.0885.

proof of scratch:

Upon further calculations, we ascertain that is bounded in the range of [-0.112526,1.0885].

Insight3 Lipschitz constant of CRReLU is .

proof of scratch:

CRReLU and . Under mild assumptions, we consider the derivative of CRReLU piecewise.

Setting its second derivative to be 0 (temporary disregarding the potential for non-differentiability at x = 0):

Then we have: .

When taking in, ; when taking in,

Considering the need to ascertain upper bound for its derivative, we will take into account both sides' values at . Hence, under mild assumptions,

Further consider it, if , we have ; and if , we have . Hence, we can express Lipschitz constant of CRReLU as .

Insight4 In order to make Lipschitz constant of CRReLU remains lower than that of GELU, the range of is [-0.188,0.084]. We recommend setting the initial value of within this range.