Thank you for your insightful review. We sincerely hope that the main concerns raised in the review can be clarified by the following responses.

On the Difference Between Optimization and Optimization

First, we are pleased to discuss the differences between optimization and optimization. Indeed, these two methods have better applications in different fields. It is undeniable that optimization typically achieves better accuracy, which has made it the focus of more widespread research. However, for many scenarios with strict sparsity requirements, optimization remains an important algorithm. Compared to optimization, optimization naturally has lower computational costs, making -based algorithms faster in general. Additionally, in scenarios requiring strict sparsity, LASSO often struggles because it is difficult to directly specify the sparsity level. A typical example is sparse adversarial attacks, where the number of noise points must be limited, but LASSO cannot precisely control the sparsity. Similar issues arise in sparse feature selection tasks.
In addition, in our sparse feature selection experiment, we can see that compared to optimization, optimization is similar in accuracy but has smaller sparsity and faster computation time.

On Numerical Experiments

Thank you for your comments. We have included LASSO-SAGA in the experiments for first-order algorithms. Additionally, we have incorporated the adversarial attack experiment and the sparse feature selection experiment into the main text.

On Presentation Style

We have made the necessary adjustments to the presentation style. Thank you for pointing this out.

It is a bit unclear to me why to use regularization together with ridge regression. One argument in the introduction is that regularization avoids hyperparameters, but in the example you re-introduce hyperparameters. How does the algorithm perform in estimation with this constraint?

First, we would like to clarify that optimization is not solely about avoiding hyperparameters but about avoiding the computation of hyperparameters (compared to ISTA algorithms). When combining -norm with -norm, the model’s robustness to noise can be significantly enhanced. This is because sparse optimization, due to the omission of some features, is often susceptible to noise. In fact, using -norm in sparse optimization problems is a common practice, not only in optimization but also in optimization, as seen in approaches like Elastic Net.

What are the asymptotic bounds on the computational complexity of iteration of the algorithm with hard thresholding?

We note that if has a -sparse unconstrained minimizer, which could happen in sparse reconstruction, or with overparameterized deep networks (see for instance [1] Assumption (2)]), then we would have , and hence that part of the system error would vanish.*

[1] Alexandra Peste, Eugenia Iofinova, Adrian Vladu, and Dan Alistarh. Ac/dc: Alternating compressed/decompressed training of deep neural networks. Advances in Neural Information Processing Systems, 34, 2021.