Making Hard Problems Easier with Custom Data Distributions and Loss Regularization: A Case Study in Modular Arithmetic
We introduce two techniques, varying the diversity of training data and introducing a regularized loss function, to improve transformer learning on modular arithmetic, cryptanalysis, and other synthetic tasks
摘要
评审与讨论
This paper proposes a new training strategy and loss function for successful modular additions and other operations. The critical observation is the utility of sparse samples. Models learn modular addition better on sparse samples, and if sparse and dense samples are mixed, it learns from sparse samples and then generalizes to dense samples. Previous studies sample training samples uniformly, and this leads the sampling to concentrate on non-sparse elements. The authors proposed to sample numbers from a distribution in which large numbers are sampled less (and thus, more chances of sampling 0). Further, the new loss function includes an additional term that encourages the model's output to be on the unit circle. The idea is based on the observation that in hard settings, the model's output collapses to the origin. In the experiments, the proposed sparsity-aware sampling and the custom loss both are demonstrated to improve the accuracy significantly.
update after rebuttal.
I appreciate the author's full elaborations and answers to my concerns, which greatly deepened my understanding of their work. The explanation and additional results address my concerns so I suppose that this work is worth being presented in the main conference.
给作者的问题
Please refer to the other cells. Particularly, I want a clear explanation of the proposed method beyond addition.
论据与证据
The claims and evidence are generally reasonable and convincing, but I have a concern that the explanation and the method sometimes assume modular addition, and the authors should include more general discussions.
For example, [l.113] says
[l.113, left] Because 0 and q-1 are "close" in a modular field, ... but this needs several remarks. First, modular field does not equip any distance as it violates the triangle inequality. Thus, being "close" is not precise even with quotations. The two numbers 0 and q-1 might appear "close" because of the unit 1 in the addition, but the field also equips multiplication. Similarly, including many 0 makes problems sparser/easier for addition task, but not necessarily for others. For the multiplication task 1 should be the number that makes the task "sparser."
方法与评估标准
The proposed method and evaluation criteria are reasonable. One of the concerns is that the explanation of the proposed method is biased toward modular addition and not general enough. The experiments also put a great focus on the modular addition, and the extended task (Table 8) is also taking additive formats. It would be better for authors to make the explanation more general, and verify them in the experiments.
理论论述
No theoretical claims are made.
实验设计与分析
The experiments carefully examine the claims with variations in the number of terms and modulus. The recent repetition technique is also taken into account, and beyond the modular addition task, LWE cryptanalysis and several standard benchmark tasks are included in the experiments. However, as mentioned in #Methods And Evaluation Criteria, the experiments are strongly based on modular addition. The experiments also include other tasks in the end, but Parity is also an additive task, and the other tasks are not very arithmetic. Including more arithmetic tasks and generalizing the proposed method will make this study even more impactful.
补充材料
I skimmed all the sections to see the overview of the proofs and searched for answers to my concerns and questions.
与现有文献的关系
This study has a strong contribution to the modular arithmetic learning. It is interesting to see if the proposed method works for symbolic computation that involves modular arithmetic. For example, symbolic regression over modular field, or polynomial sum, multiplication, reduction, with finite-field coefficients.
遗漏的重要参考文献
The following paper can be added as the literature of ML applications to hard math problems in [l.37, right].
- Learning to compute Gröbner bases, Hiroshi Kera, Yuki Ishihara, Yuta Kambe, Tristan Vaccon, Kazuhiro Yokoyama, NeurIPS'24
This also shows that the polynomial computations over finite-field is unsuccessful due to errors in coefficient predictions. There are also several studies handling easier polynomial tasks (not necessarily working on finite field), and as mentioned above, it should be an interesting future work to see if the proposed method also works successfully on these tasks on finite field.
- Do Transformers Understand Polynomial Simplification? Vishesh Agarwal, Somak Aditya, Navin Goyal, ICLR'21
其他优缺点
Important srengths and weaknesses have been mentioned in other cells. This work is solid and has sufficient contributions to this topic. The proposed methods are simple and easy to adopted.
其他意见或建议
The paper is written well in general. As I commented above, the explanation should be carefully revised and make it clear whether it writes about addition or more general cases.
We thank you for your thoughtful feedback.
Re: claims: We will include revisions in the final paper to clarify the claims more generally to not assume addition. Thanks for pointing out the clarification on and in the modular field, we will update this in the final version.
Re: sparsity: Regarding choosing to include as the element for sparsity, we actually found that including a random but fixed number instead of yields similar results. It is not the number itself but instead the shift in distribution that leads to performance improvements. The key is changing the KL divergence between the train and test sets, which is irrespective of the element chosen for sparsity.
To show this, we ran an additional experiment where we substitute the with an arbitrary integer and shift the distribution by sampling more s compared to all remaining elements in . We used a random for each different to ensure is not a factor. We can also run with more to show that this trend persists. We found similar results in the multiplication case.
| tau=1% acc | |||
|---|---|---|---|
| 32 | 257 | 160 | 100.0% |
| 32 | 3329 | 3176 | 100.0% |
| 32 | 42899 | 24606 | 100.0% |
| 32 | 974269 | 79062 | 100.0% |
| 64 | 257 | 160 | 99.2% |
| 64 | 3329 | 3176 | 99.3% |
| 64 | 42899 | 24606 | 99.4% |
| 64 | 974269 | 79062 | 99.3% |
| 128 | 257 | 160 | 97.9% |
| 128 | 3329 | 3176 | 98.3% |
| 128 | 42899 | 24606 | 97.9% |
| 128 | 974269 | 79062 | 97.9% |
Re: other tasks beyond addition: We also conducted additional experiments on many other tasks based on your feedback. We present results on the synthetic tasks in our response to reviewer ZgiU and the rest below.
Modular multiplication and scalar product: The angular embedding is designed for addition, so we use standard token embedding in multiplication experiments and compare to . We test both modular multiplication and the scalar product of two vectors mod . For both tasks, the model with performs well for smaller , but declines for larger . The scalar product is more difficult due to it requiring both multiplications and additions. Still, performs around 0% (on acc tau=1%) acc for all settings in both tasks, so is still an improvement.
Modular Multiplication with
| tau=1% acc | ||
|---|---|---|
| 4 | 97 | 100% |
| 4 | 257 | 100% |
| 4 | 3329 | 51% |
| 8 | 97 | 100% |
| 8 | 257 | 100% |
| 8 | 3329 | 32% |
| 16 | 97 | 100% |
| 16 | 257 | 98% |
| 16 | 3329 | 25% |
| 32 | 97 | 100% |
| 32 | 257 | 75% |
| 32 | 3329 | 13% |
| 64 | 97 | 100% |
| 64 | 257 | 65% |
| 64 | 3329 | 3% |
Scalar Product with
| tau=1% acc | ||
|---|---|---|
| 2 | 97 | 100% |
| 2 | 257 | 100% |
| 2 | 3329 | 98% |
| 4 | 97 | 100% |
| 4 | 257 | 92% |
| 4 | 3329 | 83% |
| 8 | 97 | 78% |
| 8 | 257 | 38% |
| 8 | 3329 | 15% |
Polynomial sum: We also run on two symbolic polynomial tasks mod : sum polynomials with degree and sum 2 polynomials with degree .
For both tasks, we encode the polynomial as , separate polynomials with <SEP> token, and ask the model to predict the coefficients of the polynomial sum. We add a RoPE positional embedding to help the model learn how to count. We fix . We report the results comparing vs .
For all experiments, the model is quickly able to predict the degree of the polynomial sum (the number of tokens to output before the <EOS> token), so the real difference between the two strategies is predicting the right coefficients. We say that the model found the solution if the maximum error across all coefficients is less than .
Task a. is closely related to our original task (taking )
| Correct % for | Correct % for | ||
|---|---|---|---|
| 16 | 1 | 99.5% | 0% |
| 16 | 4 | 99.2% | 0% |
| 16 | 16 | 99.3% | 0% |
| 64 | 1 | 98.8% | 0% |
| 64 | 4 | 98.7% | 0% |
| 64 | 16 | 98.5% | 0% |
Task b.
| Correct % for | Correct % for | |
|---|---|---|
| 64 | 99.0% | 94.3% |
| 128 | 99.1% | 92.1% |
This task is a bit easier because it can be decomposed into find the right index to attend to (and RoPE is expert on this) + sum two numbers mod (which can be completely memorized with O() even without angular embedding)
Re: references: Thanks for sharing the references, we will update the related work section to include these references. We agree that investigating finite-field polynomial tasks is an interesting avenue for future work.
Thank you for your answers and for providing additional results. These address my concerns well and I'll retain my positive score on this work.
Thank you for your helpful feedback and response, we appreciate it!
The paper addresses the challenge machine learning models face in learning modular arithmetic, specifically in the context of the Learning with Errors (LWE) problem. It proposes two techniques: (i) using a designed data distribution that mixes sparse and dense modular arithmetic instances, and (ii) introducing a custom loss function with angular embedding and regularization to discourage convergence to trivial local minima. Experimental evaluation demonstrates an increase in performance of ML models on modular arithmetic task.
给作者的问题
I do not have any specific questions for the authors.
论据与证据
Overall, the paper presents convincing empirical evidence supporting the claimed improvements. Experiments clearly demonstrate that custom data distributions and the novel loss function substantially enhance accuracy across various problem complexities. However, the claim regarding generalization to a other set of arithmetic and synthetic tasks, while promising, could benefit from further detailed experiments and additional baseline comparisons (for example check other N values).
方法与评估标准
The proposed methods (custom data distribution sampling and the custom regularization of the loss function) are well-motivated by observations from prior literature and empirical insights. The evaluation criteria, including mean squared error (MSE) and accuracy metrics, are standard for the problem context.
理论论述
The paper does not propose new theoretical claims requiring formal proofs and builds on theoretical insights from previous works.
实验设计与分析
The experimental evaluation is sound, and the experiments are comprehensive (different q, number of terms N). However, there is limited hyperparameter sensitivity analysis. Specifically, it needs a more detailed and explicit sensitivity analysis for the improvements over CL.
补充材料
No supplementary material provided.
与现有文献的关系
I think the paper has sufficient novelty, the authors make two contributions which 1) Augmenting Training Data with Sparse Vectors. 2) Loss Regularization to Avoid Model Collapse. Both of which in relation to prior work are new contributions.
遗漏的重要参考文献
Mostly discussed, however this area is not my primary area, I may have missed some references.
其他优缺点
Strengths:
- The paper is well written and structured
- The experimental evaluation is sound.
- Demonstrated significant improvements over SOTA works.
Weaknesses:
- Lack of detailed theoretical explanation for observed improvements limits understanding of the underlying mechanisms.
- Experiments do not fully address the practical cryptographic setting (i.e., noisy LWE instances).
- Generalization experiments, while promising, are limited and do not fully substantiate claims of broader applicability.
其他意见或建议
- It would be beneficial to discuss the computational overhead introduced by these modifications compared to the baseline approach.
We thank you for your thoughtful feedback.
Re: limited generalization experiments: Per your suggestion, we conducted additional experiments with more and values. These results are presented in the response to reviewer Qesb (“additional experiments on challenging settings”).
We also conducted experiments on additional arithmetic tasks (product of n numbers, scalar product of two vectors, polynomial sum) and synthetic tasks (multi hop question answering and selective copy). We present the results for the synthetic tasks below and present the other results in our response to reviewer Eby5 (“other tasks beyond addition”).
Multi hop Question Answering: We synthetically implement the associative recall with multi hop. Given pairs which represent a random permutation from to , we want to find the 2-th successor, i.e. .
| # max_length | layers | |||
|---|---|---|---|---|
| 16 | 4 | 7% | 100% | 100% |
| 32 | 8 | 3% | 96% | 99% |
| 64 | 12 | 2% | 93% | 94% |
| 128 | 12 | 1% | 91% | 90% |
Selective copy: Given a vector of size where each element is sampled from vocabulary , output a selective copy of the vector (all tokens different from the <JUNK> token). This task was introduced by Mamba paper.
| # max_length | |||
|---|---|---|---|
| 32 | 100% | 100% | 100% |
| 64 | 100% | 100% | 100% |
| 128 | 83% | 100% | 100% |
| 256 | 57% | 100% | 99% |
We find that our method is robust to these additional tasks. Specifically, we vary the distribution of the input length with our custom distributions, and we find that this still leads to performance improvements on these additional tasks. We are also happy to experiment with any other tasks the reviewers suggest.
We will include these additional results and analysis in the revised version.
Re: hyperparameter sensitivity: We will include more details on the hyperparameter sensitivity analysis in the revised version. While modifying certain parameters in the curriculum can slightly improve performance, the CL approach is much more involved and requires more tuning to the specific task. Our approach is simpler and provides consistent improvement over sampling from the default distribution. We provide the specific parameters here for your reference.
= at least half of the elements are zeros, = at maximum half of the elements are zeros.
When we ran the CL baselines, we modified three things:
- data mix:
i) Using up to , then until the end
ii) Using up to , then union until the end - Thresholds:
i) is either 1% or 3% or 10% of the training
ii) is when train_loss() < eps, where we chose eps = {1e-2, 1e-3} - lr and weight decay:
i) We experimented with 3 choices of lr (1e-5, 3e-5, 1e-4) and 3 choices of weight decay (from 0.03, 0.1, 0.3)
In table 4 of the paper we reported the best choice, which is in bold above.
Re: weaknesses
- We are happy to include more explanation on the observed improvements. We did investigate why our method succeeds and found that our sampling technique allows for a linear sample complexity, while needs an exponential sample complexity to tackle the problem. This helps to explain why our proposed sampling strategy is so effective.
Below, we measure the number of samples needed to get <0.005 loss and 90% test accuracy.
| (with best setting) | (with our best setting) | ||
|---|---|---|---|
| 6 | 4.5M | 4.1M | 0.6M |
| 9 | 7.1M | 1.9M | 0.45M |
| 12 | 12.85M | 2.6M | 0.95M |
| 15 | 51.1M | 8.15M | 1.3M |
| 18 | Never | 9.35M | 1.75M |
- Experimenting on the full practical cryptographic setting is an important area for future work, and this work provides the foundation for improving performance on practical settings.
- See above for response to generalization experiment limitations
Re: computational overhead: The computational overhead of our approach is minimal compared to the baseline, as we still generate the same number of data samples and simply modify the distribution used for generation. Similarly, the loss regularization term does not introduce any additional cost (besides the negligible cost of calculating the term itself) as we have a standard training loop. We timed the difference between the custom loss and standard loss experiments and saw that there was a 0.04% difference. We will include this explanation in the revised version of the paper.
Thank you for your response, it answers the points raised in the review, I am increasing my score.
Thank you for your helpful feedback and response, we appreciate it!
The paper improves machine learning attack baselines on LWE by training models to do modular arithmetic better. It uses custom training data and a special loss function, allowing the model to sum up to 128 elements modulo q ≤ 974269. It also shows improvements on other tasks like copy, associative recall, and parity.
给作者的问题
I am confused on how the encoded embedding can map to multiple coefficient? I think the space of embedding would be constrained to decode a large number of terms due to expressiveness?
论据与证据
The paper claims that its methods enable ML models to perform modular addition for up to 128 elements modulo q ≤ 974269—far exceeding prior limits of N ≤ 6 and q ≤ 1000. While the evidence presented supports these claims, it lacks comparisons with other released methods such as the lattice-estimator baseline (https://github.com/malb/lattice-estimator). Without such baselines, it is hard to fully assess the real gains and stability improvements.
方法与评估标准
The methods generally make sense by using distribution aligned with difficulty and better domain knowledge.
理论论述
N/A
实验设计与分析
The experimental design is generally sound, showcasing improved performance on LWE and other tasks. However, the training process consumes a large amount of data and computation, and a relative comparison with existing approaches is missing. It would strengthen the paper if the authors compared the amortized cost and practical efficiency against established methods. Moreover, while some settings are shown to be easier, it remains unclear whether the method has been evaluated under truly hard conditions. More experiments on challenging settings are needed to validate the robustness of the approach.
补充材料
No.
与现有文献的关系
The technique could help contribute a way for attacking LWE
遗漏的重要参考文献
Unsure
其他优缺点
The work is overall promising and sheds light on how modular arithmetic tasks can probe the stability of ML models. Addressing the missing baselines and providing deeper comparisons will further solidify the paper’s contributions.
其他意见或建议
N/A
We thank you for your thoughtful feedback.
Re: comparisons: In the paper, we compare our methods to the standard training approach with regular loss and default data distribution for the arithmetic, synthetic and the cryptography tasks (see Tables 3, 6, 7, and 9). We also provide a comparison to curriculum learning for modular addition (Table 4), where we show that our method is more consistent at learning the task compared to the curriculum learning approach. The lattice estimator provides theoretical cost estimates (in bit-operations) for attacking certain instantiations of LWE-based cryptosystems. Since the proof-of-concept we present in this work does not attack actual LWE-based cryptosystems, comparing against the lattice estimator would not be a fair comparison. Future work expanding on this should implement a true LWE setup and can then compare against other LWE attack methods. We welcome any additional suggestions of comparisons we could make in this work.
Re: cost/efficiency comparison: We will include an analysis of the cost/efficiency of our method compared to the baselines in the revised version. The computational overhead of our approach is minimal compared to the baseline, as we still generate the same number of data samples and simply modify the distribution used for generation. Similarly, the loss regularization term does not introduce any additional cost (besides the negligible cost of calculating the term itself) as we have a standard training loop. We timed the difference between the custom loss and standard loss experiments and saw that there was a 0.04% difference.
Re: additional experiments on challenging settings: As mentioned in our response to reviewers ZgiU and Eby5, we conducted additional experiments on more challenging settings and other tasks, which we will include in the revised version.
We conducted additional experiments on more values of and , including non powers of 2 and non primes, for modular addition. We find that our method is robust to increased s and non-powers of 2 s.
Results for Different Values of and
| tau=1% acc | ||
|---|---|---|
| 16 | 1728 | 100.0% |
| 16 | 100000 | 100.0% |
| 16 | 1048576 | 100.0% |
| 16 | 10000001 | 100.0% |
| 32 | 1728 | 100.0% |
| 32 | 100000 | 100.0% |
| 32 | 1048576 | 100.0% |
| 32 | 10000001 | 100.0% |
| 64 | 1728 | 99.5% |
| 64 | 100000 | 99.3% |
| 64 | 1048576 | 99.4% |
| 64 | 10000001 | 99.8% |
| 128 | 1728 | 98.0% |
| 128 | 100000 | 98.2% |
| 128 | 1048576 | 98.1% |
| 128 | 10000001 | 98.8% |
Results for Different Values of (non-powers of 2) and
| tau=1% acc | ||
|---|---|---|
| 20 | 257 | 100.0% |
| 20 | 3329 | 100.0% |
| 20 | 42899 | 100.0% |
| 20 | 974269 | 100.0% |
| 49 | 257 | 99.7% |
| 49 | 3329 | 99.6% |
| 49 | 42899 | 99.7% |
| 49 | 974269 | 99.6% |
| 101 | 257 | 98.6% |
| 101 | 3329 | 98.8% |
| 101 | 42899 | 98.9% |
| 101 | 974269 | 98.5% |
We also see that our method is not fully robust to high , but perhaps a longer training time or larger model is needed for higher .
| MSE loss | tau=1% acc | ||
|---|---|---|---|
| 256 | 257 | 0.15 | 90.4% |
| 256 | 3329 | 0.14 | 92.7% |
| 256 | 42899 | 0.18 | 91.2% |
| 256 | 974269 | 0.17 | 90.6% |
1. What happens if you train 4x longer?
| MSE loss | tau=1% acc | ||
|---|---|---|---|
| 256 | 257 | 0.08 | 94.8% |
| 256 | 3329 | 0.08 | 95.1% |
| 256 | 42899 | 0.09 | 95.0% |
| 256 | 974269 | 0.10 | 94.5% |
2. What happens if your model is 4x larger (embed dim goes from 256 to 512)?
| MSE loss | tau=1% acc | ||
|---|---|---|---|
| 256 | 257 | 0.07 | 96.2% |
| 256 | 3329 | 0.07 | 96.7% |
| 256 | 42899 | 0.08 | 96.1% |
| 256 | 974269 | 0.09 | 95.8% |
We also conducted experiments on additional arithmetic tasks (product of n numbers and scalar product of two vectors) and synthetic tasks (multi hop question answering and selective copy). These results are presented in our response to reviewer Eby5 (“other tasks beyond addition”). We find that our method is robust to these additional tasks. Specifically, we vary the distribution of the input length with our custom distributions, and we find that this still leads to performance improvements on these additional tasks. We are also happy to experiment with any other tasks the reviewers suggest.
Re: embedding question: Thanks for the question. We use the angular embedding from Stevens et al. (2024) in our work. Could you please clarify your question on the embedding? We want to address your question appropriately but didn’t quite understand it well enough to answer.
This work develops machine learning techniques that achieve enhanced performance for modular arithmetic. A key application is developing stronger ML attacks against the Learning with Errors (LWE) problem. The methods are evaluated experimentally and exhibit increased performance over prior work. Overall, the reviewers agreed that this is a solid contribution.