Neural Rule Lists: Learning Discretizations, Rules, and Order in One Go
We propose a fully differentiable architecture for learning an interpretable rule list classifier.
摘要
评审与讨论
The paper proposes NEURULES, an end-to-end differentiable approach for learning rule lists, jointly handling discretization, rule composition, and ordering. Experiments demonstrate consistent improvements over existing methods across various datasets.
优缺点分析
Strengths:
1. Technically solid contributions with clear and effective innovations (differentiable predicates, relaxed conjunction, and learned rule ordering).
2. The writing is clear and well-organized, providing intuitive motivation and explanation of model architecture.
3. Extensive experiments on diverse real-world datasets and multiple baselines demonstrate the effectiveness convincingly.
Weaknesses:
1. Limited experiments on datasets with high feature dimensions, thus the scalability to high-dimensional scenarios needs further validation.
2. The method requires pre-specifying the number of rules, which may cause inconvenience in practical applications.
问题
Although the overall framework is elegant and reasonable, based on my experience with differentiable rule learning methods, I still have concerns regarding the scalability of the proposed approach to high-dimensional datasets (e.g., feature dimension >100 or even >1000). The current experiments include only one dataset with dimension above 100. In scenarios of extremely high-dimensional data, classical feature preprocessing methods might show relative advantages. I’d appreciate it if the authors could provide additional experimental results or discussions on datasets with higher dimensionality to further validate scalability.
局限性
yes
最终评判理由
Thank you for your reply. I have no other questions and will keep my recommendation
格式问题
None
Dear Reviewer, thank you for your in-depth, constructive feedback.
We want to comment further on Neurules's scalability and relative performance compared to other methods. In our experiments, the Juvenile dataset with 286 features has the highest dimensionality. On it, Neurules is the best performing method, with an score of .
Furthermore, we supplement results on two additional datasets from the UCI repository that have hundreds of features: “Drug-induced autoimmunity” (477 samples x195 features) and “TUNADROMD” (4465 samples x242 features), which deals with Android malware. We use the same hyperparameters from the experiments and set the rule list size to . We provide the results for each in the two tables at the bottom, excluding CORELS and SBRL due to their inability to scale to higher dimensions.
On TUNADROMD, we achieve an of compared to a best of (Classy, RLNet). For drug-induced autoimmunity, Neurules is the best performing method with a substantial gap between it () and the next best (RLNet, ). Drug-induced autoimmunity consists mainly of continuous features, where Neurules threshold learning ability comes in handy, while TUNADROMD is only binary.
In general, we agree that scalability is an important aspect for the real-world deployment of any rule list model. The current results indicate that Neurules is on par for large-scale datasets of discrete features or even superior on those with mainly continuous features. Neurules runtimes scale favorably due to the gradient-based optimization. We will add these results to the updated manuscript and expand the discussion on scalability.
Drug-induced autoimmunity
| Method | F1 | F1_std | Accuracy | Accuracy_std | Runtime | Runtime_std |
|---|---|---|---|---|---|---|
| Neurules | 0.77 | 0.02 | 0.79 | 0.01 | 29.35 | 2.08 |
| RLNet | 0.73 | 0.03 | 0.76 | 0.02 | 209.39 | 24.12 |
| Classy | 0.70 | 0.04 | 0.76 | 0.02 | 12.93 | 4.41 |
| Greedy | 0.67 | 0.05 | 0.73 | 0.04 | 0.02 | 0.01 |
| Ripper | 0.69 | 0.02 | 0.72 | 0.03 | 1.35 | 0.28 |
Results on TUNADROMD
| Method | F1 | F1_std | Accuracy | Accuracy_std | Runtime | Runtime_std |
|---|---|---|---|---|---|---|
| Neurules | 0.97 | 0.01 | 0.97 | 0.01 | 125.60 | 24.53 |
| RLNet | 0.98 | 0.01 | 0.98 | 0.01 | 248.77 | 31.77 |
| Classy | 0.98 | 0.00 | 0.98 | 0.00 | 3.38 | 0.21 |
| Greedy | 0.92 | 0.01 | 0.93 | 0.01 | 0.03 | 0.01 |
| Ripper | 0.92 | 0.01 | 0.93 | 0.01 | 8.03 | 0.46 |
This paper presents a differentiable relaxation of the problem of learning rule lists on tabular data. It jointly learns predicates (of the form ), conjunctions of those predicates to form rules, and the order of those conjunctions. The number of rules is held fixed. The differentiable relaxation of predicates yields soft predicates, which converge in the temperature limit to strict predicates -- this is exploited in the optimized algorithm by temperature annealing as training progresses. To learn conjunctions of rules, the authors use a generalization of the weighted harmonic mean to include a slack term, which simultaneously eliminates vanishing gradients and encourages sparse rules. Finally, to learn rule list order, the authors propose to learn a priority vector for the position of each rule in the list. The final form of a rule list is a sum over all classes of an indicator over the highest priority rule; however, this indicator is relaxed into a Gumbel-softmax approximation of the argmax, which allows gradients to flow beyond the single selected rule. This paper uses another form of temperature annealing to collapse this approximation later in training. The authors also propose a regularizer based on the minimum and maximum coverage of rules. Experiments comparing this method to other neural methods, as well as combinatorial discrete optimization methods, are presented.
优缺点分析
Strengths
-
The differentiable relaxations presented for each of the three layers of rule list learning are principled, justified by both intuition and thorough arguments for convergence and gradient stability, and clearly advance prior work on neural rule lists.
-
The paper is written in a clear and concise manner. The issues with existing approaches, along with the authors' proposed solutions, are well motivated and clearly reasoned.
-
The experiments effectively demonstrate that this approach is better on many datasets compared to previous neural methods.
-
The easy extension to multi-class classification problems is a nice benefit of the proposed architecture.
Weaknesses
Major Weaknesses
-
The comparisons to combinatorial methods, especially CORELs, are weak and do not present a fair comparison. The choice to only use 5 evenly spaced cut points on continuous features undercuts CORELs' ability to fit the data well -- in fact, due to the use of objective-based bounds, the use of fewer cut points may contribute to the surprising number of timeouts of CORELs in the experiments (even on small datasets). CORELs provides a feature selection method in both the paper and the code based on frequent itemset mining. In fact, existing prior work shows that, at least on the FICO dataset, CORELs outperforms the best metric presented in Table 1 (He et al., 2023; JAIR: Visualizing the Implicit Model Selection Tradeoff); FICO is listed as timing out in this paper.
-
Similarly, the choice to perform hyperparameter optimization only on several held out datasets and not on each dataset individually is a weak-point of the evaluation in this paper. Especially for the discrete methods, the regularization is directly compared against misclassification error, and is sensitive to the class balance and particulars of each dataset. At least the regularization parameter should have been tuned for each dataset individually, ideally with nested cross validation (see Scikit-learn Nested vs. non-nested cross validation).
Minor Weaknesses
-
There is little attention given to the actual complexity of the rule lists identified by this method. The authors demonstrate that this approach can find good rule lists which don't have many rules, but it is not clear how many conjunctions these rules have. A rough metric to compare to sparse methods would be the total number of predicates used in the rule list, rather than the length of the list.
-
The maximum cardinality of 5 for CORELs is too high - a typical value for this is 2 or 3. Allowing length-5 predicates is probably another contributor to the timeouts observed in the results tables. Also, the max cardinality of 5 presented in this paper is not one of the options in the grid search (2, 3, 4) - I think this may be a typo on line 534.
Conclusion
This paper is technically solid, with reasonable and well-justified methods to make the rule list learning problem differentiable. However, the evaluations, for all methods in terms of hyperparameter optimization, and for CORELs in terms of continuous feature binning and hyperparameter choices, were incomplete and did not present a fair comparison.
问题
-
It appears that the rule ordering optimization method introduces discontinuities into the loss landscape of your neural architecture. When one priority value overtakes another, this results in an immediate swap of the rules and a corresponding change in the loss (say ). I have a similar concern about the process to learn rules -- it seems that there are discontinuities introduced whenever a predicate achieves non-zero weight. My question is: do these discontinuities matter for your optimization process? Are they solved by the relaxations to achieve differentiability?
-
How do the performance and sparsity (# of rules, and # of predicates per rule) of your method compare to CORELs when CORELs is given the same predicates to use as your method has learned? That is; how do the rule and order learning stages of your method compare to an optimal approach, conditioned on pre-selected predicates? If it performs comparably in both performance and sparsity across a variety of datasets, I will consider raising my score.
-
I have several related questions relating to the implicit regularization in the rule construction process. How strong is the implicit regularization towards sparser rules in section 3.2? Is the strength of regularization tunable, for example by adjusting the slack magnitude? Does the regularizer encourage rules to be sparser, even when some terms have useful signal, or does it only eliminate useless terms?
局限性
yes
最终评判理由
After extensive discussion with the authors, I have raised my score from a borderline reject to an accept. My issues with the paper, and their resolutions, are as follows:
Issue: The evaluation of discrete rule list methods, like CORELS, were not run with enough cutpoints on continuous features.
Resolution: The authors pointed me to a figure in their appendix, which showed that increasing the number of cutpoints did not improve the performance (as measured by f1 score) of CORELs on two different datasets. More cutpoints led to more features to choose from, and also overfitting and timeouts.
Issue: The choice to do hyperparameter tuning on only held out datasets, then to apply those hyperparameters uniformly to all evaluation datasets, is biased against methods which are sensitive to particular datasets.
Resolution: The authors provided evidence, in the form of additional experiments, showing that the choice of regularization did not affect at least CORELS' results on six of the evaluation datasets. In this light, I find the paper's experiments to be more trustworthy than my first reading and review. I still encourage the authors to perform limited per-dataset hyperparameter evaluation so that doubt cannot be raised in this direction.
Issue: On average, the rules found by NeuRules are too complicated to be called interpretable.
Resolution: Although the average rule length is quite long, the median is small, and there are several very large rules contributing to the high average. These large rules may be either spurious or meaningful, but the authors argued that they are easily pruned if they are deemed spurious using held out data.
格式问题
N/A
Dear Reviewer,
Thank you for your in-depth, constructive feedback.
Below we address your concerns:
-
Comparisons: We respectfully disagree with the suggestion that our setup for CORELS is not fair or misrepresents its capabilities. Due to the delicate balance between performance and runtime when choosing the parameters, we performed a series of experiments for CORELS where we systematically varied the number of cut points. In Appendix G, Figure 7, we report runtime and performance for CORELS with an increasing number of cutpoints. We observe that, contrary to intuition, an increase in cutpoints has a slight negative impact on score. On the other hand, increasing the number of cut points leads to an exponential growth in the number of frequent itemsets considered during CORELs’ preprocessing, resulting in an exponential runtime increase. In theory, increasing the number of cut points should expand the search space and allow CORELs to discover better classifiers. However, in our experiments, this increase in resolution often leads to overfitting and timeouts, not improved generalization.
With respect to the FICO dataset and the comparison to He et al. (2023), we note that the reported performance of CORELs is achieved under a very different experimental setup, including an extensive and carefully crafted discretization scheme and what appears to be significantly relaxed runtime constraints. Under such idealized and hand-tuned conditions, CORELs can outperform other models. But this underscores - rather than undermines - our central contribution: our method learns the discretization, rule structure, and ordering jointly in a single, scalable pipeline, requiring no manual pre-processing, and delivers competitive (often superior) performance in a fraction of the runtime.
-
Hyperparameters: Our evaluation avoids fine-tuning hyperparameters per dataset to ensure fairness and realism across a broad range of methods. In particular, neural approaches like ours often involve many hyperparameters that, if tuned per dataset, could lead to overly optimistic results and poor generalization. Moreover, tuning regularization parameters per dataset for combinatorial methods such as CORELs is especially problematic due to their high computational cost. As noted in the paper, many of these methods already approach or exceed runtime limits, and performing nested cross-validation across all datasets would break our computational budget. Instead, we determine the best overall hyperparameter configuration on a few holdout datasets. With the best overall parameters, the subsequent evaluation is conducted across all datasets to better reflect each method's robustness and out-of-the-box performance.
-
Rule complexity: The sparsity of the discovered rules is interesting due to its correlation with the interpretability of a rule. We provide in Figure 4b) the distribution of rule lengths, i.e., the number of predicates per rule, over all datasets. Below, we provide a table with statistics for all methods, which we will add to the Appendix:
|Method | median | mean | std | |:-------------------|---------:|-------:|------:| | Neurules | 4.00 | 7.85 | 13.45 | | RLNet | 6.00 | 9.65 | 15.16 | | DRNet | 21.00 | 53.96 | 97.02 | | CORELS | 2.00 | 1.83 | 0.86 | | CLASSY | 3.00 | 2.89 | 0.90 | | RIPPER | 2.00 | 1.94 | 1.37 | | SBRL | 2.00 | 1.91 | 0.71 | Neurules learns more succinct rules than other neural approaches (RLNet, DRNet), showing the efficacy of our gradient shaping in that regard. Compared to combinatorial methods that impose a limit on maximum cardinality, the rules we find are longer. We don't think that this is necessarily a weakness; there may exist datasets where rules with longer clauses make sense. In general, we observe a power law-like trend of rule lengths learned by Neurules, where the majority of rules have four or fewer clauses. -
Cardinality: We think that there are many datasets for which rules with five or more conditions are appropriate. Increasing the maximum cardinality does lead to a combinatorial explosion of possible rules and hence termination issues for CORELS. However, as a configuration with cardinality five achieved the best accuracy, it was selected in the hyperparameter optimization. Thank you for pointing out the typo in the Appendix. We will update the text.
Regarding your questions:
-
Loss discontinuities: You're right in observing that, in principle, discontinuities could arise from rule reordering or predicate activation. The idea of Neurules is to overcome the obstacle of discontinuity in a rule list using smooth, differentiable approximations. For rule ordering, we apply a Gumbel-Softmax relaxation, which smooths the argmax operation and allows gradients to flow even when priority values are near a crossover point via the additional Gumbel term in Eq. (12). Similarly, predicate inclusion is handled through soft conjunctions with slack, so the model can adjust predicate weights without abrupt changes in the loss.
That said, as the temperature approaches zero, the approximations can become steep. Particularly, if an input value lies at the transition, the soft predicate is effectively discontinuous, such as e.g Fig. 3a for . In very rare cases, this can lead to very large gradients. In practice, this has not posed a significant problem for us because we control the temperature annealing schedule and avoid hard transitions. While we did not find it necessary in our experiments, given that we precisely know when this happens, one practical safeguard would be to monitor this operation and stop gradient flow in this situation.
-
Comparison to CORELS: Although CORELS offers formal optimality guarantees, these apply only to the regularized training objective and only within the fixed rule space defined by mined itemsets. Although the method and paper are impressively rigorous, the authors do not provide a tight generalization bound. One could derive such bounds using standard tools like Hoeffding’s inequality, since the hypothesis space is finite, but they degrade rapidly as the rule space grows. Yet this bound would only hold for the regularized objective and does not allow us to make any statements about the performance on the test set.
One can, as you suggest, use Neurules to extract predicates for CORELS. To test this idea, we take the FICO dataset and extract all boolean predicates that Neurules learns, excluding those with and those that are always 0 or 1, respectively. We use these predicates to binarize the train and test sets for each cross-validation split. On this data, we re-run CORELS. We obtain an score of with an average runtime of 5 minutes per fold compared to Neurules . We hypothesize that the difference stems from the maximum cardinality constraint placed by CORELS, whereas Neurules predicates were learned in the context of longer, unrestricted rules. So while it is possible to combine Neurules predicate learning with CORELS search method, there are no immediate benefits in terms of performance, but potentially in sparsifying the learned rule list.
-
Sparsity: The sparsity pressure in our method arises from the slack term in Eq. (7). This slack appears in the denominator of the relaxed conjunction and controls how sharply low-activation predicates are penalized. When is close to zero, a small causes the denominator to be large, which results in strong negative gradients on , pushing that literal's weight toward zero. Conversely, if , the term contributes positively to the gradient, reinforcing the literal. The parameter thus controls the strength of this push-pull: smaller values lead to stronger sparsity pressure. As , this pressure increases, but if reaches exactly zero, the gradients can vanish or tend to infinity when . In practice, we keep small but strictly positive to ensure effective pruning without numerical issues. This mechanism eliminates literals with minimal contribution while retaining informative ones, and provides a smooth, tunable control over the regularization strength.
Thank you for the very thorough response -- I will consider each point in turn.
- Comparisons: This has highlighted an interesting discrepancy between my past experience using CORELs and the results in this paper. In my own experience, itemset mining with ~20-100 cutpoints results in the best generalization performance. I find it surprising that you've made the opposite observation -- that increasing the number of cutpoints seems to hinder, rather than help, generalization. That does not mean you're wrong, so I will incorporate Figure 7, as well as your thorough response, into my final review.
It would be interesting to combine the methods introduced by McTavish et al. (2022) for choosing thresholds for the GOSDT decision tree algorithm, with the rule list optimization framework. This is, however, outside of the scope of this paper.
-
Hyperparameters: In this, I disagree. I do not think that it is realistic to only tune hyperparameters for a small selection of datasets, then use those exclusively, especially for small and interpretable models. Generalization performance is part of the selection criteria, so I am not sure what you mean when you say that per-dataset hyperparameter tuning could lead to overly optimistic results. I concede that nested CV is too big an ask, but I contest the claim that per-dataset hyperparameter tuning is too much. I would strongly suggest to limit the size of rules for the non-neural solvers to 3 or at most 4, and to at least tune regularization per-dataset for these methods. The best choice of regularization is highly dependent on the class balance and the dataset size.
-
Rule Complexity: I think that this is a significant weakness. Rules with an average length of 4 (and, it appears, occasional rules that are much larger) are significantly harder to interpret. The possible existence of datasets where the additional rule length is truly necessary to achieve good performance is hard to refute, but not substantive enough to serve as justification for the very complex rules in comparison to optimal methods.
-
Cardinality: Although this configuration achieved the best accuracy, it clearly did so in conflict with CORELs' ability to finish computation in the defined time period. I think that this points to the fact that 5 was indeed too large.
Questions
-
Loss discontinuities: Thank you for the response, this is very interesting and only further highlights the strong contribution of the proposed methods.
-
Comparison to CORELs: This response is very reasonable criticism of CORELs, and the provided experiment is very useful in understanding the utility of longer rules. If I may ask -- what hyperparameters did you use for CORELs in this experiment, and did you tune the regularization parameter for f1-score? I am not asking this critically -- I think that the experiment is very interesting and I believe the results.
-
Sparsity: Thank you for the response -- I feel that I understand the slack term's sparsity pressure better.
Overall, I am convinced that the experiments were performed with awareness of the issues I brought up, and the authors included experiments in the Appendix which I reviewed but did not fully appreciate in my first review. I think that the paper would be improved by more thorough discussion of the chosen experimental set-up, and I maintain my assertion that hyperparameter tuning per-dataset (perhaps with fewer options for scalability, or a non-grid-search optimization method) would significantly strengthen the paper.
I will await further discussion to revise and finalize my review, but I am inclined raise my score in light of the initial rebuttal.
Thank you for your engagement and the interesting points you raise. We address the remaining open points below.
-
Comparison to CORELS: We think that a discretization which is too fine may lead to overfitted rules and adversely impacts performance. Alternatively, one can use all intervals defined by the cutpoints, but that leads to intervals for cutpoints. In the experiment on FICO we used the same hyperparameter configuration as in the main paper, where the regularization parameter was set at .
-
Hyperparameters: We understand where you are coming from. Proper and fair evaluation has been a main headache of this work; especially the exact methods can show wildly varying performance. We experimented with a setting like you suggest, but as you also point out, we had to reduce the number of hyperparameters that we test to stay within a reasonable compute budget. This, however, would then invite the argument that we did not tune the competing methods well-enough. Thus, we decided to go an alternate route and rely on hold out datasets and then tune all methods exhaustively. Given the available resources, this setup allowed a better balance between methods terminating in reasonable time (within a 3 days per hyperparameter configuration) and proper presentation of competitors’ performance.
-
Rule Complexity: Learning shorter rules is one of the main challenges of differentiable rule discovery and among these methods, thanks to our gradient shaping, we learn the most succinct rules. In terms of rule length, we think that shorter is not automatically better. Depending on the use case, say predicting health related outcomes, having a more complex classifier at the expense of ease of comprehension may be reasonable. One recent study indicated that conversely, users are more likely to accept and trust longer rules [1]. In particular, the classifier is still interpretable, while interpreting is just more effort. From that perspective the trade-off is effort of explainability vs accuracy.
Another reason for longer rules is the one-hot encoding for categorical feature. We often observe that our method express the presence of a feature as the absence of all other features, meaning that we sometimes get longer rules even though there is a simpler description. In ongoing work we are attempting to mitigate this.
[1] Fürnkranz, Johannes, Tomáš Kliegr, and Heiko Paulheim. "On cognitive preferences and the plausibility of rule-based models." Machine Learning 109.4 (2020): 853-898.
Thank you for the interesting and insightful discussion. We will incorporate what we discussed into the manuscript. We are also eager to explore the mentioned open research questions that are beyond the scope of the current paper, as we firmly believe in the necessity for highly performant, robust, and interpretable ML models. It goes without saying that we would appreciate if you would increase your score.
Thank you for your continued engagement.
-
Comparison to CORELS: I am not sure where I stand on this issue. Alternatively, you can imagine your method, when flexibly choosing cutpoints, to be choosing from a maximally-granular set of potential cutpoints, but I'm not concerned about your method overfitting the cutpoints. I think that the issue arises when there are too many long rules to choose from, some of which end up being spuriously predictive of the label in the training set. In this sense, a good rule selection mechanism (like fpgrowth, for example) may allow for good generalization, even with a very fine discretization. As a note -- I am used to thinking about rules as being one-sided inequalities rather than intervals. I agree that having intervals, with 2^n cutpoints, could result in a huge explosion of the number of rules to consider.
-
Hyperparameters: I commiserate with this difficulty, but I maintain that tuning at least the regularization on every dataset individually is strictly necessary for me to trust the comparison. I think that the setup you settled on is a reasonable compromise, except for this point, where the right regularization is dependent on specific properties of each dataset. Another point, which this conversation brings to mind, is that the sparsity comparison between different methods depends on the regularization parameters chosen for each method in hyperparameter optimization.
-
Rule Complexity: With identical performance, I think that shorter rules are clearly better than longer rules. It is very important to note that complexity does not always trade off with accuracy, especially on problems with noisy labels (which are very common in the sorts of domains where interpretability is valued) [1, 2, 3]. My understanding of that study is not that longer rules are preferable, but rather that people have a cognitive bias towards longer rules. It may be that this results in more trust of models with longer rules, but it remains open whether this would translate into better performance or decision making on a real world task.
For clarity: I am inclined to raise my score to a 4 on the grounds that I am confident that your method is an advancement on other neural methods for learning rule lists. I remain skeptical of the comparison to discrete solvers for rule lists, namely CORELs, due to the lack of regularization tuning per-dataset and the restrictive discretization procedure. However, I acknowledge the difficulty in designing a perfect experiment that fits in runtime constraints.
[1] Semenova et al., 2019: On the Existence of Simpler Machine Learning Models
[2] Semenova et al., 2023: A Path to Simpler Models Starts With Noise
[3] Boner et al., 2024: Using Noise to Infer Aspects of Simplicity Without Learning
Extra typo that I found: line 527, the values of the "bs" parameters are missing comma separators.
Thank you also for engaging in this productive discussion.
-
Comparison to CORELS: I agree with all points raised, and I think that there could be interesting future work in leveraging your method to find good rules, potentially along with an optimal solver to maximally exploit them. That is outside of the scope of this work though, and for now I am satisfied with this conclusion.
-
Hyperparameters: Thank you for running this comparison, this is a very useful point of reference. I presume that CORELS' lackluster performance on f1 score arises mostly from a discrepancy in the optimization metric from the evaluation metric, and secondarily from the issues of feature selection that we discussed in regards to the CORELS comparison? Perhaps CORELS is less sensitive to regularization than I thought -- I will do my own benchmarking to check this for myself outside of the review process.
-
Rule Complexity: This is an interesting point, that I had not completely considered. I generally assume that a very long rule will capture such a small portion of the training data that it is indistinguishable from a spurious rule using the training data. But doing a postprocessing check with held out data to check whether a long rule is spurious is an interesting idea, and reminiscent of the model editing framework for generalized additive models proposed in [1].
Another benchmark, which is outside the scope of this paper, but which you may find interesting, is the STreeD algorithm (proposed in [2]) for decision tree optimization. This could be trivially restricted to optimize rule lists instead (go down only one side of a tree), and has built in optimization techniques for f1 score and other metrics usually secondarily optimized.
In light of the additional experiments showing that hyperparameter tuning is likely less of an issue than I initially believed, I am inclined to raise my score to an accept. I encourage the authors to consider including results with tuned regularization, at least, because although it does not appear to make a significant impact on the results, I think it improves the trustworthiness of the paper for somebody (like me) who is skeptical of hyperparameter tuning on held out datasets.
[1] Wang et al., 2021: GAM Changer: Editing Generalized Additive Models with Interactive Visualization.
[2] van der Linden et al., 2024: Necessary and Sufficient Conditions for Optimal Decision Trees using Dynamic Programming
Dear Reviewer,
Thank you for engaging so much in this discussion; we really appreciate it.
-
Comparison to CORELS: That is an interesting observation. Indeed, our approach can be viewed as modeling from a maximally granular set of potential cutpoints.
As for pattern mining-based approaches, such as those based on fpgrowth, these are indeed able to find general patterns, yet come with their own set of challenges. As they return every itemset that passes the minimum support threshold, the result set tends to be very large due to down-ward closure property, highly redundant because due to noise it includes many variants of the same ground-truth rules, as well as includes many "spurious" rules that consists of combinations of true rules or frequent but independent items. Determining which of these are the most informative is an active field of study, but as of yet, remains a challenging task. This means spurious rules can still emerge even with sophisticated mining approaches, while the likelihood may be reduced, the fundamental issue remains. -
Hyperparameters: To address your concern about regularization tuning, we conducted additional experiments by re-running CORELS on 6 datasets with regularization parameter . Due to time constraints in the discussion period, we reduced the number of cutpoints from 5 to 3, though this modification yields nearly identical results on this subset of datasets. The results are presented in the table below.
Dataset NeuRules 0 0.001 0.005 0.01 0.02 0.03 0.04 0.05 0.1 Android Malware 0.93 0.33 0.33 0.33 0.33 0.33 0.33 0.33 0.33 0.33 FICO 0.7 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.69 0.68 Heart Disease 0.8 0.68 0.68 0.68 0.68 0.68 0.68 0.68 0.68 0.07 Hepatitis 0.81 0.76 0.76 0.76 0.76 0.76 0.76 0.76 0.48 0.63 Rin 0.92 0.63 0.63 0.63 0.63 0.63 0.63 0.63 0.63 0.54 Titanic 0.75 0.68 0.66 0.66 0.66 0.66 0.62 0.54 0.54 0.54 Avg. F1 0.82 0.63 0.62 0.62 0.62 0.62 0.62 0.61 0.56 0.47 Our analysis reveals that CORELS performance remains stable across the range , with F1 scores showing minimal variation. Performance deteriorates only at higher regularization values . This consistency across datasets supports our experimental design choice of using a fixed regularization parameter rather than dataset-specific tuning.
Notably, CORELS performs best with no regularization on these datasets, yet still shows a performance gap compared to NeuRules. This gap persists consistently across all regularization settings and aligns with the results reported in our main paper.
We want to emphasize that our rule grammar employs interval constraints rather than one-sided inequalities, which may differ from your previous experience with CORELS. Regarding sparsity comparisons, we systematically optimized all available hyperparameters (including sparsity regularizers where applicable) to maximize performance. This approach ensures that our sparsity comparisons reflect each method's best achievable performance rather than arbitrary parameter choices. -
Rule Complexity: We understand each other regarding rule lengths. We are not arguing in favour of rules that include spurious terms. If a short rule is as good as a longer one, we also prefer the short one. However, it is important to distinguish that long(er) rules may be specialized rather than spurious. Short rules can typically describe common patterns (the norm) very well, whereas longer rules may be necessary to capture important long-tail behavior.
Our analysis of NeuRules shows that the median rule length is indeed smaller than the mean – i.e. most rules are short – while some very long rules serve specialized functions. Upon closer inspection we find that most rules capture substantial parts of the training resp. test data, but there are also a few rules that capture only small portions of the training set and fire very infrequently on test data. Whether these are generally spurious (bad) or rare (good) needs further investigation, and it will be interesting to develop more advanced regularization or post-hoc pruning techniques to prune the bad apples.
Last, but not least, we do argue that having spurious rules in rule-based models differs fundamentally from spurious correlations in other machine learning approaches, such as linear regression. In rule lists, spurious rules typically fail to fire on new data and cause minimal direct harm beyond occupying a slot in the rule list, which, while suboptimal, is less problematic than spurious parameters in continuous models. While it is not immediately clear how to do so during training, it is straightforward to prune unused or rarely used rules from the list during post-processing using a hold-out set.
The rule learning method is essential for interpretable AI and highly trustworthy application domains. Current rule-learning methods need pre-discretized features, and some of the rule-learning techniques are limited in scalability. The paper proposed a novel framework for learning rules from the data without feature pre-discretization. The method mainly used a fully differentiable manner to construct the predicate, rule, and rule list ordered by the importance of rules. The experimental results indicate that the proposed method obtains the state-of-the-art performance.
优缺点分析
Strengths:
The paper is well-structured and presents its ideas using clear and rigorous mathematical formulations. The authors designed a unified method to learn Boolean conditions denoted as predicates in the paper, conjunctions, and the order of the rule list simultaneously.
Weaknesses:
The paper omits preliminary explanations of fundamental concepts, such as the notion of a predicate, or misses the proper citations for these notions. Additionally, it lacks important details regarding the experimental setup, including the specific optimization methods employed. Besides, in the related work, the authors also ignore some discussion regarding the rule learning with the column generation algorithm and the first-order rule learning method.
问题
-
The notion of an atom in first-order logic (FOL) typically includes both terms and predicates. However, the concept of a “predicate” as described in your paper appears to differ slightly from its traditional definition in FOL. Could you please clarify the relationship between the predicate used in your paper and the standard predicate in FOL?
-
Additionally, in the rule learning community, the term “soft predicate” (as described in [1]) carries a different meaning from the one presented in your work. Could you also elaborate on the connection or distinction between the soft predicate defined in your paper and that in [1]?
-
Regarding Equation (4), could you explain how the parameters are chosen? For example, what is the rationale behind the formulation and the exponential term ?
-
In the ablation studies, the paper does not appear to provide the specific values used for the temperature parameter . Does the value of have a significant impact on the experimental results?
-
Can you describe the connections, advantages, and disadvantages of the proposed rule learning method with other rule learning methods based on the column generation algorithm, such as [2].
[1] Wang, W. Y., Mazaitis, K., & Cohen, W. W. (2015, July). A Soft Version of Predicate Invention Based on Structured Sparsity. In IJCAI (pp. 3918-3924).
[2] Sanjeeb Dash, Oktay Günlük, Dennis Wei: Boolean Decision Rules via Column Generation. NeurIPS 2018: 4660-4670
局限性
yes
最终评判理由
The authors have properly given the rebuttal. I recognize that the paper has good technical contributions, but there are still some issues on the presentation side. I have already put a rather positive score, which is not very strong but I think is appropriate.
格式问题
No paper formatting concerns.
Dear Reviewer, thank you for your in-depth, constructive feedback.
Below we address your questions and concerns:
-
Predicate FOL: Thank you for your comment. Although our predicates are parameterized and grounded in feature space, they serve an analogous role as in FOL: they act as the basic Boolean conditions used to construct a conjunctive rule, similar to how predicates in FOL form the structure of atomic formulas.
In our work, we restrict these atomic conditions to threshold-based conditions over a single feature (e.g., ). In principle, for every individual feature there exist infinitely many predicates by combinations of . Thus, in subgroup discovery/rule mining, the first step is usually to generate an initial set of predicates using fixed discretization. Instead, we propose a method that explicitly optimizes those, instead of pre-defining them a priori. After training, these learned predicates become fixed, binary-valued conditions that can be interpreted as atomic logical components. Both concepts are very similar in spirit but differ in the formalities. We will clarify this connection in the text.
-
Soft predicate: In [1], soft predicates refer to latent logical abstractions that emerge through structured sparsity over clause weights. These predicates are not explicitly defined; rather, they represent shared parameter patterns across clauses, serving as a soft form of predicate invention. In contrast, our soft predicates are explicit, parameterized threshold functions over real-valued features. Each outputs a value in [0,1], approximating binary inclusion and enabling differentiable rule learning.
Although our formulation resembles fuzzy predicates-in that both yield values in [0,1], the underlying motivation differs. In fuzzy logic, the fuzziness reflects graded truth, such as "x is partially tall." In our case, the soft predicate is a training-time relaxation, used to enable end-to-end optimization via gradient descent. The goal of this relaxation is to learn the threshold by which we determine “x is tall”, by means of minimizing the predictive error. The learned predicates are ultimately interpreted as binary threshold conditions, although the intermediate softness could have similar semantics, albeit having a different functional form as say in the weak conjunction of Łukasiewicz, its primary purpose is to facilitate learning. As temperature annealing progresses, these soft predicates converge to interpretable, discrete conditions similar to classical logical atoms. We will clarify these distinctions in the revised manuscript to avoid confusion and better position our work in relation to both prior neural-symbolic methods and fuzzy logic approaches.
-
Threshold parameters: The lower and upper bound parameters and are learned via gradient descent. The formulation in Eq. (4) allows us to take gradients with respect to the bounds. In particular, in Appendix 1 we show Eq (4) is equivalent to Eq (21) In the limit of , the predicate verifies whether and , i.e. it behaves like a strict thresholding function would. We will clarify the intuition behind this formulation in the updated text.
-
Temperature: In our evaluation, we use a starting temperature of , which is gradually annealed to (Appendix E.2). Additionally, in the hyperparameter sensitivity section (Appendix E.1), we re-run our experiments with different . The results displayed in the upper left corner of Figure 5 show that under reasonable values of , Neurules performance is stable.
-
Connection to column generation: Both [2] and Neurules seek to find accurate and intrepretable sets of rules, but differ in the kind of rule ensemble that is learned and scope of application. The column generation method focuses on learning CNF (AND-of-ORs) and DNF (OR-of-ANDs) rule sets on binary data. On the other hand, Neurules learns ordered rule lists (IF THEN; ELSE IF) on continuous data.
Column generation casts rule learning as a mixed-integer program with provable guarantees, but to that end requires pre-mined candidate rules. In contrast, our method learns thresholds, rules, and their ordering jointly in a fully differentiable framework, without pre-discretization or candidate enumeration. While our approach does not offer global optimality guarantees, our evaluation shows that Neurules outperforms combinatorial methods for rule list learning that first mine candidate rules (SBRL,CORELS), and in the case of CORELS come with optimality guarantees over that search space. We will include this distinction in the revised text.
Thank you for your engagement and the interesting points you raise. We think the view that we focus on a particular syntax to define predicates for decision rules, namely thresholded continuous features, is best fitting. Note that we are by no means unable to deal with binary variables. Our rules frequently place conditions on a binary feature as . We will clarify this constraint regarding the kind of predicate in the manuscript.
DNFs are a close cousin to rule lists, as it is possible to construct a DNF from any rule list. The key difference, however, is that rule lists have an inherent hierarchy over the rules as imposed by the if-then-else-if structure, which defines a sequential order in which the rules are applied. DNF’s on the other hand come with no natural order between the rules, which affects their interpretation. Another difference is the ability to predict multiple classes in the consequent of each rule, e.g. if rule-1 then class 1 else if rule 2 then class 2 … . The close connection between DNF’s and rule list is certainly very relevant and we will discuss the work you referenced when introducing rule lists.
Thank you for further explanations on the difference between predicates in decision rules and those in FOL as well as the difference between rule lists and DFN formulas. The latter difference is an interesting point to discuss, and reminds me of the difference between Prolog and ASP, where Prolog cares the ordering or rules but ASP is more declarative. From the perspective of Knowledge Representation, one prefers a declarative language for reasoning. Perhaps the authors could adapt their methods to learn DNF expressions too, then can compare the method with many other existing tools.
On the Term "Head": I now noticed that the authors use the term head as the conditional part of a rule. This is the opposite way to name each side of a rule in logic programming, where the conclusion part of a rule is called the head (and the condition part is called the body). I wonder if the authors' way to use heads actually exists in the literature, and if not it should be corrected.
In this paper, we focus exclusively on rule lists and hence compare our method exclusively to those specific to rule list learning. Whether the techniques introduced in Neurules are transferable to differentiable DNF/CNF solving approaches is an interesting idea, and we will definitely explore this in our future research.
Regarding the term rule head, works from association rule mining use the terms head and tail for antecedent and consequent, respectively [1]. We acknowledge the potential confusion with Prolog terminology, where the head describes the consequent, and we are looking into resolving the ambiguity.
[1] Fischer, Jonas, and Jilles Vreeken. "Sets of robust rules, and how to find them." Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Cham: Springer International Publishing, 2019.
- Thank you for the rebuttals. Regarding the first question, I understand the authors’ response, but I still feel that using first-order logic (FOL) to describe decision rules may not be entirely appropriate. In this context, the predicates are limited to cases involving two numeric thresholds. It would be helpful to clarify this limitation and align the formulation with existing work in decision rule learning.
- The second issue is closely related to the use of FOL. Again, connecting it more clearly with prior research on decision rule learning is important. The learned rules in the paper can be seen as FOL expressions under a language bias, where predicates represent numeric ranges.
- From what I’ve observed, the multiple DNF rules in [1] can also be interpreted in an “IF–THEN–ELSE IF” structure. Therefore, I suggest the authors consider providing a more specific reference for the definition of rules in Section 2.
- Based on the above comments, I would prefer to keep my current score.
[1] Sanjeeb Dash, Oktay Günlük, Dennis Wei: Boolean Decision Rules via Column Generation. NeurIPS 2018: 4660-4670
This paper introduces a method called NeuRules, which is a differentiable method to obtain a rule list, an interpretable and easy to implement kind of decision tree. The author(s)' method takes data comprising of numeric (continuous or encoded as numeric) features, and predict one of the discrete output classes. Rather than applying combinatorial methods with discrete choices, they train such a rule-list using a differentiable relaxation, designing, implementing and running the relaxation. Their particular choices in this relaxation lead to rule lists that promote sparsity, and choose the orders of the rules (with a temperature annealing that tightens during the training). They implement and run their method on a variety of tabular datasets, they compare NeuRules with other existing methods using mainly F1 scores. Although Neurules doesn't score better on every dataset with this metric, on average it scores the best.
优缺点分析
The combination of various differentiable relaxations are able to work in harmony to produce rule-lists which are a very discrete object and able to order them organically. Even though rule lists are not as prevalent as, say, decision trees they are still important in areas which require high interpretability, so this work is significant in that it allows for flexible creation of rule lists that makes the modern machine learning practitioner feel at home with the use of gradients in its learning process. Thus I can also imagine this being integrated into other machine learning pipelines as a interpretable module (with possibly non-interpretable features).
As for weaknesses on originality, and significance The differentiable relaxations work (according to the experiments that were done), but there is no reason or heuristic argument that they are the best according to some metric.
问题
Q. How does one work with categorical features? How one converts such features to continuous ones should be mentioned, or it should be mentioned in the limitations section the fact that the whole setup requires continuous features.
Q. In lines 284-289 in the Ablation studies for rule ordering, the wording is not clear if Neurules is being compared to RLNet or a version of Neurules without ordering (such as deactivating the priority variables. Which one is it? In the former case we cannot really call it an ablation study, and table 6 seems to indicate that that was not what was done. But in any case the wording in section 5.3 doesn't make it clear exactly what manipulation was done. Could this be made clearer?
Q. Gradient Shaping (GS) is not defined explicitly as a method. Does without GS in figure 4b means to take ? If yes then one should say so.
Q. What does color represent in figure 4(c)?
Q. things are not clear in equation (10) as to whether the logit coordinates are learnable parameters? Also are the consequents integers (which is what would come out of the argmax, or are they (which is what is written in equation (11) for exmple, but doesn't make sense if the consequent is just a number). I think there is a notational confusion here. What is the notation?
Q(cont). The confusion continues in equation (13), where (for the first time, if I'm not mistaken) is introduced. What exactly are the differentiable relaxations of these consequents? Even if they are made numeric (and doesn't remain a symbolic "yes" or "no") by one-hot encoding, then what exactly does their differentiable relaxation mean? Why would we treat them as learnable? I think I understand the overall method, but the notation here is confusing, if not an outright mistake.
The items below are not really questions but typos that should be fixed or certain suggestions for better presentation.
-
Typo on page 1, line 28 in "i.e. the set of condtions that specify when a rule triggers." condtions--> conditions. Also better to say ``when a rule is triggered''.
-
Line 81, page 3, .
-
Table 1 should be in section 5, or better yet in section 5.1 if possible. And Table 1 would do well to include the lengths of the produced rule lists for methods which are rule lists.
-In Appendix A1, the proof's first step is to divide the numerator and the denominator by in equation (18), as to bring us to (21). Which begs the question, then why was the soft predicate (16) given as such, and not directly as (21), where its form, and the purpose of each term would be clearer. If it is a numerical stability issue, then this should be mentioned.
局限性
Yes
最终评判理由
After reading through other referees' comments and the authors' responses, I have decided to maintain my score.
格式问题
No major formatting issues.
Dear Reviewer,
Thank you for your in-depth, constructive feedback. Below we address your questions:
- Categorical features: We use one-hot encoding for categorical features and treat them as continuous features. We will make this clear in the updated manuscript
- Ablation rule ordering: In the ablation, we compare to NeuRules with a fixed, predefined rule order, where is not learned during training. This is similar to using a fixed order as RLNet, but using our predicate and rule learning models. We will clarify this.
- Gradient shaping: yes, we disable the gradient shaping mechanism by setting .
- Fig. 4 colors: The red color marks the area where the gradient shaping leads to a performance loss. The blue signals where an improvement is achieved. The color of the markers is scaled as per the performance of the ablated model.
- Notation consequent: Each rule has a consequent that is a learnable parameter and represents the logits, which are turned into class probabilities using the softmax function. We acknowledge that denoting the consequent of each rule i as and the individual probabilities as is confusing, and we will adapt the notation to avoid this. We denote as the logit parameter vector, and denote as c is the probability distribution over the classes induced by the learned logits.
- Predicate: We agree that Equation (21) makes the predicate easier to understand, so we will substitute it in the main text.
- Typos: We will incorporate your feedback. Thank you for your thorough review.
Lastly, we would like to provide further context to the rationale behind our differentiable relaxations. We only use relaxations that converge to their strict counterparts. For rule lists, the methodology of learning rule ordering and predicates is the first of its kind. For the logical conjunction, we introduce a novel relaxation approach with the slack and derive its gradient shaping (GS) properties. We evaluate GS in an ablation, showing significant improvements in accuracy and rule sparsity. We empirically find that Neurules as a whole performs better compared to other neural approaches, which employ different relaxations for logical conjunctions (RLNet, DRNet, RRL).
I appreciate the authors’ responses. They have clarified several aspects. I keep my original positive score.
This paper introduces NeuRules, an end-to-end differentiable framework for learning rule lists. The method jointly learns discretization, predicate formation, conjunction, and ordering, using principled relaxations such as soft predicates, slack-based conjunctions, and Gumbel-softmax ordering. The framework aims to combine the interpretability of rule lists with the scalability and flexibility of gradient-based optimization.
Reviewers generally agreed that the paper is technically solid and makes a clear contribution to interpretable machine learning. The differentiable relaxations are well-motivated, carefully designed, and advance prior neural rule-learning approaches. The paper is clearly written, with thorough mathematical exposition, and is supported by extensive experiments across diverse tabular datasets. Results show that NeuRules achieves strong average performance compared to both neural and combinatorial baselines, while maintaining interpretability. Multiple reviewers highlighted the potential impact of NeuRules as a bridge between modern differentiable learning and interpretable symbolic models.
Therefore, my recommendation is Accept.