/10

Poster3 位审稿人

最低3最高5标准差0.8

ICML 2025

DOLPHIN: A Programmable Framework for Scalable Neurosymbolic Learning

Aaditya Naik,Jason Liu,Claire Wang,Amish Sethi,Saikat Dutta,Mayur Naik,Eric Wong

OpenReview PDF

提交: 2025-01-23更新: 2025-07-24

摘要

关键词

neurosymbolic learningscalabilityvectorizationdifferentiable reasoning

评审与讨论

审稿意见

评分: 52025-03-12

This paper introduces Dolphin, a novel neurosymbolic learning framework. Dolphin provides three key abstractions: symbolic objects, tags (which associate symbols with tensors), and distributions (which map symbols to probabilities). Leveraging this abstraction, Dolphin decouples symbolic and probabilistic computations, enabling vectorized probabilistic computation on GPUs. This allows Dolphin to efficiently handle large-scale batched data. To ensure end-to-end differentiability, Dolphin employs vectorized provenance semirings, enabling parallel gradient computation on GPUs. It also provides two customizable provenance mechanisms, DAMP and DTKP, allowing users to fine-tune symbolic differentiation. Dolphin introduces five core operations that facilitate batched data processing, complex control flows, and recursion. Furthermore, its seamless integration with Python and PyTorch allows users to easily develop and deploy complex neurosymbolic programs with flexibility. Experimental results across 13 neurosymbolic tasks demonstrate that Dolphin significantly improves computational efficiency while maintaining state-of-the-art accuracy, outperforming existing methods in scalability and performance.

给作者的问题

The limitation parts have discussed that Dolphin can only deal with discriminative models. However, there are many non-standard discriminative models with multiple heads and complicated outputs, such as object detectors which are common in application domains. It would be good if the authors could provide a discussion or case studies on how Dolphin can be used with popular discriminative models in typical application domains. It will be very useful for practitioner who wants to use Dolphin in real-world scenarios.

论据与证据

Yes, the claims are well supported.

方法与评估标准

Yes.

理论论述

There are no theoretical claims in this paper.

实验设计与分析

Yes.

补充材料

I read the appendix.

与现有文献的关系

This paper is related to the neurosymbolic programming research. The batch process of tags is largely related to MapReduce. The authors already provide detailed discussion of these backgrounds in the paper.

遗漏的重要参考文献

No.

其他优缺点

Strengths

Dolphin provides a practical and user-friendly solution for neurosymbolic programming. The GPU acceleration addresses a key limitation of previous methods like Scallop, offering a computationally efficient approach with just five core operations. Its seamless integration with Python and PyTorch significantly enhances usability and flexibility, making it more accessible for practitioners. Additionally, the provenance semiring framework provides a strong and elegant theoretical foundation for the proposed method.
The paper is well-written, easy to follow, and presents a thorough background analysis. It clearly explains the limitations of previous approaches and justifies the need for Dolphin, making it an engaging and informative read.

Weaknesses

The symbolic operations are restricted to semiring structures over tags, which may limit the expressiveness of symbolic programs. While this abstraction enables efficient GPU computation, it could potentially restrict the range of symbolic reasoning tasks that can be expressed within the framework.
The benchmarks primarily focus on relatively simple tasks and do not demonstrate Dolphin’s applicability to real-world, complex scenarios. The paper does not explore how Dolphin could be scaled to more sophisticated domains, such as robotics or autonomous driving, which may involve complex logical reasoning and control modules, as well as unstructured outputs from deep models (e.g., object detectors with noisy or non-standard outputs), which is not thoroughly discussed.

Overall, this is an excellent paper that presents an elegant solution to the key challenges of previous methods, particularly in efficiency and complexity. It brings neurosymbolic programming much closer to practical usability, a promising paradigm with the potential for significant impact on machine learning. Its seamless integration with Python and PyTorch further enhances its compatibility with existing deep learning pipelines. Thus, it would be exciting if the authors could further illustrate its potential to be really applied for large-scale applications.

其他意见或建议

It would be good to provide a simple example and more details about how the data are changing for each of the 5 operations. For example, in Union, what is the exact format of the returned data? Does it produce a list of tuples with tagged tensors? In Filter, what happens to the symbols that are filtered out? Do they become empty placeholders, removed entirely, or assigned a default value?

作者回复

2025-04-01

We thank the reviewer for their suggestions and will add the discussions to the revised paper.

Discussion on other uses of Dolphin

Dolphin can be used for any task where the output of a model can be cast as a distribution over probabilities, including discriminative models. Consider an example of autonomous driving, where an object detector is used to detect obstacles on the road. Commonly used models like Faster R-CNN (cite) output bounding boxes, class probabilities, and confidence scores for multiple objects in an image. One can create a custom Python class that associates the coordinates of each bounding box with a Distribution object over the classes and their probabilities output by the model:


CLASSES = [‘car’, ‘person’, ‘shirt’, …]
class DetectedObject
    def __init__(self, coords, score, class_logits):
        self.coords = coords
        self.score = score
        self.distr = Distribution(CLASSES, class_logits)

One can then derive the probability that given a pair of objects, it represents a person inside a car:

# function to check if one set of coordinates is inside another
def is_inside(coord_a, coord_b):
    …

def person_inside_car = apply(o1.distr, o2.distr,
    lambda c1, c2: c1 == “person” and c2 == “car” and is_inside(o1.coords, o2.coords))

Here, person_inside_car will be a distribution over “True” and “False”, giving the probability that detected objects o1 and o2 represent a person inside a car.

Explanation of Operations

We describe the Filter and Union operations in more detail:

Filter. In the Filter operation, the symbols that do not satisfy the filtering condition are completely removed from the returned distribution. So consider a Distribution $D :: \\{ 0 \rightarrow t_1, 1 \rightarrow t_2, 2 \rightarrow t_3\\},$ with a filtering condition lambda x: x % 2 == 0 which removes odd numbers from the distribution. The returned Distribution will be $D_\text{filtered} :: \\{ 0 \rightarrow t_1, 2 \rightarrow t_3 \\}.$

Union. In the Union operation, a new Distribution is returned that contains the symbols from both input Distributions. A Union can occur over any pair of Distributions regardless of the types of symbols. For instance, consider as inputs Distributions $D :: \\{0 \rightarrow t_1, 1 \rightarrow t_2, 2 \rightarrow t_3\\} \text{ and } D' :: \\{1 \rightarrow t\'\_1 , 4 \rightarrow t\'\_2\\}.$ The output Distribution will be $D_\text{union} :: \\{0 \rightarrow t_1, 1 \rightarrow t_2 \oplus t'_1, 2 \rightarrow t_3, 4 \rightarrow t'_2\\}.$ Note here that the tags for common symbols are disjuncted.

Restricted to Semiring Structures Over Tags

While symbolic operations in Dolphin are indeed defined using semiring structures over tags—similar to existing frameworks such as Scallop—this abstraction is primarily introduced to facilitate efficient computations, including GPU acceleration, and has been shown to be quite flexible for solving neurosymbolic tasks.

审稿意见

评分: 42025-03-14

This paper brings enhanced scalability to neurosymbolic learning. The authors introduce a framework which allows symbolic computation to be conducted on the CPU, and allows vectorized computation of probabilities on the GPU. The authors introduce several Pythonic programming primitives to facilitate writing neurosymbolic programs . The tool allows pluggable support for different vectorized provenances to compute symbolic gradients. The empirical results demonstrates that, in general, DOLPHIN is much more efficient than SoTA tools, and achieves comparable, if not better, accuracy on various neurosymbolic programming tasks.

给作者的问题

I have no major questions that will affect my review. I have a few minor questions, which I wrote in "Other Comments Or Suggestions".

论据与证据

The claims are well-supported by the methodology and the experiments.

方法与评估标准

The proposed method and evaluation makes sense for the task at hand.

理论论述

N/A – There were no proofs to check.

实验设计与分析

I did not see any issues with the soundness of the experimental designs or analyses.

补充材料

I reviewed Appendix A and D. I briefly looked through the other appendices.

与现有文献的关系

The authors present a solution to a known issue (the scalability issue) in neurosymbolic programming, demonstrating superior performance to SoTA techniques in the literature.

遗漏的重要参考文献

Not that I am aware of.

其他优缺点

Strengths:

The paper addresses an important problem – making neurosymbolic learning more scalable. The framework is elegant and clean, and is - based on effective and impactful programming paradigms.
The experimental results are thorough, and demonstrate superior performance of DOLPHIN over SoTA work.
The paper provides fertile ground for new work and acknowledges certain limitations of their framework.
The paper is very well-written. Even as someone who is not as familiar with neurosymbolic programming, I was able to easily grasp the - -concepts presented in the paper. The appendices provide useful background information for the reader.

Weaknesses

I do not see any major weaknesses in the paper within the scope of the work presented.

其他意见或建议

Minor Comments/Questions

Table 2 – it would be helpful to add the unit of time in the caption, or in the main body of the text.
Section 4.4 – The authors mention they “use” a certain provenance depending on the benchmark, but given the preceding sentence, does this mean that they report the corresponding provenance’s result for each dataset?
Do the authors have any conjectures about why DOLPHIN’s accuracy is slightly lower than Scallop in the CLUTTR benchmark?
Grammatical error (lines 107-108) – “enable to efficiently compute symbolic gradients” → “enable the efficient computation of symbolic gradients”.

作者回复

2025-04-01

We thank the reviewer for their suggestions. We will add the unit of time in the caption of Table 2 and fix the grammatical errors they pointed out.

We indicate the provenance used for each benchmark in Section 4.4 (RQ2: Accuracy) and compare both provenances in Section 4.5 (RQ3: Provenance Comparisons).

The Dolphin CLUTRR program is designed for batched computations, unlike Scallop, which evaluates each sample independently. Moreover, the Scallop programs for CLUTRR-3 and CLUTRR-4 exhibit higher accuracy variance, indicating possible minor numerical issues or nondeterminism. We believe these factors account for the small (~2% pts) accuracy difference.

审稿意见

评分: 32025-03-14

This work presents DOLPHIN, a Python library that allows for efficient training of traditional neurosymbolic methods on CPUs and GPUs. The key idea is to accelerate probabilistic symbolic manipulations on GPUs, where other solutions (e.g., Scallop) rely on manipulations on CPUs.

In addition to the efforts in defining a Python library with abstractions and operations, DOLPHIN’s main conceptual contribution is a new provenance semiring, Differentiable Top-k Proofs with Add-Mult (DTKP-AM), which is a GPU-friendly vectorized approximation of the weighted model counting (WMC) version of DTKP.

Experimental results confirm that performing probabilistic symbolic manipulations on GPUs is indeed beneficial. Moreover, the new DTKP-AM provenance can improve the accuracy on one benchmark (HWF), while it performs on par (Path, CLUTRR, and Mugen) or worse (SumN) than a baseline provenance (DAMP).

给作者的问题

More experimental results without strict timeout constraints for a fair accuracy comparison would be appreciated (see comments on “Claims and Evidence” regarding Experimental design with tight timeout)

How does the accuracy and timing change for DTKP-AM when increasing k?

It seems that still a fair share of the compute time is spent on the CPU to perform symbolic manipulations (once for each batch). Wouldn’t it be possible to precompile the symbolic program and run everything on the GPU?

The results of the Scallop baseline on Mugen reported in Figure 5 (and Figures 8&9 in the appendix) are significantly worse than what the original work (Li et al. 2023) show in their paper. How serious were the efforts in reproducing that work? I would appreciate comments about the reason for this large discrepancy in the reproduction, beyond what is mentioned in Appendix D.7.

论据与证据

Experimental design with tight timeout: I am skeptical concerning the comparison of accuracy and training time (until convergence) for different methods, given the very tight training time budget constraint (10 hours) on a consumer-grade GPU (NVIDIA GeForce RTX 2080 Ti). The accuracy comparison in Figure 5 can be seen as unfair since some of the methods have not been trained until convergence (e.g., Scallop on HWF-15/19). Disentangling training time from accuracy would give better insights, e.g., by comparing the time per epoch, the time to convergence without a hard timeout constraint, and the accuracy at convergence.

No clear global benefit for DTKP-AM One of the main contributions of this paper, DTKP-AM, does not have a clear benefit across different benchmarks. Indeed, the results presented in the main paper primarily show another provenance semiring (DAMP). DTKP-AM is only beneficial in the HWF benchmark (see Figure 10). This could be due to the very low top-k used (k=1 for all benchmarks). The use of such low k might have been chosen to achieve competitive timing results compared to DAMP.

方法与评估标准

See comments on “Claims and Evidence” regarding Experimental design with tight timeout.

理论论述

This paper does not contain any proofs or theoretical claims.

实验设计与分析

See comments on “Claims and Evidence” regarding Experimental design with tight timeout

补充材料

I read Appendices A, B, C, D, E, and G. Moreover, I had a brief look at the code. It would be good if the code could be open-sourced.

与现有文献的关系

Accelerating probabilistic manipulations in neurosymbolic models is an important problem to make these methods usable in practice. At least the introduction of DOLPHIN as a Python framework for performing probabilistic manipulations on GPUs is of value. However, the second contribution (DTKP-AM) does not have a clear benefit in practice (see my comments on “Claims and Evidence”).

遗漏的重要参考文献

The following work accelerates arithmetic circuits (i.e., WMC) on GPUs. Hence, it should be discussed and possibly compared in this paper:

Jaron Maene and Vincent Derkinderen and Pedro Zuidberg Dos Martires, “KLay: Accelerating Arithmetic Circuits for Neurosymbolic AI,” ICLR, 2025.

Moreover, there are other works that introduce approximations for probabilistic manipulations:

Jaron Maene and Luc De Raedt, “Soft-Unification in Deep Probabilistic Logic,” NeurIPS 2023. Emile van Krieken et al., “A-NESI: A Scalable Approximate Method for Probabilistic Neurosymbolic Inference,” NeurIPS 2023.

其他优缺点

The introduction of the general “apply” function is appreciated, as goes beyond simple additive relations usually used in neurosymbolic benchmarking.

其他意见或建议

Missing the date in the reference «Kambhampati, S., Valmeekam, K., Guan, L., Verma, M., Stechly, K., Bhambri, S., Saldyt, L. P., and Murthy, A. B. Position: Llms can’t plan, but can help planning in llm modulo frameworks. In Forty-first International Conference on Machine Learning.»

It would be good to add a reference when introducing Differentiable Add-Mult Probabilities.

The caption of Table 2 should mention that the presented numbers are in seconds.

Please provide more details on the chosen networks in Appendix D. E.g., the exact CNN architecture is not mentioned in D.2. Also, does the work use a pretrained network? Moreover, it is not clear what is trainable in D.5 regarding Roberta-base: do you train only the classification head? Do you use a pretrained network?

作者回复

2025-04-01

We thank the reviewer for suggestions on experiment design and additional literature. We will add the results from additional experiments discussed below in the paper and expand the related work.

Scallop’s Performance on Mugen

Scallop’s Mugen results were obtained from their PLDI’23 artifact. Despite significant efforts, we were unable to reproduce those results, thus, we reported the numbers we observed. We reached out to the Scallop authors, who provided us with a different symbolic program that reproduced their results using DTKP-WMC (K=5). We also redesigned the Dolphin program for Mugen to be more batchable across samples, which we run with DTKP-AM (K=5):

Version	Framework	VTR	TVR	Total Time	Time per Epoch
1k	Dolphin	89.43	91.83	2.39e3	92.3
	Scallop	86.26	89.57	6.71e3	314.68
5k	Dolphin	95.03	95.36	1.15e4	470.05
	Scallop	91.26	94.03	3.59e4	1.58e3

While Scallop now converges, Dolphin still achieves higher accuracies and trains ~3.3x faster than Scallop.

Experiment Results without a Strict Timeout

We ran the experiments until Scallop converged and report its accuracies. We will do the same for the other baselines in the revised version.

Benchmark	Dolphin Total Time	Dolphin Accuracy	Scallop Total Time	Scallop Accuracy
HWF-15	9.78e3	92.13	1.66e5	39.58
HWF-19	1.63e4	86.18	1.82e5	7.02
Path-256	1.94e4	82.62	1.14e5	83.01
Path-128	1.78e4	84.03	4.17e4	84.90

For Path-128 and 256, Scallop converges to Dolphin’s accuracies in ~11.6 and ~31.7 hours (~2.4x and ~6x slower), respectively. For HWF-15, Scallop reaches 100% on only 2 seeds out of 6, but stays below 12.5% on the rest even after ~46 hours of training on average. For HWF-19, none of the 6 runs were able to converge after ~50 hours of training on average.

DTKP-AM with Different Values of K

We report preliminary results of this experiment. As a general trend, with the exception of HWF-19, the value of K doesn’t have much impact on the final accuracy (Acc). Increasing K also does not significantly impact the per epoch time (T/ep) due to DTKP-AM's vectorizations.

Benchmark	K=1 Acc	K=1 T/ep	K=3 Acc	K=3 T/ep	K=5 Acc	K=5 T/ep	K=7 Acc	K=7 T/ep
Sum-15	9.61	37.21	10.81	47.70	10.51	53.52	10.21	58.54
HWF-19	8.94	1.21e3	99.15	1.40e3	96.89	1.33e3	95.75	1.46e3
Path-256	81.39	1.97e3	82.14	2.34e3	80.86	2.10e3	82.38	2.12e3
CLUTRR-4	53.62	240.50	48.52	257.89	50.35	261.31	48.17	257.99
Mugen-5K	(94.1/95.7)	460.38	(95.4/95.7)	464.68	(95.3/95.4)	470.05	(95.4/95.4)	465.10

Benefits of DTKP-AM

Empirically, DTKP-AM offers an advantage over DAMP of ~2% pts on average across the complex versions of all tasks. It significantly outperforms DAMP by ~50% pts on HWF-15 and ~75% pts on HWF-19, while also yielding up to ~5% pts higher accuracy on Mugen and ~3% pts improvement on PathFinder.

Model Details

For HWF/MNIST, we use the same CNN architecture as Scallop (will be added in Appendix D). For CLUTRR, we use Scallop’s Roberta configuration: a pretrained model (roberta-base) finetuned while training the classification head.

Symbolic Computations on the GPU

Precompiling symbolic computations on the GPU poses a few challenges. First, it restricts symbolic programs to PyTorch tensor operations, a small subset of the Python functions Dolphin currently supports. Second, since we perform one set of CPU computations per batch (~16% of the train time per batch), there is limited scope for improving the training time. The parallelism will only occur across combinations of symbols, not over samples in a batch. Third, enumerating all possible combinations of symbols on the GPU could pose memory consumption issues on complex tasks, as seen with LTN.

One could potentially compile symbolic computations on the GPU by representing symbols as GPU tensors and implementing user-defined functions as tensor operations since Dolphin supports arbitrary Python objects. We tested this strategy on Sum-5, but it did not show any improvements in training time, and was 2 seconds slower to train per epoch. We consider developing efficient strategies for compiling symbolic computations directly onto GPUs an exciting direction for future research.

Related Work

Thank you for highlighting these related papers. KLAY (ICLR’25) lacks released code, preventing direct comparison. It compiles symbolic and probabilistic computations into GPU operations by transforming arithmetic circuits from Boolean logic into layered tensor structures. In contrast, Dolphin separates CPU-based symbolic reasoning from GPU-based probabilistic computations. The paper also omits comparisons against LTN or Scallop on Dolphin’s benchmarks. DeepSoftLog improves NTP by using probabilistic semantics instead of fuzzy semantics, enabling non-redundant proofs, well-defined proof scores, and non-sparse gradients. A-NESI uses learned neural models to approximate the exact probabilistic semantics of WMC, boosting scalability. We will include these works in the related work section.

最终决定Accept (poster)

2025-05-01

The paper introduces Dolphin, a programmable framework designed to enhance the scalability of neurosymbolic learning by integrating symbolic reasoning with deep learning models. Dolphin addresses the limitations of existing frameworks by enabling neurosymbolic programs to be written in Python and executed efficiently using PyTorch. It achieves this by mapping both forward symbolic reasoning and backward gradient propagation to vectorized computations, allowing for end-to-end differentiable symbolic programs that can leverage GPU acceleration. All reviewers agree that this is interesting an valuable work. I fully agree. Since the paper is about scalability and Scallop is one of the baselines, the authors should also discuss

Arseny Skryagin, Daniel Ochs, Devendra Singh Dhami, Kristian Kersting. Scalable Neural-Probabilistic Answer Set Programming. Journal of Artificial Intelligence Research (JAIR) 78:579–617, 2023

in the camera-ready version.