7.0

/10

Poster3 位审稿人

最低4最高5标准差0.5

2.7

置信度

创新性3.0

质量3.3

清晰度2.7

重要性3.0

NeurIPS 2025

A Pre-training Framework for Relational Data with Information-theoretic Principles

Quang Truong,Zhikai Chen,Mingxuan Ju,Tong Zhao,Neil Shah,Jiliang Tang

OpenReview PDF

提交: 2025-05-12更新: 2025-10-29

TL;DR

A predictive task-aware pre-training framework for Relational Deep Learning.

摘要

关键词

graph learningtabular learningrelational database

评审与讨论

审稿意见

评分: 4置信度: 22025-07-03

The authors introduce Task Vector Estimation (TVE), a novel pre-training framework that explicitly models task heterogeneity and temporal patterns by aggregating over schema traversal graphs and predicting next-window relational behavior. Information-theoretic analysis shows that, by incorporating side-channel task vectors, TVE retains more predictive, task-specific information than standard self-supervised methods, which rely only on input data and risk losing relevant signals due to task diversity and noise. The experimental results also demonstrates the capability using TVE has better downstream task performance on a variety of tasks and models.

优缺点分析

Quality (4/5): The paper demonstrates good research quality with well-designed experiments that directly address the stated research questions. The empirical results are comprehensive and robust, covering both data-limited and data-rich scenarios to show the method’s effectiveness under different conditions. Additionally, the inclusion of theoretical proofs adds rigor and strengthens the work’s technical foundation.

Clarity (2.5/5): The overall structure of the paper is clear and logically organized. The mathematical notations are understandable, but some parts, especially Equation 3, could be simplified to improve readability for a broader audience. Figure 1 could be clearer if it focused more narrowly on the core pipeline and provided additional details specific to the proposed framework. Figure 2 explains how the task vector is constructed but does not clarify whether the traversal produces variable-length vectors; this point would benefit from further elaboration. Also, Figure 3 feels disconnected from the main text, it would help to explain its context and relevance more explicitly.

Significance (3.5/5): Relational databases remain critical data sources, especially for managing sensitive internal data within organizations. While large language model pre-training has shown great success, directly applying it to private relational data may not be the best approach. This work addresses that gap by exploring an alternative strategy tailored to relational databases, which could have practical impact.

Originality (4/5): The paper offers a novel contribution by effectively bridging ideas from graph neural networks for encoding relational structures with model pre-training for downstream tasks. This integration of relational reasoning into pre-training design is both original and promising.

问题

How do you ensure that the task vector maintains a consistent size across different traversals?
The theoretical section discusses on SQL aggregation. In practice, the SQL queries might cover much more operators. How does this information loss affect learning effectiveness?
How complex are the expected SQL queries in practice? How many join hops and aggregations does the dataset typically involve?

局限性

Yes

最终评判理由

The authors has well addressed my concerns.

格式问题

None

作者回复

2025-07-31

We appreciate the reviewers’ thoughtful feedback and constructive suggestions. We addressed each point below.

[W1] The mathematical notations are understandable, but some parts, especially Equation 3, could be simplified to improve readability for a broader audience. Figure 1 could be clearer if it focused more narrowly on the core pipeline and provided additional details specific to the proposed framework. Figure 2 explains how the task vector is constructed but does not clarify whether the traversal produces variable-length vectors; this point would benefit from further elaboration. Also, Figure 3 feels disconnected from the main text, it would help to explain its context and relevance more explicitly.

We agree that improving the clarity of our notation and figures will make the core ideas of TVE more accessible. Below is how we will revise the manuscript to address each point:

We will refactor Eq. 3 by introducing a compact notation for the set‐extraction step as follow:
$\mathbf{X}_v^{(t + 1)} = \\{ f\_{\mathcal{V}}(u) \mid u \in \mathcal{N}_k^{(t+1)} (v) \\}$ $y_v^{(t)} = l(\mathbf{X}_v^{(t + 1)})$
so that the reader sees at a glance “apply labeling function $l$ to the set of next-window k-hop neighbor set,” where we already make assumptions that these values are from a specific relation $R$ . We also simplify the $\Delta t$ and represent the next-window variables via the upper-script $(t+1)$ . We’ll move the full expansion to appendix remark for readers who want the full notations.
Regarding Figure 1, we would like to provide a brief overview of different pre-training paradigms. We agree that Figure 1 could be clearer with more details describing the figure, so that readers can understand the general intuitions of our work. We will add more descriptions regarding inputs and outputs for each pre-training paradigm in Section 3.
Regarding Figure 2, we will add a clarification on the variable-length vectors, as we answer below (cf. [Q1]).
Figure 3 illustrates the Markov Chain that driving the label generation process. Specifically, for any set $\mathbf{Y}$ that is a generated by the parameter $\mathbf{\Theta}$ , the downstream task is typically a labeling set function $l(\mathbf{Y})$ . Our pre-training framework attempts to capture sufficient statistics $T(\mathbf{Y})$ of $\mathbf{Y}$ for $\mathbf{\Theta}$ , which ultimately leads to task-aware latent representation. We agree that Figure 3 should be elaborated more, and we will add more details regarding Figure 3 in the camera-ready version.

[Q1] How do you ensure that the task vector maintains a consistent size across different traversals?

Our proposed task vector construction operates similarly to Breadth First Search, and tasks extracted from a schema path is concatenated to the final task vector (cf. Figure 2). While it true that different path leads to different length of task vector, the final objective is an estimation of the final task vector. We can enforce different traversal paths to have consistent length by leveraging data compression techniques like PCA or MCA, as we outlined in Section 4.2.

[Q2] The theoretical section discusses on SQL aggregation. In practice, the SQL queries might cover much more operators. How does this information loss affect learning effectiveness?

Thank you for raising this important point. We’d like to clarify complementary arguments showing why—and under what conditions—simple statistics can suffice, and how we mitigate the information loss when compared with sufficient statistics in practice.

Statistical sufficiency in common distributions.

In many distributions, a small collection of aggregators is provably sufficient. For example:

The sum of samples is sufficient for exponential, Poisson, and Bernoulli distributions.
The max is sufficient for a uniform distribution over a bounded interval.
The mean is sufficient for a normal distribution with known variance.

Whenever the underlying dynamics match one of these canonical families, no information is lost by reducing the full set of observations to the corresponding statistic.

Enriching the task vector to minimize loss.

Real-world label‐derivation queries often deviate from textbook distributions. To guard against this, we augment our task vectors with as many simple statistics—include MEAN, MIN, MAX, SUM, COUNT, and STDDEV for numerical values, and MODE, COUNT, and COUNT DISTINCT for categorical values. This “overcomplete” summary ensures that, even if some statistics are redundant, we capture a broader range of moment and order information, thereby shrinking the worst‐case information gap.

Set‐theoretic recovery and downstream expressivity.

From a set‐theoretic standpoint, certain combinations of simple statistics can perfectly reconstruct small multisets. For instance, if exactly three purchases occur in the prediction window, then the triplet { MIN, MAX, SUM } uniquely determines the multiset. More generally, accurate pre‐training on these statistics helps the model learn an implicit approximation of the original value‐set, which in turn improves its ability to approximate any downstream labeling function—since most such functions operate on those same sets (cf. Eq. (3)).

Empirical validation in tabular domains.

Prior work on Deep Feature Synthesis (DFS) [1] demonstrates that stacking joins with simple aggregations often matches more expressive graph‐based architectures on many benchmarks [2, 3]. While DFS features incur theoretical information loss relative to a full relational graph encoding, their competitive accuracy shows that richly engineered statistics can suffice in practice. Our investigation is orthogonal: we systematically characterize all node‐level SQL‐generated labels as set functions and show how a task‐vector approach can act as surrogate representation for next-window sets. Additionally, our task vector includes similar aggregators to that of DFS.

Together, these points explain why simple statistics are not only theoretically grounded but also practically powerful—and how we effectively minimize any information loss.

[Q3] How complex are the expected SQL queries in practice? How many join hops and aggregations does the dataset typically involve?

Before deriving the computational complexity, assume the following:

Number of rows per table is is $O(N)$ .
Number of timestamps is $T$ .
Branching factor of relational entity graph (the input graph for learning) is $b$ .
Number of aggregation functions is $r$ .
Branching factor of schema graph (the blueprint of the Relational Database) is $p$ .
Number of hops for generating the task vector is $k$ .

We would like to compute the computational cost at each step for constructing the task vector. First, we compute the cost taken for a single path from the schema graph perspective (Step 1 - 3), and then sum up to get the final cost (Step 4). Remember that for every schema graph path, there are many paths fall under the same schema graph path since schema graph is just a blueprint of the input graph.

Joining with the time table, which is a necessary step regardless of pre-training paradigm, costs $O(NT)$ .
Next, every join causes each row to fan out by a factor of $O(b)$ , so the cost of joining up to $u$ successive times is:
$O(NTb + NTb^2 + ... + NTb^u) = O(NT\sum_{i=1}^u b^i).$
For $b > 1$ , the geometric sum is scaled on the order of $b^u$ :
$\sum_{i=1}^u b^i = \frac{b^{u+1} - b}{b-1} \approx \frac{b^{u+1}}{b-1} = \frac{b}{b-1}b^{u} = \Theta(b^u).$
Therefore, the total cost of for joining up to length $u$ is $O(NTb^u)$ .
One aggregation of $r$ functions over the joined rows costs $O(rNTb^u)$ , so every path (from the schema graph perspective) cost $O(NTB^u + rNTb^u) = O(rNTb^u)$ .
There are $O(p^u)$ paths of length $u$ from the schema graph perspective, summing the cost:

\sum_{u=1}^k \left[p^u \times O(rNTb^u)\right] = O\left(rNT\sum_{u=1}^k (pb)^u\right) = O(rNTp^kb^k)

Therefore, the cost blows up exponentially in $k$ through both $p^k$ and $b^k$ , with linear dependence on $r$ , $N$ , and $T$ . In practice, branching factor of the schema graph $p$ is typically small, and the schema graph is small, leading to small $k$ . Furthermore, a path is only meaningful if it contains at least one fact table (the table containing timestamps), so that we can construct the predictive pre-training objective. Finally, this operation is only done once, and is not involved during pre-training optimization.

To demonstrate our point, we would like to report the time and peak memory usage for constructing task vector across datasets. As we can see from the below table, this operation is feasible even with large-scale databases, where rel-amazon and rel-hm contain approximately 24M and 33M rows, respectively. For our experiments, the maximum hop is 2, while the possible aggregation functions include MEAN, MIN, MAX, SUM, COUNT, and STDDEV for numerical values, and MODE, COUNT, and COUNT DISTINCT for categorical values. In industry scenario, such cost is expendable.

Model	rel-amazon/item	rel-amazon/user	rel-hm/item	rel-hm/user
TVE-1	64.59 seconds / 39.98 GB	90.17 seconds / 40.07 GB	19.38 seconds / 4.62 GB	11.74 seconds / 19.85 GB
TVE-2	382.75 seconds / 40.03 GB	652.85 seconds / 52.67 GB	98.03 seconds / 16.15 GB	871.26 seconds / 148.19 GB

[1] Deep feature synthesis: Towards automating data science endeavors.

[2] 4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs.

[3] Griffin: Towards a Graph-Centric Relational Database Foundation Model.

2025-08-06

The authors has well addressed my concerns. I have raised my score accordingly.

2025-08-07

Thank you for your thoughtful re-evaluation and for raising your score. We’re delighted to hear that our clarifications addressed your concerns. We appreciate your time and consideration in reviewing our work.

2025-08-06

Dear Reviewer r84r,

We hope you’ve had a chance to read our rebuttal. We haven’t yet seen any follow-up questions or feedback from you in the discussion phase, so we wanted to check whether there are any points that remain unclear. We’d be grateful for any questions or thoughts you have—clarifying those will help us improve the paper.

Thank you again for your time,

The authors.

审稿意见

评分: 4置信度: 32025-07-03

This paper introduces Task Vector Estimation (TVE), a pre-training framework for relational databases that constructs predictive signals by summarizing next-window temporal dynamics through schema graph traversals and SQL aggregations. Unlike traditional self-supervised methods that can capture spurious patterns, TVE explicitly models the underlying data generation process by creating task vectors that represent statistical summaries of future values relevant to downstream prediction tasks. The authors provide theoretical justification showing that representations incorporating both input features and temporal side-channel information retain more task-relevant signals than input-only approaches. Experimental results on RelBench demonstrate that TVE consistently outperforms traditional baselines, particularly in data-limited scenarios, while showing greater stability across hyperparameters and producing more transferable representations.

优缺点分析

Strengths

The paper provides formal information-theoretic justification showing that incorporating side-channel information (task vectors) preserves more task-relevant signals than input-only methods
Experiments on the RelBench benchmark show that TVE consistently outperforms traditional pre-training baselines.

Weakness

The authors don't talk about the computational complexity. Please discuss the complexity of task vector estimation
The authors should show results on some other benchmark other than Relbench. Please see the 4DBInfer benchmark.
The authors must include more recent baselines. One baseline is for 2023, and the other is for 2020. Please include recent models.
The authors should talk about the memory requirements while calculating task vector estimation.
Authors should show the results on cross-domain transfer. As SSL has shown its strength in cross-domain transfer.

问题

The authors don't talk about the computational complexity. Please discuss the complexity of task vector estimation
The authors should show results on some other benchmark other than Relbench. Please see the 4DBInfer benchmark.
The authors must include more recent baselines. One baseline is for 2023, and the other is for 2020. Please include recent models.
The authors should talk about the memory requirements while calculating task vector estimation.
Authors should show the results on cross-domain transfer. As SSL has shown its strength in cross-domain transfer.
Expressions like "infinitely many possible downstream tasks" are imprecise. Please improve
The paper is very difficult to read. The paper introduces mathematical notation without sufficient explanation.

局限性

yes

格式问题

作者回复

2025-07-31

We are grateful to the reviewer for their feedbacks, and we have addressed each point below.

[Q1] The authors don't talk about the computational complexity...memory requirements…

Table 7 (Appendix B.6) shows each method’s pre-training time in seconds, demonstrating that TVE is relatively fast compared to others during pre-training. Next, we address computational complexity and memory requirements.

Before deriving the computational complexity, assume the following:

Number of rows per table is $O(N)$ .
Number of timestamps is $T$ .
Branching factor of relational entity graph (the input graph for learning) is $b$ .
Number of aggregation functions is $r$ .
Branching factor of the schema graph (the blueprint of the Relational Database) is $p$ .
Number of hops for generating the task vector is $k$ .

We derive the per-path cost (Steps 1–3) and then sum over all paths (Step 4). Remember that for every schema graph path, there are many paths that fall under the same schema graph path since a schema graph is just a blueprint of the input graph.

Joining with the time table, which is a necessary step regardless of pre-training paradigm, costs $O(NT)$ .
Next, every join causes each row to fan out by a factor of $O(b)$ , so the cost of joining up to $u$ successive times is:
$O(NTb + NTb^2 + ... + NTb^u) = O(NT\sum_{i=1}^u b^i) = O(NTb^u)$
Aggregating $r$ functions adds $O(rNTb^u)$ per path.
There are $O(p^u)$ paths of length $u$ from the schema graph perspective, summing the cost:

\sum_{u=1}^k \left[p^u \times O(rNTb^u)\right] = O\left(rNT\sum_{u=1}^k (pb)^u\right) = O(rNTp^kb^k)

Therefore, the cost blows up exponentially in $k$ through both $p^k$ and $b^k$ , with linear dependence on $r$ , $N$ , and $T$ . Since $p$ and $k$ are small in real schemas, the exponential blow-up is limited. Furthermore, a path is only meaningful if it contains at least one fact table (the table containing timestamps), so that we can construct the predictive pre-training objective. Finally, this operation is only done once, and is not involved during pre-training optimization.

To demonstrate our point, we report the time and peak memory usage for constructing task vectors. As we can see from the below table, this operation is feasible even with large-scale databases, where rel-amazon and rel-hm contain approximately 24M and 33M rows, respectively.

Model	rel-amazon/item	rel-amazon/user	rel-hm/item	rel-hm/user
TVE-1	64.59 seconds / 39.98 GB	90.17 seconds / 40.07 GB	19.38 seconds / 4.62 GB	11.74 seconds / 19.85 GB
TVE-2	382.75 seconds / 40.03 GB	652.85 seconds / 52.67 GB	98.03 seconds / 16.15 GB	871.26 seconds / 148.19 GB

[Q2] The authors should show results on some other benchmark…

First, we would like to clarify the core difference between RelBench [1] and 4DBInfer[2]. RelBench focuses on inference an SQL-generated label attached to rows in the dimension table (the table that is static and doesn’t contain timestamps), while 4DBInfer focuses on cell-level prediction.

We would like to show that our proposed framework is performant regardless of the evaluation method. We pre-trained TVE on seznam, and evaluated downstream performance on an entity attribute classification task. We reported 4 metrics— Accuracy, Macro F1, Micro F1, and Mean Reciprocal Rank (MRR)—on both validation and test splits. We include two separate tables, one for linear probing and one for full-finetuning.

seznam/charge (linear probing)

Model	V/Accuracy	V/Macro F1	V/Micro F1	V/MRR	T/Accuracy	T/Macro F1	T/Micro F1	T/MRR
Baseline	60.27 ± 1.22	28.89 ± 2.09	60.27 ± 1.22	76.96 ± 0.70	59.36 ± 1.00	27.99 ± 1.86	59.36 ± 1.00	76.48 ± 0.50
MAE-0.25	71.03 ± 0.04	44.50 ± 1.24	71.03 ± 0.04	84.00 ± 0.03	70.52 ± 0.05	43.32 ± 1.74	70.52 ± 0.05	83.76 ± 0.04
CTR-0.25	42.88 ± 0.09	12.74 ± 0.81	42.88 ± 0.09	64.56 ± 0.10	41.11 ± 0.32	10.01 ± 0.48	41.11 ± 0.32	62.79 ± 0.21
TVE-1	69.98 ± 0.21	44.20 ± 1.00	69.98 ± 0.21	83.44 ± 0.09	69.90 ± 0.16	43.39 ± 0.64	69.90 ± 0.16	83.55 ± 0.08

seznam/charge

Model	V/Accuracy	V/Macro F1	V/Micro F1	V/MRR	T/Accuracy	T/Macro F1	T/Micro F1	T/MRR
Baseline	81.13 ± 0.21	62.02 ± 0.76	81.13 ± 0.21	89.96 ± 0.12	80.97 ± 0.50	63.43 ± 0.83	80.97 ± 0.50	89.91 ± 0.28
MAE-0.25	81.18 ± 0.15	62.00 ± 2.37	81.18 ± 0.15	89.99 ± 0.09	81.08 ± 0.32	63.31 ± 1.00	81.08 ± 0.32	89.97 ± 0.18
CTR-0.25	81.21 ± 0.17	63.28 ± 1.05	81.21 ± 0.17	90.01 ± 0.10	81.31 ± 0.41	63.69 ± 0.92	81.31 ± 0.41	90.10 ± 0.22
TVE-1	81.39 ± 0.22	63.92 ± 1.17	81.39 ± 0.22	90.11 ± 0.12	81.26 ± 0.29	63.82 ± 0.75	81.26 ± 0.29	90.07 ± 0.16

We can observe that the performance of MAE and CTR is not consistent across datasets, while TVE consistantly performs well and exhibits competitive performance (matching MAE for linear probing and matching CTR for full-finetuning). This is because both MAE and CTR have limitations. For example, CTR doesn’t offer a separable representation after pre-training (low performance during linear probing), while MAE isn’t as performant as CTR during full fine-tuning. Moreover, these gains come despite a conceptual mismatch—TVE was optimized for next-state prediction on dimension tables, whereas the downstream tasks operate on fact-table rows. Regardless of evaluation setting, TVE consistently performs well across two different benchmarks RelBench and 4DBInfer.

[Q3] The authors must include more recent baselines…

Relational‐database–specific pre‐training remains an emerging area, and to our knowledge no paradigms beyond generic adaptations have been proposed. The only relevant study is [3]. Because of the substantial engineering required to adapt SSL methods for relational data contexts, we restrict our comparison to two representative branches of SSL, including generative and contrastive pre-training (cf. Appendix B.2 for detailed changes for adaptation). Through our study, we demonstrate that these SSL methods aren’t sufficient because they do not explicitly model next-window dynamics. Meanwhile, Task-Vector Estimation (TVE) explicitly models the schema-driven factors that govern downstream label generation. As the field matures, we look forward to new pre-training methods and will include them in our future evaluations.

[Q4] Authors should show the results on cross-domain transfer...

Unfortunately, cross-domain transfer is beyond the scope of our study; we focus on theoretical understanding of underlying factors driving downstream label generation process, and how to derive a task-aware pre-training objective. We adopted the baseline proposed in [1] for our study, which leads to different encoders for different relational databases.

Meanwhile, we believe that cross-domain transfer in the same database poses another interesting question. To better understand the separability of the pre-trained latent representation, we further conduct a linear probing experiment. TVE-1 (item) refers to the pre-trained model on the item's task vector, while TVE-1 (user) refers to the pre-trained model on the user's task vector.

Linear probing experiment:

Model	item-churn-least100-spending	user-churn-top100-spending
Baseline	63.01 ± 4.98	78.36 ± 1.72
MAE-0.25	52.92 ± 0.79	83.86 ± 1.04
CTR-0.25	56.96 ± 2.07	81.63 ± 0.46
TVE-1 (item)	78.06 ± 0.65	83.28 ± 0.63
TVE-1 (user)	65.88 ± 2.25	86.56 ± 0.36

Surprisingly, even when the task vector isn’t matched with downstream tasks’ entity type, it can still yield better performance than other generic SSL methods. For example, TVE-1 (user) is only worse than TVE-1 (item) on item-churn-least100-spending, and TVE-1 (item) is as strong as MAE-0.25 on user-churn-top100-spending.

Consequently, this experiment highlights the necessity to create an entity type-specific pre-training objective to maximize downstream performance gains.

[Q5] Expressions like "infinitely many possible downstream tasks" are imprecise…

We agree that “infinitely many possible downstream tasks” are imprecise. We will rephrase the term to be “diverse downstream tasks.”

[Q6] The paper is very difficult to read...

Below is how we will revise the manuscript to further enhance clarityr:

We will refactor Eq. 3 by introducing a compact notation for the set‐extraction step as follow:
$\mathbf{X}_v^{(t + 1)} = \\{ f\_{\mathcal{V}}(u) \mid u \in \mathcal{N}_k^{(t+1)} (v) \\}$ $y_v^{(t)} = l(\mathbf{X}_v^{(t + 1)})$
where we already make assumptions that these values are from a specific relation $R$ . We’ll move the full expansion to the appendix remark for readers who want the full notations.
We agree that Figure 3 should be elaborated more, and we will add more details regarding Figure 3 in the camera-ready version.

We welcome any further clarity suggestions and thank the reviewers for their time.

[1] RelBench: A Benchmark for Deep Learning on Relational Databases.

[2] 4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs.

[3] Flaky Performances when Pretraining on Relational Databases.

2025-08-06

Dear Reviewer yYsy,

Thank you very much for acknowledging our rebuttal. We noticed that you didn’t have any follow-up questions or comments in the discussion phase. We would appreciate knowing whether our clarifications have sufficiently addressed your concerns. If anything remains unclear or you’d like further details, we’d be happy to elaborate.

Best regards,

The authors.

2025-08-08

Dear Reviewer yYsy,

With the discussion period ending soon, I wanted to reach out and see if you have any outstanding questions or concerns about our paper and rebuttal. We’re happy to offer any clarification or information to ensure all your points are fully addressed.

Thank you again for your thoughtful review and time!

审稿意见

评分: 5置信度: 32025-07-07

The paper introduced Task Vector Estimation (TVE), a novel pre-training framework for relational databases (RDBs) that explicitly models task heterogeneity and temporal dynamics. Different from the traditional self-supervised learning methods that focus on input features, TVE incorporates knowledge about the underlying distribution driving label generation by constructing predictive supervisory signals through set-based aggregation over schema traversal graphs. The paper formalized their approach through information theory, demonstrating that task-informed representations retain more relevant signals for downstream tasks. Extensive experiments on the RelBench benchmark demonstrated that TVE consistently outperforms traditional pre-training baselines, especially in low-data regimes.

优缺点分析

Strength:

S1. The paper identified a key limitation in current pre-training strategies for relational data which overlook task heterogeneity and temporal dynamics. The paper argues that different tasks defined on RDBs depend on schema graphs, temporal dependencies, and SQL-defined label logics.

S2. The information-theoretic analysis provides mathematical justification for why task-aware representations should outperform traditional self-supervised methods. Theorem 4.1 proves that representations incorporating side-channel information share at least as much information with downstream tasks as those without.

S3. The paper included detailed ablation studies, sensitivity analyses, and linear probing experiments that provided insights into the stability and transferability of the learned representations.

Weakness

W1. While the paper claimed that TVE addresses task heterogeneity, the experiments mainly focus on performance on individual tasks rather than demonstrating transfer across multiple related tasks within the same database.

W2. It will be nice if the paper can provide more discussion about how to design better approximations of sufficient statistics.

W3. The paper does not analyze how well the chosen SQL aggregations approximate the ideal sufficient statistics of the underlying distribution.

问题

Please respond to W2.
Please respond to W3.
How sensitive is TVE to the quality of the schema graph representation? Would errors in schema understanding significantly impact performance?

局限性

最终评判理由

The rebuttal has strengthened my confidence in both the theoretical foundations and practical robustness of the proposed approach.

格式问题

作者回复

2025-07-31

We would like to express our appreciation to the reviewers for the thorough evaluation and helpful recommendations. We would like to address the reviewer’s key concerns below.

[W1] While the paper claimed that TVE addresses task heterogeneity, the experiments mainly focus on performance on individual tasks rather than demonstrating transfer across multiple related tasks within the same database.

We would like to clarify that the goal of our work is to create a pre-training objective that can transfer across multiple downstream tasks like traditional pretraining in other fields such as language and image. Contrary to other fields, Relational Deep Learning offers a flexible framework where one can create a novel task that is based on temporality, structure, and SQL logics. Our goal is to systematically formulate that SQL-generated node-level labels are depending on sets of next-window values, and how to retain information of these sets during pre-training. In other words, we establish a universal mechanism for capturing supervisory signal of all downstream tasks in a single framework. As such, explicitly evaluating “bidirectional transfer” between two specific tasks lies beyond our core contribution, which is to prove and realize a unified pre-training paradigm that inherently supports transfer across the entire, potentially unbounded family of relational tasks.

[W2] It will be nice if the paper can provide more discussion about how to design better approximations of sufficient statistics.

Thank you for this perceptive suggestion. Designing richer approximations of sufficient statistics is indeed a vital and challenging direction, and we would like to engage in the discussion regarding possibilities of better representation for next-window sets.

Beyond fixed aggregators like task vectors, we can employ a small neural “set encoder” (e.g., DeepSets or transformer-based pooling) that learns to compress each next-window multiset into a vector. This approach can, in principle, capture all information in the set without hand-crafting statistics. However, training two networks—one to encode the set and one to predict the label—introduces instability. For instance, a GAN-style formulation (where the encoder acts as a generator and the predictor as a discriminator) risks mode collapse, mapping diverse inputs to the same latent code. Therefore, designing a sophisticated loss function to avoid mode collapse is a challenge for this strategy.

Our current task-vector objective optimizes only one encoder for input graph, which makes training stable and efficient. Despite being simpler, these vectors capture key moments of the distribution and are not susceptible to GAN-style collapse. They provide a reliable surrogate that practitioners can use immediately, while leaving more complex, learnable summaries to future work.

[W3] The paper does not analyze how well the chosen SQL aggregations approximate the ideal sufficient statistics of the underlying distribution.

Thank you very much for your thoughtful question. First, we would like to highlight that computing minimal sufficient representations has long been a grand challenge in both the Machine Learning and Information Theory communities. Because direct evaluation of mutual information is often intractable, Alemi et al. [1] instead optimize a tractable lower bound, while Belghazi et al. [2] turn to neural estimators to approximate mutual information directly.

Next, we would like to clarify complementary arguments showing why—and under what conditions—simple statistics can suffice, and how we mitigate the information loss when compared with sufficient statistics in practice.

Statistical sufficiency in common distributions.

In many distributions, a small collection of aggregators is provably sufficient. For example:

The sum of samples is sufficient for exponential, Poisson, and Bernoulli distributions.
The max is sufficient for a uniform distribution over a bounded interval.
The mean is sufficient for a normal distribution with known variance.

Whenever the underlying dynamics match one of these canonical families, no information is lost by reducing the full set of observations to the corresponding statistic.

Enriching the task vector to minimize loss.

Set‐theoretic recovery and downstream expressivity.

Empirical validation in tabular domains.

Prior work on Deep Feature Synthesis (DFS) [3] demonstrates that stacking joins with simple aggregations often matches more expressive graph‐based architectures on many benchmarks [4, 5]. While DFS features incur theoretical information loss relative to a full relational graph encoding, their competitive accuracy shows that richly engineered statistics can suffice in practice. Our investigation is orthogonal: we systematically characterize all node‐level SQL‐generated labels as set functions and show how a task‐vector approach can act as surrogate representation for next-window sets. Additionally, our task vector includes similar aggregators to that of DFS.

Together, these points explain why simple statistics are not only theoretically grounded but also practically powerful—and how we effectively minimize any information loss.

[Q3] How sensitive is TVE to the quality of the schema graph representation? Would errors in schema understanding significantly impact performance?

Thank you for this excellent question. To quantify TVE’s sensitivity to schema‐graph quality, we ran controlled “noise” experiments on the RelBench rel-amazon database:

We randomly dropped 20% and 40% of the rows in the review table to corrupt the relational entity graph. The corrupted graph here is considered clean signals, and tasks are depending on this newly created graph.
We treat the unaltered database as noisy database; thus, the downstream tasks defined in 1. no longer fully dependent on the original schema graph. Therefore, a robust pre-training paradigm overfitting to the noisy relational entity graph won’t generalize well to the tasks in 1. It simulates the case when tasks are not based on the schema graph due to errors in schema understanding.
We compared three pre-training paradigms—MAE, CTR, and our TVE—on two tasks user-churn-top100-spending and user-churn. We then (a) pre-trained our models on the noisy graph (including task vector constructed from this noisy graph), and (b) fine-tuned (and evaluated) on the same noisy graph using labels generated from the clean database.

rel-amazon/user-churn-top100-spending (linear probing)

Model	Baseline	MAE-0.25	CTR-0.25	TVE-1
Drop 20%	75.42 ± 3.10	82.26 ± 0.27	78.98 ± 1.17	86.79 ± 0.26
Drop 40%	73.95 ± 1.76	78.16 ± 0.46	76.38 ± 1.80	81.44 ± 0.17

rel-amazon/user-churn-top100-spending

Model	Baseline	MAE-0.25	CTR-0.25	TVE-1
Drop 20%	80.72 ± 1.14	82.83 ± 0.41	81.46 ± 2.31	85.04 ± 0.92
Drop 40%	76.39 ± 1.16	76.97 ± 1.43	75.72 ± 1.26	78.29 ± 1.79

rel-amazon/user-churn (linear probing)

Model	Baseline	MAE-0.25	CTR-0.25	TVE-1
Drop 20%	60.92 ± 0.61	65.62 ± 0.04	64.59 ± 0.01	70.17 ± 0.03
Drop 40%	61.95 ± 0.67	66.96 ± 0.03	66.14 ± 0.04	71.31 ± 0.01

rel-amazon/user-churn

Model	Baseline	MAE-0.25	CTR-0.25	TVE-1
Drop 20%	70.09 ± 0.15	69.74 ± 0.21	69.98 ± 0.13	70.26 ± 0.10
Drop 40%	71.35 ± 0.06	70.61 ± 0.24	71.20 ± 0.17	71.37 ± 0.20

Even when the downstream task loosely depends on schema graph understanding, and when TVE is pre-trained on a noisy relational database, TVE’s task‐aware pre-training consistently outperforms standard SSL objectives. This demonstrates that our framework is robust to errors in schema understanding and retains its effectiveness across a range of noise levels.

[1] Deep Variational Information Bottleneck.

[2] Mutual Information Neural Estimation.

[3] Deep feature synthesis: Towards automating data science endeavors.

[4] 4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs.

[5] Griffin: Towards a Graph-Centric Relational Database Foundation Model.

2025-08-09

Thank you for the detailed rebuttal. I appreciate the comprehensive responses that effectively addressed my main concerns:

Regarding sufficient statistics approximation (W2/W3), you provided a thorough theoretical justification for your choice of statistics, including both mathematical foundations and practical considerations for stability. The discussion of statistical sufficiency in common distributions and your approach to enriching task vectors was particularly illuminating.

Your experimental results on schema sensitivity demonstrated robust performance under different levels of noise/corruption, convincingly addressing my concerns about TVE's dependence on schema understanding quality.

You also clarified your position on task heterogeneity, helping me better understand your goal of creating a universal pre-training framework rather than focusing on specific task-to-task transfer.

Based on these responses, I am raising my scores for:

Quality: from 3 to 4 Overall Rating: from 4 to 5 The rebuttal has strengthened my confidence in both the theoretical foundations and practical robustness of your approach.

2025-08-09

Dear Reviewer 5cD2,

Thank you for your thoughtful evaluation and for taking the time to raise both your quality and overall ratings. We're glad that our rebuttal resolved your concerns, and we deeply appreciate your support for our paper.

2025-08-06

Dear Reviewer 5cD2,

Best regards,

The authors.

2025-08-08

Dear Reviewer 5cD2,

As the discussion period wraps up, I wanted to check in and see if you have any remaining questions or concerns about our paper and rebuttal. We’d be happy to clarify or expand on any points to ensure everything is addressed.

Thank you again for your time and feedback!

最终决定Accept (poster)

2025-09-17

This work introduces Task Vector Estimation (TVE), a novel pre-training framework for relational databases (RDBs) that explicitly models task heterogeneity and temporal dynamics. The approach is formalized via information theory, and the authors prove that task-informed representations retain at least as much relevant signal for downstream tasks. Experimental evaluation on RelBench benchmark demonstrate consistently performance improvements over traditional pre-training baselines.

Strengths:

The paper identifies a key limitation of current pre-training strategies for relational data which overlook task heterogeneity and temporal dynamics.
The paper provides formal information-theoretic justification showing that representations incorporating side-channel information share at least as much information with downstream tasks as those without

Weaknesses:

Clarity could be improved in some parts (see r84r).
Consider including additional baselines (authors explain their decisions well in the rebuttal).

Overall: The reviewers were overall positive on this work both before and after the rebuttal with only minor suggestions for improvement.