PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
4
4
5
5
2.8
置信度
创新性2.5
质量3.0
清晰度2.8
重要性3.0
NeurIPS 2025

Uncertainty-Informed Meta Pseudo Labeling for Surrogate Modeling with Limited Labeled Data

OpenReviewPDF
提交: 2025-05-08更新: 2025-10-29
TL;DR

We propose a model-agnostic UMPL framework that leverages teacher-student uncertainty to refine pseudo labels and improve generalization with unlabeled data.

摘要

关键词
Surrogate modelingSemi-supervised regressionMeta pseudo label

评审与讨论

审稿意见
4

This paper proposes Uncertainty-Informed Meta Pseudo Labeling, a semi-supervised learning framework tailored for surrogate modeling of physical systems governed by PDEs under limited labeled data. The framework builds upon a teacher-student architecture, where the teacher estimates epistemic uncertainty via EMC Dropout to generate pseudo-labels, and the student learns from these pseudo-labels while modeling aleatoric uncertainty via heteroscedastic regression. A key contribution is the use of uncertainty-informed feedback: the student’s uncertainty guides refinement of the teacher’s pseudo-labels, enabling a closed-loop meta-learning process. The method improves generalization under distribution shifts across multiple tasks.

优缺点分析

Strengths:

  1. Integration of dual uncertainty modeling in a meta-learning loop.
  2. Efficient and theoretically grounded uncertainty estimation.
  3. Strong empirical results on PDEs and real-world datasets.

Weaknesses:

  1. The key issue here is the contribution. Although this work proposed multiple ways to improve uncertainty estimation in terms of reliability and efficiency. However, the core part of this work is the SSL, which just use the uncertainty as a tool. I think the best way to justify the contribution should focus on SSL instead of the accuract uncertainty.
  2. It is good to achieve a better uncertainty estimation, but why don't just use most up to date uncertainty estimation methods or calibration methods that achieve better performance on universal accepted benchmarks?
  3. The proposed uncertainty estimation method is not well evaluated on widely-accepted benchmarks, if the author believe the proposed method is better than other uncertainty estimation methods, you can include more experiments to justify it.
  4. The computational overhead analysis is lacking.
  5. How sensitive is the method to the quality of aleatoric uncertainty estimates from the student? The robustness of the method to different PUF thresholds or uncertainty injection strengths need to be better discussed.

问题

listed in the weakness

局限性

Yes. The paper includes a discussion on limitations in the appendix and acknowledges challenges such as computational cost and the assumption of reliable uncertainty estimation.

最终评判理由

The author has solved my concern, which is the performance of the uncertainty estimation module.

格式问题

None

作者回复

Thank you for your constructive comments. We address each point below.

Q1: The core part of this work is the SSL, which just use the uncertainty as a tool.

A1: We thank the reviewer for the valuable comment. We agree that our work falls within the scope of semi-supervised learning (SSL), where uncertainty is employed to enhance pseudo-label quality. The central objective of UMPL is to improve model generalization to unseen data by introducing a role-aware dual uncertainty framework. This framework leverages epistemic uncertainty for teacher selection and aleatoric uncertainty for student feedback, forming a closed-loop refinement mechanism. Through this design, UMPL effectively exploits unlabeled data to enhance prediction accuracy in low-label scientific modeling settings. We will revise the manuscript to better highlight this focus.

Q2: Why don't just use most up to date uncertainty estimation methods or calibration methods?

A2: We appreciate the reviewer’s suggestion. Our main goal is to improve model accuracy in low-label settings by leveraging unlabeled data. While more advanced uncertainty estimators exist, they offer limited accuracy gains at significantly higher computational cost. UMPL adopts a balanced choice based on this trade-off, and already outperforms baselines (shown in Q3&A3). Moreover, UMPL is modular and can flexibly integrate more advanced teacher/student models if needed. We will clarify this design choice and comparative results in the revised version.

Q3: The proposed uncertainty estimation method is not well evaluated on widely-accepted benchmarks.

A3: We thank the reviewer for the helpful suggestion. Systematic comparisons of uncertainty estimation in PDE-based modeling are scarce—[1] is the only work that benchmarks multiple methods on the LE-PDE network to our best knowledge. To validate our approach, we conducted additional experiments on the public Darcy Flow dataset using AeroGTO, comparing UMPL with several uncertainty modeling baselines.

MethodMARMSCEL2Training Time(h)Memory(M)
EMC Dropout(Teacher)0.21230.24920.03880.522302
MC Dropout0.17810.23330.03763.3116249
SVGD [2]0.17360.19540.03909.982253
feature-WGD [3]0.16540.18800.03223.502567
Latent-uq(Student)0.15610.17780.03820.963243
Ensemble [4]0.18470.20440.03199.682395
Ensemble latent-uq [1]0.12970.14550.029413.393887
UMPL0.08910.10110.02562.213257

We use miscalibration area (MA) and root mean squared calibration error (RMSCE) as metrics to evaluate uncertainty quality. Results show that UMPL consistently achieves better uncertainty quality and predictive accuracy. We will include these results and clarify benchmarking choices in the revision. In addition, we replaced UMPL’s uncertainty components with more advanced estimators. While these alternatives achieved better calibration, they brought only marginal improvements in predictive accuracy and incurred significantly higher computational cost.

MethodMARMSCEL2Training Time(h)Memory(M)
T(EMC Dropout)+S(Latent-uq)0.08910.10110.02562.213257
T(SVGD)+S(Latent-uq)0.08730.09830.024820.423677
T(feature-WGD)+S(Latent-uq)0.08420.09510.02436.535597
T(feature-WGD)+S(Ensemble latent-uq)0.07730.08770.023928.436273

Q4: The computational overhead analysis is lacking.

A4: Thanks for the question. All experiments were conducted on a single NVIDIA RTX 4090 GPU. We report results on AeroGTO model across three representative tasks—2D steady-state, 2D transient, and 3D simulation—as summarized below.

DatasetDarcy (bs=4)NSM2d (bs=4)Ahmed (bs=1)
MethodMemory(M)Training Time(h)Memory(M)Training Time(h)Memory(M)Training Time(h)
Supervised22970.4597273.7593753.53
Pseudo Label38431.08165679.04159478.73
Mean Teacher38431.211656710.17159479.57
Noisy TS23041.77981313.21947712.48
Supervised+uq32430.96135787.69129867.36
UMPL32572.211379216.481329015.56

Here, Supervised+uq refers to the student model used in UMPL. As shown, UMPL introduces almost no additional memory overhead, thanks to the REINFORCE-style update that avoids the costly backpropagation chain from the student to the teacher. This significantly reduces computational and memory complexity, at the cost of increased training time.

Q5: The sensitivity and robustness analysis.

A5: Thank you for the helpful suggestion. We evaluated sensitivity to aleatoric uncertainty on the OOD Stationary Lid-Driven Cavity dataset using AeroGTO as the backbone. Gaussian noise of varying magnitudes was added to the student’s predicted uncertainty during training.

Noisy Magnitude(σ\sigma)MARMSCEL2
0.050.07030.08280.0634
0.10.07170.08790.0634
0.20.07790.09420.0636
0.50.09080.10530.0663
1.00.21560.26550.1499

The result shows that while uncertainty calibration degrades steadily with noise, the predictive accuracy remains stable under small perturbations. However, large noise leads to sharp performance drops, indicating that UMPL is robust to moderate noise but sensitive to severe distortion.

We also evaluated the sensitivity to the PUF threshold and feedback strength under the same setting:

PUF thresholdMARMSCEL2
100%0.06790.08020.0633
95%0.06610.07830.0632
90%0.07070.08090.0638
Uncertainty strengthMARMSCEL2
x0.10.07110.08380.0752
x10.06790.08020.0633
x50.08250.09740.0882
x100.12730.15030.1367

A lower PUF helps filter noisy labels on harder datasets, though its impact is mild on this dataset. For feedback strength, too small values lead to no pseudo-label correction (reducing UMPL to plain PL), while overly large values inject noise, degrading label quality and student learning.

Reference

[1] Wu, Tailin, et al. "Uncertainty quantification for forward and inverse problems of pdes via latent global evolution." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 1. 2024.

[2] Liu, Q. and Wang, D., 2016. Stein variational gradient descent: A general purpose bayesian inference algorithm. Advances in neural information processing systems, 29.

[3] Yashima, Shingo, et al. "Feature space particle inference for neural network ensembles." International Conference on Machine Learning. PMLR, 2022.

[4] Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. "Simple and scalable predictive uncertainty estimation using deep ensembles." Advances in neural information processing systems 30 (2017).

评论

We sincerely appreciate your constructive comments which help us improve our manuscript. Thanks for your time and effort!

评论

Thank the author's reply. The author has solved my concern, which is the performance of the uncertainty estimation module. So I decide to rate at 4.

审稿意见
4

This paper tackles the challenge of building robust surrogate models for complex physical systems, especially under conditions of limited labeled data, distribution shifts or OOD scenarios. Traditional DNNs and neural operators require large labeled datasets, which are often costly or impractical to obtain in scientific domains but the problem arises as, even though semi-supervised learning offers a way to leverage abundant unlabeled data, it suffers from noisy pseudo-labels that can degrade model performance.

To tackle this the authors propose Uncertainty-Informed Meta Pseudo Labeling (UMPL), a semi-supervised learning framework that improves pseudo-label quality by incorporating uncertainty estimates in a setup of a teacher-student meta-learning framework where they have roles based on uncertainty. Specifically, the teacher model generates pseudo-labels along with epistemic uncertainty, which reflects the model’s confidence and helps identify uncertain predictions under distribution shifts. The student model however learns from these pseudo-labels in order to estimate the aleatoric uncertainty, capturing inherent data noise and label uncertainty. This separation of uncertainty types aligns with the distinct roles of teacher and student and leads to more robust and interpretable uncertainty estimates. The student provides feedback to the teacher based on aleatoric uncertainty, guiding the teacher to refine pseudo-labels iteratively. This meta-learning loop enhances pseudo-label quality and model generalization, especially in uncertain or underrepresented regions of the data.

优缺点分析

The paper is well written and the motivation is clear. Although I am not that familiar with the practical problems and frameworks that solve (or at least tackle how to solve) problems like pseudo-labeling of high-dimensional PDE systems, it seems that the problem is well formulated and the authors motivate it well.

Regarding the contribution of the paper, in their second bullet point the authors claim that the role-aware design leads to more robust and interpretable uncertainty estimates, however without backing it up with any claims. It would be good if the authors could point out or summarize the sections, results (figures/tables) which back this claim up. At the moment it seems a bit unjustified. The same comment stands for the third bullet point, in which the authors claim that the targeted feedback significantly improves the quality of pseudo labels across iterations without claiming why that is the case.

As I am unfamiliar with the setting of modeling high-dimensional PDE systems I struggle to find the motivation for the choice of relative L_2 error compared to just the standard L2 loss. Can the authors explain the motivation why the loss choice in (1)? Also, why have the authors chosen an encoder embedding? Is this a popular model choice in the student-teacher literature? Is there a reason why one would expect this would work better for modeling PDE systems? Both fall a bit flat unmotivated.

The same comment stands for their choice of MC Dropout. Although the simplest to implement and most straightforward to train from my experience with BNNs, the authors should argue more why they have chosen it, and not for example SVGD[1], fSVGD[2] or any other Bayesian NN choice, and I think that they should perform benchmarks/baselines to corroborate their choice. In Appendix C the authors give a nice Bayesian interpretation of MC Dropout which I am thankful for, but I was a bit disappointed that they did not deliberate anything about why specifically choosing it over other methods.

Regarding Figure 2. it is not really clear what the point of the figure is, the error between the proposed method and Noisy TS is differently distributed and with different magnitudes and its not clear to me from the plot which one is better.

Regarding using REINFORCE as the solution to minimize the loss in (8) and deriving its gradient, it would be nice if the authors add a few comments regarding the REINFORCE rule; are there any downsides to using it, is it the standard in other literature, what are its benefits and so on. I am not well versed in this literature and to me, although all the design choices that the authors have made towards proposing the UMPL framework, they fall a bit flat. In general, is there any reason why we would expect each of them to work better separately and together, how do they interplay and can the authors show that other choices for the building blocks of UMPL would've resulted in inferior performance? The ablation study in Table 3 was indeed nice and provided one small step towards this but giving at least one or two examples of what happens with other choices of uncertainty estimation (e.g., Bayesian NNs) would've been nice to include.

One argument towards this could be the superior results that the authors observe in Tables 1 and 2, but these also fall a bit short as I don't know if these results are statistically significant or not. In answers to question 6 of the Q&A the authors mention that they introduce experiment setting and details but I couldn't find how long did it take to run your experiments and whether the authors ran them on GPUs vs CPUs. Furthermore, it is mentioned that the statistical significance is not reported as in fluid dynamics modeling using a fixed random seed shows consistent performance. I believe this claim needs to be backed up and at least commented in the main paper or the appendix. If the results were computationally expensive and you have limited resources this is fine but then this should be commented on too. If you did mention this somewhere in the appendix I would be happy if you can point me to it, but if your experiments did not require extensive computational resources I think that statistical significance needs to be added both to Tables 1 and 2 as well as the L2 error in Table 3.

[1] Liu, Q. and Wang, D., 2016. Stein variational gradient descent: A general purpose bayesian inference algorithm. Advances in neural information processing systems, 29. [2] Wang, Z., Ren, T., Zhu, J. and Zhang, B., 2019. Function space particle optimization for Bayesian neural networks. arXiv preprint arXiv:1902.09754.

问题

As mentioned above:

As I am unfamiliar with the setting of modeling high-dimensional PDE systems I struggle to find the motivation for the choice of relative L_2 error compared to just the standard L2 loss. Can the authors explain this motivation?

How long did it take to run the experiments and did you use GPUs or CPUs? Could you please elaborate more on why do you not report statistical significance? Can you please include significance for at least a couple of experiments in Table 1 or 2?

Typo in Appendix H title: "experiential settings"

局限性

Yes

最终评判理由

I am still hesitant and reluctant due to the lack of statistical significance of the results. The method seems to be extremely more computational heavy compared to the other methods both in terms of memory and training time. I am especially reluctant as the authors said they have not included statistical significance due to space constraints which seems odd.

However, as I am not well versed in the field and the after reading the other reviewers' comments, I have chosen to update my score from 3 to a 4. However, I still have some concerns and worries so will not give a larger score.

格式问题

No formatting concerns

作者回复

Thank you for your careful reading of the paper and your constructive comments. We address each point below.

Q1:Request to clarify and support the claimed benefits of role-aware uncertainty modeling and targeted feedback.

A1: In response to the request for evidence supporting the claims of “role-aware design” and “targeted feedback,” we provide the following clarifications:

Regarding the second bullet point: “Role-aware design leads to more robust and interpretable uncertainty estimates” — our framework uses role-specific uncertainty: the teacher models epistemic uncertainty for distribution shifts, while the student captures aleatoric uncertainty for input-dependent noise, forming a complementary structure. This design is supported by:

  • Theoretical justification: Section 3.1 provides the theoretical foundation, where Theorem 3.1 introduces EMC Dropout for efficient epistemic uncertainty estimation on the teacher side, Equation (5) formulates the aleatoric uncertainty optimization for the student, and Theorem 3.2 establishes an upper bound on student error under noisy pseudo labels.

  • Empirical evidence: Figure 2 shows the student’s feedback loss aligns well with true error, demonstrating strong spatial adaptivity. Figures 3(a) demonstrates that our method, applied to the Ahmed dataset, provides more accurate uncertainty estimates compared to the approach that uses only the student model trained on labeled data.

Regarding the third bullet point: “Targeted feedback significantly improves the quality of pseudo labels” This is supported both theoretically and empirically:

  • Theoretical justification: Section 3.2 introduces a point-wise feedback mechanism guided by the student’s aleatoric uncertainty. The pseudo labels are perturbed (Eq. 6), and the teacher is updated using a first-order gradient approximation (Eq. 10–12), which avoids expensive meta-gradient computation.

  • Empirical evidence: Figure 2 shows that feedback signals correlate strongly with true label error, guiding the teacher to focus on high-error regions and improving pseudo-label quality over time. Tables 1 and 2 demonstrate that UMPL consistently outperforms methods without targeted feedback (e.g., Pseudo Label, Mean Teacher), and Table 3 shows a 42.6% performance drop when the feedback module is removed, confirming its critical role.

We acknowledge that the current version lacks direct quantitative analysis of pseudo-label quality progression. In the revised version, we will include comparisons of pseudo-label error curves over iterations to more explicitly demonstrate how targeted feedback improves pseudo-label quality.

Q2:Why to choose relative L2 error?

A2: Thank you for the question. We adopt relative L2 error due to its clear advantages in scientific modeling, particularly in the following aspects:

  • It is well-suited for multi-scale systems or quantities with heterogeneous units, where standard L2 loss tends to overweight large-magnitude regions and ignore small but important ones.

  • In multi-physics problems (e.g., Plasma ICP), where predicted variables differ by orders of magnitude, standard L2 loss leads to gradient imbalance. Relative L2 provides scale-invariant optimization that treats all fields fairly.

  • This choice is widely adopted in neural operator literature (e.g., FNO, Transolver, AeroGTO), both as a training loss and evaluation metric, due to its physical interpretability and robustness.

Therefore, relative L2 error not only serves as an effective evaluation metric but also as a more appropriate training objective.

Q3:Why to choose an encoder embedding? Is this a popular choice in the student-teacher literature? Is it work better for modeling PDE systems?

A3: Thank you for the question. The use of encoder embeddings is motivated by the following:

  • Unified representation of heterogeneous PDE inputs: PDE problems involve geometry, physical parameters, and boundary conditions. The encoder maps these multimodal inputs into a common latent space, enhancing generalization. For instance, in the Plasma ICP case, the encoder helps integrate irregular mesh geometry and spatial parameters into a consistent representation.

  • Widely used across domains: Teacher–student frameworks commonly use encoder embeddings in tasks such as image classification[1] and defect detection[2], enabling stable pseudo-labeling and alignment between models.

  • Adaptation to PDE-specific challenges: For datasets with non-uniform meshes or complex geometries (e.g., Darcy Flow, NSM2d), the encoder learns geometry-invariant representations to improve spatial robustness. This strategy is also used in operator models like DeepONet and AeroGTO to align heterogeneous inputs into a unified latent space.

We will clarify this design choice in the revised version.

Q4: Figure 2 isn't clear which method is better.

A4: Thank you for the valuable feedback. In the revised version, we will replace the last subplot in Figure 2 with a comparison of Kendall correlation between feedback loss and true pseudo-label error for UMPL and Noisy TS. On 100 unlabeled samples, UMPL achieves much stronger correlation: 59 samples exceed 0.8 (vs. 5 for Noisy), and only 2 fall below 0.6 (vs. 11 for Noisy). This highlights that UMPL’s aleatoric uncertainty-guided feedback more reliably captures pseudo-label error, leading to more stable and robust training.

Q5: Comments regarding the REINFORCE rule.

A5: Thank you for the question. We adopt REINFORCE to estimate the gradient in Eq.(8), primarily to avoid the costly backpropagation path from student to teacher, thus significantly reducing memory and computational overhead. REINFORCE is a standard technique and has been widely used in meta pseudo-labeling methods in computer vision. Its main limitation is increased time cost due to repeated forward and backward passes. We will clarify this design choice and trade-off in the revised version.

Q6: Why not choose other uq methods?

A6: Thank you for the insightful question. As noted, we chose EMC mainly for its computational efficiency. While more accurate uq methods exist, EMC offers a good balance between cost and performance. Our main goal is to use unlabeled data to boost prediction accuracy on unseen data. We have added experiments comparing different uq methods on AeroGTO for Darcy Flow dataset in terms of both accuracy and efficiency, using miscalibration area (MA) and root mean squared calibration error (RMSCE) as evaluation metrics:

MethodMARMSCEL2Training Time(h)Memory(M)
EMC Dropout(Teacher)0.21230.24920.03880.522302
MC Dropout0.17810.23330.03763.3116249
SVGD0.17360.19540.03909.982253
feature-WGD [3]0.16540.18800.03223.502567
Latent-uq(Student)0.15610.17780.03820.963243
Ensemble [4]0.18470.20440.03199.682395
Ensemble latent-uq [5]0.12970.14550.029413.393887
UMPL0.08910.10110.02562.213257

As shown in the results, the teacher performs worse than advanced Bayesian methods (e.g., SVGD, feature-WGD), and the student underperforms compared to Ensemble latent-uq. However, both are more efficient. As discussed in Q7&A7, Advanced replacements improve uncertainty estimation with minimal accuracy gains, at a much higher computational cost. Our framework remains flexible, enabling such substitutions depending on practical trade-offs.

Q7: Other choices for the building blocks of UMPL?

A7: We replaced the student and teacher components using the same setup as in Q6&A6. The performance comparison is shown below:

MethodMARMSCEL2Training Time(h)Memory(M)
T(EMC Dropout)+S(latent-uq)0.08910.10110.02562.213257
T(SVGD)+S(Latent-uq)0.08730.09830.024820.423677
T(feature-WGD)+S(Latent-uq)0.08420.09510.02436.535597
T(feature-WGD)+S(Ensemble latent-uq)0.07730.08770.023928.436273

These results confirm that replacing components leads to better performance, at the cost of increased computational overhead.

Q8: Computational resources & statistical significance?

A8: Thanks for the question. All experiments were conducted on a single NVIDIA RTX 4090 GPU. We report results on AeroGTO model across three representative tasks—2D steady-state, 2D transient, and 3D simulation—as summarized below.

DatasetDarcy (bs=4)NSM2d (bs=4)Ahmed (bs=1)
MethodMemory(M)Training Time(h)Memory(M)Training Time(h)Memory(M)Training Time(h)
Supervised22970.4597273.7593753.53
Pseudo Label38431.08165679.04159478.73
Mean Teacher38431.211656710.17159479.57
Noisy TS23041.77981313.21947712.48
Supervised+uq32430.96135787.69129867.36
UMPL32572.211379216.481329015.56

Due to main text space constraints, we reported only the mean over 5 trials with different seeds. Statistical significance results will be included in the appendix of the revised version.

Q9: Typo in Appendix H title.

A9: Fixed it.

Reference

[1] Wang, Zhenbin, et al. "Metateacher: Coordinating multi-model domain adaptation for medical image classification." Advances in Neural Information Processing Systems 35 (2022): 20823-20837.

[2] Zhao, Sinong, et al. "Meta pseudo labels for anomaly detection via partially observed anomalies." Engineering Applications of Artificial Intelligence 126 (2023): 106955.

[3] Yashima, Shingo, et al. "Feature space particle inference for neural network ensembles." International Conference on Machine Learning. PMLR, 2022.

[4] Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. "Simple and scalable predictive uncertainty estimation using deep ensembles." Advances in neural information processing systems 30 (2017).

[5] Wu, Tailin, et al. "Uncertainty quantification for forward and inverse problems of pdes via latent global evolution." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 1. 2024.

评论

I thank the authors for their additional included experiments and explanations. I hope that they incorporate everything they said in the paper as, if this is the case, I think the paper will be stronger and will have sufficient weight to be a conference paper.

评论

We sincerely appreciate your helpful suggestions! Thanks for your time and effort!

审稿意见
5

The paper proposes a semi-supervised teacher-student framework for emulators with limited simulation data. The teacher generates pseudo-labels using MC-Dropout to predict epistemic uncertainty, while the student learns from these labels while predicting aleatoric uncertainty via variance prediction. The student's uncertainty provides feedback to refine the teacher's pseudo-labels in a meta-learning loop and adds spatial coherence. Evaluation of the system is presented from scientific tasks and distribution shifts.

优缺点分析

Strengths

  • Original feedback closed loop between teacher-epistemic and student-aleatoric
  • The mesh-aware spatial coherence for the pseudo-labels
  • Extensive experimental section on various architectures

Weaknesses

  • Questionable uncertainty method: the paper relies on MC-Dropout to estimate the teacher-epistemic, known to struggle when the true uncertainty isn't nicely behaved and bell-shaped (Gaussian) – common in messy real-world physics like shocks or abrupt changes.
  • Reliable uncertainties not shown: there should be some experimental proof that the uncertainty prediction are trustworthy using checks like calibration metrics or reliability diagrams. Aren' t the errors remain high in complex tasks like NSM2D?
  • Potentially overly easy experiments: benchmarks used clean, smooth simulation data without the real-world noise or sharp changes one could encounter. This likely explains why even the basic "Pseudo Label" baseline beat supervised learning – a result that often doesn't hold up with noisier or more complex data. I am not convinced that the method would hold against the hard distribution shifts or more complex noise common in real data

问题

  1. Both epistemic (MC-Dropout) and aleatoric (heteroscedastic variance) uncertainty predictors assume symmetric Gaussian distributions. How does UMPL handle physical systems with asymmetric or heavy-tailed uncertainties (e.g., shock waves, bifurcation points) where these assumptions fail? Did you evaluate on such systems?
  2. Why are standard uncertainty calibration metrics absent? Can you show quantitative proof that the predicted uncertainty reliably correlates with error magnitude (not just direction) in OOD regions?"
  3. Could you report computing resources?

局限性

yes

最终评判理由

The authors have addressed most of my questions. I am increasing my score at 5.

格式问题

none

作者回复

Thank you for the positive feedback.

Q1: Quantitative proof that the predicted uncertainty reliably correlates with error magnitude.

A1: Thank you for the insightful question. We have added experiments comparing our method with other uncertainty estimation approaches on AeroGTO for the Darcy Flow dataset in terms of both accuracy and efficiency, using miscalibration area (MA) and root mean squared calibration error (RMSCE) as evaluation metrics:

MethodMARMSCEL2Training Time(h)Memory(M)
EMC Dropout(Teacher)0.21230.24920.03880.522302
MC Dropout0.17810.23330.03763.3116249
SVGD [1]0.17360.19540.03909.982253
feature-WGD [2]0.16540.18800.03223.502567
Latent-uq(Student)0.15610.17780.03820.963243
Ensemble [3]0.18470.20440.03199.682395
Ensemble latent-uq [4]0.12970.14550.029413.393887
UMPL0.08910.10110.02562.213257

Additionally, we conducted a comparison on the NSM2d experiment:

MethodMARMSCEL2
Latent-uq(Student)0.19590.24040.1728
MC Dropout0.28410.31830.1734
UMPL0.11270.13930.1213

These comparisons highlight UMPL's strong performance over baselines with a good balance between accuracy and efficiency. The teacher and student components can be flexibly replaced with more advanced methods for further gains, albeit at higher computational cost.

Q2: The method may not hold up with the noise or real-world more complex data.

A2: Thank you for the thoughtful comment. We have considered challenging scenarios in our experiments. For example, the Black Sea dataset is based on real ocean measurements with strong seasonal patterns and complex temporal evolution. It poses a difficult temporal OOD generalization problem, yet UMPL still achieves performance gains (albeit smaller than on other datasets). The Plasma ICP dataset involves distribution shift from low- to high-fidelity simulations. UMPL outperforms baselines using only 10% of high-fidelity data, showing strong data efficiency under realistic conditions.

We acknowledge that our benchmarks do not include noisy data. The current teacher model focuses on epistemic uncertainty and does not explicitly model data noise. To address this, we conducted an additional experiment on the Darcy Flow dataset using the AeroGTO backbone, introducing varying levels of input-dependent noise.

NoiseMARMSCEL2
5%0.09960.11170.0257
10%0.10760.12070.0264
20%0.12860.14430.0302
40%0.17550.19750.0386
40%,Improved UMPL0.13350.14980.0348

Results show that model performance degrades with added noise. To improve robustness with noise, we incorporated both epistemic and aleatoric uncertainty in the teacher following [5], and added an ensemble student model. This enhanced predictive performance under noisy conditions, which we will detail further in the revised version.

Q3: How does UMPL handle physical systems with asymmetric or heavy-tailed uncertainties?

A3: Thank you for the insightful question. Our teacher and student models indeed assume Gaussian scenarios, which is suitable for most standard cases. However, we acknowledge that performance may degrade under shocks or discontinuities. To evaluate this, we designed an experiment using a 1D Burgers-type PDE with small diffusion coefficient:

utμ2ux2+uux+ϵu3=0,u0(x)=ax3(a+1)x,x[1,1],nuxx=1,1=0\frac{\partial u}{\partial t}-\mu \frac{\partial ^2 u}{\partial x^2}+u \frac{\partial u}{\partial x}+\epsilon u^3=0,u_0(x)=ax^3-(a+1)x,x\in[-1,1],-\mathbf{n}\cdot\frac{\partial u}{\partial x}|_{x=-1,1}=0

Using DeepONet as the backbone, results show that the original model struggles under such non-Gaussian settings:

MethodMARMSCEL2Training Time(h)Memory(M)
Supervised//0.04991.421261
UMPL0.30040.34480.05281.621439
Improved UMPL0.14080.15900.04322.611789

We improved performance by replacing the teacher's epistemic model with feature-WGD and the student's aleatoric model with MDN [6], which better capture heavy-tailed distributions. This comes at higher computational cost. Our framework thus offers a flexible trade-off between efficiency and accuracy, with modular components adaptable to specific scenarios. We will include these findings in the revised appendix.

Q4: Could you report computing resources?

A4: Thanks for the question. All experiments were conducted on a single NVIDIA RTX 4090 GPU. We report results on AeroGTO model across three representative tasks—2D steady-state, 2D transient, and 3D simulation—as summarized below.

DatasetDarcy (bs=4)NSM2d (bs=4)Ahmed (bs=1)
MethodMemory(M)Training Time(h)Memory(M)Training Time(h)Memory(M)Training Time(h)
Supervised22970.4597273.7593753.53
Pseudo Label38431.08165679.04159478.73
Mean Teacher38431.211656710.17159479.57
Noisy TS23041.77981313.21947712.48
Supervised+uq32430.96135787.69129867.36
UMPL32572.211379216.481329015.56

Here, Supervised+uq refers to the student model used in UMPL. As shown, UMPL introduces almost no additional memory overhead, thanks to the REINFORCE-style update that avoids the costly backpropagation chain from the student to the teacher. This significantly reduces computational and memory complexity, at the cost of increased training time.

Reference

[1] Liu, Q. and Wang, D., 2016. Stein variational gradient descent: A general purpose bayesian inference algorithm. Advances in neural information processing systems, 29.

[2] Yashima, Shingo, et al. "Feature space particle inference for neural network ensembles." International Conference on Machine Learning. PMLR, 2022.

[3] Lakshminarayanan, Balaji, Alexander Pritzel, and Charles Blundell. "Simple and scalable predictive uncertainty estimation using deep ensembles." Advances in neural information processing systems 30 (2017).

[4] Wu, Tailin, et al. "Uncertainty quantification for forward and inverse problems of pdes via latent global evolution." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 1. 2024.

[5] Kendall, Alex, and Yarin Gal. "What uncertainties do we need in bayesian deep learning for computer vision?." Advances in neural information processing systems 30 (2017).

[6] Thakur, Akshay, and Souvik Chakraborty. "MD-NOMAD: Mixture density nonlinear manifold decoder for emulating stochastic differential equations and uncertainty propagation." arXiv preprint arXiv:2404.15731 (2024).

评论

The authors provided new experiments to show how UMPL performed for calibration and different noise distributions. This is appreciated and will make the paper more impactful. While the rebuttal shows comparison with a set of uncertainty quantification methodologies which are perhaps missing a few commonly adopted ones (such as deep evidential, batch ensembles, or conformal prediction-based methods), the improvement is quite clear.

评论

Thank you for the insightful feedback. Here, we clarify the reasons for not including these commonly adopted methods. For high-dimensional PDE prediction tasks, a variety of model architectures may be used, each with differing levels of complexity. For instance, the AeroGTO model employs a graph Transformer architecture, which presents significant challenges for implementing certain model-dependent uncertainty estimation techniques such as deep evidential regression. To ensure broad applicability and methodological consistency, we have intentionally adopted model-agnostic uncertainty estimation methods and uses identical data splits across all models for fair comparison. Based on these considerations, these methods were not included in the evaluation.

We agree that this point merits clarification and will add a detailed explanation in the revised experimental section. We sincerely appreciate your time and effort in helping us improve our manuscript!

评论

Dear Reviewer 9hkT,

We have provided new experiments and discussions according to your valuable suggestions, which have been absorbed into the revised manuscript. We hope that the new manuscript is made to be stronger with your suggestions.

As the rebuttal deadline is approaching, we sincerely look forward to your reply. Thanks so much for your time and effort!

Best Regards,

Authors of Submission 10749

审稿意见
5

This paper tackles the challenge of building surrogate models for physics-governed systems when only a small, costly set of simulation labels is available alongside abundant unlabeled data, proposing a semi-supervised teacher–student framework in which a lightweight Exponential-Moving Monte-Carlo Dropout (EMC-Dropout) teacher predicts system states and provides epistemic uncertainty, while a student network learns from those pseudo-labels, estimates aleatoric uncertainty, and feeds it back to the teacher. Training is further stabilized by Progressive Uncertainty Filtering, which gradually discards the noisiest pseudo-labeled regions, and by Supervised Anchoring, which periodically re-grounds the model on true labels. Evaluated by plugging into four distinct neural operators—DeepONet, MeshGraphNet, Transolver, and AeroGTO—across seven CFD and physics benchmarks spanning partial differential equations, real-world datasets, and large-scale 3-D simulations, the method achieves roughly a 14 % error reduction with only 10 % labeled data and markedly improves out-of-distribution and distribution-shift generalization.

优缺点分析

Strengths

  1. Novel teacher–student strategy – The framework introduces an innovative teacher–student loop that produces more reliable pseudo-labels while simultaneously training the surrogate model.

  2. Architecture-agnostic design – Because the method is model-agnostic, it can be plugged into virtually any neural-operator or surrogate-model architecture with minimal modification.

  3. Thorough empirical validation – The authors support their claims with extensive experiments across multiple datasets and architectures, demonstrating consistently strong performance.

Weaknesses

  1. Potential computational burden – Running two networks, multiple Monte-Carlo dropout passes, and iterative feedback may be resource-intensive. A dedicated section detailing the hardware used, run-times, and memory footprint would clarify the true computational cost.

  2. Clarity and presentation – Several aspects of the manuscript hinder readability:

    a. Key variables and symbols should be defined in the main text rather than deferred to the appendix.

    b. Algorithm 1 should be included in the main body to better understand the whole procedure.

    c. Although Figure 1 is helpful in principle, its current form does not clearly depict the full workflow and should be redesigned for greater clarity.

问题

  1. Could you elaborate on the computational resources your method requires—e.g., hardware, run-time, and memory overhead—so readers can better gauge its practical cost?

  2. Within the teacher–student framework, is there a risk of model collapse, where the teacher and student converge to similar outputs on unseen data yet still generate poor labels relative to the true target distribution? - How to protect against this?

局限性

Yes — the limitations are discussed in Appendix Section K.

最终评判理由

Authors have addressed most of my questions. I will maintain my score at 5.

格式问题

There are no significant issues with the paper’s formatting.

作者回复

Thank you for your positive feedback and for recognizing the significance and potential of our method.

Q1: Clarity and presentation – Several aspects of the manuscript hinder readability.

A1: We thank the reviewer for the helpful suggestions. For a and b, we will include key variable definitions and move Algorithm 1 to the main text in the revised version. For c, we will revise Figure 1 in the revised version to better reflect the complete UMPL workflow. The updated figure will explicitly show the teacher–student interaction, the three-stage training loop (pseudo-label generation, student update, teacher update), and the roles of Progressive Uncertainty Filtering (PUF) and Supervised Anchoring (SA), thereby enhancing the clarity and interpretability of the framework.

Q2: Could you elaborate on the computational resources your method requires?

A2: Thanks for the question. All experiments were conducted on a single NVIDIA RTX 4090 GPU. We report results on AeroGTO model across three representative tasks—2D steady-state, 2D transient, and 3D simulation—as summarized below.

DatasetDarcy (bs=4)NSM2d (bs=4)Ahmed (bs=1)
MethodMemory(M)Training Time(h)Memory(M)Training Time(h)Memory(M)Training Time(h)
Supervised22970.4597273.7593753.53
Pseudo Label38431.08165679.04159478.73
Mean Teacher38431.211656710.17159479.57
Noisy TS23041.77981313.21947712.48
Supervised+uq32430.96135787.69129867.36
UMPL32572.211379216.481329015.56

Here, Supervised+uq refers to the student model used in UMPL. As shown, UMPL introduces almost no additional memory overhead, thanks to the REINFORCE-style update that avoids the costly backpropagation chain from the student to the teacher. This significantly reduces computational and memory complexity, at the cost of increased training time.

Q3: Within the teacher–student framework, is there a risk of model collapse? How to protect against this?

A3: We thank the reviewer for the insightful question. We agree that model collapse is a potential risk in teacher–student frameworks. To mitigate this, our design incorporates several mechanisms:

  • Divergent Initialization Strategy: The teacher and student models are initialized with different parameters, and during each training loop, they are exposed to independently sampled labeled data. This helps reduce the risk of early-stage over-alignment due to shared errors.

  • Uncertainty-Guided Feedback Loop: The student provides point-wise feedback based on aleatoric uncertainty, which captures spatial error sensitivity and directs the teacher’s updates toward high-error regions. This breaks blind imitation and reduces the chance of reinforcing incorrect signals.

  • Uncertainty-Aware Pseudo Labels: Pseudo labels are perturbed using the teacher’s epistemic uncertainty, replacing hard targets with uncertainty-aware guidance. This alleviates label bias and stabilizes convergence.

  • Progressive Uncertainty Filtering (PUF): Before being used for training, the teacher’s pseudo labels are filtered by uncertainty. Low-confidence samples are discarded, improving pseudo-label quality and enhancing training stability.

We will include a detailed discussion of these safeguards in the revised version to better address this potential risk.

评论

Thanks for answering my questions and providing a detailed response. I will maintain my score.

评论

We truly appreciate your valuable feedback, which has helped us strengthen our work. Thank you for your time and effort!

评论

Dear Reviewer fwWx,

We have provided new discussions according to your valuable suggestions, which have been absorbed into the revised manuscript. We hope that the new manuscript is made to be stronger with your suggestions.

As the rebuttal deadline is approaching, we sincerely look forward to your reply. Thanks so much for your time and effort!

Best Regards,

Authors of Submission 10749

评论

Dear Reviewers,

The author-reviewer discussion period is now open and will continue until August 6. Please review the authors’ rebuttal to determine whether it adequately addresses your concerns. If you have further questions or comments, engage with the authors by acknowledging that you’ve read their response and providing additional feedback as needed

Sincerely,

Your AC

评论

Dear Reviewers, ACs, SACs, and PCs,

We would like to express our heartfelt gratitude for your invaluable time and meticulous attention in reviewing our manuscript. We appreciate that all four reviewers gave us positive feedback with insightful suggestions that can help us improve the quality and rigor of our work.

We have provided detailed responses to each reviewer’s concerns:

  • Reviewer fwWx raised concerns on computational cost, clarity of presentation, and risk of model collapse in the teacher–student framework.

  • Reviewer VXJf questioned the justification of key design choices (loss, embedding, MC Dropout, REINFORCE), clarity of figures, missing computational cost details, and the lack of comparison with other uncertainty estimation methods and statistical significance.

  • Reviewer 9hkT raised concerns about MC-Dropout under non-Gaussian uncertainties, lack of calibration validation, overly clean benchmarks, and missing resource reporting.

  • Reviewer FbuM focused on unclear SSL vs. uncertainty contribution, limited benchmark validation, missing overhead analysis, and insufficient robustness discussion.

For each, we have diligently addressed all concerns by conducting extensive additional experiments to support our responses. The main manuscript has been revised with improved descriptions and updates to figures. A detailed list of all changes is provided in the rebuttal, and additional experimental results are included in the appendix due to space constraints. Below, we summarize the key modifications made in both the rebuttal and the revised manuscript:

  1. Extended baselines: Compared with seven model-agnostic uncertainty methods, our approach showed superior accuracy and uncertainty quality on unseen data.

  2. Extended experiments: Replacing teacher–student modules with alternative uncertainty methods improved performance at higher cost, demonstrating the framework’s flexibility, while module refinement maintained strong results on non-Gaussian data.

  3. Further analyses: Sensitivity tests on aleatoric quality, noise robustness, PUF threshold, and uncertainty injection, plus efficiency evaluation showing no extra GPU memory use but added feedback time.

  4. Manuscript revisions: Updated Figures 1–2, clarified the core objective, refined formulations, added explanations of REINFORCE and encoder embedding, reported significance in the appendix, and expanded analysis on model collapse.

Overall, UMPL introduces (1) the first integration of pseudo-labeling with high-dimensional PDE modeling on non-uniform meshes for improved OOD generalization, (2) a role-aware teacher–student framework for robust and interpretable dual uncertainty estimation, and (3) an uncertainty-informed feedback mechanism that iteratively enhances pseudo-label quality in high-uncertainty regimes.

We kindly request that you carefully consider the contributions of our paper.

Best Regards,

Authors of Submission 10749

最终决定

This paper introduces an uncertainty-informed method for constructing robust surrogate models of PDE-governed physical systems under limited labeled data. Specifically, the authors propose a student–teacher model with an uncertainty-informed feedback loop designed to improve both uncertainty estimates and the quality of pseudo-labels. Extensive evaluations across seven tasks demonstrate that their method consistently outperforms the best existing semi-supervised regression approaches, even with limited labeled data. The paper received four reviews: two accepts and two borderline accepts. All reviewers recognize the novelty of using an uncertainty-informed feedback mechanism in a teacher–student framework to simultaneously refine pseudo-labels and uncertainty estimates. They also highlight the strong empirical results demonstrated across multiple simulated and real-world datasets using different architectures. Nonetheless, reviewers raise several concerns. These include the choice of UQ method for the teacher model, the computational complexity of the proposed approach, performance under noisy real-world data, questions about the statistical significance of the reported results, and the adequacy of the uncertainty evaluation. The authors’ rebuttal is extensive and detailed, providing new experimental results that largely address these concerns.