PaperHub
5.7
/10
Poster3 位审稿人
最低5最高6标准差0.5
6
6
5
2.7
置信度
正确性3.0
贡献度3.0
表达3.7
NeurIPS 2024

No Free Delivery Service: Epistemic limits of passive data collection in complex social systems

OpenReviewPDF
提交: 2024-05-15更新: 2025-01-16
TL;DR

Formal impossibility results for model validation in key AI tasks such as recommender systems and LLM reasoning if they require passive data collection from complex social systems.

摘要

关键词
validitymodel validationcomplex systemsepistemologyscalinglarge language modelsrecommender systems

评审与讨论

审稿意见
6

This work shows that the classic paradigm of training and testing in ML has validity flaws that makes it unreasonable to generalize about performance from test-set performance. They use the no free lunch theorems to theoretically demonstrate this, and then show empirical lens on the MovieLens benchmark dataset for recommender systems. They show that multiple models are able to fit the observed data but still perform arbitrarily well on unobserved data.

优点

The paper is very well-written overall. It tackles a very important question that has many implications given how popular the train-test paradigm is.

缺点

I’m ultimately not very sure what the prescriptive argument of this work is. I understand that there is value in work for pointing out a problem, but the issues with generalizing performance of ML systems into the real world is widely known and empirically shown (as demonstrated by the presence of the WILDS dataset [https://wilds.stanford.edu/datasets/]. This work provides a theoretical grounding for this, but from the abstract’s final sentence, I expected more implications rather than just gestures to participation and open science. I would have liked to see implications which demonstrate the strengths of approaching this problem from a theoretical perspective, as opposed to just the empirical evidence we have of it.

There is also increasingly a line of work on the limitations of using observational data (as opposed to data from, e.g., randomized controlled trials, as well as performative prediction, which I felt this work could connect to more, since that seems to be part of the premise of the issue with passive data collection.

A smaller point of confusion, I was unclear sometimes whether you were referring to “benchmark” only as the evaluation dataset or also the training dataset.

问题

Above, in weaknesses.

局限性

Addressed

作者回复

Thank you for your thoughtful feedback. I agree that it is important to clarifying the prescriptive argument and the role of participatory methods. I hope that the response to all authors as well as the below additional clarifications can alleviate your concerns. Since everything is addressable via minor clarification updates, I hope this will also allow you to raise your score and confidence.

I’m ultimately not very sure what the prescriptive argument of this work is. I understand that there is value in work for pointing out a problem, but the issues with generalizing performance of ML systems into the real world is widely known and empirically shown (as demonstrated by the presence of the WILDS dataset [https://wilds.stanford.edu/datasets/]. This work provides a theoretical grounding for this, but from the abstract’s final sentence, I expected more implications rather than just gestures to participation and open science. I would have liked to see implications which demonstrate the strengths of approaching this problem from a theoretical perspective, as opposed to just the empirical evidence we have of it.

  • Thank you for highlighting this issue, I agree that this is an important aspect. There are three aspects in your question that I would like to separate:
    • Prescriptive argument: The paper's theoretical results provide indeed actionable insights to improve data collection via the kk-core condition for validity. In short, we would want to collect data such that either increases the kk-connectivity or the size of the rank(f)\text{rank}(f)-subgraph. Importantly, we can compute from a given sample graph where to collect data points. Please see the response to all authors for a detailed discussion of this.
    • Lack of justification versus impossibility results: If I understand your comment correctly, the key difference between the known results and issues that you are listing is that, to the best of my knowledge, these are based on a "lack of justification", i.e., that we know that our usual guarantees do not apply. In contrast, this paper provides much stronger insights in form of rigorous impossibility results. The insights that can be gained from such results are much more substantial and provide a path forward to improving the situation, as in the kk-core condition above (while just noting a lack of justification can not). Please see the response to all authors for a detailed discussion of this (including also the different contributions of sufficient and necessary conditions).
    • Participatory: I agree that the current discussion of this is insufficient, thanks for pointing me to it. The argument for participatory methods stems from the sheer amount of data that would need be collected in a targeted way according to the kk-core scheme above. While we are currently preparing a paper for submission that introduces a novel method for this purpose, I believe that it is beyond the scope of this paper to also cover this aspect in detail, (e.g., it would need to introduce additional concepts related to mechanism design, game design, economics, and the efficient computation of the kk-core objectives). However, I will add a further discussion why participatory data collection is favorable for the setting of large-amount of data + targeted collection.

There is also increasingly a line of work on the limitations of using observational data (as opposed to data from, e.g., randomized controlled trials, as well as performative prediction, which I felt this work could connect to more, since that seems to be part of the premise of the issue with passive data collection.

  • I agree that that this is an important aspect. However, I would point out that the paper actually discusses connections to work that goes beyond observational data, i.e., to counterfactual and causal estimators (e.g., see lines 261-265 in the submission). To the best of my knowledge this is also the first impossibility result for such estimators in settings as considered in this paper. As such, the theoretical insights provide, in my opinion, non-trivial insights into the limitations of these popular approaches. Thank you for pointing me to this. I believe it is an important contribution of the paper and will discuss it more clearly in the updated version.

A smaller point of confusion, I was unclear sometimes whether you were referring to “benchmark” only as the evaluation dataset or also the training dataset.

  • I agree that this is not clear from the current presentation. Part of the issue comes from the fact that in (k-fold) cross-validation part of the training set act as validation set. I will clarify this (maybe in the appendix due to space constraints).
评论

Thank you for clarifying on the prescriptive component, that is helpful! My review remains the same as a borderline accept.

评论

Thank you very much for your response, I really appreciate it. I'm glad that clarifying the prescriptive component was helpful, especially since it was my impression that this was your main concern and that you are otherwise very supportive of the contributions and potential impact of the paper. Please let me know if I can help to clarify any further questions or concerns.

审稿意见
6

Building advanced and very large general-purpose ML models using extremely large datasets from sources such as the Internet has gained a lot of recent attention. Particularly, when the training set is sampled from a distribution S while the target distribution is T, the paper introduces the notions of (\epsilon,\alpha)-validity and (\epsilon,\alpha)-invalidity for inference validity and defines test validity accordingly. The paper shows that there is "no free delivery service" of data that allows inference/test validity on a global scale for complex social systems.

More importantly, the paper provides the metrics and necessary conditions that limit the scope of AI in complex social systems.

优点

S1. The paper studies a problem of immense importance: with the rapid growth of advanced ML models such as LLMs and their versatile usage across a wide range of tasks that impact societies and human lives, establishing the necessary conditions and checks and balances to limit their scope is vital. Formally introducing such metrics and conditions, this paper is an interesting step in that direction.

S2. The paper follows a formal and careful writing scheme, which makes it easy to follow and fun to read.

S3. The proposed validity framework and the inference and test validity concepts provide metrics for evaluating the validity extension of a model trained on a distribution S to T.

缺点

W1. While I enjoyed reading the formality and the definitions in section 2, I did not find the so-called no delivery service (NDS) of data surprising. As the author also stated, based on the theory of ML and the NFL theorem, when there is a sampling bias, and the source and target distributions are different, the expected performance guarantee of the model does not carry over. This has also been extensively discussed under "lack of generalizability" concept.

W2. It is not particularly a weakness, but if I am not mistaken, the NDS observation for complex social systems naturally follows the previously known heavy-tailed distribution property of these systems.

问题

No specific questions

局限性

Please see the weaknesses section.

作者回复

Thank you for your thoughtful feedback. I agree that W1 is important to clarify and hope that the response to all authors as well as the below additional clarifications can alleviate your concerns. Since everything is addressable via minor clarification updates, I hope this will also allow you to raise your score.

W1. While I enjoyed reading the formality and the definitions in section 2, I did not find the so-called no delivery service (NDS) of data surprising. As the author also stated, based on the theory of ML and the NFL theorem, when there is a sampling bias, and the source and target distributions are different, the expected performance guarantee of the model does not carry over. This has also been extensively discussed under "lack of generalizability" concept.

  • Thank you for raising this important question. The key difference to the known results and issues that you are listing is that, to the best of my knowledge, these are based on a "lack of justification" argument, i.e., that we know that our usual guarantees do not apply. In contrast, this paper provides much stronger insights in form of rigorous impossibility results. The insights that can be gained from such results are much more substantial and provide a path forward to improving the situation (while a lack of justification can not). Please see the response to all authors for a detailed discussion of this (also in terms of sufficient versus necessary conditions and their different contributions).
  • I also want to highlight that the results in this paper do point to an actionable path forward via targeted data collection using the kk-core condition. This is also discussed in detail in the response to all authors. Again, this is possible because of the theoretical insights in this paper.
  • For "lack of generalizability": How I am familiar with this concept in the context of validity theory is mostly as an empirical observation, e.g., that studies do not generalize beyond their very specific context. The results in this paper are again stronger, in the sense that the benchmark itself is invalid (not only that it doesn't generalize to new settings).
  • The practical importance of the results is also highlighted by the wide-spread usage of benchmarks like MovieLens in settings where it cannot be valid, even in SOTA LLM benchmarks such as BigBench (https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/movie_recommendation/README.md)

W2. It is not particularly a weakness, but if I am not mistaken, the NDS observation for complex social systems naturally follows the previously known heavy-tailed distribution property of these systems.

  • The NFDS results in this paper are two-fold:
    • They first establish necessary conditions that need to hold for model validation to be valid
    • and then show that these conditions are violated when sampling from complex social systems
  • As such the necessary conditions hold independently of the concrete application in social systems.
  • While the heavy-tailed distribution in complex systems have indeed been established prior to this work, one contribution of this paper is to combine this knowledge with the newly established necessary conditions to gain non-trivial insights into current practice of AI.
审稿意见
5

The paper addresses the critical issue of model validation in AI systems, especially those deployed in complex social environments. It argues that the prevalent train-test paradigm, commonly used for model validation, is often invalid in these settings due to the inherent assumptions it violates. The paper presents formal impossibility results, demonstrating that for many AI tasks involving complex social systems, the train-test paradigm cannot ensure valid model performance assessments. The study uses the MOVIELENS benchmark to illustrate these issues and suggests remedies like participatory data curation and open science to address the epistemic limitations identified.

优点

  1. The paper provides novel insights into the limitations of the train-test paradigm, especially in the context of AI systems interacting with complex social systems.
  2. It offers formal proofs of the epistemic limitations, providing a robust theoretical foundation for its claims. Relevance: The study addresses a highly relevant issue in modern AI, given the increasing deployment of AI systems in socially impactful contexts.
  3. Using the MOVIELENS benchmark to illustrate the theoretical points adds practical relevance and makes the arguments more tangible.

缺点

  1. The formal proofs and theoretical discussions might be too complex for practitioners without a strong background in the relevant mathematical and statistical concepts.
  2. While the paper suggests remedies, it does not delve deeply into how these can be practically implemented on a large scale, which could limit their immediate applicability.
  3. The results are heavily dependent on the specific characteristics of the social systems and data collection methods considered, which might limit the generalizability of the findings to other contexts or systems.

问题

  1. Could you elaborate on alternative model validation paradigms that could be more suitable for complex social systems?
  2. Could you provide more examples or case studies beyond the MOVIELENS benchmark to illustrate the validity issues in different types of AI systems?
  3. Could you discuss any potential limitations of your approach and how they might be addressed in future research?

局限性

  1. To make the paper more accessible, consider simplifying some of the theoretical discussions or providing more intuitive explanations alongside the formal proofs.
  2. Include more detailed discussions on how the suggested remedies, such as participatory data curation, can be practically implemented in real-world scenarios.
作者回复

Thank you for your thoughtful feedback. I agree that your comments address important question and hope that the response to all authors as well as the below additional clarifications can alleviate your concerns. Since everything is addressable via minor clarification updates, I hope this will also allow you to raise your score.

The formal proofs and theoretical discussions might be too complex for practitioners without a strong background in the relevant mathematical and statistical concepts.

  • Thank you for the suggestion. My aim is certainly to make this paper as accessible as possible, and I tried to do so via informal statements of the main results and providing extensive context. Unfortunately, the page constraints set limit on how much this is possible. However, I will follow your suggestion and add more intuitive examples where it is possible.
  • While I acknowledge the issue, I'd also point out that the impression might not be fully uniform, e.g., feedback from other reviewers states that the paper is "easy to follow and fun to read", "very well-written".

Could you elaborate on alternative model validation paradigms that could be more suitable for complex social systems

  • Thank you for raising this question. With my current knowledge, I would not so much point to alternative model validation paradigms but rather to improved data collection for model validation via the paper's kk-core condition. For details on this, please see the response to all authors.

Could you provide more examples or case studies beyond the MOVIELENS benchmark to illustrate the validity issues in different types of AI systems

  • Thank you for this question. Indeed, these results apply to a wide range of settings. In the PDF attached to the response to all authors, I illustrate this using widely used benchmarks for three different settings:
    • Reasoning - FB15K-237 (https://paperswithcode.com/dataset/fb15k-237): Here the goal is to infer the truth value of (subject, predicate, object) triples using logical reasoning. It is a widely used benchmark for reasoning and, as can be seen from the PDF, has the same structural properties as MovieLens. As such the results of this paper apply directly (also discussed in Section 3 and Appendix D.3). The social system the generates the biased observation is both the production of knowledge and its recording in FreeBase.
    • Link Prediction in Graphs - CORA (https://paperswithcode.com/sota/link-prediction-on-cora): Here, the goal is to predict links/edges in a citation graph based on observed edges. It is a widely used benchmak in graph learning and again shows the same structural properties at MovieLens. The social system that generates the biased observations is the citation practice in science (incl. popularity bias etc).
    • Extreme Classification - Wiki10-31k (http://manikvarma.org/downloads/XC/XMLRepository.html): Here, the goal is to predict the correct labels for Wikipedia entries from a large number of user provided labels (hence extreme classification). The social system that generates the biased observation are the authors that contribute Wikipedia entries and their labeling practice.
  • In addition to these datasets, I also want to highlight that MovieLens is THE recommender systems benchmark (comparable to MNIST for vision) and is still widely used, even in SOTA LLM benchmarks such as BigBench (https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/movie_recommendation/README.md)

Could you discuss any potential limitations of your approach and how they might be addressed in future research?

  • Please see the limitations section in appendix of the paper.

The results are heavily dependent on the specific characteristics of the social systems and data collection methods considered, which might limit the generalizability of the findings to other contexts or systems.

  • The NFDS results in this paper actually are two-fold:
      1. It first establishes necessary conditions that need to hold for any model validation to be valid
      1. and then shows that these conditions are violated when sampling from complex social systems
  • As such, the necessary conditions in 1) hold independently of the concrete application in social systems and provide general insights into the validity of model validation.
  • That being said, I again want to highlight the urgent need to also understand the interaction with social systems since this is affecting much of our practice. Hence, even if the results would only be contained to this setting, I wouldn't consider it a limitation.
作者回复

Thanks to all reviewers for their insightful feedback, it will help me to clarify important aspects of the paper and improve its impact. Before addressing the reviewers' concerns in detail, I am happy to acknowledge the overall very positive feedback from all reviewers on soundness, contribution, and presentation, e.g.,

  • "very well-written", "tackles a very important question that has many implications" (qRpv).
  • "a problem of immense importance", "formal and careful writing scheme, which makes it easy to follow and fun to read" (9VqZ)
  • "novel insights" into a "highly relevant issue in modern AI" (RViy)

In the following, I will address two questions that came up in different form across reviewers. Please see the individual responses for further discussions related to points that are specific to a single review. I hope that the following discussion can alleviate the reviewers' concerns and allow to raise their scores since all points can be addressed easily via minor clarifications.

Practical implications / prescriptive argument

Reviewers raised questions with regard to the practical implications of the theoretical results, i.e., are they useful to improve our data collection efforts?

I agree that this is an important question and, indeed, the theoretical results of this paper provide direct insights into how to improve data collection for model validation via its kk-core conditions. In particular, lemma 2 and corollary 3 imply two clear objectives for targeted data collection:

  • a) collecting data points that increase the kk-connectivity of the sample graph. This would increase the complexity of the world that can be assumed such that model validation is still valid for the entire sample graph
  • b) collecting data points that increase the size of the rank(f)\text{rank}(f)-core of the sample graph, where rank(f)\text{rank}(f) is the complexity of the world that we want to assume. This would increase the size of the subgraph for which a rank(f)=k\text{rank}(f) = k assumption would still yield valid model validation

Hence, both objectives are based on the k-core condition and attack it from different angles: increasing the minimal complexity that we can assume for the entire graph or increasing the size of the valid subgraph for a given complexity. Moreover, both objectives can be computed from the known sample graph (doing this efficiently is non-trivial though).

I thank the reviewers for highlighting the issue and agree that it is not very clear from the current write up. Unfortunately, when fitting the content within the page limit, I missed that the discussion of this aspect has suffered. I will include the above discussion in improved form in the updated manuscript.

In this context, the argument for participatory methods stems from the sheer amount of data that would need be collected in a targeted way according to the scheme above. While we are currently preparing a paper for submission that introduces a novel method for this purpose, I believe that it is beyond the scope of this paper to also cover this aspect in detail, (e.g., it would need to introduce additional concepts related to mechanism design, game design, economics, and the efficient computation of the kk-core objectives). However, I will add a further discussion why participatory data collection is favorable for large amount of data + targeted collection.

Lack of justification versus Impossibility results

Reviewers raised also questions with regard to the significance of the theoretical results relative to known issues and results (e.g. known lack of performance guarantees, known limitations of observational data,)

The key difference between these known issues/results and this work is the difference between a lack of justification (known) and rigorous impossibility results (novel contribution of this work). For instance, while it is clear that standard learning theory does not apply to OOD settings, sampling bias etc. it is not clear that specific methods and practices are not valid. It just means we have no justification for them.

In contrast, the impossibility results in this work are much stronger. They show that there cannot be any method that leads to valid results in this setting, even for methods where this is not obvious at all such as counterfactual estimators. These results are also especially important in the context of scaling, which in most cases is exactly passive data collection from complex social systems and which is the dominant approach today. The results of this paper establish rigorous limits of this approach and show that alternative approaches are needed (such as the k-core approach discussed above).

Another way to look at it is in terms of necessary versus sufficient conditions. Sufficient conditions, which standard learning theory is often based on, provide insights into a specific/narrow case. They are important to motivate the validity of a specific method but do not say much when they are violated. On the other hand, this work establishes necessary conditions which provide conditions that have to hold for every case. Since they exclude a large set of hypotheses that otherwise would have to be explored, they provide important insights when the path forward is not entirely clear. This is exactly the case for evaluation in modern AI and why I believe the results of this paper are badly needed.

Again, I would like to thank the reviewers for highlighting this question. I agree that the discussion of this aspect is currently not optimal and will improve it along the above lines in the updated paper.

Additional results

Per request of RViy, I have also attached a PDF with experimental results for additional settings on widely used datasets, relating the paper's results to reasoning (FB15k-237), graph link prediction (Cora) and extreme classification (Wiki10-31k). Please see also the response to RViy for further details.

评论

Dear Reviewers,

I sincerely appreciate the time and effort you've dedicated to reviewing and responding, as well as the overall positive feedback that the submission has received. Your insightful questions, comments, and suggestions have been highly value to improve this work.

As the rebuttal period draws to a close, I wanted to check in and ensure that my responses have effectively addressed your concerns. I have made every effort to thoroughly respond to your comments, providing clarifications, detailed discussions, and additional experimental results as outlined in the general and individual responses.

If my responses have satisfactorily addressed your concerns, I would greatly appreciate your consideration of an increased score, especially since the additional clarifications can easily be incorporated. If you have any further questions or require further clarifications, please don't hesitate to reach out. I look forward to continuing the discussion with you.

Thank you once again for your time and consideration.

最终决定

The review team agrees that the paper has a clear and rigorous message through an impossibility result: for key tasks in modern AI, we cannot know whether models are valid under current data collection practices. While this conclusion might seem less surprising in hindsight, a theoretical formalization still has a lot of value. Albeit the focus is mainly theoretical, the reviewers appreciated that the paper used the Movielens dataset to illustrate the theoretical points.

As several reviewers suggested, the paper would further benefit from better connecting its insights to practice, e.g., making the presentation of the paper more accessible to practitioners and discussing in greater depth how the suggested remedies can be practically implemented in real-world scenarios. Adding more case studies beyond the movielens dataset would also help better illustrate the main message across different AI systems/models.

Overall, the review team, including myself, views positively the theoretical contribution of the paper. I believe the above concerns are minor and can be sufficiently addressed in a revision.