PaperHub
6.0
/10
Rejected4 位审稿人
最低5最高8标准差1.2
8
6
5
5
3.0
置信度
ICLR 2024

On convex decision regions in deep network representations

OpenReviewPDF
提交: 2023-09-22更新: 2024-02-11

摘要

关键词
explainabilityTransformersconvexity

评审与讨论

审稿意见
8

This paper investigates the hypothesis that the complexity of a concept is linked to its generalizability in machine learning models. This is significant for bridging the worlds of machine learning and human psychology since it is hypothesized (Gardenfors 2014) that human concepts are geometrically convex. The paper defines a generalized notion of convexity for non-Euclidean spaces using graph geodesics, and investigates the convexity of various categories within different modalities of data, both before and after fine-tuning on human annotations. Results suggest that convexity of human concepts does indeed increase after fine-tuning and that convexity may be linked to predictive performance.

Note: I previously reviewed this manuscript, and I checked to pdf to see if the authors made any revisions since then.

优点

The motivation of the paper is relatively well-explained given the novelty of the question. The measure of convexity appears to be novel, and a variety of experiments are conducted. The analysis of the experiments appears sound. The variety of modalities investigated (images, audio, text, motion data) is impressive. I did not check the proofs in detail, but the overall proof strategy makes sense to me.

缺点

While the concept of graph convexity is novel and interesting, it is not clear how much of a contribution it provides to the field, since there may be existing concepts (such as linear separability) which may or may not achieve the same goal in terms of characterizing representations.

EDIT: A remaining weakness, as pointed out by other reviewers, is the lack of practical applications to improve the performance neural networks.

问题

How does the measure of graph convexity give more insight in practical cases than an existing method for measuring linear separability (https://arxiv.org/pdf/2307.13962.pdf)?

评论

Dear reviewer,

Thank you for the review efforts and for the kind comments: the novelty of the research question and the variety of experiments. We appreciate the comments on possible improvements and questions raised.

This version of the paper is quite different from the one you reviewed the last time, but the review does not seem to reflect this: for example, your summary is exactly the same even though the content of the paper changed. One of the changes is including Euclidean convexity and comparison of the results to the geodesic one.

We discuss the relation of our approach to linear probing in the second half of page 4. Linear decision regions are Euclidean convex but the reverse is not true. Euclidean convexity is in many cases different from graph convexity (we present some of the examples in Figure 2). Therefore, by studying both the graph and Euclidean convexities, we get a more complete picture of the representations.

评论

Thanks for pointing out the new additions to the paper which I missed in my quick scan. I find that the discussion of linear probing on page 4 and the figure 2 showing the difference between Euclidean and graph convexity are quite helpful. Thus, I am happy to raise my score from 5 to 7.

评论

I have raised my score to 8. However, since I do not think the paper is free of weaknesses, I have added another one. Reading the other reviews, I tend to agree that the paper would be stronger if it could suggest a few ways to use the insights provided by this tool to improve the performance of neural networks in practical applications.

审稿意见
6

The authors delve into the exploration of the convexity of machine learning latent spaces, developing tools to measure the convexity in sampled data. The paper is motivated by the observation that representational spaces in the human brain frequently form in convex regions. The authors aim to showcase that state-of-the-art deep learning models also exhibit convexity in decision regions. Experiments spanning multiple domains such as images, human activity, audio, text, and medical imaging are conducted. The findings suggest that neural networks with higher performance tend to exhibit greater convexity in their decision regions.

优点

  • The paper is clear and easy to understand.
  • The idea of linking human thinking and computer learning is interesting and could be important for understanding the mind and contributing to cognitive science.
  • The experimental analysis in the paper is comprehensive and varied.

缺点

  • The concept of measuring convexity in NNs appears to be somewhat circular. It doesn’t seem like the convexity in NNs spontaneously emerges from the data or is an unexpected outcome; rather, it seems intentionally designed. Take the "graph convexity" prominently featured in the paper, for example. If we view the final softmax layer as the classifier (setting the decision boundary) and the feature vector inputted into this final layer as a data point on a manifold, it's almost always a given for a 100% accurate NN to possess a graph-convex decision boundary, owing to the intrinsic convexity of the softmax function. Essentially, the core function of the feature extractor is to map the data onto a (potentially) high-dimensional space where a simple linear or convex layer like the softmax can efficiently segregate data with varied labels, a strategy evidently seen in methods like SVM. Hence, the design of convexity seems more of an intentional architectural choice rather than a novel, data-driven emergence.

  • The methodology used in constructing graph estimates, a critical element in the paper, seems somewhat imprecise. In Defn.3 on Page 3, graphs are formulated based on Euclidean nearest neighbors, referencing the "local Euclidean"-ness of the manifold. This representation appears somewhat misconstrued. Manifolds are locally Euclidean within the subspace topology, not the standard topology in R^n. More explicitly, "local Euclidean" nature implies the presence of a continuous map that maps between the manifold's local subspace and R^n, but it doesn’t inherently suggest that a graph formed by Euclidean nearest neighbors will replicate the manifold's "similar" structure, as this neighbor determination occurs in R^n’s full space rather than in a subspace. The proposed graph construction method is vulnerable, for example, it will transform an open ring with closely positioned endpoints into a closed ring. I believe this is also the major reason why KNN performs better than epsilon-Neighbor as the former is more robust in handling these tricky topology cases. The paper could benefit from refinement in these foundational aspects to ensure mathematical precision and conceptual clarity.

  • The paper, dense with detailed proofs and extensive explanations, feels somewhat overloaded. Elements like Theorem 2 and 3 seem somewhat simple, potentially to individuals with an initial graduate-level understanding. A more streamlined approach, possibly through referencing standard convex optimization textbooks for such foundational concepts, could enhance the paper's readability and focus.

  • The connection between human representation and NN convexity isn’t entirely lucid. While there’s a notable reference to numerous papers on human representation in the introduction, the subsequent sections don’t seem to elucidate a substantial link beyond a generalized notion of convexity. It leaves room for doubt regarding whether the human-related assertions are primarily theoretical embellishments or if they hold significant, intrinsic connections and inspirations.

  • The experiments predominantly focus on the evaluations of convexity. An expansion of the experimental design to include algorithmic innovations that utilize these analysized insights to enhance training and learning processes could be quite valuable. Questions like whether explicitly incorporating convex boundaries during NN training could expedite convergence remain unexplored. Without such practical explorations and demonstrations, the applicability and impact of these observations seem somewhat limited.

问题

See the weakness.

评论

Thank you for the review and useful comments!

Thank you for the comment:

"The concept of measuring convexity in NNs appears to be somewhat circular. It doesn’t seem like the convexity in NNs spontaneously emerges from the data or is an unexpected outcome; rather, it seems intentionally designed."

Author reply: The convexity is not entered by design for NN! As we note in the paper, the last layer is convex by the convexity of softmax. Also, we argue that the internal attention/softmax layers can induce convexity. And so can ReLU layers. However, they can also lead to much more complex decision regions, so by design, there is a possibility for convexity, not a necessity. Testing the hypothesis, we find experimentally that convexity is quite pervasive in the networks, and it is increasing with label-based fine-tuning. This is highly non-trivial and aligned with findings in human learning.

We are grateful for your comment:

"The methodology used in constructing graph estimates, a critical element in the paper, seems somewhat imprecise."

Author reply: We introduce new tools for analyzing NNs in the paper, and surely these methods are only a first step. We think the Euclidean convexity is straightforward using the definition, but the actual estimators may be later refined. For the graph convexity, we were very inspired by the workflow developed originally for the ISOMAP algorithm. This entails an assumption that the geometry can be reconstructed from a graph based on Euclidean neighbors. We will change the wording and delete the "intuition" about local flatness. We are happy to include relevant references suggested by the reviewer (beyond the ISOMAP technical note (Bernstein et al., 2000) we currently cite.)

We appreciate your comment:

"The paper, dense with detailed proofs and extensive explanations, feels somewhat overloaded.".

Author reply: Please note that all proofs were delegated to the appendix. We agree that the paper introduces several new concepts, as needed when proposing a conceptually new way of analysing NNs

Thank you for the comment:

"The connection between human representation and NN convexity isn’t entirely lucid."

Author reply: The cognitive psychology literature suggests that convexity enables generalization as we detail in the introduction. In our work and conclusion, we find a similar result for NNs. We claim this is highly non-trivial and hope that you agree this is a worthy ICLR contribution.

We appreciate your comment:

"The experiments predominantly focus on the evaluations of convexity. An expansion of the experimental design to include algorithmic innovations that utilize these analysed insights to enhance training and learning processes could be quite valuable. Questions like whether explicitly incorporating convex boundaries during NN training could expedite convergence remain unexplored. Without such practical explorations and demonstrations, the applicability and impact of these observations seem somewhat limited."

Author reply: Our contribution is mainly conceptual and we maintain that ICLR can present work that is not engineering of new architectures and algorithms. We ask a fundamental scientific question ("Are NN decision regions convex"?) and we find significant evidence in favor of the hypothesis. Also, we find convexity is linked to generalization. We hope you will consider raising your grade to reflect the value of these findings.

评论

Thanks for the authors' reply and clarification. Regarding convexity, I am convinced that it does not emerge from manual design, but I would suggest the authors add more content (specifically the replies to the review above) for clarity.

For 2-3 bullet points, I suggest the authors modify the manuscript to make the paper precise.

I still think the lack of new algorithms inspired by convexity is a weakness of the paper. Even some simple experiments like adding convexity to the loss function to train a simple model that can outperform the one trained w/o the loss can demonstrate the practicability of the discovery.

Based on the above, I raised my score to 5.

审稿意见
5

Motivated by insights from geometric psychology and neuroscience, this paper proposes measuring the generalization ability of models based on the convexity of decision regions. To achieve this, the paper investigates the learned latent space of various models and measures their convexity using Euclidean convexity and graph convexity. Experimental results demonstrate that the convexity of different models generally increases layer by layer, suggesting that deeper features exhibit better generalization ability. Additionally, the paper establishes a strong correlation between the convexity of models and their fine-tuned accuracy.

优点

The investigation of model convexity is intriguing.

缺点

  1. The organization of the paper's content is unreasonable. For instance, the introduction section dedicates substantial space to discussing the relationship between convexity and generalization, while neglecting to mention the specific objectives of the paper. This section should be reorganized by moving background information to the related work or appendix while focusing on introducing the main content and contribution of the paper.

  2. The experimental setup in the paper has significant issues. To demonstrate the claimed relationship between convexity and generalization, the paper should provide comparisons of convexity and performance among multiple different models within the same domain, thus proving that convexity indeed facilitates generalization. However, the paper only conducts experiments with a single model for each modality, which fails to reveal the impact of convexity on generalization.

  3. This paper fails to provide any meaningful conclusions. Despite emphasizing the relationship between convexity and generalization, as mentioned above, the existing experiments fail to demonstrate this point or draw any valuable conclusions, rendering the contributions of the paper trivial.

  4. Some experiments raise questions. For example, Figure 3 employs t-SNE to visualize the feature space, but due to the inherent instability of t-SNE, this experiment lacks persuasiveness. In Figure 5, why are there multiple points for each modality? And why is the vertical axis using recall rate instead of accuracy?

  5. Although the authors mention that one of their contributions is proving the stability of convexity to relevant latent space re-parametrization and provide a formal proof in Appendix A, I was unable to find it. Appendix A only mentions the robustness to affine transformations, which is different from re-parametrization. Furthermore, in 2.1.3, there is an anonymous citation. From the context, it seems that the authors may have inadvertently revealed personal information in this citation.

问题

Please refer to the weaknesses.

评论

Dear reviewer Rwrg,

Thank you for taking the time to review our paper. We are glad to hear that you find the investigation of model convexity intriguing. We would like to address your questions and comments regarding our paper:

  1. We believe that an integral part of paper is bridging the gap between the notion of convexity in human and machine learned representations - a topic that to our knowledge has not before been investigated. The motivation for our research is grounded in extensive research within the field cognitive science, and we believe it necessary to introduce the new vocabulary to properly motivate our research question, namely “Are generalizable, grounded decision regions implemented as convex regions in machine-learned representations?” Our contributions are stated on page 2 at the end of the introduction.

  2. This is indeed a good point and something we will definitely investigate in future work. We have prioritised demonstration of our method across multiple different domains to ensure maximal usability of our work across different subfields of machine learning.

  3. Our most important conclusion is that “we find evidence that the higher convexity of a class decision region after pretraining is associated with the higher level of recognition of the given class after fine-tuning in line with the observations made in cognitive systems, that convexity supports few-shot learning.” Additionally, we hope that this new approach will inspire future investigations of machine learned representations rooted in properties known to be found in human learned representation spaces.

  4. Experiments

    a. T-SNE is indeed an unstable method for dimensional reduction. We would like to emphasise that the method is only deployed for visualisation purposes to give an intuition into the organisation of the latent space. No conclusions are drawn based on the t-SNE plots.

    b. We apologise for the lack of clarity in the description of the figure. We will update the description to make it more clear. In the figure, we are looking at the recall rate for each individual class in the fine-tuned model and plotting it vs. the convexity of this class in the pretrained model. The number of points for each model is therefore equal to the number of classes in the domain under investigation for said model.

  5. Proof in Appendix A on the stability of convexity to relevant latent space re-parametrization We mean to say that the convexity measures are invariant to relevant re-parametrization of the latent space induced by weight initialisations and other sources of randomness in training. We provide proof that the Euclidiean convexity measure is invariant to affine transformations and that the graph convexity measure is invariant to isometries and scaling. We will update the text to “relevant latent space transformations”.

We hope these clarifications have shed some light on our motivations and approach behind the paper and that you will consider raising your score.

评论

I have read the author's response and the comments of the other reviewers. Unfortunately, the authors did not add the multiple models experiments that I mentioned and the paper is poorly presented (e.g., the contributions listed in the introduction are in lowercase, and there is overlapping text in the paragraph above the image domain on page 6 of the article), so I'm still inclined to reject the paper. I think this paper needs more preparation to meet ICLR publication standards.

评论

We very much appreciate the input and concerns of the reviewer:

First, we apologize for the typographical mistakes, they have been corrected!

Second, following your suggestion we have performed a set of new experiments. In particular, we have added results for two more models for the human activity domain and for the text domain. The findings are consistent with our earlier results: Higher pre-trained convexity is associated with improved downstream accuracy. The human activity data shows a major effect, while the text data models show more modest differences (see new appendix E).

We hope you find these results as interesting as we do and that you will consider raising your score.

审稿意见
5

The paper presents some properties of recent neural networks that they found -- convexity emerges everywhere in layers of neural networks. The paper develops a tool for this purpose based on graphs and the Euclidean method. The paper studies the model pre-trained on multiple domains.

优点

In recent years, the large language model has been demonstrating a huge potential to realize AI whilst humans don't clearly understand how it works, which makes the papers that aim to interpret deep neural networks more important. I like the idea of the paper to demystify the learned representation in terms of convexity.

Also the paper derived a set of tools from convex theory and made it possible to analyze the convex property learned inside neural network.

Also the paper analyzes neural networks in different domains, which make their conclusion about neural network more convincing.

缺点

I have concerns about the paper. Before I lay out the weaknesses list, I would like to mention that I’m not an expert of this domain. And so my suggestions and comments are possibly incorrect.

Is the conclusion of the paper about neural networks new? Scientists who have been developing artificial neural networks are trying to follow the conclusions like convexity found by neural scientists. I think we, as developers of artificial neural networks, would naturally consider this property by design and the conclusion is thus obvious. So when the paper claims that pre-trained neural networks have emergent convexity. I’m not very surprised. So I think the paper might want to clarify whether the conclusion is different from before. In other words, I’m suggesting the paper justifying that it contributes some new paper-worthy observations/conclusions.

How can this conclusion be used to improve neural networks? It’s interesting to know the convex property of learned representation. But, as a researcher in computer vision, I’m not sure how this can be used --- e.g., to improve the model. I would suggest authors finding a proper application. For example, the work can inject the conclusion to some other neural network to improve the explanation ability or generalizability etc.

The paper doesn’t analyze the convex property of neural networks that might be sub-optimal. It might be also interesting to analyze the sub-optimal network (shallow/smaller networks). From my perspective, it can give you an idea about how much convexity is enforced in different neural networks.

In short, while the paper presents a tool to analyze the convexity of learned representation, it’s hard to know if the conclusion can inspire future works -- is this a new observation for neural networks or can this conclusion be applied to improve neural networks? Therefore, I tend to reject this paper.

But again, I’m not the expert in this domain, so I would be happy to hear back from authors during rebuttal in case I misunderstand anything.

问题

please address the questions that I raised above.

评论

Dear reviewer,

thank you for taking the time to read our paper. We appreciate you find the idea of analyzing representations of neural networks using convexity measures interesting.

To the best of our knowledge, no previous work showed both graph and Euclidean convexity of learned representations, as well as their relation to generalization. Euclidean convexity is only enforced in the last layer due to the softmax, but neither graph convexity nor Euclidean convexity in earlier layers is enforced by design (see section 1.3, bottom of page 4 for a detailed discussion on “linear regions”). Previous work has investigated the properties of linear decision regions in latent spaces, which is related to our Euclidean convexity measure, however, graph convexity and its relation to the generalization of the neural network is novel.

In regards to the improvement of neural networks, we would like to highlight here that this analysis is scientific, and not engineering. Nevertheless, we believe our work has various practical applications in the field of XAI. For example, it might be used to evaluate the pretraining process of a neural network and give an estimate of its generalizability. Furthermore, properties of convex regions formed by the network might give valuable insights into its decision-making.

Analyzing smaller networks is an interesting idea. However, these are usually not trained with self-supervised learning to learn representations of the data, so they would not fit in our framework of comparing pretrained and fine-tuned models. Future work could focus on exploring the convexity in various architectures and comparing what role the pretraining plays in the classifiers (compared to only supervised training).

评论

I appreciate author's detailed feedback. But I'm not very sure if I'm fully convinced due to the missing practical use of it's theoretical analysis.

评论

We very much appreciate the input and concerns of the reviewer: You asked in your review “How can this conclusion be used to improve neural networks?”. In our revised draft we have included a preliminary experiment, we have finetuned one of our networks using a convexity-promoting strategy. The experiments show that Euclidean convexity is indeed increased and this has a positive effect on performance (see the new appendix D). We believe that this opens new interesting avenues for future constructive research. In a second set of new experiments, we have added results for two additional models for the human activity and text domains. The findings are consistent with our earlier results: Higher pre-trained convexity is associated with improved downstream accuracy. The human activity data shows a major effect, while the text data models show more modest differences (see new appendix E).

We hope you find these results as interesting as we do and that you will consider raising your score.

评论

We thank all our reviewers f5N3, NRVi, nfkX and Rwrg for suggesting that the work will gain importance if "convexity promoting" strategies can help engineer better neural networks.

While the conceptual dimension in the paper is our main message, we were intrigued by the challenge proposed by the reviewers. Thus, we present an early result for one such scheme, namely a data augmentation strategy inspired by MixUp [1]. Under this scheme, we fine-tune with convexity-promoting data augmentation that creates augmented data on the line connecting any two points in the same class, and within a given latent embedding layer. Points and layers are randomly sampled. Since we are using actual labels to define the augmented data in its present form, the scheme applies to the fine-tuning type experiment (here as an example we choose the human activity domain). Our proposed method increases convexity of the decision regions as well as the recall for three out of four classes.

The experiment and the early supporting results are presented in the new appendix D. Furthermore, we add a "perspectives paragraph" for the Conclusion section, stating the results of fine-tuning with convexity-promoting data augmentation. Finally, we updated relevant parts of the paper to include clarifications to questions raised by the reviewers.

[1] Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). Mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.

评论

We thank the reviewers again for their efforts and discussions so far!

We have uploaded a new pdf with clearly marked changes following your feedback (replacing the previous version as the diff module seemed unstable).

We are really grateful for the suggestion to improve the paper by introducing "convexity promoting" strategies and help engineer better networks. The experiment shows that Euclidean convexity is indeed increased and this has a positive effect on performance (see the new appendix D). The new results based on data augmentation are clearly just an indication. Yet, the potential is there and we will look for new tools to promote convexity moving forward.

评论

Thanks for the authors' rebuttal, the new experiment looks great to me. I have updated my score to 6.

评论

The experiments with encouraging convexity are interesting.

AC 元评审

The submission explores the relationship between convexity in a latent space and accuracy of a model using that latent space. The main conclusion is that increased convexity is correlated with improved accuracy. The submission received mixed reviews with two reviewers being less enthusiastic, and two recommending acceptance. Concretely, on a positive side the reviewers were generally intrigued by the potential relationship between convexity and generalization. Concerns about the logical inference of a direct relationship between convexity and generalization, or that convexity as opposed to linear separability more generally (which implies convexity) is the reason for improved performance. The conclusions may also be related to information bottleneck theory, which has prominently been studied at ICLR, but not mentioned in this submission. The conclusions are essentially empirically motivated, with the primary technical content of the submission being related to definitions of convexity, and generalizations to non-Euclidean notions. The "theory" appendix A is appealed to in various places to give intuitions, but is not included in a formal proof of a relationship between convexity and classification, although convexity (implied by linear separability) is required at the last layer to achieve zero error.

为何不给更高分

The submission is interesting, but seemingly incomplete. Experiments are not sufficiently large scale with multiple models to make a general claim, and no theoretical generalization bounds are shown.

为何不给更低分

N/A

最终决定

Reject