Polynomial Width is Sufficient for Set Representation with High-dimensional Features
We prove that polynomial dimension of the embedding space are sufficient for DeepSet like architecture to represent any high-dimensional continuous permutation-invariant functions.
摘要
评审与讨论
The paper considers the representational powers of the permutation-invariant DeepSets architecture, which maps inputs to features , computes a sum, and then computes an output with . Past results have shown that the DeepSets architecture can represent any permutation-invariant continuous function when as long as , and that is necessary under certain assumptions about the and being implementable with analytic activation functions. This work shows that without these activation restrictions, permutation-invariant continuous functions can be exactly represented using embedding dimension .
They do so by introducing a pair of positive results in Theorem 3.1, one having a power mapping activation (where each input is mapped to its powers up to degree ) and another having an exponential activation. The constructions use a similar proof structure, both of which involve finding that that the function is continuous and invertible, which makes it possible to invert the mapping in and then apply the permutation-invariant function directly. (Note that this means that, while is a simple construction that can be made explicit, the inversion of is an existential result that may correspond to a highly non-smooth function.) The feature mapping is constructed carefully to ensure that inputs that are element-wise permutation-invariant but not vector-wise permutation-invariant map to different features; the authors do so by using anchor vectors to distinguish each vector. The remainder of the construction is dedicated to ensuring that some anchor exists for every possible input and that the desired constraints are enforced.
They contextualize their results by including several negative results to illuminate their design decisions, including Lemma 3.4 (why and must be continuous) and Theorem 4.5 (why there must be at least candidate anchor vectors for each input to have an anchor). They run brief experiments in Appendix K that train DeepSets models to compute lexicographical medians, and find that the necessary width grows polynomially in and .
优点
The contrast to the Zweig and Bruna lower bound is of theoretical interest, since it shows that requiring analytic activation functions changes the minimum width cost from polynomial in and to exponential. The work, coupled with its experimental results, suggests that an impressive amount of information can be encoded in sums of relatively low-dimensional feature embeddings, and that the functions that cannot be represented by standard DeepSets architectures are likely highly pathological or discontinuous.
The results are technically impressive, and I found no major errors in the proofs. In particular, the LP and LE constructions were clever, meticulous, and well-visualized by figures.
缺点
While I don't expect this paper to solve the problem, the non-explicitness of the construction means that is likely to be a highly non-smooth function that is difficult to compactly approximate using a neural network architecture. In future works, I'd be interested in understanding whether feature dimensions that are larger polynomials in and make it possible to yield explicit (and ideally more smooth) mappings. Perhaps the work could discuss possible approaches to bounding the smoothness of inversion functions? Or perhaps the authors can discuss which kinds of permutation-invariant functions are expected to have smooth features?
Minor issues
- I was momentary confused by the conditions on in Theorem 3.1. At first, I thought could not be larger than and not that this is an upper bound on the smallest . Perhaps the lower bounds could be mentioned separately, for the sake of clarity?
- Page 17 says "pigeon-hold" instead of "pigeon-hole."
- Lemma E.1 is much easier to parse after noting Remark E.2. Perhaps the remark could be included before the proof, or maybe the lemma could just define the values explicitly as prime numbers, since a generality in is not necessary for any of the proofs?
问题
N/A
We thank Reviewer oY1k for the detailed summary of our results and acknowledging our theoretical significance. Regarding your questions, please see our response below:
1. Brief discussion on .
Fine-grained analysis on remains a widely open question in this area. We agree with Reviewer oY1k that is highly non-smooth and hard to learn [1]. Nevertheless, we hypothesize that modern ML algorithms can approximate non-smooth inversion of the sum-pooling well. This argument is also supported by our experiments in Appendix K. We doubt that non-smooth points in a generic set are often located at the regime of no interest, meaning worst-case may not occur in practical scenarios. We also agree a rigorous study demystifying the mapping is very necessary and the suggested direction of bounding the smoothness of the inversion function is of high interest to explore in the future. Our results can be easily extended to equivariant functions, as a foundation to form multi-layer set functions. Under this scenario, may be simplified.
[1] Murphy et al., Janossy Pooling: Learning Deep Permutation-Invariant Functions for Variable-Size Inputs
2. Presentation issues.
We have revised our introduction to clarify the bound of means the upper bound of the smallest , and fixed all the typos pointed out.
3. Questions on Lemma E.1.
It is a good point to mention can be chosen as prime numbers in Lemma E.1. But we would remark that the non-divisibility is the most crucial property to let Lemma E.1 hold, which allows for any other choices of satisfying the condition in Lemma E.1.
I thank the authors for their detailed responses. I maintain my score and believe that the contributions are substantive and technically interesting.
This work concerns the expressive power of the DeepSets architecture [Zaheer et al., 2017], which is designed to model permutation invariant functions taking as input sets whose elements are real vectors. The basic parameters relevant for the model are N and D; N denotes the size of the input set and D denotes the dimension of its elements. The novelty of this work is in proving a poly(N,D) upper bound on the latent dimension sufficient for DeepSets to express any permutation-invariant function. Previously known bounds were exponential in the relevant parameters N and D.
To elaborate, the DeepSets architecture is of the form , where is a feature map for the set elements , and is an activation function. Both and are chosen by the model designer. Writing , where we can view a “sum-pooled” feature map for the input X. A key question for the practitioner is “How large should I set the latent dimension L to model any permutation-invariant target function?” This work shows that L = poly(N, D) is both sufficient and necessary (the bounds are not tight, however). Previously known upper bounds on L were exponential in N and D, thus this work represents a significant improvement over the previous state-of-the-art in theory.
优点
The paper presents novel ideas in designing the feature map to overcome limitations of previous work which did not generalize beyond the scalar-elements case (i.e., D=1). The combinatorial argument presented in the proof of Lemma 4.4 is particularly nice, though it is somewhat difficult to follow in the current exposition. Overall, I find the proofs insightful and quite surprising (I will elaborate on this below). Hence, I am inclined to accepting this paper.
The proof of the main result can be understood in stages. Let , where denotes D-dimensional vectors which are elements of the given “set” X. The ultimate goal is to design such that implies , where means that the two matrices are equivalent up to row permutations (the opposite implication is obvious). The key difficulty is in “surviving” the sum-pooling operation .
Ideas from previous work, such as the degree-N polynomial mapping , can be applied entrywise to ensure that implies that the rows of X and X’ are equivalent individually, which is a weaker implication than . This only works only in the D=1 case and not for D > 1 since requires the coordinates of the rows be jointly aligned. To overcome this limitation, the authors propose novel ideas in the design of so that alignment between the coordinates of are ensured as well. Personally, I found this issue of coordinate-wise alignment quite challenging and was pleased to see its resolution here.
缺点
One shortcoming of this paper is that the exposition is quite hard to follow. It would help this paper reach a wider audience if the authors improved their exposition. I would suggest using less confusing notation and providing a more structured and detailed proof overview. Specific examples include:
- The proof of their main theorem can be presented in stages as follows “Suppose . The polynomial mapping ensures that the rows of X and X’, when viewed as multi-sets, are equal. The main technical challenge is to ensure that the coordinates of the rows are aligned as well. This step combines two ideas: the “anchors” and the set , representing the additional linear mappings that ensure pairwise alignment …”
- The linear projections that form the coordinates of are all represented as ’s. It would be more informative if different symbols were used for the different “types” of these mappings (standard basis, anchors, \gamma’s …).
- In p.7, the superscript notation for is confusing. For the samples, the superscript “(n)” was used to index elements of the set. Here it is used to denote different “types” of w. As mentioned before, simply using different symbols for different groups would avoid the use of unnecessary superscripts.
问题
- One of the key technical proof ideas is showing that there exists some set of real numbers s.t. for any , if , , and for all , then (here, we view [a b] as “row stacked on top of row ”). Is the idea of using such linear transformations (of the form ) to ensure alignment between the coordinates new? Or is this “linear coupling” a well-known fact in mathematics?
We sincerely appreciate Reviewer jrmV's constructive advice to improve our paper's clarity. We have carefully revised the text and adopted a more informative notation system. Please check our updated version.
1. The exposition can be improved.
We have revised our high-level description of the proof sketch in Sec. 4 to enhance clarity. We also improved our notations in the following way: -> for canonical basis, for anchor construction, and for linear coupling.
2. Question regarding linear coupling.
Sorry for the confusion. and here are both column vectors, and we stack them to form an by 2 matrices. The form of linear coupling may have appeared in other proof techniques, but to our best knowledge, it is the first time used to constrain the permutation orbits of matrices to ensure alignment.
The authors propose a theoretical bound for the size of the embedding dimension of permutation invariant set functions which is not restricted to a single feature dimension, as some previous works.
优点
- The problem is very relevant, as many works which utilize permutation invariant functions do not consider the importance of the embedding dimension.
- The derivation in the text appears thorough and rigorous
缺点
- It took me a while to grasp the concept of anchors. In lemma 4.2 I think it needs to be stressed that the same anchor is being applied over all the channels. Although the notation states this, it would be good to state it in plain text as well.
- Under Lemma 4.2 there is a sentence which says "then once each coupled pairs are aligned, two data matrices are globally aligned." Can you elaborate on the precise meaning of this sentence? What does "aligned" and "globally aligned" signify?
- Why is Lemma 4.4 necesary? I do not see the reason for this, and I do not think it is explained well in the text either. At the beginning of page 8, it is stated that: "Eq. 5 implies Eq. 6, but none of these seem to depend on Lemma 4.4 or "Contruction point #3" so I am not sure why it is necessary. I think this needs to be explained better in the text.
问题
- Right before section 2.2, it is stated: "The obtained results for D′ = 1 can also be easily extended to D′ > 1 as otherwise f can be written as [f1 · · · fD′ ]⊤ and each fi has single output feature channel." I assume this is referring to the previous sentence and means "the results obtained for invariance can be extended to equivariance, as.." Is this correct?
I would be curious to hear the authors opinion on the following:
According to this and prior theoretical works, a very large embedding dimension is needed to maintain the universal function approximation ability of permutation invariant functions, however, many practical works which utilize permutation invariant set functions do not use such a large embedding dimension. Therefore, what is the practical takeaway for an implementation which wishes to utilize a permutation invariant function? There seems to be quite a large disconnect between theory and practice on this topic.
Overall, my biggest conflict with this work is that there is no empirical experiment to corroborate the theoretical findings presented in the paper. While I cannot find issue with any of the claims made in the paper, if there is no way to empirically verify the given bounds, then it is quite difficult to understand their practical significance. Therefore, if the authors could provide an experiment or further discussion which illuminates this topic, I would be happy to revisit my current score.
We thank Reviewer a75R for acknowledging our mathematical rigor and high relevance with ML community. For your questions, please see our response below:
1. Confusion by Lemma 4.2 and its implication.
Thanks to Reviewer a75R's suggestion, we have revised the manuscript with an emphasis on "the same anchor is being applied over all the channels". The terminology "alignment" between and refers to . Hence, our argument "once each coupled pairs are aligned, two data matrices are globally aligned" can be rephrased in a more rigorous way: "once the permutation orbits of each coupled pair intersect, the permutation orbits of the two data matrices also intersect", and can be translated into our mathematical statement: .
2. Why is Lemma 4.4 necessary?
Proof in Sec. 4.1.2 proceeds in the following way:
- Sum-pooling implies column-wisely by injectivity of power mapping, which induces .
- By Lemma 4.4, column-wisely implies for any .
- By union alignment (Lemma 4.2), implies , which is essentially Eq. 5 Eq. 6.
Lemma 4.4 plays a critical role in our proof, as the construction therein mixes anchor channels with the original feature channels. And this construction guarantees that alignment between mixed channels can induce alignment between the tuples of anchor and feature channels.
3. Question regarding "The obtained results for can also be easily extended to as otherwise f can be written as and each has single output feature channel".
We apologize for the confusion caused. The original definition in Def. 2.2 states a general permutation-invariant function can admit vector outputs of dimension . Note that here denotes the number of output channels (e.g., classification logits) instead of number of output set elements. We used the aforementioned argument to clarify that our investigation starting from Sec. 2.2 will only focus on scalar outputs without loss of generality. Our permutation-equivariant results are not as trivial as extending the dimension of the outputs. It requires some additional theoretical establishment Lemma G.1 to extend the proof of Theorem 3.1.
4. Practical significance and empirical verification.
In many practical applications of DeepSets-like architectures, even though the embedding dimension is not carefully chosen, it is observed that a reasonably small latent dimension is sufficient to let those networks perform well. Such empirical observations appear to contradict existing theoretical studies, in which exponential many neurons are often needed. In this work, we provide a more efficient construction of DeepSet embedding layers, which results in polynomially many hidden neurons for the first time, successfully explaining the empirical promise of DeepSets-like architectures. Designing experiments to directly testify our theoretical bound can be inherently challenging. We provide a preliminary experiment in Appendix K, in which we show that the minimal hidden dimension required to achieve a small training error scales with sequence length and feature dimension at a polynomial rate.
Thank you for the response. I will raise my score. I would strongly recommend to add a section about practical takeaways and to move the existing empirical experiment to the main text if possible.
High level summary
This paper delves into the question of representations of sets as continuous functions in vector spaces. The work builds up on the ideas proposed in DeepSet, which shows that proves any permutation-invariant continuous function, can be restated by mapping each element to a vector, summing those vectors, and then mapping them to a scalar again. This paper addresses the unresolved question, whether sets over high-dimensional vectors can be represented efficiently, i.e., polynomially wrt to the set and feature size.
More formally, represents feature vectors of dimension and function is defined to be permutation-invariant if for any permutation matrix it satisfies The main contributions of the paper can be enumerated as
- Thm 3.1 asserts that there and such that can be re-written as where is at most polynomial wrt to and , with two different constructions that authors refer to as "power mapping " and exponential activation". The theorem further asserts that is lower-bounded roughly by .
- Thm 5.1 further shows that permutation equivariant functions, i.e., such that then there are and such that where again is polynomial in and .
Technical summary
Set representation for one-dimensional elements The previous result proves that when there are and such that ). This result mostly hinges on a particular function, referred to as power-mapping , defined as The paper goes on to explain that is "bijective" in a particular sense that deviates from the standard definition, in that implies rows of are a permutation of This allows us to introduce the mapping and its inverse as an identity and conclude where and .
The channel alignment problem in The first insight is explaining that the straightforward approach of representing each channel (dimension) of the high-dimensional features, will not work. This is because the mapping is permutation-invariant, meaning that while extending it to high dimension will preserve permutation invariance across each channel, these permutations are not going to be necessarily the same, referred to as alignment. This means that while we can ensure that while the naive approach ensures that each channel is preserved with permutation invariance, these permutations are not necessarily the same, and thus the vectors structure, where indices of vector do matter, may be lost.
Linearly lifting dimension to to resolve alignment The main idea behind the proposed theory, as far as I can understand, is how to prevent this alignment problem by linearly lifting the -dimensional elements to a much higher dimension that has lots of redundancies. Crucially, this lifting has the property channels of two different matrices after lifting can be aligned independently, then one matrix is a row-permutation of another.
Here I will summarise the more detailed theoretical insights I can draw from various parts of the paper
- Anchor. The main idea for the entire theoretical construction in fact develops on top of this alignment problem, suggesting a way to "anchor" different channels. The theoretical construct. Formally, anchor for data , preserves the equality structure of the rows of : .
- Coupling of alignment with anchor. If is an anchor of , and is a permutation of and same permutation can be applied on every channel of to transform to the equivalent channel of , then rows of are a permutation of rows of . Thus, we can couple the alignment of various channels to the alignment of to we can easily conclude that
- *Linear probes for anchors. * We can pick linear probes such that for any matrix at least one of the must be an anchor of . This is achieved by picking a large enough , and 's such that every subset of size will be linearly independent, i.e., in general position, which is achievable by simply drawing from some Gaussian multivariate distribution. The claim follows from a cute pigeonhole principle. Now, if we do set representation of all and these linear probes 's, we can ensure that each of them are injective, and at least one of the 's is an anchor, but we still haven't coupled the alignment of the anchor to the rest of the channels.
- Coupling channel alignments to the anchor Next, authors prove existence of coefficients such that if and is a permutation of for every , then permutation of and can be aligned.
- Lifting Here, using the linear probes and coefficients 's from previous step, we can define lifting operator colums for a polynomial range of , and put all of them as columns of the lifting linear Then each channel (column) of will be of the form Therefore, by previous properties, if every channel (column) of can be aligned to (is a permutation of) , then rows of are a permutation of The set representation of each channel will follow naturally from this step.
优点
Here are the main strengths I find in the appear:
- The theory of the paper, to be best of my understanding, is sound and accurate. The statement of the theorems, lemmas, and the proofs, as much as I delved into, are sound and clearly stated.
- The main problem that this paper focuses on, is a natural abstraction of real-world problems.
- There are many clever and intuitive proof techniques used in this work, which may be interesting for the reader
缺点
Main issues:
- While the paper presents a mathematically intriguing case, I am not quite sure what to draw from it from a machine learning perspective. While theoretical contributions should certainly be welcome in the ML field, I think the theoretical works should take a few steps in demonstrating the relevance of their results for the broader community. For example, are there any concrete examples or use cases that these theoretical findings would be relevant?
- Following up on the previous point, despite the embedding-sum decomposition, the construction of the function is still a complete black box. Again, these non-constructive arguments do not seem to add up to any "practical insight." While this is not a 100% a critical point, if authors can think of ways to make it easier for the reader to imagine applications, they would broaden their audience and enhance the impact of the paper.
- While the paper is mathematically sound, it could benefit from more high-level idea developments both before and after the theorems. There are several places where a geometric interpretation is available, but not discussed in the main text. For example in the proof of Lemma 4.2 it becomes clear that a system of 's in general position, if there are "enough" of them, at least one will be one that is not in the hyperplane orthogonal to difference between each two columns of , perhaps this can even be visualised for some toy example with and With the intuition that authors have built upon this case, if they are able to come up with more intuitive/visual depiction of these concepts, it would dramatically improve the paper's readability and accessibility.
Minor issues:
- The notion of injectivity for functions (Def 2.7), that implies rows of are a permutation of rows of ', slightly deviates from the standard notion that assumes "a function f that maps distinct elements of its domain to distinct elements." It took me quite a few iterations to notice this slight difference. Perhaps author can caution the reader before or afterwards. Most readers will be inclined to assume the "default" notion which could lead to some confusions later on.
- The notation which (for example used in Lemma 4.2) is somewhat ambiguous. Initially, I interpreted it as concatenating the two vectors along their main axis which leads to lots of contradictions, while the authors implied stacking them along a new dimension which makes sense. While this might be consistent with the papers notation elsewhere, it is still good and helpful to highlight it for the readers that this implies stacking them and not concatenation.
- "Comparison with Prior Arts"? I'm guessing authors mean "articles" here? Is this a common shorthand for articles? While I don't want to sound authoritative, this seems like a rather informal writing style.
- (page 17) "pigeon-hold principle" I'm guessing authors refer to "pigeonhole principle" :)
问题
- The main results of this paper are obtained for a notion of permutation-invariance of which is somewhat strictly defined over the entire input set In real world applications, often the feature vectors represent learned or encoded features of a particular dataset. So in these cases, the "user" will be interested in "relaxed set preserving properties", in the sense that permutation invariance is only held over a subset (possibly even countable subset) of . Can authors think of any interesting relaxations on and extend their theory for those? (assuming it has certain properties or its feature vectors are chosen from a smaller/finite set)
- Upon reading the paper, the architecture that authors propose resembles a network with depth "2" (perhaps this mysteriously encoded , which in reality could be a highly complex function). In certain scenarios, there could be a hierarchical set representations, e.g., we want to embed sets of sets of sets, (or even more). Can authors comment on extendability of their theory, or in general any comments they may have for such multilevel/hierarchical sets?
We genuinely thank Reviewer XKbR for acknowledging our theoretical soundness and providing a detailed summary of our technical insights. Regarding your questions, please see our response below:
1. Practical implications and relevance to the ML community.
When the input data is a set or a graph, machine learning models often require permutation invariance or equivariance. DeepSets or sum-pooling-based architectures are ubiquitous in these models. Representative examples encompass GNNs and PointNet. GNNs learn to pass information along graph topology via a neighborhood aggregation operation at each layer of computation, essentially corresponding to a set function [1]. PointNet [2] processes point clouds by representing points as an unordered set and follows a DeepSets-like architecture. However, the analysis of expressive power for most of these usages does not take the embedding dimension into consideration, as also acknowledged by Reviewer a75R. Our work provides a rigorous justification that moderately many neurons are sufficient to represent a high-dimensional set with DeepSets. This affirms the feasibility of the DeepSets architecture given high-dimensional features and explains why DeepSet-like operations can be effectively adopted in GNNs and PointNet.
[1] Xu et al., How Powerful are Graph Neural Networks?
[2] Qi et al., Pointnet: Deep learning on point sets for 3d classification and segmentation
2. Construction of remains black-box
Among most of works analyzing expressiveness of DeepSets architectures [1,2], identifying bijectivity (also known as orbit separation) of the sum-pooling layer is of most theoretical interest. We note that the resultant in our construction is not a "complete" black box. Our construction guarantees continuity of , a crucial property which allows it to be universally approximated by a commonly-used neural network. We leave more fine-grained discussion of to our future work and a potential direction can be analyzing the learnability of depending on smoothness of the embedding layer ().
[1] Zaheer et al., Deep Sets [2] Wagstaff et al., On the Limitations of Representing Functions on Sets
3. More intuitive interpretation of our proof.
We appreciate the reviewer's insightful interpretation of Lemma 4.3 and we have included it in our latest revision. We have also added some high-level proof ideas upon Reviewer jrmV's suggestion and would like to direct Reviewer XKbR to Fig. 2, an illustration of our proof, which Reviewer oY1k found helpful. Although the majority of our proofs are algebraic, a simple geometric interpretation could be that we identify a non-trivial subspace embedded in a higher-dimension ambient space such that for any two points in this subspace, the intersection of their orbits, generated by column-wise permutations, signifies the orbit intersection after a global row permutation.
4. Typos and confused notation.
Thanks for pointing out typos. We have fixed all the typos and added additional remarks for confusing notations in our updated manuscript. The notion of injectivity we adopted here also follows a standard definition, as "" is equivalent to "" by contrapositive. The term "prior arts" is usually used interchangeably with "prior results", referring to previous works showcasing state-of-the-art results, as exemplified in [1]. To avoid further confusion, we change it to "prior results".
[1] Tsakiris & Peng, Homomorphic sensing
5. Relaxation of the input space.
Most discussion on permutation-invariant models focuses on the invariance defined over the entire ambient space. Meanwhile, it is common in network expressiveness analysis to consider a general input (e.g., arbitrary open/close/compact sets). We appreciate reviewers' thoughtful suggestions on considering a constrained input space. If we know a part of elements in 's order is irrelevant while other elements' order matter, one can apply the idea of DeepSets to the first subset of elements and concatenate the output with the representations of the remaining elements, and finally adopt standard fully connected networks to process the concatenation. This will still give universal approximation. Our theory is still applied to the DeepSet part. Some other relaxation can also be explored in our follow-up works. For instance, can be assumed from a low-dimensional manifold and further obey permutation invariance/equivariance. However, such discussion is beyond the scope of this work. Nevertheless, our results considering the most general inputs can already cover a lot of practical scenarios such as GNNs and PointNets.
6. Solution to hierarchical set.
Extending our results to hierarchical sets is beyond the scope of this paper. A straightforward solution based on our current results could be hierarchically apply the sum-pooling and append a final output layer . For instance, consider a set of depth 2: , one can provably realize a function in terms of this special invariant structure by:
where . If LP embedding layer is adopted, the minimal needed latent dimension for the innermost sum-pooling is upper bounded by and for the outer sum-pooling is upper bounded by .
Summary: The article investigates the impact of embedding dimension on the expressive power of DeepSets.
Strengths: Referees found the paper presented novel ideas, the contributions substantive and technically interesting, appreciated a sound and accurate theory, found the problem relevant, and proof techniques thorough and rigorous.
Weaknesses: On the critical side, a referee expressed concern about unclear implications from a machine learning perspective, lack of practical insight, although they also pointed out this was not a critical point. They also suggested providing more high level ideas. Following the discussion period some of these concerns could be clarified, prompting some referees to raise their recommendations.
In view of the positive reception, I am recommending accept. The authors are strongly encouraged to work on making the intuitions as clear as possible as well as the practical takeaways in the final version of the manuscript, as this was a general point in the reviews.
为何不给更高分
At the end of the discussion period referees still strongly recommend to add a section about practical takeaways.
为何不给更低分
The strengths are well above the weaknesses; the article had a generally positive evaluation from all referees.
Accept (poster)