Monotone and Separable Set Functions: Characterizations and Neural Models
We build a multi-set function that is monotone and separable
摘要
评审与讨论
This paper discusses modelling functions on multisets for applications such as Set Containment, and proposes a definition of Monotone-and-Separable functions on such sets. The authors prove under which conditions such functions can exist, and generalize the definition to the case of models with learnable weights. They propose a class of activation functions called 'hat functions' which enforce separability, and define a model architecture which obeys the MAS conditions. They test this architecture on a number of applications related to set containment and demonstrate slightly improved performance compared to existing methods.
优缺点分析
Strengths:
- Paper is clearly written and easy to follow (at least for the first half)
- Theoretical contributions are very good
Weaknesses:
- Many editing/formatting issues - definition of weak separability on page 5 is missing, Appendix links are broken, tables on page 8 and 9 are small and hard to read and interfere with text. The paper is generally quite cramped near the end and many later sections are overly short and use frequent abbreviations. The paper should likely be reorganized to allow for more room for these final sections, rather than trying to cram them in in a small amount of space.
- Experiments are limited in scope and results do not show significant improvement over existing baselines. Some of this is fine for a theoretical work, but the results show such little improvement over Set Transformers/Deepsets that it makes me wonder if the strict MAS condition is even necessary in practice.
- Architecture of MASNet seems incremental, with only very minor differences from DeepSets.
问题
- Is there a reason to expect that existing set models fail to obey the separability condition in enough cases to make a model like this necessary? The incremental improvement over simple set baselines makes me suspect that these methods already at least approximately obey the separability condition most of the time in practice.
- Why is the condition of monotonicity necessary? The set containment problem only requires separability does it not?
- Do you observe significant difficulties in training networks with the 'hat' activation functions? Intuitively I would expect these activation functions to perform very poorly and have significant issues with gradient flow, particularly in deeper networks. It doesn't seem overly important since you show that ReLU-based networks with multiple layers still satisfy separability, but I would expect those activation functions to be almost entirely a non-starter for less trivial cases.
- Am I correct in concluding that existing DeepSet models are monotone if they simply enforce non-negativity in their parameters, and separable as long as they have at least 2 layers? This seems like a very minor/incremental change from existing models.
- Are Set Transformers with ReLU activations and >2 layers separable? The analysis was largely focused on Deep Sets, which are a much more limited architecture.
局限性
yes
最终评判理由
The author's rebuttal has clarified for me that the poor experimental performance is largely due to highly unbalanced classes, and that with balanced classes the proposed model does show much more significant gains. In addition, they have provided some compelling evidence for the strengths of their model compared to the Set Transformer architecture by showing that this architecture cannot satisfy the criteria of monotonicity.
I do still have some remaining concerns about the fact that a basic Deep Set model with only very slight modifications essentially satisfies all of the proposed criteria, making the novelty of the MASNet architecture limited. That being said, the authors do provide a great deal of additional analysis about alternate architecture choices such as "hat" activation functions and neural integrals that can be used to design more sophisticated MASNet variants, so I think it is fine to move forward. I have changed my score accordingly.
格式问题
Style issues related to size and placement of tables on page 9
We thank the reviewer for their comments, which will improve our paper.
Why is the condition of monotonicity necessary? The set containment problem only requires separability does it not?
Separabality alone does not suffice for set containment. In a neural network setting, such problems require us to design a function such that the order between and can serve as boolean test for whether is a subset of . For this needs to satisfy two conditions.
(I) (II)
We would like to emphasize that for high accuracy, needs to satisfy both conditions (I) and (II).
Separability alone satisfies condition (I). Monotnicity implies , which is equivalent to (II). Hence, without monotonicity, confition (II) may not be satisfied. Here, we will have (thus predicting ) but in reality, is a subset of . Hence, non-monotonicity ensures in large false negatives.
Am I correct in concluding that existing DeepSet models are monotone if they simply enforce non-negativity in their parameters, and separable as long as they have at least 2 layers?
Indeed, MASNET-ReLU can be seen as a special case of the deep sets model as it is of the form .
But it's not entirely correct that to attain monotonicity, all the parameters of MASNET need to be non-negative. We choose the inner MLP to end with a non-negative activation like , but the weights can be negative as well. For the outer transformation , we have chosen a non-negative function such as a neural network with non-negative weights and ReLU activation, or simply the identity. To attain weak separability we require that has layers.
The difference in architecture appears incremental. However this required a very detailed analysis, which we built up gradually starting from Section 2. Our key contribution involves three aspects: (1) characterization of existentence of MAS functions; (2) characterization of weakly MAS function, stability and more.
It is useful for the community to know that set containment problems can be better addressed by implementing deepsets in the correct way. Apart from deepset, we also provided more expressive neural formulations, which learns the activation functions automatically by trainable neural integrator (we mentioned them in appendix. If the reviewer thinks that it would highlight the contribution better, we will put it in main.)
results do not show significant improvement over existing baselines
The incremental improvement of the baselines might be misleading. One of the key reason why DeepSet and Set Transformer does well for the text and point cloud datasets lies in the distribution of classes (subset vs non-subset) our datasets. The non-subset to subset ratio in the datasets for our experiments with text and point cloud was taken to be 9:1. Due to this data imbalance, deepSets and Set Transformers are predicting almost all pairs as non-subsets (due to the data imbalance), resulting in high false negative rate.
This becomes more clear when we look at the False Negative rates of the models:
| Model | Bedding | Feeding | MSWEB | MSNBC |
|---|---|---|---|---|
| DeepSets | 1.00 | 1.00 | 0.26 | 0.08 |
| Set Transformer | 1.00 | 1.00 | 0.67 | 0.90 |
| MASNET-ReLU | 0.00 | 0.00 | 0.00 | 0.00 |
| MASNET-Hat | 0.00 | 0.00 | 0.00 | 0.00 |
The high false negative rate of baselines are due to lack inductive bias as they do not have a built-in design of monotonicity. Since MASNET is equipped with the inductive bias of monotonicity, it predicts the true subsets correctly most of the time (and, all the time) in case of noisy (respectively, exact) set containment tasks. By design : hence, the true subsets are never misclassified and thus the False Negative Rate is zero.
Here, we present the results on synthetic data with a balanced class ratio (1:1 ratio of subsets to non-subsets). We observe that we outperform them by a large margin.
| -size | -size | DeepSets | Set Transformer | MASNET |
|---|---|---|---|---|
| 1 | 10 | 0.59 | 0.59 | |
| 10 | 20 | 0.89 | 0.80 | |
| 10 | 100 | 0.54 | 0.57 |
For the text datasets as well, we sepeated the experiments during the rebuttal by taking 1:1 ratio of subsets to non-subsets and re-training the baselines and MASNET on this data. We present the results in the following table to demonstrate our point.
| Model | Bedding | Feeding | MSWEB | MSNBC |
|---|---|---|---|---|
| DeepSets | 0.62 | 0.68 | 0.84 | 0.81 |
| Set Transformer | 0.72 | 0.70 | 0.87 | 0.89 |
| MASNET | 0.96 |
We observe that we outperform the baselines by a large margin.
Are Set Transformers with ReLU activations and >2 layers separable? The analysis was largely focused on Deep Sets, which are a much more limited architecture
The exact characaterization of conditions of separability for set transformer is extremely challenging; but we can prove it is not monotone. Therefore, it can be effectively used for set containment, as it will show high false negative rate (as shown in previous discussion too too)
We consider a set transformer model with sum pooling, namely . Explicity, this is defined for a give multiset by first defining query, keys and values per point and then taking a weighted average, defined by weights to obtain a permutation invariant function
Lemma: Let , and let be a Set Transformer defined by , then is not a pointwise monotone function.
Proof We need to show that there exist some parameters for which the function is not monotone. Choose any matrix which has no kernel, and any non zero . Choose some vector such that is non zero. By swapping with if necessary, we can choose some index such that for , the -th index is negative . By assumption we also have that is a non-zero vector, and hence it has a positive norm.
Now, let be the singleton set , which is a subset of , where denotes the all zero vector in . We note that While, since are all zero, Note that and it follows that which shows that this function is not monotone. This concludes the proof.
It is plausible that the set transformer model could be modified in some way, so that it is monotone. We believe this is a good direction for future work, and we thank the reviewer for pointing it out.
Do you observe significant difficulties in training networks with the 'hat' activation functions?
Training MASNET-Hat was not very difficult. For example, we refer to eq (8) of the paper, where we used a parametric version of Hat function, given by
+ReLU \left( \frac{x-(\alpha+\beta)}{(1-\gamma) \cdot \beta}\right) -ReLU \left( \frac{x-(\alpha+\gamma \cdot \beta)}{\gamma \cdot (1-\gamma) \cdot \beta}\right)$$ With appropriate parametrizations of $\beta > 0$ such as $\beta = ELU(\hat{\beta}, \nu) + \nu$ (ELU is the Exponential Linear Unit function, $\hat{\beta}$ is learnable param eter and $\nu$ is a hyperparameter) and $\gamma \in (0,1)$ as $\gamma = Sigmoid(\hat{\gamma} \cdot \tau)$, (where $\tau$ is hyperparameter and $\hat{\gamma}$ is learnable) we were able to have a large enough support for our hat activations and mitigate the difficulties in training (due to poor gradient flow) to a good extent. We've made a discussion of that in lines 967-984 (Page 27) of our full paper (attached in the supplement) which we could elaborate more in the final version in case of acceptance. > *Editing/formatting issues* We sincerely apologize for the inadvertent errors in the current version of the manuscript. In our internal repository, the manuscript was in order; however, during the final stages of submission to OpenReview, a few hasty edits introduced these typos. In fact, during the original submission, we uploaded a version in supplementary material, which has all broken Appendix references fixed. We would like to emphasize that these issues have already been resolved in our internal version. Unfortunately, OpenReview does not allow us to update the current submission version, but we will ensure that all such errors are fully corrected in the camera-ready version, if the paper is accepted. The definition of weak-separability, which was accidentally ommitted, is as follows: “The parametric set function $F: P _{\leq \infty}(V) \times \mathcal{W} \to \mathbb{R}^m$ is weakly separable if, given any $S, T \subseteq V$ such that $S \not \subseteq T$, there exists a parameter $w \in \mathcal{W}$ such that $F(S; w) \not \le F(T; w)$”. This essentially means that at least some point in the parameter space separates two non-subsets.Dear Reviewer Mcwl,
During the course of the rebuttal and discussion with other reviewers, we’ve performed some more experiments which we think is also relevant to address some of the questions you asked.
Am I correct in concluding that existing DeepSet models are monotone if they simply enforce non-negativity in their parameters, and separable as long as they have at least 2 layers? This seems like a very minor/incremental change from existing models.
While we’ve theoretically proved and experimentally demonstrated that MASNET-ReLU is separable, the key idea in the proof is that a 2-layer ReLU network can simulate a 1-layer Hat function-based network, which happens to be Holder separable as per Theorem 11 of our paper. Thus, a ReLU network has to effectively learn the weak separability. In our experiments also, to accommodate this, we took ReLU networks to have a larger number of parameters than Hat network. We wanted to see what happens when both have a similar number of parameters (2-layer ReLU vs 2-layer Hat), and we clearly observed that for a comparable number of parameters, Hat function-based models are superior. We perform experiments for exact set containment on Amazon baby registry dataset and experiments for inexact set containment (based on provided relevance labels) on BEIR datasets. In both cases, a 1:1 ratio of subsets to non-subsets was taken. The accuracy numbers are presented as follows:
| Model | Bedding | Feeding | NFCorpus | SciFact |
|---|---|---|---|---|
| MASNET-ReLU | 0.95 | 0.92 | 0.71 | 0.70 |
| MASNET-Hat | 0.97 | 0.96 | 0.73 | 0.75 |
This demonstrates the novelty and practical application of Hat functions, which offer a wide class of possible neural architectures than DeepSets with layer ReLU.
- Are Set Transformers with ReLU activations and > 2 layers separable?
We have already proved theoretically that Set Transformers are not monotone in our previous response. We follow this up with an experiment to demonstrate that they are not reasonably separable as well. Recall that separability means . Thus, for set embeddings that are not separable, we may have for sets . In such cases, the number of false positives would be high. We give the false positive rates of Set Transformers and MASNET on Amazon and BEIR datasets with a 1:1 class ratio of positives to negatives.
False Positive Table
| Model | Bedding | Feeding | NFCorpus | SciFact |
|---|---|---|---|---|
| Set Transformer | 0.26 | 0.35 | 0.48 | 0.40 |
| MASNET | 0.06 | 0.08 | 0.39 | 0.35 |
Thus, we see that Set Transformers show a significantly higher degree of False positives than MASNET for both exact and inexact set containment tasks. This demonstrates that when the class ratio in the data is balanced, Set Transformers fail to learn monotonicity and separability at the same time.
Since the discussion period is going to end soon, we would kindly request that you let us know if our response addresses your rebuttal questions.
Dear Reviewer McwL,
We realized that the first para in our response to the following question may be confusing due to a typo. We clarify here for better understanding.
Are Set Transformers with ReLU activations and >2 layers separable? The analysis was largely focused on Deep Sets, which are a much more limited architecture
The exact characterisation of conditions of separability for set transformer is extremely challenging; but we rigorously prove that it is not monotone (we presented a Lemma with proof in the rebuttal itself). Therefore, it can not be effectively used for set containment, as it will show high false negative rate (as presented experimental results and preceding discussions too)
Dear Reviewer Mcwl,
Could we request you to let us know if our response addresses your rebuttal-questions. We would be more than happy to provide any additional results/analysis, if you deem suitable.
Dear Reviewer McwL,
We further conducted experiments on MNIST. Results are follows, which show that under the similar model capacity, MASNet Hat is better than MasNet ReLU. Therefore, we observe the utility of Hat functions across a span of applications from Amazon datasets to text retrieval and image classification.
Accuracy numbers are in the following table:
| Model | Accuracy |
|---|---|
| Set Transformer | 0.71 |
| MASNET-ReLU | 0.81 |
| MASNET-Hat | 0.85 |
Since the discussion period will conclude in less than 12 hours, we would be indeed very grateful if you could let us know of any remaining concerns. We would be happy to address them promptly.
Thank you for your reply. Our replies are inline.
I understand that the proposed architecture can also be used with the "hat" activation functions, but I remain somewhat suspicious of the usefulness of those activation functions in more general settings in practice.
Our hat functions are more effective in practice than other activation function like ReLU. We already experimented this during discussion period and presented for other reviewers. Specifically, we provide more effective comparison of MASNET-HAT vs MASNET-ReLU by ensuring approximately equal number of parameters on a large class of setups i.e., synthetic dataset (1:1 ratio of subset-nonsubset pairs) and also a wide class of real datasets- MSWEB,MSNBC; Bedding and Feeding from Amazon datasets; NFCorpus and SciFacts from BEIR dataset.
Acc on synthetic dataset
| Model | Accuracy |
|---|---|
| MASNET-ReLU | 0.916 |
| MASNET-HAT | 0.974 |
Acc on real dataset
| Model | MSWEB | MSNBC | Bedding | Feeding | NFCorpus | SciFacts |
|---|---|---|---|---|---|---|
| MASNET-ReLU | 0.92 | 0.94 | 0.95 | 0.92 | 0.71 | 0.70 |
| MASNET-Hat | 0.98 | 0.97 | 0.97 | 0.96 | 0.73 | 0.75 |
These two results demonstrate that the HAT activation is effective and outperforms ReLU when evaluated under comparable model capacity. Since both models are monotone, the improved performance can be attributed to the superior separability of MASNET-HAT compared to MASNET-ReLU under similar model capacity.
Moreover, in our proof of MASNET-ReLU's separability, we crucially relied on the fact that a single-layer network with the HAT activation function is separable. In this sense, the HAT function serves as a foundational stepping stone in establishing the separability of MASNET-ReLU. Without first introducing and analyzing the HAT activation, it would not have been possible to rigorously justify the separability of MASNET-ReLU, despite its seemingly simpler architecture.
We would also like to emphasize that, as a class of functions, HAT activations are significantly richer than ReLU. Unlike ReLU, which is inherently piecewise linear, HAT activations are not restricted to piecewise linearity and can capture more complex functional forms.
The RELU form of MasNET seems even less distinct from a baseline Deep Set architecture than I thought, given your comment. Is there any difference at all between MasNET-RELU and a Deep Set with ReLU activations and multi-layer feedforward encoders and decoders?
We are sorry for the confusion. MASNET-ReLU is not the same as DeepSet with ReLU activation. DeepSets have form of , where and . For such a DeepSet to be a MASNET-ReLU, we need to have to be a non-negative vector-valued function, which in our case is a neural network with ReLU activation in the last layer; and is a vector-to-vector monotone increasing function. We can provide monotonicity using two ways:
- Using a neural network with non-negative weights and a hardcoded monotonically increasing activation function.
- Using a neural integral based technique inspured from [1] where we model to be co-ordinate wise monotone function. This we achieve by modelling the derivative of using a non-negative neural network and use a neural integral to compute , as follows: , where is a non-negative neural network.
This makes our architecture significantly different from "Deep Set with ReLU activations and multi-layer feedforward encoders and decoders", because we need to essentially constrain the outer decoder to preserve the partial order of vector monotonicity, which we can achieve by the above couple of prescribed methods.
[1] Unconstrained Monotonic Neural Networks by Wehenkel and Louppe
We are open to hear more from you. If you have questions, please do not hesitate to write to us. The neural integral approach was discussed Appendix for a general context. If the reviewer suggests, we can bring them in main if the paper gets accepted.
I thank the authors for their very thorough responses, and apologize for my lateness in replying in kind. The experiments with balanced class distributions do show a stronger empirical argument for MASNet. I also appreciate the detailed analysis of the Set Transformer case - this is a compelling argument that should make it into the paper.
My concerns about the incremental nature of the architecture remain, however. I understand that the proposed architecture can also be used with the "hat" activation functions, but I remain somewhat suspicious of the usefulness of those activation functions in more general settings in practice. The RELU form of MasNET seems even less distinct from a baseline Deep Set architecture than I thought, given your comment. Is there any difference at all between MasNET-RELU and a Deep Set with ReLU activations and multi-layer feedforward encoders and decoders?
This paper considers the problem of the existence of mapping f between multisets and vector spaces such that S is contained in T (in the multiset sense) if and only if f(S) \leq f(T) pointwise. The authors consider what is the dimension needed for the existence of such a mapping, consider neural architectures of implementing these functions and test these in multiple experiments on real and synthetic datasets.
优缺点分析
The paper suggests a new notion of MAS set functions that leads to interesting mathematical analysis. It also provides interesting experiments. One issue is the lack of compelling motivation for MAS as opposed to monotone functions. The authors need to articulate in more detail the motivation for their new notion. I also find the abundance of notions and results a disadvantage making it harder for a reader to take away the important findings and use this paper in theory and practice. The authors may want to focus first on their most important 2-3 findings as opposed to talking about stability, infinite sets (less relevant to ML) and new activation functions. Furthermore the paper is pepped with inaccuracies (line 376, an inaccurate reference to the appendix and more). My feeling is that this paper is not yet ready for publication in a prime conference such as NeurIPS.
问题
-
in line 31 15 is cited as relevant to the paper. I do not see why it is relevant. Perhaps the authors can explain?
-
Why introduce a new activation function if embedding can implemented by a simple architecture with two layers?
-
The introduction of parameters in the denominators in (8) could lead to numerical instability while training. Can it be avoided?
局限性
- Lack of completing motivation for the notion of MAS.
- Numerous new notions (MAS, stability, new activation function and more).
最终评判理由
The authors addressed some of my concerns. Accordingly I am raising my rating the borderline accept.
格式问题
None.
We thank the reviewer for their comments, which we will incorporate in the revised version, if accepted.
One issue is the lack of compelling motivation for MAS as opposed to monotone functions.
We begin with two applications which are based on the set containment task:
(1) In recommendation system design, one problem is to recommend item having a particular set of features . Here, we represent each item as a set of corresponding features and the problem is to find all sets such that approximately, . (2) Text entailment where we are given a small query sentence and the goal is to find the set of corpus items from a large corpus , where or entails . Here, and are typically represented as sets of contextual embeddings (S for q and T for c) and the entailment problem can be cast as the problem of checking if .
In a neural network setting, such problems require us to design a function such that the order between and can serve as boolean test for whether is a subset of . For this needs to satisfy two conditions.
(I) implies (II) implies
We would like to emphasise that for high accuracy, needs to satisfy both conditions (I) and (II). Monotone function ensures that if , then we have or equivalently implies (i.e., condition (II) above). However, monotone-only condition does not ensure the condition (I). As a result, if we predict based on when is monotone but not separable, it will give large number of false positives. Therefore, we need to define the notion of separable functions to ensure our function satisfies condition (I).
Here, we run some additional experiments for the text entailment application on BEIR datasets: NFCorpus and SciFact. The following table reveals that MAS functions perform better than all baselines (including four new baselines, suggested by Rev b3Mt) as well as monotone (but not MAS) functions. As per Prop-6 from our paper, DeepSet with single layer ReLU network as the embedding model is not separable (even weakly), which we chose for the monotone (but not MAS) set model for our experiment. Accuracy is reported as follows:
| Model | NFCorpus | SciFact |
|---|---|---|
| DeepSet | 0.61 | 0.69 |
| SetTransformer | 0.68 | 0.72 |
| FSPool | 0.65 | 0.70 |
| HORSE: Hierarchical Representation | 0.60 | 0.65 |
| Slot Set Encoder | 0.62 | 0.67 |
| UMBC Set Encoder | 0.66 | 0.66 |
| Monotone Only (shallow ReLU) | 0.59 | 0.65 |
| MASNET | 0.72 | 0.75 |
This strengthens our argument that monotonicity alone is not sufficient for set containment.
The authors may want to focus first on their most important 2-3 findings as opposed to talking about stability, infinite sets (less relevant to ML) and new activation functions.
We did ponder on this, but realized that there are several cases where infinite sets and stability are crucial. Note that by infinite sets, we mean the ground set or the universal set is infinite. Such cases are applicable when dealing with continuous features or representations. For example, in case of point clouds and images, the features come from a continuous distribution; in the context of health records, some features (e.g., blood sugar) may be continuous. Even in text applications, where the vocabulary consists of discrete words, the contextual embeddings of a word in a sentence (e.g., BERT embeddings or LLM encoder output) are effectively continuous vectors. Since the possible occurrences of a word across sentences can be really diverse, the set of token embeddings (with added positional encoding) can be roughly considered as an infinite set in practice, due to the very small resolution.
Stability or holder separability also has practical implications, especially for infinite sets. Typically, in applications, we encounter approximate set containment rather than exact set containment, labelled by human annotators. For example, if two images consist of A = ("leopard") and B = ("tiger", "tree"), then a human annotator is likely to assign . Holder separability can better account for this than exact set containment.
paper is pepped with inaccuracies
We sincerely apologise for the inadvertent errors in the current version of the manuscript. In our internal repository, the manuscript was in order; however, during the final stages of submission to OpenReview, a few hasty edits introduced these typos. In fact, during the original submission, we uploaded a version in supplementary material, which has all broken Appendix references fixed.
We would like to emphasise that these issues have already been resolved in our internal version. Unfortunately, OpenReview does not allow us to update the current submission version, but we will ensure that all such errors are fully corrected in the camera-ready version, if the paper is accepted.
We hope that these typos—already fixed on our end—do not detract from the technical strengths of the work, which you also recognised.
in line 31 15 is cited as relevant to the paper.
The paper considers a reduction to the set splitting problem to prove the hardness result. This is the task of partitioning a set such that for a collection of subsets , we must have . This is basically preventing set containment for a subset family. For some context, we intended to mention some of the relevance of set monotonicity to learning theory. But we understand the connection may appear quite remote, and we will clarify this more in final version, if our paper gets accepted.
Why introduce a new activation function if embedding can implemented by a simple architecture with two layers?
When we fine-tune a large layer model for different tasks, sometimes the practice is to freeze the first N-1 layer and only train/fine tune the last layer. Thus, here we fine-tune only one layer rather than two layers. In such cases, the variant of our model with exactly one layer with a new activation function can be more easily plugged and played. Moreover, for such large models--- which require large memory to fit--- adding more layers will make it difficult to train with available resources.
To elaborate more, we consider image retrieval applications, where we are given a segment of an image as the query: and the goal is to retrieve all images in the database which contain the object(s) in . In this task, we only fine-tune the last layer of CNN with our new activation function. Experiments were performed on the MNIST dataset, and the goal was to retrieve all the images with the same digit as the query image segment. The performance is measured in accuracy, where each image in the corpus was assigned a boolean label of 0/1 depending on whether the digit in the image is same as that of the query segment.
| Model | Accuracy |
|---|---|
| DeepSet | 0.62 |
| SetTransformer | 0.71 |
| FSPool | 0.84 |
| UMBC Encoder | 0.82 |
| Slot Set Encoder | 0.79 |
| MASNET | 0.85 |
Apart from the practical usage, the hat activation was introduced was firstly as a tool to develop the TRI function, which was a key step in theoretically justifying that two-layer neural nets can be used to develop a MAS function. Hence, to prove that a simple architecture with two layers can serve as MAS function, we had to develop this new activation function.
The introduction of parameters in the denominators in (8) could lead to numerical instability while training. Can it be avoided?
We didn't incur too much instability during training of MASNET-Hat, where the Hat function admits the particular form in (8). We have two learnable parameters in the denominator: the support width (where ) and the peak (where ). The way we tackled this was in the particular form of parametrizing and during our training. This was done as follows:
-
For ensuring , we introduced an unconstrained learnable parameter and then tried several different parametrizations of , as follows: (i) (ii) (iii) , where is a hyper-parameter and Amongst these choices, the third one worked best. An intuitive explanation is: training instability comes to picture when . But in the third parametrization, , thus for most values of , we have . The choice of the hyperparameter slightly affected the accuracy, but with no significant impact.
-
For ensuring , we again chose a learnable unconstrained parameter , a hyper-parameter , and used the tempered-sigomid based parametrization of . Since like before, we had a stable training since for most values of . As observed in our experiments, the temperature hyperparameter didn't affect the accuracy a lot.
Here, we give a table of the avg and std (across seeds) of accuracy for different parametrizations of .
| Parametrization type of | Synthetic | MSWEB | MSNBC |
|---|---|---|---|
| Abs value | 0.97() | 0.95() | 0.96() |
| Square | 0.96() | 0.94() | 0.94() |
| ELU | 0.99( 0.01) | 0.95() | 0.97() |
Thanks for the rebuttal that clarified some of my questions. I am not convinced by the rationale of introducing new activation function. In any event the author may want to correct the sentence "Apart from the practical usage, the hat activation was introduced was firstly as a tool to develop the TRI function..."
Dear Reviewer 7bQs,
Sorry to bother you again. Could you kindly check our follow-up responses (Re. rationale behind introducing a new activation function, and More re. motivations behind Hat activations) and let us know if we’ve correctly understood and addressed your concern? We now realize that our initial explanation was unconvincing due to a misunderstanding on our part.
We’d be very grateful if you could take a quick look and let us know.
Best,
Authors
Dear Reviewer 7bQs,
In an attempt to answer your questions about justifying the new class of Hat activation functions, we've followed up our experiments on synthetic data with some additional experiments on real-world datasets, namely MSWEB and MSNBC. We solved the task of exact set containment where the ratio of subsets to non-subsets was taken to be 1:1. We considered the two MASNET models where MASNET-Hat is equipped with a one-layer neural network as the embedding model and MASNET-ReLU has a two-layer architecture with a comparable number of parameters (1.1x the no of parameters of MASNET-Hat). The accuracy numbers are as follows:
| Model | MSWEB | MSNBC |
|---|---|---|
| MASNET-ReLU | 0.92 | 0.94 |
| MASNET-Hat | 0.98 | 0.97 |
This backs up our intuition that Hat activation shows better separability than ReLU with a comparable number of parameters.
Could you please let us know if our answers addressed your concerns?
Dear Reviewer 7bQs,
We would really like to thank you for engaging with us.
We acknowledge that we misunderstood the intent of your question in the original rebuttal. Our response focused on the utility of a one-layer network, whereas your question was more concerned with the rationale behind introducing the new activation function. We provide a clarification below. We kindly request the reviewer to look into it and please get back to us once more.
In our initial experiments, we compared a single-layer MASNET-HAT (HAT is the new activation function) with a two-layer MASNET-ReLU (the two layer network which the reviewer has pointed out). The small difference between performances of MASNET-ReLU and MASNET-HAT in this setting should not be taken as evidence against the utility of the HAT function. Instead, the observed performance gap is primarily due to the increased number of parameters in MASNET-ReLU, resulting from its deeper architecture.
To more directly evaluate the effectiveness of the HAT function, we conducted experiments where we approximately equalized the number of parameters between MASNET-HAT and MASNET-ReLU.
First, we control the hidden dimension of a two-layer MASNET ReLU, so that the parameter count of MASNET-ReLU becomes close to (1.1x) MASNET-HAT with a single layer on our synthetic dataset with a 1:1 subset-pair non-subset-pair ratio The following results show that MASNET-HAT performs better.
| Model | Accuracy |
|---|---|
| MASNET-ReLU | 0.916 |
| MASNET-HAT | 0.974 |
Next, we compare a two-layer MASNET-HAT with a two-layer MASNET-ReLU on the same dataset. The following results show that MASNET-HAT performs better.
| Model | Accuracy |
|---|---|
| MASNET-ReLU | 0.971 |
| MASNET-HAT | 0.998 |
These two results demonstrate that the HAT activation is effective and outperforms ReLU when evaluated under comparable model capacity. Since both models are monotone, the improved performance can be attributed to the superior separability of MASNET-HAT compared to MASNET-ReLU under similar model capacity.
Moreover, in our proof of MASNET-ReLU's separability, we crucially relied on the fact that a single-layer network with the HAT activation function is separable. In this sense, the HAT function serves as a foundational stepping stone in establishing the separability of MASNET-ReLU. Without first introducing and analyzing the HAT activation, it would not have been possible to rigorously justify the separability of MASNET-ReLU, despite its seemingly simpler architecture.
We would also like to emphasize that, as a class of functions, HAT activations are significantly richer than ReLU. Unlike ReLU, which is inherently piecewise linear, HAT activations are not restricted to piecewise linearity and can capture more complex functional forms.
Thank you for pointing out the issue with the sentence construction (we cannot change the sentence anymore, unfortunately, since the edit button is frozen). We will definitely ensure a more polished and formal discussion in the final version, should the paper be accepted.
Dear Reviewer 7bQs,
We wanted to follow up on whether your concerns about the motivation and justification for introducing the class of hat activations have been addressed by our experiments and intuitions that we provided. Since the author-reviewer discussion period has been extended by a couple of days, we were wondering if we could provide you with some deeper insights into it or other aspects of our methods/theory.
We would appreciate it if you kindly let us know what you think. We will be more than happy to provide any further clarifications.
Best,
Authors
Dear Reviewer 7bQs,
In an attempt to further strengthen our claim on why we need a new Hat activation, we have performed more experiments on Amazon Baby Registry datasets (Bedding, Feeding) and BEIR datasets (NFCorpus, SciFact). We have kept the number of parameters of two-layer MASNET-ReLU and two-layer MASNET-Hat to be similar. The accuracy numbers are given as follows:
| Model | Bedding | Feeding | NFCorpus | SciFact |
|---|---|---|---|---|
| MASNET-ReLU | 0.95 | 0.92 | 0.71 | 0.70 |
| MASNET-Hat | 0.97 | 0.96 | 0.73 | 0.75 |
Here, we observe that, under similar parameter scale, using Hat function shows advantage over ReLU, following our theoretical insight that ReLU networks learn to simulate Hat functions.
Hope this strengthens our point on why Hat functions are useful even outside theoretical analysis.
Since the discussion period is going to end soon, we would kindly request you to please let us know if our response addresses your rebuttal questions.
Thank you. I have no further questions.
The authors completely characterize the dimensions needed for a “monotone and separating” embeddings (for sets/multisets) in finite settings and show they are impossible in infinite ones, then introduce a relaxed “weakly” version that admits practical neural implementations. They propose MASNET, a simple sum decomposable architecture, based on Deepsets, with specially chosen activations that provably enforces monotonicity and, with high probability, separability. Finally, experiments on synthetic, text, and point cloud tasks demonstrate that MASNET outperforms existing permutation-invariant models on set containment and monotone function approximation.
优缺点分析
Strengths
-
The paper precisely establishes when exact monotone and separating embeddings exist for sets in the finite settings and shows their non existence in infinite ones.
-
Weakly MAS functions are introduced to strikes a practical balance between theoretical rigor and real world feasibility.
-
MASNET combines simple sum decomposable functions with tailored activations to guarantee monotonicity and separability (where possible), as well as introducing the notion of Hölder stability.
-
MAS embeddings are shown to approximate any monotone set function and gives exponential‐decay bounds on separation failure.
-
MASNET consistently outperforms standard permutation‐invariant models on containment and monotone function tasks.
Weaknesses
-
A large portion of the set literature seem to be ignored in this work (see References).
-
The set sizes in the experiments are smaller than what is normally used in the set literature.
[1] Fspool: Learning set representations with featurewise sort pooling
[2] Scalable set encoding with universal mini-batch consistency and unbiased full set gradient approximation
[3] Enhancing neural subset selection: Integrating background information into set representations
[4] Mini-batch consistent slot set encoder for scalable set encoding
[5] HORSE: Hierarchical Representation for Large-Scale Neural Subset Selection
问题
-
Sum decomposable functions like deepsets have been shown limited expressivity compared to attention based methods like Set Transformer. Given that the proposed architecture is based on deepsets, what would be required to propose an architecture based on the transformer like Set Transformer that is monotone and separable? A discussion is enough.
-
How are MAS functions related to the MBC methods [2,4]? From my understanding, the restriction they impose on embeddings of subsets, together with the aggregation functions they utilize (similar to DeepSets) have some connection with the problem tackled in this work.
-
What happens when the set sizes grow to millions? While from Table 6 it seems performance increases with increasing |S|, what happens when |T| is very large?
局限性
Yes.
最终评判理由
Most of the queries/concerns I had have been addresed by the authors.
格式问题
Appendix references missing.
We thank the reviewer for their suggestions, which will improve our paper.
A large portion of the set literature seem to be ignored in this work (see References).
Many thanks, we will incorporate and discuss them in the main paper if our paper gets accepted.
Fspool used a sort-based pooling strategy and introduced the responsibility problem. Both HORSE and INSET are designed specifically for subset selection. Specifically, HORSE is built on attention and to retrain information from both the input ground set and the optimal subset. In contrast, our method focuses on characterising and neural design for monotone and separable set functions, specifically for the set containment task.
How are MAS functions related to the MBC methods... have some connection with the problem tackled in this work.
Relation to MBC works: The goal of the MBC works is to design a set function , which, if applied on any subsets from partitions of a set and pooled through an activation , should return . Thus, at a high level, MBC functions analyse and design functions for the following batch consistency condition: . MBC methods achieve this by imposing certain restrictions on attention-based set architectures. Similar to this, our work also puts restrictions on the inner embedding transformation and the outer vector-to-vector transformations. However, the nature of restrictions for our method is very different. Our method offers neural restrictions to ensure monotonicity and weak separability, which would translate partial order of set containment to the partial order of coordinate-wise vector dominance.
MASNET can be made Mini Batch Consistent: The reviewer's question triggered us to think if MASNET can be made minibatch consistent. Since set containment is the major theme of our work, with potential applications in retrieval and recommendation systems, processing huge sets at once can be a key bottleneck, and it would be nice to make MASNET process over mini-batches and then design some aggregation mechanism to make the aggregated embedding equal to the embedding of the entire set. If we consider a partition of a given set , where 's are all disjoint, then by monotonicity. But, we can try to find an appropriate pooling mechanism over the subset embeddings to make the aggregated embedding same as that of the entire set.
For example, if the outer transformation of MASNET as in eq(7) of the main text was invertible, and then we may define a particular permutation-invariant pooling function such that . Then we have that: This is irrespective of how we partition into . And this would enable us to process the subsets independently and aggregate them accordingly.
For our purposes, we have needed the outer transformation to be a vector-to-vector monotone function. But that doesn't guarantee injectivity or invertibility. Thus, we need to restrict the choice of to be monotone and invertible. One possible choice is to use ideas from relevant works like [A], [B] to design monotone and invertible neural networks, and analyse the expressibility properties of the resulting set model.
[A] Unconstrained Monotonic Neural Networks by Wehenkel and Louppe
[B] Analyzing Inverse Problems with Invertible Neural Networks by Ardizzone et. al.
Experimental comparison with MASNET: We compare the suggested methods against our method on BEIR datasets for text retrieval applications. The following table shows the results, which show that we perform better.
| Model | NFCorpus | SciFact |
|---|---|---|
| DeepSet | 0.61 | 0.69 |
| SetTransformer | 0.68 | 0.72 |
| FSPool | 0.65 | 0.70 |
| HORSE: Hierarchical Representation | 0.60 | 0.65 |
| Slot Set Encoder | 0.62 | 0.67 |
| UMBC Set Encoder | 0.66 | 0.66 |
| MASNET | 0.72 | 0.75 |
The set sizes are smaller... What happens when the set sizes grow to millions?
In the following experiments on synthetic data (with 1:1 ratio of subsets to non-subsets), we keep fixed and increase .
Following are the results.
| -size | -size | DeepSet | Set Transformer | MASNET |
|---|---|---|---|---|
| 1000 | 3000 | 0.68 | 0.73 | 0.99 |
| 1000 | 10000 | 0.66 | 0.69 | 0.97 |
| 1000 | 50000 | 0.58 | 0.68 | 0.95 |
| 1000 | 100000 | 0.61 | 0.63 | 0.92 |
| 1000 | 500000 | 0.55 | 0.59 | 0.90 |
| 1000 | 1000000 | 0.51 | 0.54 | 0.87 |
Table: Accuracy for set containment when goes large.
We observe that (1) Our method outperforms the baselines more and more significantly, as goes large. (2) Accuracy of all methods worsens as goes large. But, the degradation of performance of the baselines is very large, whereas the degradation of our method is quite modest. Intuitively, the chances of wrongly predicting increase as , but MAS functions still do better as they're equipped with provable separability.
Sum decomposable functions like deepsets have been shown limited expressivity compared to attention-based methods. what would be required to propose an architecture based on the transformer .. that is monotone and separable? A discussion is enough.
Set Transformers are not monotone. Attention makes the monotonicity challenging due to the normalising factor in the denominator. We give a constructive proof of the fact that, for a large class of weights for set transformer, we can find sets , such that . Thus, it's not pointwise monotone, unlike MASNET, as shown below:
Lemma: Let , and let be a Set Transformer defined by , then is not a pointwise monotone function.
Proof We need to show that there exist some parameters for which the function is not monotone. Choose any matrix which has no kernel, and any non-zero . Choose some vector such that is non-zero. By swapping with if necessary, we can choose some index such that for , the -th index is negative . By assumption we also have that is a non-zero vector, and hence it has a positive norm.
Now, let be the singleton set , which is a subset of , where denotes the all zero vector in . We note that While, since are all zero, Note that and it follows that which shows that this function is not monotone. This concludes the proof.
To achieve monotonicity in interaction based set models, one could extend the notion of attention without the normalization factor, i.e., to use functions of the form where is a non-negative function to model the interaction between elements and are vector-valued functions with non-negative entries. Such set functions are monotone by design. Obtaining separability results in such cases could be quite challenging and this seems like interesting direction for future work.
For separability, we first point out to Corollary-4(l-173) from the main text, where we show that one can't achieve both mononicity and separability for infinite ground sets. Thus, for a "monotone version" of a set transformer, achieving full separability is impossible, and we thus fall to the notion of weak separability: which requires every pair of non-subset to be separated by at least one set of parameters. The exact characterization of parameters that makes this "monotone version" of transformer separable too, is challenging.
Experimental evidence that set transformer is not MAS Monotonicity means that , or in other words, that in the set containment problem, there can be no false negatives. Hence, non-monotonicity ensures in large false negatives. Similarly, the absence of separability results in a large false positive rate. To evaluate this experimentally, we perform experiments on synthetic data with a 1:1 ratio between the number of subset-pairs and non-subset pairs. Results are as follows:
| Model | False neg rate | False Pos rate |
|---|---|---|
| Set Transformer | 0.13 | 0.25 |
| MASNET-ReLU | 0.0 | 0.02 |
| MASNET-Hat | 0.0 | 0.03 |
Table: Acc, FNR, FPR for synthetic data with 1:1 class ratio
We see that MASNET has no false negatives due to monotonicity, and a low false positive rate due to separability, as we expected. In contrast, the present implementation of Set Transformers has high rates of both false positives and false negatives.
Appendix link missing
We sincerely apologize for the inadvertent errors in the current version of the manuscript. In our internal repository, the manuscript was in order; however, during the final stages of submission to OpenReview, a few hasty edits introduced these typos. In fact, during the original submission, we uploaded a version of the full paper in supplementary material, which has all broken Appendix references fixed.
In any case, all these errors have been fixed in our current version of our manuscript, and all will be in order for the final version of the paper.
I would like to thank the authors for the comprehensive response to my queries. Overall, most of my concerns have been discussed/addressed hence I'm inclined to raise my score. Additionally, I will monitor the discussions with the other reviewers and re-evaluate at the end of the rebuttal period. Thanks again, especially for the detailed discussions on the transformer based models as well as the possible relation/extension to MBC models.
The paper discusses embedding (finite) multisets into vectors that preserves monotonicity -- i.e. the embedding of a larger multiset should be a (coordinate-wise) larger vector and conversely that two multisets–each one not a subset of the other–should be mapped to incomparable vectors. The latter direction is termed "separability" in the paper.
The paper explores the minimal vector dimension of such embeddings. It is shown that the minimal dimension is the size of the ground set (the set from which the elements of the multiset are taken). When restricting the function to work on multisets of bounded-cardinality, the minimal dimension may be smaller, and upper and lower bounds for it are derived. It’s also shown that no such embedding exists when the ground set is infinite.
To allow building such embeddings over infinite ground sets, the paper proposes 2 relaxations of the separability condition:“weakly seperable”, and “lower Holder seperable”. It proves that the embeddings proposed in the DeepSets paper are in general not weakly separable, but claims that using a different activation function it terms a “hat function” or that using more than one-layer of the element embedding part would make the final embeddings weakly separable and lower Holder separable.
Finally, the paper evaluates the benefit of the proposed embeddings to the task of determining set containment and approximating a monotonic set-to-scalar function. It compares with SetTransformer and DeepSet embeddings as the baselines and shows that the embedding proposed perform better.
优缺点分析
I think the inductive bias that the paper proposes for “good” multiset embeddings is useful. Containment is a major and general property of set objects and good embeddings should preserve it. To my knowledge its novel. I find that the paper provides good theoretical characterization of the embeddings and works around the limitations with reasonable weakening of the requirements. The presentation is clear and motivations are provided throught the paper on the various assumptions being made.
Weaknesses:
- I’m not an expert on partial-ordered sets or lattices, but I’d imagine that the results of Theorem 1 follow pretty easily from standard elementary results in those fields. It’s nice that the proofs are self-contained but it would be good to mention any standard results this can be derived from if any.
- I would have liked to see more end-to-end experiments of using MAS models to solve realistic modelling problems such as the examples in “Deep Sets”: point-cloud classification, Text Concept Set retrieval, Sum of Digits, etc. The paper only provides experiments that predict containment which is a much easier task for these embeddings given their inductive bias.
问题
- It looks like Prop 8 contradicts Prop 6: As far as I can tell, a hat function can be monotone and non-negative, in which case by Prop 6, there exists an S not-subset of T for which F(S,(A,b)) <= F(T, (A,b)) for all A, b, so F is not weekly MAS. Am I missing something?
- In Theorem 13, isn’t it True that for all injective functions F:P_{<=k}(V) → R^m there is an M such that F(S) =M (F(S) ). Also you’re using ‘F’ to denote both the MAS and the desired function F. Can you change the notation for one of them to make the theorem statement clearer?
局限性
Yes.
最终评判理由
The authors have addressed my concrents. It's a solid paper suitable for the conference.
格式问题
- The actual definition of “weak separability” is missing due to a typo in the paper.
- In Theorem 12, do you mean m rows instead of columns?
- What does ‘m’ refer to in line 331?
We thank the reviewer for their comments, which we will incorporate in the revised version, if accepted.
I’m not an expert on partial-ordered sets or lattices, but I’d imagine that the results of Theorem 1 follow pretty easily from standard elementary results in those fields. It’s nice that the proofs are self-contained but it would be good to mention any standard results this can be derived from if any.
We agree that this problem of designing set embeddings that captures the notion of set containment is a very fundamental question, and lots of our proof techniques-specially those related to existential characterization of MAS functions, boil down to combinatorial arguments. However, we did an extensive literature survey and were unable to find significant existing work that derives results related to set embeddings that preserve the partial order of monotonicity.
We note that for some of the theorems we did employ some standard techniques: In Theorem 2 we leveraged probabilistic methods based techniques similar to Alon’s text on probabilistic methods [1], for Theorem 3 we have leveraged standard combinatorial results like the Erdos-Szekeres Theorem [2]. We have referenced [2] in the main text but accidently omitted [1], we will fix this in the final version.
[1] The Probabilistic Method by Noga Alon and Joel Spencer
[2] Erdős, P.; Szekeres, G. (1961), "On some extremum problems in elementary geometry", Ann. Univ. Sci. Budapest. Eötvös Sect. Math., 3–4: 53–62
I would have liked to see more end-to-end experiments of using MAS models to solve realistic modelling problems such as the examples in “Deep Sets”: point-cloud classification, Text Concept Set retrieval, Sum of Digits, etc. The paper only provides experiments that predict containment which is a much easier task for these embeddings given their inductive bias.
During the rebuttal, we performed several experiments on realistic examples as per your suggestion. Here, we present results on image classification on MNIST dataset from given segment of an image and text retrieval from BEIR datasets (NF Corpus and Scifact). For both of these realistic tasks, we intuitively think that having set containment as an inductive bias along with the power of weak separability might be helpful, which is reflected in our experimental results as well.
Specifically, for image classification, we used a N-layer CNN , where we froze the first layers and modify the last layer to ensure that is a MAS function. Thereafter, we predict the boolean 0/1 label on which images of the MNIST dataset have the same digit(object) as the candidate segment image.
For BEIR dataset, we used BERT embedding as input to a MAS function. Following table shows the results, which shows that we perform better than standard set based models at a similar scale of parameters.
For both the experiments, we report the accuracy numbers in the following table:
| Model | MNIST | NFCorpus | SciFact |
|---|---|---|---|
| DeepSet | 0.62 | 0.61 | 0.69 |
| SetTransformer | 0.71 | 0.68 | 0.72 |
| MASNET | 0.85 | 0.72 | 0.75 |
It looks like Prop 8 contradicts Prop 6: As far as I can tell, a hat function can be monotone and non-negative, in which case by Prop 6, there exists an S not-subset of T for which F(S,(A,b)) <= F(T, (A,b)) for all A, b, so F is not weekly MAS. Am I missing something?
Thanks for raising the point, but we'd like to point out that, a hat function cannot be monotone.
According to definition 7, a hat activation is a real-valued function that is (a) non-negative (b) compactly supported (c) not identically zero and (d) continuous. Hence, if is a hat activation, there exists such that and for . Thus, if was monotone, then would be identically zero, for all , which is in contradiction with (c).
Since a hat function can never be monotone, we cannot apply Prop-6 to hat activations (since it only applies to monotone activation functions such as Sigmoid or ReLU), and thus Prop-6 does not contradict Prop-8.
In Theorem 13, isn’t it true that for all injective functions F:P _{<=k}(V) → R^m there is an M such that F(S) =M (F(S) ). Also you’re using ‘F’ to denote both the MAS and the desired function F. Can you change the notation for one of them to make the theorem statement clearer?
We agree with the reviewer the notation for one of the functions should be changed. Theorem 13 should read as follows:
Let be a finite ground state, and let be a MAS function. Then for every multiset-to-vector monotone function , there exists a vector-to-vector monotone function such that .
With this correction, the theorem statement is now clearer and implies a non-trivial result.
The actual definition of “weak separability” is missing due to a typo in the paper.
Sorry about this, we accidently commented out the definition just before sumbission, while doing some edits. The definition of weak-separability is as follows:
“The parametric set function is weakly separable, if given such that , there exists a parameter such that ”.
This essentially means that at least some point in the parameter space separates any given pair of non-subsets.
In Theorem 12, do you mean m rows instead of columns?
Yes, and the matrix A as in Theorem 12 has dimensions . We meant rows only, and sampling them independently.
What does “m” refer to in line 331?
Instead of , it should read , where is as defined in line 330. Thus, , which means it’s an interior point of the support.
Thanks for pointing out all these typos, we have fixed these in our current version of the manuscript, so that the final version of the paper will be corrected.
I thank the author for addressing my concerns. With the new results and fixes I'm changing my rating to 5.
Dear Reviewer z6Li,
I thank the author for addressing my concerns. With the new results and fixes I'm changing my rating to 5.
We are happy to know that our response have addressed your concerns. Re. rating, we still observe the old rating. An updated rating will make the rating invisible to us (and will become visible during final decisions). Could you please check once?
Dear Reviewer z6Li,
We would like to thank you for your positive feedback and comments, which we would surely include in the final version of the paper, if accepted. Since the discussion period will be over soon, we would like to ask if you have any concern, which you want us to clarify.
Thanks
Authors
Dear Reviewers,
We would like to thank you for your thoughtful reviews. We attempted to address all your suggestions in our detailed rebuttal. Key points are as follows:
- We showed theoretically that Set transformers are not monotone. We also experimentally showed that Set Transformers can’t learn to be both monotone and separable to a good extent just by training only.
- We showed utility of Hat functions and explained empirically and theoretically why they are better than ReLU function. We also demonstrated the utility of Hat function across several applications— text retrieval, recommendation dataset, image based application
- We discussed and compared our method with MBC (Mini Batch Consistent) methods. We showed that our method can easily be adapted to MBC
- We experimentally showed that our model scales much better compared to DeepSets and Set Transformers when size of the sets grows to be very large.
- We discussed and provided experimental justifications on how to mitigate challenges related to training deep networks with hat activations.
- We explained why both monotonicity and separability are necessary for set containment-based tasks, which tackle false negatives and false positives, respectively.
- We have addressed some writing issues, including typos in the paper.
We earnestly request the reviewers to let us know if our responses have satisfactorily addressed your queries or if there are any areas where further elaboration could be helpful. Though it’s the last hour of the discussion, we’d still be happy to provide any further clarifications or justifications of our methods
This work focuses on developing neural network models suited for multiset containment tasks. The paper first defines a class of set functions that the authors refer to as 'monotone and separable' (MAS). The paper then provides bounds on the embedding dimensions required for MAS functions, as a function of the cardinality of the multisets and the underlying ground set. No such functions exist in the case of an infinite ground set. The paper then proposes MASNET, a model which satisfies a weak MAS property and is stable. Experiments demonstrate that the MASNET model outperforms standard baseline models on set containment tasks.
The reviewers agreed that the paper's main strength is its theoretical contributions. In particular, the theoretical results presented in the paper appear to be non-trivial and potentially impactful. Incorporating set containment as an inductive bias can be useful for various set learning tasks. The reviewers raised concerns about the empirical evaluation of the proposed models which only focuses on multiset containment tasks and ignores other common tasks such as point cloud classification and text concept set retrieval, about the marginal improvements offered by the proposed models, about the lack of motivation for MAS, and about missing related works. The reviewers also complained about various formatting issues. Moreover, one reviewer raised concerns about the novelty of the proposed architecture since a MASNET-ReLU model can be seen as a special case of DeepSets.
The authors claimed in the response that the formatting issues have already been resolved in their internal version of the manuscript. They also argued that it is useful to know how to implement DeepSets in the context of set containment problems. In addition, the authors attributed the marginal improvements to the distribution of classes in the considered datasets. Moreover, during the rebuttal, the authors conducted additional experiments on image classification and text retrieval. The reported results seem promising. The authors also experimented with larger set sizes and conducted experiments to highlight the need for Hat activation functions.
It is also my view that the theoretical contributions presented in the paper are original and solid. However, I also believe that the paper would benefit from a more extensive empirical evaluation of the proposed method. The additional experiments conducted during the rebuttal go in the right direction and should be incorporated into the manuscript. I also suggest the authors provide explanations regarding the marginal improvements and discuss whether the MAS condition is even necessary in practical applications. To summarize, even though the empirical results are somewhat limited, the theoretical contributions are sufficient to recommend acceptance. I strongly encourage the authors to revise the paper and improve the writing taking into account the detailed comments in the reviews.