PaperHub
4.7
/10
Rejected3 位审稿人
最低3最高6标准差1.2
6
5
3
2.3
置信度
正确性2.7
贡献度2.7
表达2.3
ICLR 2025

Sparling: Learning Latent Representations with Extremely Sparse Activations

OpenReviewPDF
提交: 2024-09-25更新: 2025-02-05
TL;DR

We prove it is possible to identify an extremely sparse intermediate latent variable with only end-to-end supervision, and introduce Sparling, an extreme activation sparsity layer and optimization algorithm that can learn such a latent variable.

摘要

关键词
machine learningsparsityinterpretabilityoptimizationidentifiability

评审与讨论

审稿意见
6

This paper addresses problems in representation learning and interpretability, toward understanding when and how black box models can correctly learn sparse intermediate activations, or "motifs". The central modeling assumption is that many real-world problems can be modeled by a generative process whereby the data come from first identifying a sparse set of higher-level somatic motifs, which is then used to create higher resolution raw data. This paper makes theoretical progress through its Motif Identifiability Theorem.

优点

The paper identifies an interesting setting for making progress on the interpretability of end-to-end neural networks. While the tasks in the paper are synthetic, they were designed with a long-term research objective in mind, such as being able to identify motifs in difficult problems like RNA splicing. The main strains of the paper are in the originality of this setting and in having theoretical guarantees that it is possible to be addressed. The paper is overall well-written and clear even to a reader who is not deeply familiar with the sub-area.

缺点

The main weaknesses of the paper are that it is the first step on a longer-term research problem, and the particular tasks that are able to be addressed now are highly synthetic and unrealistic.

问题

Would the method work with more complicated model structures such as with more layers? How would one know which layer should be the one to extract specific latent motifs?

评论

We thank the reviewer for their helpful feedback. Below are responses to specific points raised in the review.

The main weaknesses of the paper are that it is the first step on a longer-term research problem, and the particular tasks that are able to be addressed now are highly synthetic and unrealistic. We agree that this paper is a first step in this space and there is a lot of work left to be done in future, and hope that this has been adequately communicated in the text of the paper.

Would the method work with more complicated model structures such as with more layers? How would one know which layer should be the one to extract specific latent motifs? This technique is extremely agnostic as to the structure (dataflow and architecture) of your model after the latent layer (the h model as described in the paper). It also is mostly agnostic to the architecture of your model before the latent variable (the g model). However, the locality constraint, which is a constraint on the dataflow of your g model, means that it would generally make sense to place the latent layer near the beginning of the model’s dataflow. However, you can add as many 1x1 convolutions as you like, and perform heavy computations at this stage.

评论

Thank you for the responses. I will keep my score.

审稿意见
5

The paper studies the extent to which sparsity alone can be used to identify latent variables determining an input-output mapping. They provide a theoretical result on sufficient conditions for recovery of the latents, based on extreme sparsity, low end to end error and properties of the data distribution. To enable training models with extreme sparsity the SPARLING algorithm is presented, which is shown to recover latent variables on several synthetic tasks.

优点

  • The motivation, goals and contributions of the work are clearly stated and are easy to follow.

  • Studying theoretical aspects of recovering latent variables, and more generally interpretability, is important and less explored. The work clearly demonstrates a scenario where interpretable latents can be recovered via a sparsity assumption and further provide evidence where the assumptions break and sparsity is insufficient (LaTeX-OCR).

  • The suggested algorithm to induce sparsity seems to be efficient ,simple and applicable in a more general setting than suggested in the present work.

缺点

  • Some of the notation and formal statements are unclear and hard to follow:

    • In section 2 paragraph 2, square brackets ([d]) denote a set but are then also used to denote dimensions of tensors in the domain of X - why is a tensor not denoted in standard notation? XRN1×N2×dX \in \mathbb{R}^{N_1\times N_2 \times d}
    • The definition of locality, although made intuitive via description with graph convolutions, is hard to understand, partially due to mixing spatial and channel indices. Additionally, the notation for footprint function and motif cell is abused with pip_i introduced before.
    • The definitions in section 3.2 are unclear: vg^(x)v_{\hat{g}(x)} is interchanged with vn^v_{\hat{n}} , summation is over ii’ that doesn’t exist in the summands - can you please clarify?
  • The main results, in figure 3, are displayed without any reference baseline. If a simple baseline from any one of the relevant methods in the related work section can be added it can highlight the significance of the method.

问题

  • The SPARLING algorithm is presented as a contribution in the main text but then in appendix A it is said to exist in prior work [1] - can the authors please clarify? do the 2 statements refer to the same algorithm, what are the differences between them and what is the contribution of this work? statements in the main text and the appendix should be consistent.

  • I found the definitions of error metrics, section 3.2, and of α\alpha-MOTIF-IMPORTANCE to be confusing - do you think an intuitive explanation through one of the synthetic tasks can be helpful?

[1] Improved modeling of rna-binding protein motifs in an interpretable neural model of rna splicing, Gupta et. al.

评论

We thank the reviewer for their helpful feedback. Below are responses to specific points raised in the review. Part 1 of 2.

square brackets ([d]) denote a set but are then also used to denote dimensions of tensors in the domain of X - why is a tensor not denoted in standard notation? Since we don’t make any assumptions about the number of dimensions in X and M except that all but their last dimensions are the same size, we wish to avoid having ellipses everywhere when describing the shapes of objects. As such we refer to indexing by I, a set defined as a cartesian product. As a result, we use a set for [d] and [n] as well. We could alternatively use an assumption that they are always 2 dimensional, but that would be inapplicable to AudioMNIST and Splicing domains, which are 1 dimensional.

The definition of locality, although made intuitive via description with graph convolutions, is hard to understand, partially due to mixing spatial and channel indices. We agree that the definition is hard to understand for that reason, and on reflection, that slight added generality does not contribute anything to the paper. It is not relevant in our examples and probably wouldn’t be in nearly any situation. We have simplified the definition by removing this mixing.

Additionally, the notation for footprint function and motif cell is abused with p_i introduced before. Thank you for the catch! I have replaced the previous dimensional measurements p_i with D_i

The definitions in Section 3.2 are unclear… The exchange was meant to show an example of the function being called and then its definition using a variable, however I now realize this was unclear. It has been modified to refer to \hat m in both cases. The summand was meant to include \hat m[i’] rather than \hat m[i], thank you for catching the typo.

The main results, in figure 3, are displayed without any reference baseline. If a simple baseline from any one of the relevant methods in the related work section can be added it can highlight the significance of the method. The existing baselines are not attempting to extract spatially local latent variables in this context, so do not provide a relevant comparison. We have an implicit comparison to a trained non-sparse baseline in Figure 4. In effect, what Figure 4 demonstrates is that a trained network that is non-sparse is not producing features corresponding to the motifs. We have made this more explicit in the caption to Figure 3.

The SPARLING algorithm is presented as a contribution in the main text but then in appendix A it is said to exist in prior work [1] - can the authors please clarify? do the 2 statements refer to the same algorithm, what are the differences between them and what is the contribution of this work? statements in the main text and the appendix should be consistent. Reference [1] does contain Sparling; however it does not introduce it. It cites an earlier preprint of this paper as the source of the algorithm. Note: the preprint is not blinded, as a warning in case you look for the reference.

评论

Part 2 of 2.

I found the definitions of error metrics, section 3.2, and of alpha-MOTIF-IMPORTANCE to be confusing - do you think an intuitive explanation through one of the synthetic tasks can be helpful? Thanks for your suggestion! We have considered this but never really been able to find a succinct and useful explanation involving examples for the assumptions. Here is a sketch of an attempt, let us know if you think it might be helpful, or if there’s any ways to improve it.

“”” To take an example, we can look at AudioMNISTSequence, a dataset where a number of digits in [5, 10] is selected at random, and then that many digits are selected and placed randomly in non-overlapping positions. Here, we want to pair same-probability pairs of sequences and sequences with a deleted digit whose probabilities sum to \alpha, because that enables us to say that a model that cannot detect the deletion \kappa of the time will be wrong \alpha\kappa/2 of the time.

However, of course, this runs into the problem of sequences with e.g., 7 symbols having far more degrees of freedom and thus lower probability than those with 6 symbols. We thus want to instead ensure that we can pair multiple 7-symbol sequences with a single 6-symbol sequence. To do this, we use the \psi() probability function. At this point, it becomes important that we be allowed to “not use all the probability” of the 7-digit sequences. Since not all 6-digit sequences are capable of accommodating a 7th digit at every position, and all 6 and 7 digit sequences have equal probability, P(6 digit sequence) < P(any 7 digit sequences that extend the 6 digit sequence).

Thus, we can define our \psi as such: if there are exactly 5 digits present, return \bot. Otherwise, with probability 50%, return \bot. Otherwise, delete one of the motifs at random. This function clearly has a non-negligible α\alpha (since the number of digits is uniform on {5, …, 10} it’s 50% of ⅚, or 5/12). We can then establish that it does not put too much probability on any one sequence by a combinatorial argument establishing that the probability of any particular post-deletion target is at least half as high as the probability of every sequence that leads to it combined. “””

评论

It is still very hard to follow, it seems like you are attempting to explain the α\alpha-MOTIF-IMPORTANCE. Since this requires a distribution of samples could you perhaps try a simpler version of audio MNIST for the explanation? e.g. where a number of [1,2] or [1,3] digits can be seen in a sequence. I think it can be helpful to highlight what are the ground truth motifs, an example for none-optimal predicted motifs and the error they induce and how does that error change when changing the α\alpha-MOTIF-IMPORTANCE. if I'm misinterpreting some of the definitions please let me know.

评论

I hope the following answers your question, if it does not, please let me know and I will try to further clarify. I specifically do not understand what you mean by "none-optimal predicted motifs". Would you like an explanation as to how, if \alpha-Motif-Importance holds, it implies that there is error proportional to \alpha? I first go over a simple domain and how to set up a \psi that follows our rules, then go over why the rules are relevant, then a quick sketch of how this establishes error.

I think the simplest example for \alpha-Motif-Importance is a variant of AudioMNISTSequence where there are either 1 or 2 true motif, where motifs have audio length 3, and where the sequence has audio length 12. We specify that P(1 true motif) = P(2 true motifs) = 50%

In the 1 motif case, the 1 motif can be centered at a position between 1 through 10 (0 does not work as there is no space to the left, 11 does not work as there is no space to the right). Each of these possibilities has probability 1/20. In the 2 motif case, some examples of valid configurations include {1, 4}, {1, 5}, {1, 6}, etc., but not {1, 2} or {1, 3} as they would involve an overlap. Enumerating possibilities finds that there are 28 possibilities total:

{1, 4}, {1, 5}, {1, 6}, {1, 7}, {1, 8}, {1, 9}, {1, 10}, {2, 5}, {2, 6}, {2, 7}, {2, 8}, {2, 9}, {2, 10}, {3, 6}, {3, 7}, {3, 8}, {3, 9}, {3, 10}, {4, 7}, {4, 8}, {4, 9}, {4, 10}, {5, 8}, {5, 9}, {5, 10}, {6, 9}, {6, 10}, {7, 10}

Each with probability 1/56.

In \alpha-Motif-Importance we need to find some subset of the data with probability \alpha that, when perturbed by a deletion, leads to a paired state with a different outcome. In this example, the "different outcome" is guaranteed, as all combinations of motifs lead to different output predictions. However, for anything in the 1 motif case, there is no way to perturb by deletion and still have a valid motif. We thus set \psi(\bot | 1 motif sequence) = 1, that is, do not pair 1-motif sequences with anything.

We could try to pair e.g., {1, 4}, {1, 5}, {1, 6}, {1, 7}, {1, 8}, {1, 9}, {1, 10} all with {1}, perturbing by deleting the later motif. However, this runs into the issue that the total probability of the {1, x} motifs is 7 * 1/56 = 1/8, whereas the probability of {1} is 1/20. Even non-deterministically pairing these half with {1} and half with {x} is insufficient as this leads to 1/16 vs 1/20. The discrepancy occurs because {1} is uniquely well suited to additional motifs being added as it is at the edge, so creates fewer potential conflicts. To resolve this problem, we instead pair each {x, y} with {x} with probability 2/5, {y} with probability 2/5, and \bot with probability 1/5. This causes everything to be balanced, with {1, x} -> {1} now being 1/8 * 2/5 = 1/20. A quick check shows that this holds for all other targets as well.

Finally, we can compute \alpha, which comes out to 2/5. P(not \bot) = 1/2 * P(not \bot | 2 motifs) = 1/2 * (4/5) = 2/5

So why did we have to do all this? Well, if we had allowed unbalanced pairing to allow \alpha = 1/2. we would end up in a situation where we needed to produce paired sets of states of 1/2 cumulative probability each that are off by a single perturbation and lead to different outcomes, and that simply does not exist for this problem. The somewhat complicated constraint is necessary in order to ensure that our pairings are valid and can be used in later proofs.

The later proof goes something like this. We know that with some probability \kappa, a motif will be missed (predicted as not existing). Now, consider our \alpha=2/5 of the input space as described before. Consider one of these cases M1 = {1, 4}, of probability \alpha_i, with paired perturbation M2 = {1}. In this case, with probability \kappa, our model will be unable to distinguish M1 from M2. Since both P(M1) and P(M2) > \alpha_i = 2/5 * 1/56 = 1/140, we have that our model must have end-to-end-error at least \kappa \alpha_i / 2 = \kappa/280 on {1, 4}, since it must "pick" between interpreting the input as {1, 4} or {1} with no ability to distinguish them, and these lead to different outputs (either a 1-length or 2 length string). So we can establish that error(M1) >= \kappa \alpha_i / 2, and aggregating, we get error >= \kappa \alpha / 2.

评论

Thank you for the detailed explanation on α\alpha-Motif-Importance, I do find it helpful. I still think the presentation of formal definitions (e.g. locality, metrics) in the manuscript needs to be improved, I am keeping my score.

评论

Thank you for the detailed response, but I could not see any of the changes mentioned here in the pdf or any revisions in the history. I tried looking for revisions you listed to other reviewers as well (e.g. changes to caption of fig 1 in response to reviewer odWd) but did not see any.

评论

Sorry about that, I seem to have forgotten to upload the revised PDF. I have uploaded it now

审稿意见
3

This paper theoretically proves that under certain assumptions, the motifs (or latent variables) of a task can be accurately identified by a neural network that is trained end-to-end. More specifically, it provides an upper bound for the motif error when the end-to-end error is low. The paper shows that this kind of motif identifiability can be achieved by enforcing extremely sparse activations in the training process, and validates it on several example tasks.

优点

  1. The paper seems to focus on an interesting problem, i.e., can a neural network automatically identify motifs (latent variables) of a task under the end-to-end training scheme.

  2. The paper theoretically formalizes several sufficient conditions (assumptions) for motif identifiability, and quantifies the motif error in terms of end-to-end error. I really appreciate this kind of theoretical effort.

缺点

  1. The presentation of the paper needs to be improved. It is not clear from the beginning what a “motif” is. I didn't see a concrete description of the motifs until the experiments section on Page 7, e.g., the latent motifs layer mm^* is the position of each digit. I strongly recommend taking one of the tasks in the experiments as an illustrating example at the beginning of the paper. Try to use figures to explain what are the ground-truth motifs the model is expected to learn, and what is the intermediate output that is considered as the model’s encoding of a motif (Is it a tensor? If so, plot the shape of the tensor). This illustrating example can also help readers better understand the physical meaning of the assumptions in Section 3.3.

  2. The claim that the motif error is small according to Figure 3 is not very convincing, because the experiment lacks comparison with properly designed baselines. There should be some comparisons against a certain baseline, e.g., when g^\hat{g} is a random mapping, which is a very weak baseline. I encourage the authors to come up with stronger baselines to make the result more convincing.

  3. The practical implication of the paper is still limited. The current paper mainly focuses on special tasks that exhibit certain structures (e.g., input noise, sparse input signals). Is it possible to validate the assumptions and the theory on more common tasks such as image classification, e.g., the digit classification on MNIST? If not, I also encourage the authors to discuss why it is the case in a Limitations section.

  4. Minor.

(a) Could you add the equation numbers and the line numbers, so that we can refer to specific contents more easily?

(b) In the equation vm^(i)=ip2(i)1(m^[i]0)v_{\hat{m}}(i)=\sum_{i’ \in p_2(i)} \boldsymbol{1}(\hat{m}[i] \neq 0), should the m^[i]\hat{m}[i] on the right hand side be m^[i]\hat{m}[i’]?

问题

  1. Could you provide a figure describing the network architecture used in Section 5.1? This would significantly help readers understand the network architecture. Besides, many parts of the architecture follow Deng et al. (2016). Could you explain why these specific design choices are made?

  2. There are some works studying the decision sparsity or sparsity of concepts encoded by neural networks [c1, c2, c3], without requiring the network itself (e.g., parameters or activations) to be sparse. They have not been discussed in the related work, and I wonder whether/how these works are related to and different from the sparsity in this paper.

  3. Most tasks described in this paper have noises in the input, and it seems that the model needs to do denoising when it outputs the correct result. Is the input noise a necessary setting for motif identification?

[c1] Enouen and Liu. Sparse Interaction Additive Networks via Feature Interaction Detection and Sparse Selection. NeurIPS 2022.

[c2] Sun et al. Sparse and Faithful Explanations Without Sparse Models. AISTATS 2024.

[c3] Ren et al. Where We Have Arrived in Proving the Emergence of Sparse Interaction Primitives in DNNs. ICLR 2024.

伦理问题详情

NA

评论

We thank the reviewer for their helpful feedback. Below are responses to specific points raised in the review.

Weakness 1: “Try to use figures to explain what are the ground-truth motifs the model is expected to learn, and what is the intermediate output that is considered as the model’s encoding of a motif”. Thank you for your feedback. As a first attempt at addressing this, we have improved the image and caption of Figure 1. In the image, we added the tensor dimensions for X and M, as well as the motif locations as dots. We have also updated the caption to explain the tensor dimensions. We also moved Figure 2, which shows more examples, including the model’s predictions, up to an earlier page.

Weakness 2: We have an implicit comparison to a trained non-sparse baseline in Figure 4. In effect, what Figure 4 demonstrates is that a trained network that is non-sparse is not producing features corresponding to the motifs. We have made this more explicit in the caption to Figure 3.

Weakness 3: the reason that this does not apply to e.g., MNIST is that MNIST digits do not have their information concentrated in small and independent parts of the input. As such, our assumptions would not apply to this domain. We have added a Limitations section discussing this. We do not believe that noise is a significant limitation, see our later answer to your Question 3.

Weakness 4a: We have added line numbers; for some reason they were not in the template we used. Thank you for the reminder. Weakness 4b: Thank you for the correction, this has been fixed in the text.

Question 1: We did not focus on optimizing the architecture, our improvements are orthogonal to architectural choices past the latent layer. Before the latent layer we used a simple residual network architecture, and performed some experiments varying it to little effect (positive or negative) as discussed briefly in Section 5.1 line 261. If you like, we can add a diagram of the model architecture to an appendix.

Question 2: [c1] is more akin to causal sparsity than to activation sparsity; specifically it selects among feature interactions (relationships in a causal graph) rather than features. Additionally, the post-feature extraction layer is then combined linearly rather than, as in our model, via an arbitrary function. [c3] is a kind of sparse causal graph; defining sparsity in terms of a small number of interactions between variables. We have added a citation to [c1] and [c3] in our list of works referring to and sparse causal graphs.

[c2] is a technique for measuring the sparsity of a model that does not have an input-independent sparse structure, this has some applicability to our work (since we do not have an input-independent sparse structure), though we provide a much more straightforward definition of sparsity that applies to our network. We have added a reference to this work in our discussion of Neural Input Attribution in Appendix A.

Question 3: None of the assumptions in our proof require the presence of noise in the input; thus if you can train a h^g^\hat h \circ \hat g that works well end-to-end at high sparsity, you can recover an accurate g^\hat g. However, in practice, we have found that the algorithm in Section 4 does not work in the complete absence of noise, because of tiebreaking (i.e., if 95% of your image is identical white pixels, you cannot have a deterministic local function that is 90% sparse). In practice, this is not a significant problem, as you can inject some very negligible amount of noise and address this issue. Additionally, most realistic domains will not have a perfectly uniform background (e.g., in Splicing we do not inject noise since RNA varies from location-to-location for reasons unrelated to the g function).

评论

I would like to thank the authors for their efforts in the response. Some of my questions are addressed, e.g., comparison to another line of work in sparsity, and implicit assumptions for the input noise. However, I believe that the overall presentation still needs to be improved, e.g., the revised Figure 1 still does not clearly explain what a motif is. I will keep my original score, and I think the paper might gain more impact by improving its presentation and clarity.

One kind reminder (not related to my rating of this paper): in the updated manuscript, you could use a different color to highlight what you have revised, which provides some visual guidance for the reviewers in addition to the response.

AC 元评审

This paper investigates the identifiability of sparse and local latent variables, referred to as motifs, that represent intermediate states in real-world processes. The paper presents the Motif Identifiability Theorem, proving that, under certain conditions, these motifs can be precisely identified by reducing end-to-end error. The Sparling algorithm is then introduced, which enforces activation sparsity through a novel informational bottleneck, achieving over 90% accuracy in localizing intermediate states on synthetic and real-world domains like DigitCircle, LaTeXOCR, and AudioMNISTSequence.

The paper addresses an important problem, and formalizes several sufficient conditions for motif identifiability. The tasks in the experiment are synthetic. As all the reviewers appear to agree, the overall presentation leaves a lot of room to be improved, such as the formal definitions. So another round of revision seems necessary.

审稿人讨论附加意见

The rebuttal has been noted by the reviewers and have been taken into account by the AC in the recommendation of acceptance/rejection.

最终决定

Reject