PaperHub
5.7
/10
Rejected3 位审稿人
最低5最高6标准差0.5
6
6
5
3.0
置信度
正确性3.0
贡献度2.3
表达3.0
ICLR 2025

Incidental Polysemanticity: A New Obstacle for Mechanistic Interpretability

OpenReviewPDF
提交: 2024-09-28更新: 2025-02-05
TL;DR

Polysemanticity in deep neural networks might be attributable to causes other than optimizing for the task, i.e., it may be incidental.

摘要

关键词
polysemanticitymechanistic interpretabilityAI safetydeep learningscience of deep learningneural computationinterpretability

评审与讨论

审稿意见
6

Polysemantic neurons that are activated by different natural concepts have attracted the attention of researchers recently due to their influence on feature interpretability. In this paper, the authors introduce a new scenario where polysemanticity might be caused by non-task factors in the training process. In the toy models, they consider two conditions. i.e., the l1l_1 training regularization and hidden layer noise. With the theoretical and empirical evidence, they reproduce the polysemanticity in the toy models and analyze the learning dynamics in the training process.

优点

  1. The theoretical analysis of the training dynamics in toy models is solid and cooperates well with the empirical evidence.
  2. The paper is well-organized and easy to follow. The main insight is clear and the two conditions in the toy models are explained well.
  3. Analyzing the mechanism of polysemantyicity is important and this paper proposes an insightful perspective.

缺点

  1. This paper focuses on the polysemanticity that is not related to the tasks. It would be better to add more discussions on the differences between incidental polysematicity and the original task-related polysemanticity. For example, as polysemanticity is not related to the tasks, can we get rid of the performance-interpretability trade-off? Besides, in the neural networks, how can we distinguish the incidental polysemanticity?
  2. The analysis in this paper is conducted on the toy models. I understand it is for the ease of theoretical analysis. However, it would be better to provide some insights into the polysmenaticity in the neural networks trained on real-world data and add some additional experiments or discussions.
  3. The explanation and motivation of Figure 6 is a little confusing and it is perhaps a digression from the main topic of the paper. It would be better to provide more discussions of the new insights.
  4. In this paper, the authors mainly focus on how to obtain polysemantic neurons. However, in real-world data, the main challenge lies in obtaining the monosemantic neurons. Consequently, is it possible to provide some insights about how to attain monosemanticity based on the analysis in this paper?

问题

See weaknesses.

评论

The explanation and motivation of Figure 6 is a little confusing and it is perhaps a digression from the main topic of the paper. It would be better to provide more discussions of the new insights.

Thank you for raising this point. Figure 6 actually reveals an unexpected mechanism for polysemanticity: while most encodings reach almost perfect sparsity (indicated by Wi44=1{{||W_i||}_4}^4 = 1), the orange curve encoding gets stuck below Wi44=0.2{{||W_i||}_4}^4 = 0.2. Note that this isn't just noise or training instability, it represents a stable ‘compromise’ where multiple features share representation due to interference patterns. From the final weight matrix, we can clearly see that sparsity pushes W2 to "pick" one of the 7 coordinates and increase its absolute value, but interference patterns force W2 to compromise across all 7 coordinates.

We believe this phenomenon is particularly important because it demonstrates how polysemanticity can happen for a wide variety of sometimes surprisingly hard-to-predict incidental reasons, e.g. in this case non-task factors, and it’s why eliminating incidental polysemanticity is particularly challenging.

In this paper, the authors mainly focus on how to obtain polysemantic neurons. However, in real-world data, the main challenge lies in obtaining the monosemantic neurons. Consequently, is it possible to provide some insights about how to attain monosemanticity based on the analysis in this paper?

Thank you, this is an excellent question. We have added to the paper to clarify this. Here is one possible way one might get rid of incidental polysemanticity in a neuron that currently represents two features i and j: Duplicate that neuron, divide its outgoing weights by 2 (so that this doesn't affect downstream layers), add a small amount of noise to the incoming weights of each copy, then run gradient descent for a few more steps. One might hope that this will cause the copies to diverge away from each other, with one of the copies eventually taking full ownership of feature i while the other copy takes full ownership of feature j.

Of course, as we noted in the paper, we can also perform a careful control of noise distributions. One might say that the paper should explore further initialization strategies that encourage initial feature-neuron correlations to develop toward monosemantic representations. However, the auto-encoder were part of a larger network, then in most cases you don't know in advance what the features will be (and they're not going to be aligned with the basis vectors just by random chance) so it's unclear how you'd guarantee that your initialization is monosemantic with respect to the features that eventually get learned.

评论

We thank reviewer 8rnd for the thoughtful suggestions, especially regarding the needed clarifications on Figure 6!

For example, as polysemanticity is not related to the tasks, can we get rid of the performance-interpretability trade-off?

Regarding the performance-interpretability trade-off, Unlike necessary polysemanticity (as described in Elhage et al.) which arises from capacity constraints, incidental polysemanticity occurs even when the network has sufficient capacity to represent features monosematically. Our toy model demonstrates this cleanly since we can guarantee the network has more than enough capacity to represent all features without polysemanticity while still achieving perfect task performance. This suggests that at least for incidental polysemanticity, there may not be a fundamental performance-interpretability trade-off, though more research is needed to verify this in more complex model settings.

Besides, in the neural networks, how can we distinguish the incidental polysemanticity?

Thank you for this important clarifying question. Whether we can distinguish neurons based only on the final, trained state of the model or if we need to know more about the training process is a complex question. While our current work focuses on establishing the existence of incidental polysemanticity, we agree that future work exploring these questions would be very valuable. We are currently unaware of any work that investigates this property in detail in any network.

However, in our “Discussion and Future Work section”, we did outline some concrete approaches for potentially reducing incidental polysemanticity. As a starting point, here is one possible way one might get rid of incidental polysemanticity in a neuron that currently represents two features i and j: Duplicate that neuron, divide its outgoing weights by 2 (so that this doesn't affect downstream layers), add a small amount of neural noise to the incoming weights of each copy, then run gradient descent for a few more steps. One might hope that this will cause the copies to diverge away from each other, with one of the copies eventually taking full ownership of feature i while the other copy takes full ownership of feature j.

The analysis in this paper is conducted on the toy models. I understand it is for the ease of theoretical analysis. However, it would be better to provide some insights into the polysmenaticity in the neural networks trained on real-world data and add some additional experiments or discussions.

We appreciate your concern about the limited evaluation using only a toy model and dataset. We would like to provide some additional context for why we believe toy models are actually a strength of our work given the specific research questions and claims we aim to address.

The key point we want to demonstrate is that polysemanticity can arise incidentally from factors other than limited capacity, even in networks that achieve perfect performance on the task. A toy model is particularly well-suited to make this point cleanly and definitively. We can guarantee the network has more than enough capacity to represent the task features without polysemanticity, and that it does indeed reach 100% performance. With real-world datasets and networks, it would be much harder to disentangle polysemanticity driven by incidental factors like initialization and regularization vs. insufficient capacity. The toy setting allows us to isolate the phenomenon of interest.

Toy models have a long and impactful history in machine learning research as a tool for building insight and precisely characterizing important phenomena. Polysemanticity and superposition, double descent, scaling laws, grokking and adversarial robustness have all benefited from analysis in simplified models that abstract away confounding factors. The closely related work of Allen-Zhu & Li 2020 that you reference itself uses a toy network and synthetic data. Our use of toy models thus follows a well-established approach in the field for deriving clean insights about interesting behaviors.

Cleanly defining and identifying "features" in real networks remains an open problem for mechanistic interpretability. Our work intentionally sidesteps this challenge to focus on a separate important question about the origins of polysemanticity. Toy models allow us to make progress on this question even without solving feature identification in complex settings. We view our contributions as complementary to ongoing work on scalable feature definition.

审稿意见
6

Polysemanticity describes the quality of an internal feature/neuron in which the feature will have a high activation for multiple features, indicating that the feature is associated with more than one feature. The prevailing theory on polysemanticity is that it is necessary: when the feature space is smaller than the number of features, then neurons must represent multiple features. This paper explores the possibility that polysemanticity may not occur necessarily, but as a feature of training. Particularly the paper looks at sparsity enforcement in the hidden layer, and observes unnecessary and random, or incidental polysemanticity.

优点

Toy model is illustrative and easy to apprehend. In-depth and accessible discussion in section 2. Favorable results matching theory in section 2. Well written and easy to follow.

缺点

The examples for incidental polysemanticity are explored with L1 regularization of internal representation, or noise injections, and when the feature dimension is larger than necessary. The main question is why polysemanticity occurs in typical settings, and it is unclear how these results apply to the more typical scenarios. This lessens the paper's contribution significance. This paper could be improved by elucidating more how the content applies to more typical scenarios or providing content more in alignment with those scenarios.

Section 3/4 discuses how noise is tied to sparsity, but does not verify that noise is tied to polysemanticity. The paper could be improved by providing experiments showing how noise causes incidental polysemanticity.

问题

I don't see any baseline polysemantic neurons count for your model, with no regularization or noise. Is this because there is no way to define them, as there is no winner-takes-all enforcement? It would be beneficial to establish some sort of baseline.

Section 4, on noise injections, explains the relationship between noise injections and sparsity, but I fail to see a discussion on what this has to do with polysemantic neurons. Are there experiments on the number of incidentally polysemantic neurons with noise injections? This would be beneficial.

Is there an example of incidental polysemantic neurons when m < n, or the feature space is smaller than the required number of features? This would help move the scenario of the ecperiemnt closer to what is typical.

Can you explain how incidental polysemanticity with L1 regularization relates to the base case? Is it that the scenario where the feature space dimension is much smaller than the number of features is similar to a scenario with feature sparsity? Making this more clear would help readers understand your contribution.

Can you explain how noise injections relate to explaining the cause of polysemanticity in the typical case? Perhaps I be unfamiliar with how common these types of noise injections are.

An interesting experiment (if they have not been done) is to train an autoenoder in a more normal scenario: without regularization, noise, and with m < n. Then check if the number of polysemantic neurons is greater than n-m. If so, then this could be evidence of incidental polysemnticity in a typical scenario.

Minor Issues: 051) In that case 135) By Wi=fiW_i = f_i , do you mean Wi=eiW_i=e_i? Figure 3, graph not displaying range bars for last two points.

评论

Can you explain how noise injections relate to explaining the cause of polysemanticity in the typical case? Perhaps I be unfamiliar with how common these types of noise injections are.

We selected the noise setup for three main reasons.

  • Showing that multiple factors besides L1 regularization can cause incidental polysemanticity strengthens our main claim that such polysemanticity can arise from diverse causes unrelated to task performance.
  • Noise is a practically relevant factor to study, since widely used techniques like dropout induce noise in real networks.
  • Studying the impact of noise on representations has precedent in related works like Bricken et al. 2023 and Allen-Zhu & Li 2020.

So we showed that certain types of noise can induce sparsity, which then leads to polysemanticity through mechanisms similar to those we demonstrated with L1 regularization. As in Figure 4, bipolar noise (which has negative excess kurtosis) pushes encodings towards sparsity, while normal noise does not. This sparsification then creates the conditions for incidental polysemanticity through the same winner-takes-all dynamics we observe with L1 regularization. The key insight is that noise with certain statistical properties can implicitly create regularization-like effects.

  1. In that case

Thank you for pointing that out! We have fixed the spelling issue.

  1. By Wi=fiW_i = f_i, do you mean Wi=eiW_i = e_i?

In 132), we define fkRmf_k \in \mathbb{R}^{m} as the kthk^{\text{th}} basis vector in Rm \mathbb{R}^{m}, the hidden layer dimension (as opposed to the input/output dimension).

Figure 3, graph not displaying range bars for last two points.

Thank you for pointing that out. We have rerun the experiments and updated the plot to display the error bars for each point. The updated plot can be viewed here: https://drive.google.com/file/d/1Dp9SSH0JYrpKI0kSE_GBltpbPJ7FKRK7/view?usp=sharing.

评论

Is there an example of incidental polysemantic neurons when m < n, or the feature space is smaller than the required number of features? This would help move the scenario of the ecperiemnt closer to what is typical.

Your insightful question about our contributions of sparsity, noise, and gradient descent to polysemanticity in undercapacity regimes (where m<n) is great. Our current work focuses on the setting where networks have excess capacity, but we agree that studying the "incidental" nature of polysemanticity in under-capacity regimes is a fascinating direction for future work, and we will be sure to acknowledge this in the paper.

We would also like to note that in under-capacity regimes, the distinction between “incidental” and “necessary” polysemanticity is not as crisp, since all optimal solutions would have some amount of polysemanticity. But we can still talk about whether gradient descent will find a solution of minimum polysemanticity among all optimal solutions (we can define the “amount of polysemanticity” as the number of pairs of features that share at least one neuron). Our intuition is that as the number of neurons falls below the number of features, neurons would start to be shared by larger and larger groups of features that get packed together, striking a compromise between sparsity and expressivity. But this compromise would not be struck optimally. In particular, we expect that, until the number of neurons is much smaller than the number of features, we would continue to see that a significant fraction of neurons are left unutilized at the end of training, mirroring what we saw in the experiments we ran with an equal number of neurons and features. This would be a clear case of incidental polysemanticity.

Can you explain how incidental polysemanticity with L1 regularization relates to the base case?

The key insight here is that the regularization induces a "winner-take-all" dynamic through the sparsity force (as defined in Section 1). When we have feature benefit pushing weights to have unit length, or L1 regularization pushing down on weights with constant force, it creates a dynamic where weights whose absolute value is above some threshold will grow, while those below the threshold will shrink. As we showed in the paper, the threshold is (1Wi2)Wik=λsign(Wi)    Wik=λ1Wi2(1 - \|W_i\|^2) W_{ik} = \lambda \, \text{sign}(W_i) \iff |W_{ik}| = \frac{\lambda}{1 - \|W_i\|^2}.

Is it that the scenario where the feature space dimension is much smaller than the number of features is similar to a scenario with feature sparsity? Making this more clear would help readers understand your contribution.

Yes, the scenario where the feature space dimension is smaller than the number of features is a more central example of a setting where recovery of sparse features has been explored. However, even in our setting where the feature space dimension is larger than the number of features, the features are maximally sparse (only one of them is ever on). The difference is that it would no longer be surprising that they can be recovered losslessly (since one could assign one neuron to each feature), and the surprising thing is that in fact when L1 regularization is introduced, some polysemanticity still occurs.

评论

Section 3/4 discuses how noise is tied to sparsity, but does not verify that noise is tied to polysemanticity. The paper could be improved by providing experiments showing how noise causes incidental polysemanticity.

For experiments how noise causes incidental polysemanticity, we request you to instead check out Section 4 of the paper, where our experiments demonstrate a clear causal chain from noise to sparsity to polysemanticity. In Figure 4, we show that bipolar noise (with negative excess kurtosis) pushes encodings towards sparsity, while normal noise does not. Specifically, for bipolar noise with higher standard deviations (e.g. σ=0.1,0.15\sigma = 0.1, 0.15), we observe the average fourth norm increasing dramatically above the 3/m3/m baseline, indicating strong sparsification. This sparsification directly leads to polysemanticity through the "winner-take-all" dynamics that we have presented in the paper.

Further, Figure 6 provides direct evidence of how noise leads to polysemanticity in the interplay between interference and the push for sparsity presented. In that figure we observed cases where encodings are forced to represent multiple features due to noise-induced dynamics, even when this isn't optimal for task performance. Specifically, we show how the second encoding (W2) ends up with significant weights across multiple coordinates due to noise-induced interference patterns.

I don't see any baseline polysemantic neurons count for your model, with no regularization or noise. Is this because there is no way to define them, as there is no winner-takes-all enforcement? It would be beneficial to establish some sort of baseline.

Thank you for raising that interesting point. In our base model with neither regularization nor noise, feature representations tend to remain distributed across multiple neurons rather than concentrating in a single neuron, resulting in fourth norms close to the random baseline of 3/m3/m shown in Figures 4 and 5. This baseline behavior is visible in the early stages of training in Figure 4, where before the sparsifying effects take hold, both bipolar and normal noise conditions show fourth norms hovering around the 3/m3/m reference line. This represents the natural state of the network before the sparsity-inducing forces come into play.

We can precisely measure polysemanticity once the winner-takes-all dynamics have emerged, because we can identify when multiple features are primarily represented by a single neuron. Without the dynamics having played out, we can expect the features to be distributed across neurons in a way that make counting "polysemantic neurons" less well-defined. Hence, our analysis focuses on how regularization and noise create conditions where we can unambiguously measure the phenomenon.

评论

We thank reviewer WVCj for the thoughtful comments, especially for pointing out wording issues, clarifications on the noise in the hidden layer setup, and pointing out a much-needed full error bar display, which we have now implemented!

The main question is why polysemanticity occurs in typical settings, and it is unclear how these results apply to the more typical scenarios. This lessens the paper's contribution significance. This paper could be improved by elucidating more how the content applies to more typical scenarios or providing content more in alignment with those scenarios.

We appreciate your concern about the limited evaluation using only a toy model and dataset. We would like to provide some additional context for why we believe toy models are actually a strength of our work given the specific research questions and claims we aim to address.

The key point we want to demonstrate is that polysemanticity can arise incidentally from factors other than limited capacity, even in networks that achieve perfect performance on the task. A toy model is particularly well-suited to make this point cleanly and definitively. We can guarantee the network has more than enough capacity to represent the task features without polysemanticity, and that it does indeed reach 100% performance. With real-world datasets and networks, it would be much harder to disentangle polysemanticity driven by incidental factors like initialization and regularization vs. insufficient capacity. The toy setting allows us to isolate the phenomenon of interest.

Toy models have a long and impactful history in machine learning research as a tool for building insight and precisely characterizing important phenomena. Polysemanticity and superposition, double descent, scaling laws, grokking and adversarial robustness have all benefited from analysis in simplified models that abstract away confounding factors. The closely related work of Allen-Zhu & Li 2020 that you reference itself uses a toy network and synthetic data. Our use of toy models thus follows a well-established approach in the field for deriving clean insights about interesting behaviors.

Cleanly defining and identifying "features" in real networks remains an open problem for mechanistic interpretability. Our work intentionally sidesteps this challenge to focus on a separate important question about the origins of polysemanticity. Toy models allow us to make progress on this question even without solving feature identification in complex settings. We view our contributions as complementary to ongoing work on scalable feature definition.

Further, while our current experiments focus on establishing the existence of incidental polysemanticity in a setting where it doesn't affect task performance, we believe studying polysemanticity is still important because the type of representation learned can have implications beyond just raw task performance. For example, polysemantic representations may be less interpretable (as each neuron's activity is no longer clearly associated with a single input feature), more or less robust to noise or distribution shift, or more amenable to compression.

评论

Dear Reviewer WVCj, we would greatly value your review of our rebuttals to consider whether you would like to update your ratings. Given your thoughtful comments about generalizability, we believe further discussion could be valuable. Similar theoretical work using toy models was awarded Best Paper runner-up at ICLR 2023.

The deadline is approaching quickly, and we would greatly appreciate a response.

评论

Good rebuttal regarding toy models, point well made. I recommend that you highlight that you have not definitively shown what the cause of polysemanticity is in the case of lower dimension, but have definitively demonstrated viable alternatives to the going theory.

I suggest that the connection between noise and polysemanticity be made a little more clear. I believe that the L-4 norm is basically being used as a proxy for polsemanticity? So noise causes L4 norm increase, which means polysemanticity? This being spelled out more would be helpful.

431 - "we consider a the training"

The score has been changed.

审稿意见
5

The paper investigates the emergence of polysemanticity in neural networks, demonstrating that it arises not solely from bottleneck dimensions (extending Elhage et al.'s work). The authors propose that incidental polysemanticity occurs during random initialization and early training, providing mathematical derivations to support this claim.

优点

  • The paper focuses on origins for polysemanticity, a central and important problem in mechanistic interpretability
  • The paper provides thorough walkthroughs of mathematical derivations and explains individual resulting terms (feature benefit, interference, regularization)

缺点

In general, I am unsure about the weight of novel contributions in this paper relative to the bar of an ICLR acceptance. I will defer to the AC in this regard.

Critiques on Section 1, L1 Regularization I am not surprised that random initialization determines whether true features are encoded monosemantic or binary. The authors clearly show how random initialization arguments are reflected in experiments on training dynamics and scaling the number of hidden neurons. However, I am unsure how these findings translate into language models, which are not explicitly trained with an L1 regularization.

Critiques on Section 3, Neural Noise

  • Without context from Elhage et al. it is unclear, why kurtosis is the main metric tracking sparsity in this context. An accompanying background section motivating kurtosis would be useful
  • Section 3.2 quickly summarizes the mathematical analysis of this section. A motivation of why highlighting bipolar noise would be useful. (Gaussian noise is a natural choice)

问题

--

评论

We thank reviewer RE6X for the thoughtful comments! Your suggestions have enabled us to significantly improve the content and exposition.

In general, I am unsure about the weight of novel contributions in this paper relative to the bar of an ICLR acceptance. I will defer to the AC in this regard.

Could you help us understand which aspects of the contributions you find uncertain? Our paper makes several novel contributions that we believe are significant for the field, including:

  • We demonstrate a fundamentally new origin story for polysemanticity that challenges the prevailing view that it only arises from capacity constraints.
  • We provide rigorous mathematical analyses showing how different factors (L1 regularization, noise) can cause polysemanticity (including studying their effects throughout training and measuring the effects concretely as we increase L1 regularization strength or standard deviation of the noise) even when unnecessary for task performance.
  • We precisely characterize the number of incidentally polysemantic neurons we expect to see, and verify this experimentally.

Please note that ICLR has a strong tradition of publishing theoretical work using toy models to illuminate fundamental properties of neural networks. For instance, the ICLR 2023 Best Paper runner-up "Disentanglement with Biological Constraints" (Whittington et al.) used a similar approach of mathematical analysis on toy models to understand representation learning. Their work studied mixed selectivity, closely related to polysemanticity.

However, I am unsure how these findings translate into language models, which are not explicitly trained with an L1 regularization.

Our primary aim is to demonstrate that l1l_1 regularization can amplify random correlations between features and neurons, causing incidental polysemanticity. While language models are not explicitly trained with l1l_1 regularization, regularization effects arise implicitly through initialization, dropout, weight decay, optimization algorithm choice, and other techniques. The objective of the paper is not language models – our goal was to cleanly demonstrate this phenomenon of polysemanticity occurring incidentally in a simplified setting where we can precisely analyze it, so as to provide a foundation for future work investigating how these dynamics manifest in other model settings. As also approached by Elhage et al., we successfully demonstrated these dynamics in full detail first in this toy model setting, which was intentionally chosen to allow for a clean mathematical analysis and to definitively show that polysemanticity can arise even when the network has sufficient capacity to learn monosemantic representations.

Without context from Elhage et al. it is unclear, why kurtosis is the main metric tracking sparsity in this context. An accompanying background section motivating kurtosis would be useful.

Elhage et al. does not use kurtosis to track sparsity. However, as we note in the paper, previous related works such as Bricken et al. 2023 and Allen-Zhu & Li 2020 study the kurtosis of the noise distribution. Higher kurtosis indicates more concentration of mass in the tails, which corresponds to sparse activation patterns where neurons are either strongly active or inactive. Thus, we feel that it is a natural and well-used metric to track in our toy model. We've also added citations to related work in sparse coding, ICA and compressed sensing in our related work section, where kurtosis plays a similar role.

Section 3.2 quickly summarizes the mathematical analysis of this section. A motivation of why highlighting bipolar noise would be useful.

Thank you for the thoughtful question. In Appendix C, we compute in full detail the excess kurtosis different noise distributions (including bipolar and gaussian) have, and the effect of these computations on incentivizing sparsity. We selected to study noise in the hidden layer for three main reasons:

  • Showing that multiple factors besides L1 regularization can cause incidental polysemanticity strengthens our main claim that such polysemanticity can arise from diverse causes unrelated to task performance.
  • Noise is a practically relevant factor to study, since widely used techniques like dropout induce noise in real networks.
  • Studying the impact of noise on representations has precedent in related works like Bricken et al. 2023 and Allen-Zhu & Li 2020.

Thus, noise seemed like a natural choice given our goals and the broader context of the field. As we discuss in 3.2, the expected loss at the (t+1)’th step of training involves the excess kurtosis of the noise distribution (and we refer you to Appendix C for that full computation). Thus, noise distributions with contrasting kurtosis can aptly be used to demonstrate the contrast adding different types of noise to the network can have and its downstream effects on incentivizing sparsity.

评论

Dear authors, thank you for your helpful clarifications. I am increasing my score to 5, but am still doubtful about the applicability of the findings to more natural neural networks. The work verifies that

  • l1 regularization
  • bipolar noise (but not gaussian noise)

can cause an overparameterized toy model to map independent input features to the same hidden latent, a phenomenon the authors term incidental polysemanticity. While theoretical and experimental results of this work motivate incidental polysemanticity well in this setting, I am doubtful about the extent to which incidental polysemanticity occurs in non-overparameterized neural networks.

The only direct advice the authors give on mitigating incidental polysemanticity is this:

As a starting point, here is one possible way one might get rid of incidental polysemanticity in a neuron that currently represents two features i and j: Duplicate that neuron, divide its outgoing weights by 2 (so that this doesn't affect downstream layers), add a small amount of neural noise to the incoming weights of each copy, then run gradient descent for a few more steps. One might hope that this will cause the copies to diverge away from each other, with one of the copies eventually taking full ownership of feature i while the other copy takes full ownership of feature j.

I agree with the authors, that a concrete measure for incidental polysemanticity in neural networks (especially without overparameterization) is a valuable direction for future research.

评论

Dear Reviewer RE6X, we note that the ICLR discussion period has been extended. Could you please review our rebuttals to consider whether you would like to update your ratings? We would greatly appreciate your consideration and response.

AC 元评审

This paper investigates the phenomenon of "incidental polysemanticity" in neural networks, arguing that polysemantic neurons can arise from non-task-related factors such as random initialization, regularization, or noise, rather than solely from capacity constraints. While the work provides novel insights supported by theoretical derivations and toy model experiments, its reliance on simplified settings limits its generalizability to real-world scenarios. The paper falls short in addressing practical strategies for mitigating incidental polysemanticity or achieving monosemantic representations in standard neural networks. Despite its theoretical contributions, the lack of real-world validation and actionable insights diminishes its broader impact. Therefore, rejection is recommended.

审稿人讨论附加意见

During the rebuttal period, several critical points were raised by reviewers regarding the paper's scope, generalizability, and the robustness of its conclusions. Reviewer RE6X expressed concerns about the applicability of the findings beyond toy models, questioning whether the insights on incidental polysemanticity could translate to real-world neural networks, such as language models, which are not explicitly trained with regularization. Reviewer WVCj raised similar issues, highlighting that while the toy model experiments were illustrative, their relevance to typical scenarios remained unclear. Additionally, WVCj questioned the experimental setup, particularly the lack of a baseline comparison for polysemantic neurons without regularization or noise. Reviewer 8RNd focused on the distinction between incidental and task-induced polysemanticity, noting that the paper lacked sufficient discussion of the performance-interpretability trade-off and practical implications for achieving monosemanticity in real-world networks.

In response, the authors provided detailed rebuttals aimed at addressing these concerns. To clarify the relevance of toy models, the authors argued that simplified settings allow for precise analysis of polysemanticity's origins, isolating incidental factors like random initialization and regularization from capacity constraints. They cited precedents in machine learning research where toy models have yielded valuable insights, emphasizing that their contributions were complementary to ongoing work in larger-scale settings. To address the lack of baseline experiments, the authors provided additional context, explaining how sparsity and noise-induced dynamics were measured and their effects on polysemanticity quantified. They further elaborated on how noise distributions, such as bipolar noise, implicitly create regularization-like effects, leading to sparsity and eventual polysemanticity.

Regarding the performance-interpretability trade-off, the authors argued that incidental polysemanticity does not necessarily compromise task performance, as demonstrated in their toy models where networks achieved perfect accuracy despite incidental polysemanticity. They acknowledged the need for further research to verify these findings in more complex networks. On the issue of practical implications, the authors proposed speculative strategies to mitigate incidental polysemanticity, such as duplicating neurons and introducing small perturbations to weights, but admitted that these approaches require further validation. Finally, the authors agreed with reviewers that extending their work to real-world settings, including networks trained on natural datasets, was a valuable direction for future research and promised to highlight these limitations in the final version of the paper.

While the authors' responses clarified several technical details and improved the paper’s positioning, the concerns about generalizability and impact persisted. The reliance on toy models, while useful for theoretical exploration, limited the practical relevance of the findings. The rebuttals also did not fully address how the insights could be applied to typical neural networks or training regimes. The speculative nature of the proposed mitigation strategies and the lack of additional experiments to bridge the gap between toy models and real-world scenarios further weakened the paper’s contributions.

In weighing these points, the decision was influenced by the balance between theoretical novelty and practical applicability. While the authors provided a strong theoretical framework for understanding incidental polysemanticity, the lack of validation in realistic settings and actionable solutions reduced the paper’s potential impact on the broader machine learning community. The reviewers’ consistent concerns about generalization, combined with the authors’ inability to address these concerns with concrete evidence, led to the final recommendation of rejection. This decision acknowledges the paper's merits in advancing theoretical understanding but recognizes that its current form does not meet the threshold for acceptance at a venue focused on impactful and broadly relevant research.

最终决定

Reject