/10

Spotlight4 位审稿人

最低3最高4标准差0.4

ICML 2025

Discovering a Zero (Zero-Vector Class of Machine Learning)

Harikrishna Metta,Venkatesh Babu Radhakrishnan

提交: 2025-01-21更新: 2025-07-24

TL;DR

The classes are defined as vectors in a Vector Space, where Addition corresponds to the union of classes, and scalar multiplication resembles set complement of classes. Zero-Vector in that vector space has many useful applications.

摘要

关键词

MettaMetta-ClassMetta ClassMachine LearningICMLClass VectorClass Tensor EquationClass IntegrationRepository of ClassesContinual LearningClass AdditionClass SubtractionClass InvertZero Vector ClassSet operations on ClassesBoolean operation on ClassesUnary classificationManifold learningLifelong LearningClassifier as Generator

评审与讨论

审稿意见

评分: 42025-03-13

The authors propose a mathematical framework for representing data classes as vectors in a vector space. The goal is to enable operations such as addition (class union) and scalar multiplication (class complement) to improve machine learning classification. The main contributions are: the introduction of the Zero-Vector Class, a novel approach to represent class boundaries, the application of this framework to enhance neural network learning, and improvements in classification accuracy and continual learning. The study shows that the Zero-Vector Class allows networks to learn the true data manifold, leading to clearer decision boundaries and better performance on classification tasks.

给作者的问题

How does the Zero-Vector Class perform on high-dimensional datasets, and have you tested it beyond MNIST and CIFAR-10?
What are the computational costs of integrating the Zero-Vector Class, and how does it compare in efficiency to standard classification methods?

论据与证据

The authors claim that the Zero-Vector Class improves classification by enabling neural networks to learn the true data manifold. The evidence includes mathematical proofs establishing a vector space for class representation, experimental results on MNIST and CIFAR-10, and comparisons between models trained with and without the Zero-Vector Class. The main claims are: the Zero-Vector Class refines decision boundaries, it enables unary class learning, and it facilitates continual learning. The experiments support these claims by demonstrating reduced misclassification in empty feature space, improved performance in single-class training, and effective knowledge transfer in continual learning. However, the scalability claim is less supported, as results on CIFAR-10 indicate challenges in higher-dimensional spaces, suggesting the need for further validation on more complex datasets.

方法与评估标准

The proposed methods and evaluation criteria are well-suited for the problem of class representation in a vector space. The introduction of the Zero-Vector Class and the use of set operations on classes align with the goal of improving classification and continual learning. The evaluation criteria are also strong.

理论论述

No major flaws were found in the provided proofs, but additional empirical verification on complex datasets would strengthen the theoretical claims.

实验设计与分析

From my point of view, the experimental design appears sound.

The paper evaluates the Zero-Vector Class using well-known datasets (including MNIST and CIFAR-10) to test its effectiveness in classification.
The authors compare models trained with and without the Zero-Vector Class to evaluate its impact on decision boundaries and classification performance.
The introduction of the Occupancy Factor and Purity Factor provides a structured way to evaluate the clarity and accuracy of class boundaries.
The continual learning experiments demonstrate the potential of the Zero-Vector Class for modular and scalable learning without retraining entire models.

补充材料

I reviewed the supplementary material, focusing on the additional theoretical derivations, experimental details, and evaluation metrics.

与现有文献的关系

The paper's key contributions align with and extend existing concepts in machine learning and embedded representations. Representing data classes as vectors in a vector space is reminiscent of embedding techniques, which map high-dimensional data into lower-dimensional spaces while preserving meaningful semantic relationships. This approach facilitates operations like addition and scalar multiplication, akin to how embeddings capture semantic similarities in natural language processing.

遗漏的重要参考文献

The paper's key contributions are grounded in established concepts within machine learning, particularly vector space models and support vector machines (SVMs). Could you reference foundational works that have significantly influenced this domain, like word2vec: https://arxiv.org/abs/1301.3781 ?

其他优缺点

Strengths:

The paper exhibits notable strengths in its originality and potential significance.
The authors introduce a novel mathematical framework that represents data classes as vectors in a vector space.
This offers a fresh perspective on class representation in machine learning.
The Zero-Vector Class allows neural networks to learn the true data manifold rather than just decision boundaries.

Weakness:

The scalability of the Zero-Vector Class to high-dimensional datasets is not thoroughly explored.
Limited discussion on computational complexity and efficiency of the proposed method.

其他意见或建议

Overall the paper is well-written.
While the method is evaluated on MNIST and CIFAR-10, a discussion on its applicability to higher-dimensional datasets would enhance its impact.

作者回复

2025-04-01

We sincerely appreciate your careful and constructive review, which motivated us to perform additional experiments on ImageNet-1K embeddings and to analyze computational complexity explicitly. Your feedback has significantly strengthened our paper.

Response on Scalability and Computational Efficiency

We first address computational complexity to support the discussion on scalability.

Q2. Computational Cost

To analyze this clearly, we divide the computational requirements for any $c$ -class classifier into two distinct parts:

Pre-softmax logit computation
Softmax computation

Let a Zero-Exclusive Network be a classifier that does not use the zero-vector class during training. Suppose there are $n$ training points, and let the computation needed to compute pre-softmax logits be $O(f(n))$ —for instance, $f(n)$ might represent millions of computations. Computing the softmax function then takes $O(n \cdot \text{softmax}(c))$ computations, where $\text{softmax}(c)$ represents the complexity of softmax over $c$ classes.

Now consider modifying the Zero-Exclusive Network by adding an extra node at the output to classify the zero-vector class; we call this the Zero-Inclusive Network. In this scenario, we incorporate the zero-vector class in training. The addition of the new node introduces extra calculations dependent on the number of nodes in the preceding layer. Letting $L$ denote the number of nodes in the layer preceding this new node, the additional computation required is $O(nL^2)$ .

The computational complexities during training and inference are summarized in the following table:

Classifier	Training Cost	Inference cost
Zero-Exclusive Network (base classifier)	$O(f(n)) + O(n \cdot \text{softmax}(c))$ $\approx O(f(n))$	$O(f(n)) + O(n \cdot \text{softmax}(c))$ $\approx O(f(n))$
Zero-Inclusive Network ( $k$ zero-vector data points in training)	$O(f(n+k) + (n+k)L^2) + O((n+k) \cdot \text{softmax}(c+1))$ $\approx O(f(n+k))$	$O(f(n) + nL^2) + O(n \cdot \text{softmax}(c+1))$ $\approx O(f(n))$

Typically, $O(f(n))$ dominates $O(nL^2)$ since $L$ corresponds only to the final internal layer. There is a trade-off between computational overhead and purity improvement. This trade-off depends significantly on the intended applications leveraging the full potential of the Zero-Vector framework.

We will include this complexity analysis in the additional page for the final submission.

Q1. Scalability

We acknowledge scalability as a critical concern. Our motivation for exploring high-dimensional embeddings comes from the success of embedding-based models like VQ-VAE, Stable Diffusion, and MAE, where internal dimensions (512–1024) capture complex distributions effectively.

To test this, we conducted experiments on embeddings (dim: 768) from a pretrained MAE on ImageNet-1K:

Selected 10 random ImageNet classes and obtained embeddings using MAE
Generated corresponding Zero-Vector Class data within this embedding space.
Trained two classifiers:
- Zero-Inclusive Network (trained with Zero-Vector Class data)
- Zero-Exclusive Network (trained without Zero-Vector Class data)

Results:

Purity Factor:
- Zero-Exclusive Network purity fluctuated significantly (0.1–0.9), indicating substantial misclassification of empty regions in the feature space.
- Zero-Inclusive Network purity remained consistently high and stable (0.7–1.0), indicating effective suppression of misclassification in empty regions.
Accuracy:
- Both networks reached similar test accuracy (~82%), showing no degradation from the Zero-Vector Class.

These experiments demonstrate the practical scalability of our framework to high-dimensional feature spaces, directly addressing the reviewer’s concern. Click here to view additional results on ImageNet-1K embeddings.

Similarly, in NLP Transformer models, next-token prediction is a classification over a fixed vocabulary, naturally allowing inclusion of the Zero-Vector Class as an extra token. While NLP experiments are left for future work, typical embedding sizes (~512) are smaller than those already tested (e.g., 784 for MNIST), supporting the method’s practical feasibility in NLP tasks.

Regarding References

This work explores a novel direction, and we found it challenging to identify direct precursors. Thank you for suggesting foundational works like word2vec and SVMs—we agree these are relevant in spirit and will incorporate such references in the final version. We welcome any further suggestions as well.

We believe these clarifications resolve the concerns raised. If you find these points convincing, we would appreciate a reconsideration of your Overall Recommendation. Thank you again for your valuable review and time.

审稿意见

评分: 42025-03-14

This paper proposes a mathematical framework for handling classes in datasets as vectors in a vector space, including a Zero-Vector Class which can be regarded as the absence of a class. They introduce all theoretical foundations and discuss two applications of their framework, namely "clear learning" and "unary class learning". The approach is validated using MNIST and CIFAR-10 datasets.

给作者的问题

How does the training procedure for clear learning look like?

论据与证据

The main content of this work is the theoretical foundation of the Zero-Vector Class. The authors provide well-supported evidence for their claims in the form of experiments and visualizations of their theoretical concept. However, there are some assumptions that lack evidence, e.g. the assumption that the Zero Vector Class acts like a uniform distribution and an additive identity. However, the emperical experiments show that the learning approaches work.

方法与评估标准

The proposed methods make sense for the problem at hand. They demonstrate the theoretical assumptions, but also reveal limitations in scalability, which is left for future work.

理论论述

See above.

实验设计与分析

The experiments and analyses are sound. However, the training procedure for the individual neural networks is not described in much detail.

补充材料

The code for the models is provided as supplementary material, which I checked sparingly.

与现有文献的关系

The authors contribute to a broad range of literature on the theoretical foundations of learning algorithms. In the future more experiments should be conducted for more diverse benchmarks beyond MNIST and CIFAR-10.

遗漏的重要参考文献

There are no essential references not discussed that I know of.

其他优缺点

The idea is very interesting and a great contribution. The theoretical assumptions and limitations concerning scalability should be investigated more in the future.

其他意见或建议

Small errors:

line 043: "many techniques (...) that allow" instead of allows
the alternation between upper and lower case is irritating (e.g. line 044: "the Neural Network", in the next column: "If a neural network exhibits..", "Logit" vs. "logit".
line 037 (right column): "of the combined class's" should be "classes'" I think
line 139: "need not be an.." should be "does not need to be.."
line 132 (right column): "in the left-hand side" should be "on the left-hand side"
Beginning of Section 3: there is an article missing "set of Valid Logit.."
When using equations the punctuation is missing
On the right column of page 4 the equations are not numbered
Instead of "refer the subsection" it should be "see subsection .."
line 234 (right column): "the PDF of [0] is considered" instead of "consider"
line 310: "the learning does not make much sense"

作者回复

2025-04-01

We sincerely appreciate your careful and meticulous review, especially your identification of small errors with exact line numbers; this level of detail is invaluable for improving the manuscript's clarity and quality.

Clarification on Training Procedure for Clear Learning

Thank you for raising this important point. We acknowledge the training procedure was not fully detailed in the original submission and appreciate the opportunity to clarify.

Clear Learning follows a standard supervised classification setup using Cross-Entropy loss and SGD optimization. The only change is the inclusion of the Zero-Vector Class. This is done by uniformly sampling points from the input space, labeling them as the Zero-Vector Class, and adding them to the training data. The classifier’s output layer includes one extra node to represent this class.

All other components—architecture, loss function, optimizer, and hyperparameters—remain identical to baseline training. This makes the method easy to integrate into existing pipelines with minimal changes.

We used standard network architectures appropriate for each task:

For MNIST and the 2D datasets (e.g., Figure~6), we used fully connected networks.
For CIFAR-10, a conventional convolutional neural network.
For the equation discovery task, a multidimensional Taylor-series-based classifier.

Given the range of tasks, we’ve included implementation code in the supplementary material and will release the full codebase upon acceptance. Training details will also be clarified in the final revision.

Assumptions about Zero-Vector Class (Mathematical Rigor & Clarity)

We appreciate your comment regarding the assumption that the Zero-Vector Class behaves like a uniform distribution and an additive identity. While our main goal was to convey the intuition leading to the framework, we agree that formalization is valuable. Your feedback helped us clarify this aspect more rigorously

As described in Section 2 (Characterization of Logits) of the main paper, if the data PDF is $f(x)$ , a predefined threshold $\alpha_{f}$ , can determine class membership. Thus tuple $(f(x), \alpha_{f})$ defines that class, say class $-f$ . More generally, let $f_T(x)$ be the threshold of the PDF $f(x)$ , and the tuple $(f(x), f_T(x))$ defines that class. Consider a black-box operator that takes a function as input and returns a threshold function suited for that PDF. For example, $(f(x) + k, f_T(x) + k)$ defines the same class- $f$ for any real-valued constant $k$ . All such tuples preserve the decision boundary and thus represent the same underlying class.

Let 𝓜 be the set of monotonically non-decreasing functions. Then each tuple in the set $[f(x)]:=$ { $( M \circ f(x),\ M \circ f_T(x) ) \big| M \in 𝓜$ } also defines the same class- $f$ . Here, $M \circ f(x)$ denotes $M(f(x))$ , written this way to reduce bracket clutter.

Definition of set V: $V =$ { $[f(x)], [g(x)], [h(x)], ....$ }

Addition: $\forall_{[f(x)], [g(x)] \in V}$ $[f(x)] + [g(x)] := [f(x) + g(x)]$

Scalar multiplication: $\forall_{\lambda \in \mathbb{R}}$ $\forall_{[f(x)] \in V}$ $\lambda [f(x)] :=[\lambda f(x)]$

The remaining vector space properties follow as in the main paper. Here, we focus on the additive identity (the zero-vector).

Let $[I(x)]$ is the identity of this vector space. Then, for any vector $[f(x)]\in V$

$[f(x)] + [I(x)] = [f(x)]$ ⇒ $[f(x) + I(x)] = [f(x)]$ ⇒ $[(f + I)(x)] = [f(x)]$ which implies:

{ $(M ∘ (f + I)(x), M ∘ (f + I)_T(x)) | M ∈ 𝓜$ } $=$ { $(M ∘ f(x), M ∘ f_T(x)) | M ∈ 𝓜$ }

If $I ∈ 𝓜$ , then the above holds trivially. Hence, the additive identity must be a monotonically non-decreasing function. Since every vector in $V$ corresponds to a class (i.e., a PDF), there must exist a PDF representing the zero-vector class that is monotonically non-decreasing. We now analyze its form:

Case 1: PDF of the Zero-Vector is Constant

A constant PDF corresponds to a uniform distribution. Hence, the zero-vector class corresponds to a uniform distribution over its support.

Case 2: PDF is Strictly Monotonically Increasing

Let $A$ be the volume spanned by a PDF that is strictly monotonically increasing and has zero probability outside the volume $A$ . Since a PDF must integrate to one over its support, as $A \to \infty$ , the PDF appears approximately constant within any finite, localized region. Thus, samples will appear uniform within any localized region.

From both Case 1 and Case 2, it follows that data sampled from the zero-vector appears uniform within any finite local region. Hence, the Zero-Vector Class effectively behaves as a uniform distribution.

We invite you to look at the computational cost and scalability discussed in response to reviewer gWUe. If our clarifications have addressed your concerns, we would be grateful if you would consider updating your overall recommendation. Thank you again for your time and thoughtful review.

审稿人评论

2025-04-07

Thank you very much for the further clarification and considering my comments. I will revise my score.

作者评论

2025-04-08

We deeply appreciate your acceptance of our work and your thoughtful suggestions, which have significantly improved the clarity and rigor of our manuscript—particularly around the training procedure and formalization.

Given the foundational nature of this concept, we feel a strong responsibility to ensure it reaches and informs the broader ML community, helping shape future research around clearer, more interpretable representations.

As we’ve fully addressed your thoughtful feedback, we kindly invite you to consider increasing your recommendation to further support this direction and its broader impact.

Thank you again for your time, insight, and constructive review.

审稿意见

评分: 42025-03-15

This paper introduced a novel technique to regularize neural networks' decision boundary by introducing the so-called "zero-vector class". The paper established some mathematical properties of the zero-vector class concept and derived a simple method to improve on the neural networks in different tasks.

给作者的问题

In the paper, the author assumed that we could sample the "zero class" simply by drawing from a uniform distribution. However, it is quite common that the data distribution we care about lie on a low dimensional manifold (or a neighborhood of that manifold) which we can hardly sample efficiently from. For example, we can not easily sample random images besides Gaussian noises. I am curious how the authors would consider sampling from the "zero class" in those cases.

论据与证据

The paper claims to have invented the notion of "zero-vector class" and applied it on classification problems. The concept is useful in many situations, such as single-class learning and continuous learning, and improving the decision boundary. The paper studied the empirical benefit in the Appendix which I have not been able to check.

方法与评估标准

The main part of the paper is consisted of building the formal framework of Valid Logit Functions and the vector space they constitute. There are few empirical claims discussed in the paper but their results were presented in the Appendix, which I have not checked in full.

理论论述

The theoretical claims are the valid logit functions can be partitioned into equivalence classes and these classes form a vector space under certain conditions. It is proved in the main paper and I believe it is correct.

实验设计与分析

This paper have some experimental results showing the empirical properties of adding a "zero class" in the classification task. Even though it is not a overwhelmingly better method compared to not adding the class, it did changed the neural network's decision boundary in some ways, and improved on the purity factor defined in the paper. However, it is unsure whether this approach could scale when the domain of tasks change.

补充材料

see above

与现有文献的关系

n/a

遗漏的重要参考文献

n/a

其他优缺点

see above

其他意见或建议

n/a

作者回复

2025-04-01

Thank you for your thorough review and for positively highlighting the theoretical contributions of our work.

Q. Regarding Sampling from Low-Dimensional Manifolds

Your concern about sampling from low-dimensional manifolds is valid. Directly sampling from such manifolds is generally impractical. Crucially, however, our method explicitly does not require manifold sampling. Instead, we uniformly sample from the entire input space—a process that is both simple and ensures maximal entropy.

Why is this effective? Uniform sampling guarantees maximal entropy and thus explicitly represents regions of the input space associated with high uncertainty. These samples intentionally represent areas the classifier should not confidently assign to known classes. By introducing the Zero-Vector Class, we encourage the classifier to place decision boundaries precisely around true data manifolds, effectively learning the manifold structure indirectly. This is clearly demonstrated in Figure 6 (bottom row): without the Zero-Vector Class (bottom left), the classifier's boundaries incorrectly extend into empty regions. In contrast, with the Zero-Vector Class (bottom right), decision boundaries closely follow the true data distribution. This result validates our simple yet powerful sampling approach.

Regarding the Strength of Empirical Results

We respectfully disagree with the assertion that improvements from our method are marginal. The gains are substantial, measurable, and practically significant. Figure 6 visually demonstrates how the Zero-Vector Class dramatically improves decision boundary alignment with the true manifold. Moreover, quantitative evidence strongly supports our method:

In Figure 18 (MNIST), the purity factor for classes 3 and 8 drops below 5% without the Zero-Vector Class, meaning the classifier confidently misclassifies 95% of the region. With the Zero-Vector Class, purity exceeds 95% implies misclassifies only 5% of the region.
A similar pattern occurs in Figure 29 (CIFAR-10).

High purity explicitly indicates decision regions closely align with true data manifolds. Achieving this high purity directly enables a variety of applications (e.g., single-class learning, continual learning, equation discovery), which are otherwise infeasible with traditional classification methods.

Scalability to High-Dimensional Data

We appreciate your question about scalability, and explicitly evaluated it through additional experiments inspired by the practical success of embedding-based models (e.g., VQ-VAE, Stable Diffusion, and Masked Autoencoder [MAE]). Specifically, we:

Selected embeddings (768-dimensional) from 10 random classes in ImageNet-1K using a pretrained MAE model.
Generated corresponding Zero-Vector Class data uniformly in this embedding space.
Trained two classifiers:
- Zero-Inclusive Network (with Zero-Vector data)
- Zero-Exclusive Network (without Zero-Vector data)

Results demonstrate clear scalability and effectiveness:

Purity Factor:
Zero-Exclusive Network purity fluctuates dramatically (0.1–0.9), indicating misclassification of empty space.
Zero-Inclusive Network purity remains consistently high and stable (0.7–1.0).
Accuracy: Both classifiers achieve comparable accuracy (~82%), confirming no accuracy penalty from the Zero-Vector Class.

These experiments clearly validate our method’s scalability to high-dimensional embedding spaces used by modern neural networks. Click here to view additional results on ImageNet-1K embeddings.

For NLP applications, Transformer models typically operate in embedding spaces with dimensions (~512) lower than those we have already successfully tested (e.g., MNIST: 784). Since next-token prediction in NLP is formulated as a vocabulary classification task, incorporating the Zero-Vector Class as an additional vocabulary token is straightforward and promising. Explicit NLP experiments will be pursued in future work.

审稿人评论

2025-04-05

Thank you for providing the additional results on Image-1K embeddings. After reviewing the updated findings, I have revised my score to reflect the improvements. However, as someone who is not very familiar with this specific line of work, I advise readers to interpret my evaluation with appropriate caution.

作者评论

2025-04-08

Thank you for your thoughtful reconsideration and for updating your evaluation based on the additional ImageNet-1K experiments. We greatly appreciate your engagement with our clarifications and results, which helped us strengthen the manuscript significantly.

We view the review and rebuttal process as a core part of scientific progress, and we’re confident that readers will interpret all evaluations—including yours—as essential contributions to that process.

We believe this work presents a foundational perspective on class representation in machine learning, with broad relevance to tasks such as anomaly detection, continual learning, and the interpretability of decision boundaries. Given its conceptual importance, we feel a responsibility to ensure this idea is made accessible to the wider ML community—especially at a venue like ICML, where emerging directions often shape future research.

If you find that the revised results and responses have addressed your original concerns, we would sincerely appreciate your further endorsement. Your support could meaningfully increase the visibility of this work and encourage broader exploration of its ideas.

Thank you again for your time and thoughtful feedback.

审稿意见

评分: 32025-03-17

The paper introduces a novel mathematical framework for understanding class representations in machine learning by defining classes as vectors in a vector space. The core idea revolves around the concept of a Zero-Vector Class, which corresponds to a data class with a uniform distribution. This conceptualization enables various applications, including clear learning, unary class learning, set operations on classes, continual learning, data generation, and equation discovery. The framework is demonstrated through experiments on the MNIST and CIFAR-10 datasets.

给作者的问题

Stability of Zero-Vector Class Training: Given that Zero-Vector Class data is sampled from a uniform distribution, does this introduce instability or difficulties during training? Did you observe any vanishing/exploding gradients or convergence issues?
The paper draws parallels between Zero-Vector Class training and Energy-Based Models (EBMs). How does the proposed method differ in terms of learned representations, training stability, and generalization?

论据与证据

The claims made in the submission are supported by clear and convincing evidence

方法与评估标准

The proposed methods and/or evaluation criteria (e.g., benchmark datasets) make sense for the problem or application at hand

理论论述

I did not check the correctness of any proofs for theoretical claims

实验设计与分析

I check the soundness/validity of any experimental designs or analyses.

补充材料

I review the supplementary material.

与现有文献的关系

The key contributions of the paper are not related to the broader scientific literature

遗漏的重要参考文献

I am unfamilar with this area and not sure whether there are essential references that should be discussed

其他优缺点

Strengths:

Novel Theoretical Contribution: The paper presents an innovative way of treating classes as vectors, enabling set operations on class regions in a mathematically rigorous manner. The introduction of the Zero-Vector Class provides a fresh perspective on classification and decision boundaries.
Improved Decision Boundaries (Clear Learning): The paper argues that incorporating the Zero-Vector Class leads to clearer and more realistic decision regions in neural networks, avoiding misclassification in empty feature space regions.
Facilitates Unary Class Learning: The proposed approach allows neural networks to learn a single class in isolation, which can be useful in anomaly detection and other real-world applications where only positive examples are available.
Continual Learning Without Retraining: The Zero-Vector Class allows for incremental learning without catastrophic forgetting, as new classes can be integrated without redefining previous decision boundaries.
Potential for Data Generation and Equation Discovery:The method can be used to generate synthetic data by following gradients of logit functions. It can also help discover mathematical equations representing classes through Taylor-series-based learning.

Weaknesses:

Mathematical Rigor & Clarity: While the paper provides proofs and derivations, some sections could be presented in a more structured and rigorous manner. Certain notations and definitions (e.g., the validity of the equivalence sets and operations on them) could benefit from clearer formalization.
Scalability to High-Dimensional Data: The framework's applicability to more complex datasets (e.g., ImageNet or NLP tasks) remains unclear. The Zero-Vector Class is primarily tested in lower-dimensional spaces; its efficiency in high-dimensional feature spaces needs further exploration.
Computational Efficiency:The paper does not discuss the computational overhead of incorporating Zero-Vector Classes in training. The potential trade-off between training time and classification accuracy should be analyzed.

其他意见或建议

See weaknesses

作者回复

2025-04-01

We sincerely thank the Reviewer for their detailed and constructive review. The questions raised are insightful and have helped us further refine and improve our work.

Q1. Stability of Zero-Vector Class Training

We observed no instability during training with the Zero-Vector Class. The Zero-Vector Class uses identical network architecture and loss function as standard training. Empirically, we find training stability improved when incorporating the Zero-Vector Class:

Purity Factor plots (e.g., "Figure 18 shows MNIST, Figure 29 shows CIFAR-10") are more stable during training when the Zero-Vector Class is included.
Test Accuracy plot (e.g., "Figure 18 shows MNIST, Figure 29 shows CIFAR-10") is stable irrespective of zero-vector class used in training or not.
No vanishing/exploding gradients and convergence issues as training metrics either improved or remained stable.

Q2. Differences from Energy-Based Models (EBMs)

We clearly distinguish our method from EBMs as follows:

Representation: EBMs learn a single global energy function representing data likelihood.
Our method explicitly learns separate class logits, each class individually via a standard classification network. This results in distinct, interpretable per-class distributions.
Training Stability: EBMs typically require specialized training procedures (e.g., Langevin dynamics sampling), leading to potential instability or complex hyperparameter tuning.
Our method uses standard supervised classification (cross-entropy loss) with the Zero-Vector Class, providing inherent training stability without extra sampling or specialized optimization.
Generalization: EBMs are susceptible to mode collapse, potentially failing to capture the complete data distribution.
Our method explicitly learns the underlying data distribution through class logits (Equation 27), converging to the true distribution as the number of data points increases. This ensures comprehensive generalization to the true data manifold, thus avoiding mode collapse.

Scalability and Computational Efficiency

Computational Efficiency

We analyze computational complexity in two parts:

Pre-softmax logits: complexity $O(f(n))$ for $n$ data points.
Softmax calculation: $O(n \cdot \text{softmax}(c))$ for $c$ classes.

Adding the Zero-Vector Class introduces one extra output node, adding overhead $O(nL^2)$ , where $L$ is the preceding layer size.

Complexities summarized:

Classifier	Training Cost	Inference Cost
Zero-Exclusive	$O(f(n))+O(n\cdot\text{softmax}(c))\approx O(f(n))$	$O(f(n))+O(n\cdot\text{softmax}(c))\approx O(f(n))$
Zero-Inclusive ( $k$ extra points)	$O(f(n+k)+(n+k)L^2)+O((n+k)\cdot\text{softmax}(c+1))\approx O(f(n+k))$	$O(f(n)+nL^2)+O(n\cdot\text{softmax}(c+1))\approx O(f(n))$

Typically, $O(f(n))$ dominates, making overhead manageable, especially given improvements in purity.

Scalability

Inspired by successful embedding-based models (VQ-VAE, Stable Diffusion, MAE; typical dimensions: 512–1024), we explicitly tested scalability on embeddings (dimension:768) from Masked Autoencoder (MAE, ImageNet-1K):

Selected 10 random ImageNet classes, generated embeddings and corresponding Zero-Vector data.
Trained Zero-Inclusive (with Zero-Vector data) and Zero-Exclusive classifiers.

Results confirm scalability:

Purity Factors: 0.7–1.0 for Zero-Inclusive Network vs. 0.1–0.9 for Zero-Exclusive Network, demonstrating improved boundaries.
Accuracy: Both networks achieved similar accuracy (~82%), confirming no accuracy degradation.

Click here to view additional results on ImageNet-1K embeddings.

In NLP Transformers, next-token prediction tasks classify over vocabulary, naturally permitting Zero-Vector Class integration. Typical NLP dimensions (~512) are lower than those already tested (e.g., MNIST:784), confirming practical feasibility. Explicit NLP experiments remain future work.

Response on Mathematical Rigor & Clarity:

Our goal was to clearly convey the core intuition behind our framework and support it empirically. We agree formalization is valuable and have provided rigorous mathematical explanations in response to Reviewer dsGt, which we encourage you to read for more details.

Regarding Essential References:

Given the novelty of this approach, identifying directly related prior work was challenging for us as well. We hope this contribution opens a new line of exploration.

We sincerely appreciate your detailed review and hope our responses clarified key concerns. Feedback from other reviewers was positive; if our clarifications resolved your concerns, we would appreciate your reconsideration of the Overall Recommendation. Thank you for your valuable evaluation and time.

审稿人评论

2025-04-06

Thank you for all your detailed responses. I will revise my score.

作者评论

2025-04-08

We sincerely appreciate your reconsideration and the increased evaluation score. Your insightful comments have meaningfully shaped and improved the clarity and depth of our manuscript.

We genuinely believe this work introduces a foundational concept with far-reaching implications for the ML community. Given its importance, we feel a strong responsibility to ensure the idea is clearly communicated and widely understood, helping guide future research along scientifically grounded and impactful directions.

In direct response to your earlier concerns, we conducted additional experiments and provided detailed clarifications—addressing high-dimensional scalability (via ImageNet embeddings), computational complexity, training stability, and the distinctions from Energy-Based Models (EBMs).

If you find that these clarifications and results fully resolve your earlier concerns, we kindly ask you to consider revising your recommendation to a clear Accept. Your support would play a key role in helping this contribution reach the community it is intended to serve.

Thank you again for your thoughtful and constructive feedback.

最终决定Accept (spotlight poster)

2025-05-01

This paper introduces an innovative framework that models classes as vectors, enabling mathematically rigorous set operations on class regions. Building on this formulation, the authors propose the concept of a Zero-Vector Class, which offers a novel perspective on classification and decision boundaries. The incorporation of this zero class results in more compact and representative class acceptance regions and facilitates learning from a single class, making the approach well-suited for anomaly detection. Additionally, the Zero-Vector Class supports incremental learning without catastrophic forgetting, as new classes can be integrated without redefining existing decision boundaries. The method also shows promise for synthetic data generation by following gradients of logit functions and for discovering analytical class representations through Taylor-series-based learning. All of these contributions are substantiated by the experimental results. I strongly recommend the acceptance of this paper.