PaperHub
6.7
/10
Poster3 位审稿人
最低6最高7标准差0.5
6
7
7
2.7
置信度
正确性3.0
贡献度2.7
表达3.3
NeurIPS 2024

Navigating Extremes: Dynamic Sparsity in Large Output Spaces

OpenReviewPDF
提交: 2024-05-16更新: 2025-01-16
TL;DR

Investigates Dynamic Sparse Training for large output spaces. Leveraging semi-structured sparsity, intermediate layers, and auxiliary loss, it enables end-to-end training with millions of labels on commodity hardware with near-dense performance.

摘要

关键词
Dynamic sparse trainingextreme classificationmemory efficient traininglarge output spacesscalable machine learning

评审与讨论

审稿意见
6

The paper proposes the application of DST(Dynamic Sparse Training) to XMC(Extreme Multi-label classification) by employing an intermediate layer and adding an auxiliary training objective to enable end-to-end training of XMC problems with millions of labels on commodity hardware.

优点

  1. The paper is well-written and easy-to-follow.
  2. The paper solves an important problem of democratizing access to the state-of-the-art XMC model training.
  3. The paper provides a thorough analysis of challenges and solutions in applying DST to XMC.
  4. The conclusion is supported by experimental results on 4+1 datasets.

缺点

  1. The paper addresses the memory efficiency, however the training time still remains a concern, specifically for time-sensitive real world application datasets.
  2. The paper does not present results on datasets with label features except one(LFAT-131K). Many modular techniques such as NGAME, DEXA etc. perform much better on these datasets and are memory efficient by using negative sampling and optimizers such as SparseAdam.
  3. The introduction of new hyperparameters, such as sparsity levels, intermediate layer sizes, and auxiliary objectives, adds to the complexity of the model. The sensitivity of the model's performance to these hyperparameters is not thoroughly explored.

问题

  1. It would be good to add training time estimates in addition to memory usage.
  2. In the introduction, the authors refer to [32] by Schultheis and Babbar which assumes the existence of pre-trained embeddings and claim that DST can give substantial memory savings. What if DST is applied to fixed embeddings generated from NGAME, DEXA etc.? Why is there a need to train these models end-to-end?
  3. It would be good to add a couple more datasets with LF features.
  4. It would be good to compare to Renee with smaller sized encoders.
  5. DEXML(Dual-Encoders for Extreme Multi-Label Classification) is a new encoder-based approach that seems to perform at par with XMC models. Please compare the proposed approach to DEXML on a couple of datasets with label features.
  6. It would be good to add sensitivity of the model's performance to the new hyperparameters.

局限性

  1. The authors have discussed some limitations.
  2. It would be good to comment on the scalability of the proposed approach given that the real-world datasets have much more than 3M labels.
作者回复

It would be good to add training time estimates in addition to memory usage.

The training times of the proposed dynamic sparse approach is close to LightXML, and about 1.5x compared to using a dense last layer (corresponding to the Renee architecture). In order to have a fair timing comparison, all these experiments were run on an A100 GPU, despite the fact that our method would work with cheaper hardware.

Training time in hours:

Attn.Light.Casc.DenseDense-BNRiGLOurs
Wiki10-31K-6.70.30.750.951.21
Wiki-500K37.334.322.021.6354736
Amazon-670K26.128.87.51827.54430
Amazon-3M59.5--5256OOM72

What if DST is applied to fixed embeddings generated from NGAME, DEXA etc.? Why is there a need to train these models end-to-end?

Further results with fixed embeddings obtained from CascadeXML features:

Wiki-500KP@1P@3P@5Amazon-670KP@1P@3P@5
Fixed73.654.842.1Fixed42.637.133.1
End-to-End76.757.844.5End-to-End47.141.838.0

The results demonstrate that training the model end-to-end is beneficial in terms of prediction performance as it provides noticeable improvements over fixed embeddings. Furthermore, the embeddings obtained from other pipelines are dataset specific, and in a realistic application scenario, if the dataset is not derived from any of the pre-existing ones from the Extreme Classification Repository, the only option is to learn the model from scratch in an end-to-end manner.


DEXML(Dual-Encoders for Extreme Multi-Label Classification) is a new encoder-based approach that seems to perform at par with XMC models. Please compare the proposed approach to DEXML on a couple of datasets with label features.

Below, we compare our method with DEXML for the LF-AmazonTitles-131K dataset. Please refer to the subsequent answer for the comparison with the LF-WikiSeeAlso-320K dataset.

LF-AmazonTitles-131KP@1P@3P@5Memory(GB)Time (h)
DEXML42.52-20.6430.230
Ours (Fan-in=128)44.529.821.32.28

It would be good to add a couple more datasets with LF features.

We have added below the comparison on another dataset with label features (LF-WikiSeeAlso-320K). These demonstrate that performance at par with state-of-the-art methods DEXML and NGame can be achieved at a fraction of GPU memory consumption and training time.

LF-WikiSeeAlso-320KP@1P@3P@5Memory(GB)Time (h)
DEXML46.129.922.356.123.2
NGAME47.731.623.618.675.4
Ours (Fan-in=128)44.227.920.12.3130
Ours (Fan-in=256)46.029.9222.12.9830

Please note that the peak memory and training time comparisons are conducted with the same batch size. For the newly added LF-WikiSeeAlso-320K dataset, we used a batch size of 64. The DEXML model, with all negative label settings on the LF-WikiSeeAlso-320K dataset, required more than 160GB of memory (causing OOM on our 4xA100 node server). Therefore, we used the 10 hard negatives settings as described in Table 3 of the DEXML paper.


It would be good to compare to Renee with smaller sized encoders.

Thank you for highlighting this point. We realize that our explanation on line 234 could benefit from additional clarity. Our dense baseline is similar to Renee but with a smaller encoder. The key differences are: (i) We use squared hinge loss instead of BCE, (ii) We employ Adam for the extreme layer, whereas Renee uses SGD. These modifications result in a slight performance improvement over Renee. We included this enhanced dense baseline for comparison in the manuscript. For your reference, we provide the results of Renee with a smaller (base) encoder in the table below.

P@1P@3P@5
Wiki-500K78.460.046.6
Amazon-670k49.544.139.7
Amazon-3M50.047.445.3

Note that for Wiki500K, the encoder architecture remains the same as in the Renee paper, with the sequence length reduced to 128 for a fair comparison.


It would be good to add sensitivity of the model's performance to the new hyperparameters.

Please refer to Table 1 in the Author Rebuttal for the impact of different sparsity levels in conjunction with the use of auxiliary loss, and to Table 2 for a detailed ablation study of hyperparameters related to the auxiliary loss. The ablation study concerning the intermediate layer size is presented in Appendix D (line 568). Also, please note that, while the method introduces hyperparameters such as sparsity levels, it on the other hand does not require any hyperparameters related to hard-negative mining.


It would be good to comment on the scalability of the proposed approach given that the real-world datasets have much more than 3M labels

To the best of our knowledge, Amazon3M is the largest dataset that is publicly available; if you know of a dataset we could use, we'd be happy to run these additional experiments.

评论

I thank the authors for detailed responses to my questions. The rebuttal response has addressed most of my concerns, specifically related to the hyper-parameter sensitivity, comparison with Renee, DEXML. After carefully reading the responses, I would like to keep my score.

审稿意见
7

[Edited: My overall score has increased from 6 to 7 (Accept) after the authors' rebuttal.]

Dynamic Sparse Training (DST) holds the promise of more efficient training and model robustness, but remains largely impractical due to the lack of inherent structure and severe amplification of training steps needed to converge to high quality. The authors apply DST to extreme multi-label classification (XMC) problems and build on past work to avoid losing important gradient signals and achieve a practical speedup. They use a structured fixed fan-in sparsity at the final, extremely large, classifier layer to provide benefit in practice, and an auxiliary classifier to enhance gradient signals to layer before the semi-structured sparse classifier. This classifier is designed to provide useful signals early in training, but not become so important that it cannot be removed after that. The proposed modifications to DST in XMC are evaluated across a variety of data sets and against a seemingly comprehensive set of baseline techniques, and the results suggest that the method reduces memory consumption compared to the most baselines and gives a model with comparable quality to the best-performing baseline. The authors also show that the auxiliary objective not only helps with initial training, but also with high sparsity in the final classifier.

优点

Originality There seem to be two novel aspects: applying a past semi-structured fixed fan-in sparsity to the final classifier, and moving the low-rank intermediate layer to become an auxiliary classifier, solving a standard label shortlisting task, that can be removed by gradually decaying its importance over time. The authors also provide new insight into the importance of the commonly-used meta-classifier and demonstrate their approach is effectively robust to label imbalances.

Quality I believe the claims have been supported by experimental results in all cases. The ablation study in Figure 3 makes the benefit of the auxiliary classifier clear. The authors also provided a sweep of rewiring intervals to support their experimental settings, and the baselines seem fair to me; kudos to the authors for thinking of including a "smaller, dense" version of the fixed fan-in sparse classifier with the same number of parameters.

Clarity In general, I found the paper easy to read with a fine layout, and I appreciate the effort the authors spent in explaining the XMC problem for readers, like me, who are not familiar with its unique aspects.

Significance Though I'm not familiar with XMC, if it is as widespread as some other tasks, successfully training a sparse model from scratch will be a huge advancement.

缺点

Originality It really seems like [32] used the same type of semi-structured sparsity on their classifier. What is the difference? Section 2.2 makes it appear as though this were your contribution.

Quality The authors claim, in line 142, that 2:4 structured sparsity results in deteriorated model accuracy, but there was no reference (other than the 2:4 whitepaper showing only high model qualities), and I couldn't find evidence in the submission that 2:4 sparsity behaved poorly for XMC tasks.

Clarity Though the authors provided context for XMC, it still took me a couple reads to really understand the task, but once I did, the difficulty of the problem the authors set out to solve was clear.

I'm confused about the authors' intent to claim novelty for applying semi-structured fixed fan-in sparsity to the final classifier.

Significance This is my first time hearing about the XMC problem, but I've heard of many other tasks and problems that I do not work on directly. This makes me think that perhaps the applicability of the work may be limited, but I'll be happy to be wrong.

The improvement over past work is generally only in memory savings; there are often past techniques that give a higher quality model (though with higher memory requirements). It'd be more compelling if the method matched or exceeded the accuracy of past work with a smaller memory requirement.

It's hard to judge the benefit of one of your primary contributions, the python bindings for CUDA kernels, without seeing them.

问题

  • What are the concrete novel aspects of your work?
  • Can you give me some idea of the importance of XMC as a task?

局限性

Yes, the authors have discussed the limitations and potential societal impact of their work.

作者回复

The authors claim, in line 142, that 2:4 structured sparsity results in deteriorated model accuracy, but there was no reference (other than the 2:4 whitepaper showing only high model qualities), and I couldn't find evidence in the submission that 2:4 sparsity behaved poorly for XMC tasks.

We should have been more precise here. The argument should be three-fold:

  • First, the more constaints are put on the sparsity pattern, the less representational capacity a model has; 2:4 sparse models are a tiny subset of all possible sparsity patterns with 50% sparsity.
  • Second, despite having been introduced with Ampere, 2:4 sparsity so far hasn't found wide-spread use in model training. Given the difficulty in publishing negative results, it is hard to pinpoint the exact reasons. Note, however, that a recent blogpost by the pytorch team introducing 2:4 sparse training support shows degraded performance of the 2:4 trained model that they need to fix by a short period of fully-dense training in the end.
  • Third, for the goal of memory-reducion of this paper, 50% sparsity as induced by 2:4 is not sufficient.

Further, 2:4 sparsity has also proven to be challenging to learn in the DST framework. Leading SOTA methods for learning 2:4 sparsity such as SR-STE [1] require dense gradients every optimization step, thereby eliminating any potential memory savings such as those we achieve in this work. Further, [1] also acknowledged that for N:M sparsity, generalization performance increases as the size of M increases. Fixed fan-in sparsity can be viewed as a particular type of N:M sparsity where M is simply the dense fan-in of each neuron.

The improvement over past work is generally only in memory savings; there are often past techniques that give a higher quality model (though with higher memory requirements). It'd be more compelling if the method matched or exceeded the accuracy of past work with a smaller memory requirement.

In XMC setups, there is always a trade-off between memory/compute and accuracy. Previous methods mainly address the memory consumption problem by projecting to smaller embeddings before the classification layer, e.g., LightXML projects BERT's 768-dimensional embeddings down to 300 dimensions; For Renee, in their experiments with 100M labels, they use only 64 dimensions. In our experiments, we found that DST consistently outperformed bottleneck approaches with matching memory consumption.

What are the concrete novel aspects of your work?

The reviewer is correct to observe that semi-structured sparsity in the last layer is not a novel contribution, and section 2.2 could be moved as part of 2.1. The main contribution of our work are (i) identification on the need of an auxiliary objective to stabilize the gradient flow with a semi-structured sparse layer, (ii) key insight on the role of label clusters (random or k-means) beyond its conventional role in XMC as a negative sampling technique for speeding up training, (iii) armed with this insight, using the meta-branch for achieving gradient stabilization, and (iv) further tuning of the network to achieve training with the largest publicly available dataset on a single commodity GPU with close to state-of-the-art results, which otherwise requires multiple A100s.

Please note that, as indicated in Table 1 of the common response, we initially observed a very low base performance of 5% for the Amazon670K dataset with 96% sparsity. However, by employing the auxiliary branch, the performance increased significantly to 38.4%.

Can you give me some idea of the importance of XMC as a task?

XMC refers to any (deep) learning setting in which the size of the output layer is extremely large. This could be an instance of text-classification[2] or multi-modal learning [3]. Apart from other applications in health [4], the recent works from Amazon and Microsoft, other tasks that can be reduced to an XMC task are product/item recommendation[5], and prediction of Related Searches[6], and dynamic search advertising [7].

In addition to the current application areas, we think that XMC techniques may become relevant for resource-efficient LLM settings. Recent models have shown an increase in vocabulary size, from 32k in LLama2 to 128k in LLama3, and even 256k in Gemma2. That means that for Gemma2 2B with an embedding dimension of 2304, about 590M --about a quarter-- of the weights are used just for token embeddings.

[1] A. Zhou et al., “Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch,” Apr. 18, 2021, arXiv: arXiv:2102.04010. doi: 10.48550/arXiv.2102.04010.

[2] K. Dahiya, D. Saini, A. Mittal, A. Shaw, K. Dave, A. Soni, H. Jain, S. Agarwal and M. Varma. DeepXML: A deep extreme multi-Label learning framework applied to short text documents. In WSDM 2021

[3] A. Mittal, K. Dahiya, S. Malani, J. Ramaswamy, S. Kuruvilla, J. Ajmera, K. Chang, S. Agrawal, P. Kar and M. Varma. Multimodal extreme classification. CVPR 2022.

[4] Mario Almagro, Raquel Martínez Unanue, Víctor Fresno, and Soto Montalvo. ICD-10 coding of Spanish electronic discharge summaries: An extreme classification problem. IEEE Access 8 (2020), 100073–100083.

[5] H. Jain, V. Balasubramanian, B. Chunduri and M. Varma, Slice: Scalable linear extreme classifiers trained on 100 million labels for related searches, in WSDM 2019.

[6] Y. Prabhu, A. Kag, S. Harsola, R. Agrawal and M. Varma, Parabel: Partitioned Label Trees for Extreme Classification with Application to Dynamic Search Advertising in WWW, 2018.

[7] Chang, W. C., Jiang, D., Yu, H. F., Teo, C. H., Zhang, J., Zhong, K., ... & Dhillon, I. S. (2021, August). Extreme multi-label learning for semantic matching in product search. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining (pp. 2643-2651).

评论

Thank you, authors, for your thoughtful and clear responses to my concerns. Assuming you'll add the clarifications provided here, and the results provided to other reviewers, into a revised version, I've increased my rating to Accept.

审稿意见
7

This paper investigates the application of Dynamic Sparse Training (DST) methods to the domain of extreme multi-label classification (XMC), where the label space can be very large, on the order of millions. The authors propose several enhancements to standard DST approaches to address the challenges posed by the highly skewed label distributions and scarcity of training data typical in XMC datasets. These include using semi-structured sparsity with fixed fan-in connections, adding an intermediate layer between the encoder and sparse classifier, and incorporating an auxiliary training objective to improve gradient flow, especially in the early phases of training. Empirical results on various large-scale datasets demonstrate that the proposed approach can significantly reduce GPU memory requirements while maintaining competitive performance compared to dense models and specialized XMC methods.

优点

  • The paper is well-organized and clearly written, providing a comprehensive overview of the challenges in XMC and the proposed modifications to DST to address them. The technical contributions are well-motivated and supported by empirical results.
  • The paper tackles an important problem of scaling DST to the domain of XMC, characterized by enormous label spaces, label imbalance, and label noise. It demonstrates DST's applicability beyond the typical benchmark datasets used in sparsity research.
  • The authors present a well-motivated set of modifications to standard DST algorithms to handle XMC-specific issues. The semi-structured sparsity, intermediate layer, and auxiliary loss are supported by empirical analysis showing their impact on performance.
  • The results show substantial memory savings on large-scale datasets with minimal loss in predictive performance. This enables end-to-end training of XMC models on commodity hardware.

缺点

  • Limited novelty in core techniques: The main components (DST, semi-structured sparsity, auxiliary objectives) are existing methods, though their combination and application to XMC is novel. More discussion and intuition on how these components interact and complement each other in an overview would be beneficial.
  • The paper is primarily empirical and lacks rigorous theoretical justification for why the proposed modifications work. Some insights into the underlying mechanisms that make the proposed approach effective would be valuable.

问题

  • How sensitive are the results to the various architectural hyperparameters like fan-in degree, decay schedule for auxiliary loss? Some ablation experiments exploring these choices besides the intermediate layer size would be informative.

局限性

Yes.

作者回复

Limited novelty in core techniques: The main components (DST, semi-structured sparsity, auxiliary objectives) are existing methods, though their combination and application to XMC is novel. More discussion and intuition on how these components interact and complement each other in an overview would be beneficial.

We agree that each of the above-noted components are not novel in their own right; however, integrating these components is a non-trivial contribution. The application of DST to the XMC task is of significant interest since the memory overhead of XMC tasks remains a challenging constraint. The application of DST methods to this setting showcases that the theoretical justifications used to motivate much of the DST literature do in fact result in real-world improvements when memory overhead is a concern. Further, the effect of the auxiliary objective on convergence of higher sparsity DST experiements may, in turn, spur the broader DST community to incorporate similar mechanisms to improve the generalization performance of DST in general. In our opinion, the integration of these various components and our detailed empirical results represent a significant contribution to both the XMC and DST literature.

The paper is primarily empirical and lacks rigorous theoretical justification for why the proposed modifications work. Some insights into the underlying mechanisms that make the proposed approach effective would be valuable.

While our results are primiarly derived from our empirical observations, we believe their publicaiton will lead to further interest in the application of DST methods to the XMC setting. In future work, we hope to further examine our findings and develop a theoretical framework to help describe why this approach appears to be so effective.

How sensitive are the results to the various architectural hyperparameters like fan-in degree, decay schedule for auxiliary loss? Some ablation experiments exploring these choices besides the intermediate layer size would be informative.

We have provided additional hyperparameter ablations in the common response of the rebuttal.

评论

Thank you for the response. I agree that the integration and application of these techniques to XMC is a non-trivial contribution. The hyperparameter ablations address my concern about the robustness of the approach. On the lack of theoretical justification, empirical results do often precede theoretical frameworks and these empirical findings you have presented could spur more theoretical research in this direction, so I will keep the current score.

作者回复

We express our gratitude to all reviewers for their insights and will endeavor to address each comment separately.

We performed rigorous ablation studies on key hyperparameters: fan-in (sparsity level) and auxiliary loss. The results for the Amazon-670K dataset are presented in the tables below.

Table 1: Impact of varying sparsity levels: The table below illustrates the impact of varying sparsity levels (ranging from 50% to 96%) in conjunction with the use of auxiliary loss. As sparsity levels increase, there are benefits in memory usage, training time, and inference time; however, performance metrics simultaneously decline. Additionally, the importance of auxiliary loss becomes particularly significant at higher sparsity levels.

Fan in (sparsity)AuxP@1P@3P@5Peak Mem. (GiB)Epoch Time (min)Inf. Time (ms)
384 (50%)No49.043.539.57.0318:1310.2
384 (50%)Yes49.243.739.67.1318:1910.2
256 (67%)No47.041.637.75.2715:459.18
256 (67%)Yes47.642.338.45.3615:509.18
128 (83%)No45.640.236.33.6013:208.54
128 (83%)Yes47.141.838.03.7013:238.54
64 (92%)No30.726.523.62.8812:128.14
64 (92%)Yes42.337.233.32.9712:138.14
32 (96%)No5.55.04.62.5211:367.94
32 (96%)Yes38.433.830.42.6111:387.94

Table 2: Sensitivity to Auxiliary Loss cut-off epochs: We employ auxiliary loss with an initial scalar weight that decays until a specified cut-off epoch. Below table illustrates the model's final performance at various cut-off epochs for two sparsity levels. A value of 0 (No aux) indicates the absence of auxiliary loss, while 'No cut-off' signifies its application throughout training.

Fan in (Sparsity)Auxiliary cut off epochP@1P@3P@5
128 (83%)0 (No aux)45.640.236.3
128 (83%)1547.141.838.0
128 (83%)3046.641.137.1
128 (83%)6045.940.636.6
128 (83%)9044.639.735.9
128 (83%)No cut-off (full training)42.137.433.7
64 (92%)0 (No aux)30.726.523.6
64 (92%)1542.337.233.3
64 (92%)3032.828.325.2
64 (92%)6032.528.024.9
64 (92%)9031.327.023.9
64 (92%)No cut-off (full training)22.717.714.4

Our analysis reveals that prolonging the auxiliary loss beyond optimal cut-off epochs adversely affects performance for both sparsity levels, as discussed in section 2.3 (lines 196-208) of the paper. Notably, maintaining the auxiliary loss throughout training leads to performance deterioration, resulting in scores lower than those achieved without its use.

Through our detailed ablation studies and tuning, we identified improved hyperparameter settings for the Amazon-3M dataset, resulting in enhanced performance. (Refer to Table 2 in the main paper.) We updated the latest results below in comparison to the previous results.

Table 3: Improved Results for the Amazon-3M Dataset:

Sparsity (%)P@1P@3P@5Mtr (GiB)
Previous6749.446.744.719.7
Latest8350.247.144.813.5

This improvement is attributed to the optimal hyperparameter settings related to the learning rate of the encoder, final layer, and primarily the auxiliary cut-off epoch.

最终决定

The paper addresses extreme multi-label classification (XMC) problems by applying semi-structured sparsity in the final layer. The authors introduce several modifications to enhance Dynamic Sparse Training, including an intermediate layer and an auxiliary objective to stabilize gradient flow. Furthermore, they optimize hardware utilization, enabling training on the largest publicly available dataset using a single commodity GPU. All reviewers found the paper makes significant contributions and agreed that this paper is above the acceptance threshold.