Contrastive Learning with Simplicial Convolutional Networks for Short-Text Classification
摘要
评审与讨论
This work proposes C-SCN, which combines contrastive learning with simplicial complexes in convolutional networks to capture higher-order interactions and improve short-text classification performance. Experimental results demonstrate its superiority over existing methods.
给作者的问题
Can the simplicial complexes be used in Transformer?
In the era of LLMs, models like ChatGPT, DeepSeek, and LLaMA have already achieved strong performance in short-text classification. For closed-source LLMs, we can use APIs and Apps, while for open-source LLMs, we can fine-tune them on local machines. Why not use these LLMs for short-text classification?
In the loss function in Eq. 12, both the classification loss and contrastive loss use features generated from the BERT model. It is difficult to justify whether the proposed SCN encoder contributes, and to what degree, to the performance.
论据与证据
The work identifies three issues with current methods:
- The augmentation step may fail to generate positive and negative samples that are semantically similar and dissimilar to the anchor, respectively.
- External auxiliary information may introduce noise to the sparse text data.
- Limited ability to capture higher-order information, such as group-wise interactions.
However, it remains unclear how the proposed methods specifically address these issues.
SCN is used to capture higher-order information, but there are many alternatives, such as self-attention, jump self-attention, and etc. What are the reasons for adopting simplicial complexes over these other options?
The paper puts effort into introducing the message-passing mechanism. How is it integrated into convolutional networks, and what is the structure of the SCN? A figure may be a helpful illustration.
方法与评估标准
The proposed method is not well justified for the problem. The chosen benchmark datasets are appropriate for the task.
理论论述
Theoretical claims seem fine.
实验设计与分析
The experiment designs are good.
补充材料
I reviewed the supplements.
与现有文献的关系
The idea to incorporate the high-order information has been highlighted in the existing works.
遗漏的重要参考文献
What is the TDL mentioned in the introduction section? Please give a reference.
其他优缺点
The topic is important in the field, and the idea of incorporating higher-order information into the model design is inspiring.
其他意见或建议
No other comments.
(W1) The work identifies three issues with current methods. However, it remains unclear how the proposed methods specifically address these issues.
Due to the word limit constraints, we refer to the response to Reviewer jGxX (W3) for similar concerns.
(W2) SCN is used to capture higher-order information, but there are many alternatives, such as self-attention, jump self-attention, and etc. What are the reasons for adopting simplicial complexes over these other options?
As mentioned in Section 3, our model adopts a self-attention mechanism as the READOUT function. It summarises the words (0-simplexes, nodes) and the connections between words (1-simplexes, edges).
Self-attention (Vaswani et al., 2017) and jump self-attention (Zhou et al., 2022) are able to process the input sequence by allowing every element in a sequence to attend to every other element, regardless of their distance in the sequence. Hence, long-range information is captured. However, the attention scores calculated between two words are still pair-wise interactions, and the higher-order information is only accounted for if self-attention layers are stacked. Although the jump self-attention algorithm introduces the mix-hop mechanism, the number of hops participating in this mechanism requires additional hyperparameter tuning. In contrast, C-SCN involves a single layer of convolutional networks and incorporates multiple higher-order objects, including 0-simplexes, 1-simplexes, and 2-simplexes, without requiring additional hyperparameters.
Vaswani et al. (2017). Attention is all you need. In NIPS.
Zhou et al. (2022). Jump Self-Attention: Capturing High-Order Statistics in Transformers. In NIPS.
(W3) The paper puts effort into introducing the message-passing mechanism. How is it integrated into convolutional networks, and what is the structure of the SCN? A figure may be a helpful illustration.
We have included the detailed message-passing mechanism in Section 3, and Figure 3 demonstrates the model architecture.
(W4) What is the TDL mentioned in the introduction section? Please give a reference.
Topological Deep Learning (TDL) bridges the gap between topology and machine learning, offering powerful tools to model complex data structures and relationships. By incorporating topological insights, it enhances the interpretability, robustness, and performance of deep learning models across a wide range of applications. Clough et al. (2022) adopted TDL to improve image classification, object detection, and segmentation by capturing the geometric and topological structure of images and utilising persistent homology to extract shape-based features that complement traditional pixel-based methods. Shen et al. (2024) incorporated both high-order interactions and multiscale properties into a TDL architecture, specifically simplicial neural networks, where polymer molecules are represented as a series of simplicial complexes at different scales to enhance the accuracy of polymer property prediction.
J. R. Clough et al. (2022). A Topological Loss Function for Deep-Learning Based Image Segmentation Using Persistent Homology. in TPAMI.
Shen et al. (2024). Molecular Topological Deep Learning for Polymer Property Prediction. In arXiv.
(W5) Can the simplicial complexes be used in Transformer?
Yes, simplicial complexes can be used in Transformers to incorporate higher-order relationships and geometric structures into the model. While traditional Transformers operate on sequences or graphs, simplicial complexes extend these structures by modelling higher-order interactions (e.g., triangles, tetrahedra) rather than just pairwise relationships. This can enhance the model's ability to capture complex dependencies in data. For example, Cellular Transformer (CT) (Ballester et al., 2024) leverages algebraic topology to form cell complexes from graph data and inputs to the transformer model, where the accuracy of graph classification is enhanced.
_Ballester et al. (2024). Attending to Topological Spaces: The Cellular Transformer. In arXiv. _
(W6) In the era of LLMs, models like ChatGPT, DeepSeek, and LLaMA have already achieved strong performance in short-text classification. For closed-source LLMs, we can use APIs and Apps, while for open-source LLMs, we can fine-tune them on local machines. Why not use these LLMs for short-text classification?
Due to the word limit constraints, we refer to the response to Reviewer Ntsf (W1) for similar concerns.
(W7) In the loss function in Eq. 12, both the classification loss and contrastive loss use features generated from the BERT model. It is difficult to justify whether the proposed SCN encoder contributes, and to what degree, to the performance.
We have included a comparison between with and without contrastive loss in the Appendix section, where the loss functions are utilised for SCN and BERT separately, and the contribution of SCN is highlighted.
The paper proposes Contrastive Learning with Simplicial Convolutional Networks (C-SCN) for short-text classification. The method constructs document simplicial complexes to capture higher-order interactions beyond simple pairwise relationships and integrates a contrastive learning framework that leverages both structural representations from the simplicial convolutional network and sequential representations from transformer models. The authors demonstrate improvements on several benchmark datasets in few-shot learning settings.
给作者的问题
NA
论据与证据
The authors claim that their approach can better capture long-range and higher-order interactions in short texts compared to traditional graph-based models, leading to enhanced classification performance. However, the evidence provided is not fully convincing. In particular, the paper does not adequately justify the necessity of a complex graph-based model when simple large language models (LLMs) could potentially address these issues. Additionally, the baselines used are outdated, and the absence of comparisons with current LLM-based methods weakens the evidence supporting the proposed claims.
方法与评估标准
While the construction of document simplicial complexes and the integration of contrastive learning are interesting, the methods lack a compelling motivation. The paper does not sufficiently explain why a graph-based model is preferred over simpler LLMs (e.g. llama 3), especially when techniques like skip connections or graph rewiring can mitigate issues related to high-order dependencies in GNNs. Moreover, the evaluation criteria are standard for short-text classification, but the outdated baseline comparisons reduce the impact of the reported improvements.
理论论述
NA.
实验设计与分析
As discussed in the previous section, LLM baselines are needed.
补充材料
NA
与现有文献的关系
NA
遗漏的重要参考文献
LLM-related papers.
其他优缺点
The presentation quality can be improved.
其他意见或建议
NA
(W1) The paper does not adequately justify the necessity of a complex graph-based model when simple large language models (LLMs) could potentially address these issues.
We would like to highlight the novelty of our work in pioneering the use of higher-order simplicial complexes for short text classification, offering a novel geometric perspective on text representation and classification, and advancing the theoretical understanding of higher-order simplicial complexes in text classification. This lays the groundwork for future research in geometric machine learning. Unlike black-box LLMs, our approach leverages higher-order simplicial complexes to pave the path towards interpretable insights into the relationships between short texts through the higher-order simplexes. In addition, with a significantly smaller trainable parameter set, our method is particularly effective in resource-constrained environments, providing a lightweight alternative to computationally intensive large language models (LLMs).
Due to time constraints, we obtained the results from Llama3.1-8B with two datasets, as shown in the table below.
| Model | MR | |||
|---|---|---|---|---|
| F1 | Acc | F1 | Acc | |
| BERT | 54.92 | 51.16 | 51.69 | 50.65 |
| GPT2 | 67.41 | 67.76 | 50.77 | 53.18 |
| RoBERTa | 56.02 | 52.29 | 52.55 | 51.3 |
| Llamma3-8B | 34.06 | 50.3 | 70.13 | 71.67 |
| C-SCN | 75.61 | 76.09 | 69.46 | 69.87 |
We can identify that Llama3, with extensive pre-training and a large parameter count of 8B, can handle longer texts with proper English in movie reviews more effectively, as shown by its improved performance on the MR dataset. On the other hand, in the Twitter dataset, which features shorter texts and non-standard English words, Llama3 may not perform well in the few-shot setting. For example, we found that Llama 3 treats both “:(” and “:)” as negative sentiments and classifies non-proper English words, such as “thankyou” and “followback”, in the negative category, resulting in worse performance. In contrast, C-SCN leverages both structural information and contextual information to prevent the pre-trained embeddings from dominating the sentence representations and manages to handle these cases better with the same number of examples and much fewer parameters, demonstrating particular strength in domains where short text data exhibits complex relational structures, which are naturally modelled by higher-order simplicial complexes. While LLMs achieve higher overall performance in one dataset, our approach introduces a novel framework that opens new avenues for exploring geometric representations in text classification.
We will consider the impact of text lengths and grammatical correctness in future work, as short texts from online resources, including tweets and reviews, may pose challenges to short text classification tasks.
(W2) The baselines used are outdated, and the absence of comparisons with current LLM-based methods weakens the evidence supporting the proposed claims. The baseline models we selected also included those after 2020, including DADGNN (2021), SHINE (2021), NC-HGAT (2022), besides GIFT (2024).
(W3) The paper does not sufficiently explain why a graph-based model is preferred over simpler LLMs (e.g. llama 3), especially when techniques, like skip connections or graph rewiring, can mitigate issues related to high-order dependencies in GNNs.
In addition to the comparison with LLM in the previous question, skip connections (Xu et al., 2021) and graph rewiring (Topping et al., 2022) are introduced to mitigate the over-squashing problem of GNNs and to incorporate higher-order dependencies. However, skip connections combine graph features at different scales and introduce complex interactions among layers. This may diminish the effectiveness of shallow network structures and introduce architectural complexity. Furthermore, graph rewiring may compromise the sparsity of graphs (Barbero, 2024). In contrast, C-SCN maintains the original graph structure with a shallow layer and incorporates the higher-order dependencies with contrastive learning in one pipeline of training, demonstrating its effectiveness and efficiency in the few-shot learning setting.
Xu, K., Zhang, M., Jegelka, S., & Kawaguchi, K. (2021). Optimization of Graph Neural Networks: Implicit Acceleration via Skip Connections and Increased Depth. CoRR, abs/2105.04550.
Topping, J., Di Giovanni, F., Chamberlain, B. P., Dong, X., & Bronstein, M. M. (2022). Understanding Oversquashing and Bottlenecks on Graphs via Curvature. In International Conference on Learning Representations.
Barbero, F., Velingker, A., Saberi, A., Bronstein, M. M., & Di Giovanni, F. (2024). Locality-aware graph rewiring in GNNs. In The Twelfth International Conference on Learning Representations.
Due to the word limit, we would like to clarify more during the comment period.
Thanks for the response and providing additional results.
However, my concerns are not fully addressed:
- My major concern still holds. I still believe that the proposed problem "leveraging higher-order structure", which explores the interactions across distant words, is already an explored problem in NLP community. Both BERT-style or GPT-style language models consider this kind of action.
- Skip connection and rewiring are not only proposed to solve the oversquashing problem. In fact, skip connection is also introduced to handling high-order information before over-squashing is investigated. please check JKNet [Xu et al., 2018] for details.
- Although I appreciate the effort of adding LLM results, the LLM compared is not large enough. The results are not convincing when GPT2 is better than LLM on twitter.
[Xu et al., 2018] Representation Learning on Graphs with Jumping Knowledge Networks. In ICML 2018.
Since my major concerns still hold. I would like to keep my score.
Due to limited labels and sparsity in words and semantics, short text caught much attention. Most of the current models adopted self-supervised contrastive learning across different representations but generate samples and external auxiliary information can not guarantee the effectiveness. And they also can not extract high-order information. The authors proposed a novel document simplicial complex construction for a higher-order message-passing mechanism. By contrasting the structural representation with the sequential representation generated by the transformer mechanism for improved outcomes and mitigated issues, the C-SCN model outperform existing models on four benchmark datasets.
给作者的问题
It is an interesting topic. However, please clarify that whether the proposed model have completely solved the challenges discussed at the beginning.
论据与证据
It sounds convincing, however, the authors do not provide solid proof.
方法与评估标准
From the results of four benchmark dataset, it looks convincing. However, more theoretical proof is demanded.
理论论述
This paper only provides definition and methodology. No theoretical proof.
实验设计与分析
The four selected benchmark datasets are commonly used. And the results show the model's effectiveness. However, the authors did not provide the source code.
补充材料
I have not find the Supplementary Material.
与现有文献的关系
This paper have compared the proposed model with several baselines.
遗漏的重要参考文献
Boosting Short Text Classification with Multi-Source Information Exploration and Dual-Level Contrastive Learning[J]. arXiv preprint arXiv:2501.09214, 2025.
其他优缺点
Strengths
- This work is well written and easy to follow.
- The authors proposed a novel document simplicial complex construction for a higher-order message-passing mechanism.The C-SCN model outperform existing models on four benchmark datasets.
- The results on the evaluation datasets demonstrate a remarkable performance improvement achieved by the proposed method. Weaknesses
- This paper have compared the proposed model with several baselines. They provides definition and methodology. However, more theoretical proof is demanded.
- The authors did not provide the source code about the model.
- We would like to see the author could clarify that whether the proposed model have solved the challenges discussed at the beginning.
其他意见或建议
Please see the weaknesses.
(W1) This paper has compared the proposed model with several baselines. They provide definitions and methodology. However, more theoretical proof is demanded.
We would like to refer to the following sources for theoretical proof. To compare the expressiveness of graph neural networks and neural network structures that involve higher-order objects, such as simplicial complexes, the Weisfeiler-Lehman graph isomorphism test (WL test) (Weisfeiler & Lehman, 1968) is often used to compare different architectures.
Bodnar et al. (2021) extended the WL test with simplexes, known as Simplicial WL or SWL, by involving boundary and coboundary relations. The following theorem backs up the algorithm’s expressiveness.
Theorem 1. SWL is strictly more powerful than WL at distinguishing non-isomorphic graphs.
Due to the word limit, we would like to include detailed proof during the discussion phase if needed. The following theorem explains our motivation to adopt the framework in our model architecture.
Theorem 2. With sufficient layers and injective aggregators, Message Passing Simplicial Networks (MPSN) is as powerful as SWL.
Boris Weisfeiler and Andrei Leman. The reduction of a graph to canonical form and the algebra which appears therein. Nauchno-Technicheskaya Informatsia, 2(9):12–16, 1968.
Cristian Bodnar, Fabrizio Frasca, Yuguang Wang, Nina Otter, Guido F Montufar, Pietro Li´o, and Michael Bronstein. Weisfeiler and Lehman go topological: Message passing simplicial networks. In Proceedings of the 38th International Conference on Machine Learning.
(W2) The authors did not provide the source code for the model.
To enhance the reproducibility of our work, we have included the pseudo-code in the Appendix section. We would like to include the code as supplementary material when the edit function is enabled.
(W3) We would like to see the author could clarify that whether the proposed model have solved the challenges discussed at the beginning.
Our work identifies three key challenges at the beginning, and the proposed model architecture could address them individually.
Challenge 1: The data augmentation step and the negative sampling step of contrastive learning may distort the semantic meaning and introduce unnecessary noise.
Removing graph components is adopted as a data augmentation strategy; however, this approach may disrupt the original meaning of the text. An instance from the Movie Review (MR) dataset: “There's not enough to sustain the comedy” while removing the word “not” reversely changes the meaning of this short sentence.
In C-SCN, we did not create the positive and negative labels that are used in pre-training for the text data. Instead, we provided augmented views of texts by applying the structural SCN and sequential language models. This has circumvented the challenges of generating distorted semantic meanings in the positive and negative classes.
Challenge 2: Auxiliary information, such as entities, latent topics, and part-of-speech (POS) tags (e.g., nouns and verbs), may be added to graph models for language understanding and enriching the limited available local context. However, this step might introduce misinformation, such as pulling documents that express opposite semantics but share similar topics.
Assigning entities, such as a film name, might pull texts that complement and dislike the film into a closer neighbourhood. By the homophily assumption, which states that nodes with similar characteristics or labels tend to be connected, the model learns that the two texts are more similar to each other compared to other texts that do not mention the entity name but share similar semantics.
In C-SCN, we did not include any auxiliary information that might mislead the model in learning neighbourhood information related to entities, latent topics, or POS tags. Instead, we leverage the pre-trained information from the sequential language model BERT in the augmenting step to enhance the contextual understanding when contrasting with SCN.
Challenge 3: Graph models are mathematically limited in modelling higher-order features, such as group-wise interactions among a few nodes and edges expressed in terms of phrases.
The short sentence “It is what it is” uses repetition to emphasise the acceptance of the status quo. At the same time, graph models with only nodes and edges learn pairwise interaction. They need to extend the number of layers in order for words to incorporate the meaning of other words further apart. Group-wise phrase “it is” needs to be linked with “what” to model such repetition.
In C-SCN, we adopted the simplicial complex to incorporate higher-order relations, which have undergone the message-passing mechanism. This enables the grouped phrase “it is” as an edge to be connected to the third node “what” through the 2-simplex (filled triangle) formed.
Due to the word limit, we would like to clarify more during the comment period.
This paper introduces C-SCN, a contrastive learning framework using simplicial complexes to capture higher-order structural information in short text classification. While the novelty is somewhat limited compared to recent advances in LLMs, the idea of leveraging topological structures for interpretability and lightweight modeling is interesting and experimentally validated. I recommend a weak accept given the unique perspective and solid empirical results, though comparisons to stronger LLM baselines would strengthen the work.