PaperHub
6.1
/10
Poster4 位审稿人
最低2最高4标准差0.8
2
3
4
4
ICML 2025

SNS-Bench: Defining, Building, and Assessing Capabilities of Large Language Models in Social Networking Services

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24

摘要

关键词
BenchmarkSocial Network ServicesLarge Language Models

评审与讨论

审稿意见
2

This paper introduces a benchmark dataset SNS-Bench to evaluate LLMs' capabilities in social networking service. The dataset consists of 8 NLP tasks centering around user postings that compiled from the REDnote social networking platform. The authors evaluate over 25+ closed and open sourced LLMs on SNS-Bench and show that their performance generally adhere to the scaling law.

update after rebuttal

The authors addressed my concerns regarding using OCR to convert post images to texts, and the performance without translating Chinese texts into English. However, my main concern about the work, that the proposed datasets and tasks lack the sense of "personalized recommendation" and "social networking". While the Author did try to demonstrate the "personalization" concept in the Note-Hashtag task, the provided example only shows how they select post tags to "characterize" the post author based on that single post, without utilizing any personal attribute and behavioral histories. Further more, the "social networks" of the users (i.e., the connections and interactions of the user in question) are ignored in these datasets and tasks, making the claim of bench marking for "Social Networking Services" questionable. As such, I am maintaining my overall rating.

And my apology that I replied the authors rebuttal as an official comment, which is not visible to the authors.My rebuttal comment is pasted below: """ Thank the authors for demonstrating the "personalization" concept in the Note-Hashtag task. The provided example, however, looks like to select post tags to "characterize" the post author based on that single post. I think we need a clear definition of "personalized recommendation" as the rebuttable ground. In my humble knowledge, the core of "personalized recommendation" is to tailor recommendation for each individual based on their personal attribute and behavioral histories, and I don't see such characteristics in the proposed tasks, and neither do I see how "social networks" (i.e., the connections and interactions of the user in question) play a part in these tasks.

Admittedly, the LLMs have deeply reshaped the landscape of many research fields, and my opinion about "personalized recommendation" and "social networks" tasks may be too shallow and conservative. I would be glad to learn how the authors perceptive these concepts with the application of LLMs. """

给作者的问题

  • As the REDnote platform are mostly a Chinese community, did you translated the notes into English or you only selected notes in English? If you you performed translation, is there any performance difference before and after the translation?
  • Could you explain where the concepts of "social networking" and "personalization" may apply in the benchmark?

论据与证据

  • The paper claims that SNS-Bench is designed to assess LLMs' capabilities in social networking services. However, the eight benchmark tasks primarily align with conventional NLP tasks centered around user notes (postings), without explicitly modeling the broader social networking context, such as user interactions, network structures, or temporal dynamics.
  • Section 3.1 defines Personalized Recommendation as a task category, stating that models should deliver tailored content based on user interests, behavior, and past interactions. However, the proposed tasks, Note-Hashtag and Note-QueryGen, do not incorporate any personalization mechanism. They focus on general content tagging and query generation rather than adapting recommendations based on user-specific preferences or engagement history.

方法与评估标准

No, the proposed benchmark does not consider how a user's social networking may affect their engagement with the media generated by their social networks.

理论论述

N.A., this is a benchmark paper, no theoretical claim.

实验设计与分析

Other than the missing of user's social networking context, the experimental design and analysis are solid and comprehensive.

补充材料

Reviewed A.2. Prompt Templates for Instructions, looks good to me.

与现有文献的关系

The proposed benchmark dataset is related in applying LLMs in social networking tasks.

遗漏的重要参考文献

not in my awareness

其他优缺点

Strengths:

  • The paper evaluates a diverse set of both open-source and closed-source LLMs on the proposed benchmark, providing a broad comparative analysis.
  • The SNS-Bench provides a framework to evaluate LLMs's capabilities in social media related tasks.

Weaknesses:

  • The benchmark tasks are designed around the note format of the REDnote app, which may limit their generalizability to other social networking platforms with different content structures and interaction dynamics.
  • The conversion of user note images to text using Optical Character Recognition (OCR) may miss critical visual elements that contribute to the meaning of note, potentially affecting the fidelity of content understanding.

其他意见或建议

Typo: "Large Language Models (LLMs) play an importent role in SNS..." --> "important"

Suggestion: Improve justification for task selection. Some tasks, like Note-Gender, seem less well-defined in terms of real-world SNS applications.

作者回复

R4.Q1: Broader Social Networking Context

Thank you for the insightful observation. The construction does incorporate key social interactions:

  1. Data Collection Pipeline (Section 3.2)

    • User engagement metrics reflecting community response
    • Note categories and tags chosen by creators based on audience reception
    • Comment sections capturing direct user interactions
  2. Example: Note-Hashtag

    • Candidate tags include the original creator’s selections (personal preference)
    • Community-generated popular tags for similar content (crowd preference)

R4.Q2: Personalization

We give the example on personalization:

{"content":"Note Title: 108 Exercise + Reading Classics\nNote Content: Since discovering the Seed Law in October 2023, I’ve practiced 108 exercises and recited classics for over 7 months. My health has improved—better complexion, fewer wrinkles, and improved sleep. I've also experienced increased energy and financial gains.\nAs Seed Dad says, only action brings results.\nI hope everyone persists—each breakthrough leads to a better self!\nEvery drop of sweat is proof of our transformation!","candidates":"Enhance Energy,Grapefruit Jasmine,Must Visit Xinjiang,Cat Internal Deworming,Physics Tutoring for International Students,Stele Forest Museum,Sydney Spa,Buccellati Gardenia,108 Exercise,Tongue Piercing,Seed Law,Sharing What I Find Interesting,Women's Growth,Guojin Building","answer":"Seed Law,108 Exercise,Enhance Energy,Women's Growth"}
  1. The tags ("Seed Law", "108 Exercise") are derived from:
    • The author's consistent personal practice (7 months)
    • Tangible health benefits (specific improvements listed)
    • Community validation (implied by their selection)
  2. The selected tags represent:
    • A personal transformation journey ("Women's Growth")
    • Proven techniques with real impact ("108 Exercise")
    • A philosophical alignment ("Seed Law")
  3. This demonstrates how we capture personalization:
    • Creator-defined preferences (original tags)
    • Community-endorsed interests (popular tags)

R4.Q3: Platform limitations We appreciate the concern about REDnote's note format. In our ongoing work (SNS-Bench-V2), we are addressing this through:

  1. Multi-platform expansion:
  • Collecting data from Twitter/X and Instagram.
  • Designing unified task formulations that work across platforms.
  1. Beyond note content:
  • Incorporating threaded conversations.
  • User history.
  • Cross-post interactions.

R4.Q4: Visual loss in OCR We take the visual element concern seriously and have implemented:

  1. Strict quality control:
  • Human verification for all OCR conversions (Appendix B)
  • Automatic filtering of low-confidence OCR results

R4.Q5: Typo and Tasks We will fix the typos and rewrite the detailed definition of tasks.

R4.Q6: Results on Chinese Version Most cases are English, with Chinese translated into English (GLM4 + manual review). To provide more results, we have translated all English cases into Chinese (GLM4 + manual review). We will release the Chinese data. The average Chinese results:

Note-TaxonomyNote-HashtagNote-QueryCorrNote-MRCNote-NERNote-GenderNote-CHLWNote-QueryGenSNS-Bench
Llama-3.2-3B-Instruct10.506.4823.2513.224.0564.2524.8132.8422.42
Qwen2.5-1.5B-Instruct12.0934.0642.4217.6142.5568.9150.6932.4337.59
Phi-3.5-mini-instruct(3.82B)32.8462.0940.9125.1638.3054.4026.4733.2139.17
Phi-4-14B57.1663.6044.7441.1746.1889.6427.131.4350.13
Glm-4-9b-chat53.0980.2142.1127.7556.6988.0837.7637.5152.90
Qwen2.5-7B-Instruct47.9077.5443.5254.3762.5189.1236.2037.2456.05
Qwen2.5-32B-Instruct66.6584.0851.3864.8765.5990.6739.0035.7862.25
Qwen2.5-72B-Instruct68.2987.8355.9059.6665.7592.2349.0139.6564.79
Deepseek-v374.2691.3957.2262.8374.293.2640.9335.0766.14
GLM-4-Plus71.4289.7152.6761.4968.8893.2632.8836.0363.29
Gemini-1.5-pro70.2787.8848.3960.4270.1890.1634.2637.3662.36
GPT-4o-2024-05-1369.1880.2855.0265.1170.5291.1948.0439.5664.86

(PS. Llama-3.2-3B-Instruct and Phi-4-14B with a performance drop.)

R4.Q7: Social Networking & Personalization SNSbench captures social networking dynamics through:

  1. Interaction Signals: Tasks use data with implicit social context—comments, user-generated tags, and replies. For example, Note-CHLW identifies highlight words from actual discussions, mirroring how platforms detect trending topics.
  2. Personalization: Most tasks reflect personalized SNS behaviors, for example:
  • Note-QueryGen: Generated queries mimic how users personalize searches based on interests (e.g., converting a skincare note into "best vitamin C serums for sensitive skin").

We kindly invite you to review our responses and reconsider your assessments. Thank you for your time and consideration!

审稿意见
3

The paper introduces SNS-BENCH, a comprehensive benchmark for evaluating large language models in social networking service tasks. It covers eight diverse tasks—from note taxonomy and sentiment analysis to query generation and entity recognition—using a dataset of 6,658 questions sourced from a major social platform. The study presents detailed experimental results across 25+ LLMs, highlighting performance variations and a scaling law that informs both the strengths and limitations of current models.

给作者的问题

Please check above sections.

论据与证据

The paper claims that SNS-BENCH offers a systematic and robust framework to assess LLMs’ capabilities in SNS contexts and that model performance improves with scale, with closed-source models generally outperforming open-source counterparts. Extensive experimental results, quantitative metrics (accuracy, F1, etc), and detailed analyses across eight tasks support these claims. The evidence appears convincing, though some claims would benefit from further discussion on dataset representativeness.

方法与评估标准

The paper employs a multi-step data collection and annotation process, ensuring diversity and quality through de-identification, manual review, and expert validation. Tailored evaluation metrics are defined for each of the eight tasks, including accuracy, F1 score, and so on. These methods and criteria are well-aligned with the goals of assessing LLM performance in complex, real-world SNS scenarios.

理论论述

NA

实验设计与分析

The experimental design benchmarks over 25 LLMs using standardized prompts and metrics across eight distinct SNS tasks on a large-scale dataset. The analyses compare performance variations based on model size and type, providing clear insights into task-specific challenges and strengths.

Overall, the design is methodologically sound and the analyses are thorough.

补充材料

NA

与现有文献的关系

NA

遗漏的重要参考文献

NA

其他优缺点

Strengths include the paper’s comprehensive and well-structured benchmark, detailed experimental analysis, and clear evaluation criteria that address diverse SNS tasks.

A notable weakness is the reliance on text-based interactions might overlook multimodal aspects inherent to modern social networking.

Another miner concern is the lack of insights or detailed analysis. The current analysis is more akin to a summary to the results.

其他意见或建议

Expanding the discussion on future extensions—such as incorporating multimodal data—could further enhance the paper. Overall, the work is solid and offers valuable insights into the evaluation of LLMs in social networking contexts

作者回复

R3.Q1: Multimodal limitation. We sincerely appreciate the reviewer’s constructive feedback. The reviewer is absolutely right to highlight the importance of multimodal interactions in modern SNS platforms. In fact, we are already working on SNS-Bench-V2, which will incorporate image-text pairs (e.g., OCR-extracted text from note images, user-generated captions) and multimodal tasks (e.g., visual hashtag recommendation, sentiment analysis). This extension will align with real-world SNS content while maintaining our focus on nuanced social understanding. We will explicitly discuss this direction in the revised manuscript’s Broader Impact section.

R3.Q2: Deeper analysis and insights. Thank you for prompting us to clarify our findings. Beyond aggregated results (Table 2), our analysis reveals critical task-specific and generalizable insights about LLMs’ social capabilities:

1. General Insights:

Even state-of-the-art models struggle with SNS-unique demands:

  • Creativity: Underperform in Note-QueryGen (lexical diversity) and Note-CHLW (novelty detection) due to rigid training objectives.
  • Depth: Struggle with Note-MRC (Complex) (evidence extraction) and Note-QueryCorr (Topic) (intent granularity), revealing gaps in social context modeling (Section 5.1).

2. More Task-Specific Insights:

  • Note-Taxonomy: Small models (e.g., Qwen-1.5B) favor single-choice tasks (accuracy: 27.5%), while larger models (e.g., Llama3-70B) excel in multi-hop reasoning (65.12%), proving scale aids hierarchical reasoning (Table 2).
  • Note-Hashtag: Most models (32B+) perform better on single-choice (Qwen-72B: 86.25%) than multi-choice (84.60%), except Claude-3.5, which leads in multi-choice (88.58%), suggesting superior multi-label adaptability (Figure 9).
  • Note-MRC (Simple): Gemini/Qwen excel at binary relevance judgment (F1: ~90%) but fail to precisely extract answers (BLEU: ~55%), whereas DeepSeek-V3 achieves the best content extraction (ROUGE-L: 74.81%), highlighting a trade-off between relevance judgment and granular retrieval (Table 5).

These observations underscore that SNS challenges require both social and technical innovation—a theme we will emphasize in Section 6. We will also add a new subsection (5.3 Model Behavior Analysis) to consolidate these insights with supporting visualizations (e.g., confusion matrices for Hashtag tasks).

审稿意见
4

This paper aims to advance LLM models for Social Networking Services(SNS) by introducing a comprehensive benchmark SNS-BENCH derived from a social media platform, addressing the limitation of studying SNS in isolated tasks in prior work. The benchmark includes eight distinct tasks, such as note classification, sentiment analysis and personalized recommendations, providing evaluation across various realistic dimensions. The authors evaluate over 25 LLMs on SNS-BENCH, providing insights into model performance across different categories. One main result shown is that closed-source models generally outperform open-source ones, but with a relatively small margin, and tasks involving complex emotions and long-text understanding remain challenging for current LLMs.

给作者的问题

Have you considered evaluating models under adversarial settings, simulating malicious users in online communities? The dataset may have biases due to its reliance on a single social media platform, so I’m curious about further analysis on this aspect.

论据与证据

Claims made in this work are overall well-supported. The comprehensiveness of the SNS-BENCH benchmark is well illustrated by its diver source and different tasks for SNS. This dataset itself is well-structured, paired with standard metrics for each subtask. Experiments are thorough in terms of including a variety of popular LLMs from both open-source and close-source communities. However, one caveat is that the authors motivate the idea to shift from focusing on isolated SNS tasks, but this SNS-BENCH benchmark proposed in this submission suffers from isolated evaluation on each SNS task after breakdown, albeit the source data is realistic and situated. This makes some claims on the ability of generalizing the conclusion beyond individual tasks less convincing.

方法与评估标准

The benchmark includes a rigorous annotation process with both automated and human verification, ensuring high data quality. Evaluations on each separate task are fair and standard, including widely-used metrics from QA and text generation literature (F1, BLEU,etc.)

理论论述

N/A

实验设计与分析

Experimental setup and analysis are well-documented, with clear descriptions of the evaluated models, the computational resources, and the experimental protocol. This work thoroughly compares the performance of different models and analyzes task-specific challenges. However, the discussion reads as a bit brief, likely due to limited space. Moving some graphs to the appendix would help. This work can be further strengthened by providing analysis into the social and online dimensions of SNS tasks, such as how different models cope with user diversity and time dynamics. These dimensions are naturally introduced by the source data into this benchmark, but the evaluation and analysis do not explicitly address those challenging aspects.

补充材料

No

与现有文献的关系

The study emphasizes the necessity of a comprehensive SNS-specific benchmark, and highlights gaps in existing LLM capabilities regarding social interaction in the online culture.

遗漏的重要参考文献

This paper is well-referenced.

其他优缺点

One of the key weaknesses of this submission is its lack of creativity in designing a novel evaluation framework. The approach taken primarily involves deriving a benchmark from real-world user data and breaking it down into individual tasks with standard evaluation metrics. While this is a reasonable and common methodology, it does not push the boundaries of evaluation in the Structured Neural Search (SNS) domain. The field would greatly benefit from more innovative evaluation paradigms that account for the complex nature of real-world SNS challenges, particularly those involving user interaction and diversity.

其他意见或建议

022: “in SNS remains challenging (Bandura).” → include year for this citation. Figure 4,5,6,7,8,9: words in the graph are a bit too small to read.

作者回复

Response to Review 2: Thank you for your insightful feedback.

R2.Q1: Isolated evaluation of SNS tasks. We acknowledge the concern regarding task isolation in SNS-Bench and appreciate the opportunity to clarify our design rationale.

  1. Task-Specific Evaluation Necessity: Social networking services (SNS) involve diverse interactions (e.g., content comprehension, sentiment analysis, recommendation), each requiring distinct capabilities. Isolated evaluation allows us to:
  • Pinpoint strengths/weaknesses: For instance, models may excel at hashtag selection (structured tasks) but struggle with complex reasoning in Note-MRC (Figure 7).
  • Guide targeted improvements: Task-specific metrics (e.g., F1 for Note-NER, ANLS for Note-QueryGen) reveal granular performance gaps (Table 5).
  1. Holistic Insights via Aggregation: While tasks are evaluated independently, our analysis synthesizes cross-task patterns (Section 4.3). For example:
  • Closed-source models consistently outperform open-source ones (Table 2), suggesting general superiority in SNS contexts.
  • Tasks requiring emotional/cultural understanding (Note-Gender) exhibit higher variance, highlighting challenges in social nuance (Section 5.1).
  1. Real-World Generalization: The benchmark’s diversity—8 tasks spanning 6,658 questions with varied formats (Table 1)—ensures broad coverage of SNS scenarios. By testing isolated but representative tasks, we simulate the multifaceted demands of real-world platforms (e.g., handling both classification and generation). We agree that future work could explore interdependencies between tasks (e.g., joint training). However, our current design prioritizes diagnostic clarity, enabling actionable insights for model development. Thank you again for your valuable advice. We will emphasize this rationale in the revised manuscript.

R2.Q2: Lack of creativity in evaluation framework.

We appreciate the insightful feedback. While our benchmark adopts standard evaluation metrics for individual tasks, its core value lies in the authenticity and diversity of real-world SNS scenarios. Unlike synthetic or simplified datasets, SNS-Bench captures nuanced user behaviors (e.g., informal language, cultural references) and task interdependencies (e.g., sentiment influencing recommendations) from actual SNS platforms. This fidelity enables a more grounded assessment of LLMs’ practical utility in social contexts. Future work will build on this foundation to incorporate interactive and adversarial dynamics.

R2.Q3: Citation and figure readability issues.

Thank you for your careful review. We will address both points in the revised manuscript:

  1. Citation update: The reference to Bandura will be updated to include the publication year (e.g., "Bandura, 2001").
  2. Figure improvements: Figures 4–9 will be resized to ensure all text (axis labels, legends, annotations) is clearly legible in the final version.
审稿意见
4

This paper presents SNS-Bench, to access LLMs on different social networking services. It includes 8 tasks such as query content relevance. It evaluates 25+ LLMs and provide further insights.

update after rebuttal

I maintain my score in support of the work after rebuttal.

给作者的问题

Please see above comments.

论据与证据

The central claim is the current LLMs are not performing ideally in SNS tasks. Table 2 well supports this claim.

方法与评估标准

Yes, it evaluates 25+ leading LLMs including Claude, GPT, Llama, Qwen etc.

理论论述

There is no theoretical claims if the reviewer understands correctly.

实验设计与分析

Yes, the experiment design is sound (this is a benchmark paper, with reasonable and comprehensive choice of LLMs.)

补充材料

Yes, the reviewer mainly reviews Appendix A which provides extensive prompt template used.

与现有文献的关系

Related benchmarks do not benchmark complex scenarios with multiple network tools or heavily rely on GPT models. The paper well addresses these two problems.

遗漏的重要参考文献

The reviewer believes the paper provides a good list of references.

其他优缺点

The paper presents a good problem - evaluating LLMs in SNS tasks.

其他意见或建议

The font of Table 2 is too small. Please consider using more than one row for one task.

作者回复

Response to Review 1:

R1.Q1: Font size in Table 2. Thank you for your helpful suggestion. We will adjust the font size and optimize the layout of Table 2 (e.g., using multiple rows for tasks if needed) to improve readability in the revised version.

审稿人评论

Thank you for the reply. I maintain my score and support the work to be accepted!

最终决定

This authors presents SNS-BENCH, a benchmark designed to evaluate LLMs across a diverse set of social networking service (SNS) tasks. The benchmark covers eight task types derived from a real-world platform, and is used to evaluate over 25 LLMs. The authors also provide a detailed performance analysis, identifying specific challenges such as emotion understanding and multi-hop reasoning.

The motivation is timely and well-supported, and the breadth of tasks and models included makes the benchmark a useful contribution to the evaluation landscape for LLMs in social contexts. The authors’ responses clarified key design decisions, including the rationale for evaluating tasks independently and how implicit social and personalized signals were incorporated.

That said, one reviewer raised valid concerns about the lack of explicit modeling of social networks and user histories, and about the narrow definition of personalization in the current setup. These concerns remain partially unresolved and point to directions for future work.

After reviewing the paper, author response, and discussion, I believe that the benchmark provides a valuable foundation for assessing LLMs in SNS-like scenarios, even if its scope is not yet comprehensive. I encourage the authors to further clarify their definitions and continue extending the dataset to better capture network dynamics and personalization in future iterations.