6.8

/10

Poster4 位审稿人

最低6最高7标准差0.4

3.0

置信度

COLM 2025

G1yphD3c0de: Towards Safer Language Models on Visually Perturbed Texts

Yejinchoi,Yejin Yeo,Yejin Son,Seungju Han,Youngjae Yu

OpenReview PDF

提交: 2025-03-20更新: 2025-08-26

TL;DR

Towards Safer Language Models on Visually Perturbed Texts

摘要

关键词

safetysocietal implicationsmultimodality

评审与讨论

审稿意见

评分: 6置信度: 32025-05-07

This paper proposes GLYPHDECODE, a framework to defend against visually perturbed text intended to bypass content moderation systems. The framework includes: GLYPHPERTURBER, which generates visually perturbed training samples using OCR-informed glyph similarity; GLYPHRESTORER, a lightweight multimodal transformer that restores text from its perturbed visual form using character-level image and text embeddings. The authors also introduce GLYPHSYNTH, a benchmark dataset with both word-level and multi-line inputs. Experiments show that GLYPHDECODE significantly improves safety classification and restoration performance across various multimodal LLMs (closed and open-source).

接收理由

The paper studies a timely and important problem in online safety and adversarial robustness.
The proposed method is simple, modular, and practically applicable, with strong empirical results across models and tasks.
The authors construct a well-constructed benchmark with multi-line, category-specific, and visually perturbed content.

拒绝理由

While the framework achieves strong results, the paper lacks ablation studies analyzing the contribution of each component (e.g., visual embeddings, fusion mechanism, OCR backbone). Without such analysis, it is difficult to assess the necessity and individual impact of different modules in GLYPHDECODE.
The paper does not compare against simpler baselines, such as training a safety classifier directly on visually perturbed data via data augmentation. It remains unclear whether restoration is necessary for improving safety classification, or if comparable performance can be achieved through less complex strategies.

2025-06-02

We appreciate the reviewer’s acknowledgment of the relevance of our problem setting, the practicality of our modular approach, and the value of the benchmark we introduce. Below, we address the points raised by the reviewer:

Point1. Ablation studies
| While the framework achieves strong results, the paper lacks ablation studies analyzing the contribution of each component (e.g., visual embeddings, fusion mechanism, OCR backbone). Without such analysis, it is difficult to assess the necessity and individual impact of different modules in GLYPHDECODE.

Thank you for the valuable feedback and have extended our experiments with a detailed ablation study.

To isolate the contribution of each core module in GLYPHDECODE, we systematically ablated the following components: (1) without the visual embeddings, (2) without the character embeddings, (3) other fusion mechanism (replacing our modality-adaptive gated fusion with cross-attention), and (4) other OCR backbone (replacing our backbone with the pretrained transformer-based TROCR [1]).

We observed a substantial performance collapse across all metrics when visual embeddings were removed, with an average absolute drop of 69.4% across all three visual perturbation mechanisms, confirming that visual features are essential for restoration.

Ablating character embeddings also led to an average drop of 25.1%, indicating that the model relies not only on visual signals but also on the underlying textual structure to recover content effectively.

Our proposed fusion mechanism showed slightly better performance than a standard cross-attention alternative, with an average gain of 2.6%, suggesting that modality-adaptive weighting provides marginal but consistent benefit in this task setting.

Finally, replacing our OCR backbone with TrOCR led to a 24.8% average performance drop, especially under PaddleOCR and VIPER perturbations. As shown in Section 4.3 and Figure 6, these involve more complex distortions, to which TrOCR is less robust. In contrast, our backbone better handles such challenging visual noise.

We will include these ablation results and accompanying analysis in the revised version.

Model	EOCR (p=0.5)	EOCR (p=1.0)	POCR (p=0.5)	POCR (p=1.0)	VIPER (p=0.5)	VIPER (p=1.0)
GlyphRestorer (w/o visual embedding)	16.89%	19.16%	15.54%	14.99%	15.22%	13.04%
GlyphRestorer (w/o char embedding)	64.32%	55.88%	67.29%	65.94%	62.60%	56.66%
GlyphRestorer (w/ cross-attention)	98.06%	97.14%	86.34%	68.25%	74.61%	51.11%
GlyphRestorer (w/ TrOCR backbone)	90.57%	77.92%	68.61%	34.89%	41.67%	29.97%
GlyphRestorer (full model)	98.90%	98.26%	87.00%	76.16%	77.83%	57.12%

References
[1] TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

2025-06-02

Point2. Comparison against simpler baselines
| The paper does not compare against simpler baselines, such as training a safety classifier directly on visually perturbed data via data augmentation. It remains unclear whether restoration is necessary for improving safety classification, or if comparable performance can be achieved through less complex strategies.

Thank you for this important suggestion. To evaluate whether our restoration pipeline is truly necessary, we incorporated a safety classifier inspired by prior work on robustness to visual perturbations [1]. Specifically, we adopted Detoxify [2], a widely used safety classifier, and fine-tuned its pretrained model on our visually perturbed text data from the GlyphSynth dataset.

As shown in the table below, this simpler baseline (Detoxify) achieves moderately improved performance over zero-shot models. However, it still significantly underperforms compared to our GlyphRestorer pipeline across all models and evaluation metrics. This suggests that simply exposing classifiers to noisy inputs is insufficient, and that explicit restoration is critical for recovering semantic integrity under perturbation.

multiline

Model	EOCR (p=0.5 / p=1.0)	POCR (p=0.5 / p=1.0)	VIPER (p=0.5 / p=1.0)
Detoxify	76.20% / 49.14%	71.97% / 46.26%	77.37% / 48.39%
Gemini	85.95% / 74.25%	84.19% / 66.32%	74.37% / 43.31%
Qwen2-VL-72B	79.00% / 59.17%	77.39% / 61.47%	64.14% / 22.45%
GlyphRestorer (Gemini)	93.08% / 92.99%	93.93% / 93.82%	91.16% / 89.70%
GlyphRestorer (Qwen2-VL-72B)	90.13% / 89.78%	90.48% / 90.71%	86.68% / 85.62%

singleword

Model	EOCR (p=0.5 / p=1.0)	POCR (p=0.5 / p=1.0)	VIPER (p=0.5 / p=1.0)
Detoxify	61.27% / 44.42%	63.40% / 59.48%	62.87% / 49.88%
Gemini	83.72% / 79.86%	84.03% / 81.00%	75.10% / 62.27%
Qwen2-VL-72B	76.32% / 54.75%	75.59% / 69.68%	53.15% / 14.02%
GlyphRestorer (Gemini)	90.79% / 89.95%	91.91% / 91.62%	86.97% / 83.99%
GlyphRestorer (Qwen2-VL-72B)	88.69% / 88.35%	90.55% / 90.67%	83.68% / 80.24%

We will include these baseline results in the revised version.

References
[1] Learning the Legibility of Visual Text Perturbations
[2] https://github.com/unitaryai/detoxify

2025-06-05

Thank you for the detailed response. While your clarifications addressed many of my concerns, considering the overall quality, scope, and the positive assessments from other reviewers, I will keep my original score, which was already positive. Wishing the authors the best of luck.

审稿意见

评分: 7置信度: 32025-05-08

This paper approaches the problem of decoding text that may have been visually perturbed. The authors propose a method for perturbing text (GlyphPerturber) which is used to generate a dataset (GlyphSynth) used to train models of a novel architecture (GlyphDecode). The trained models outperform closed-weight VLMs not trained for this task.

接收理由

The combination of a method, dataset, and model are a lot for one paper but make sense to me as a coherent single contribution.
The results are good: GlyphDecode beats both open and closed-weight LLMs when used out of the box.
It is nice that GlyphDecode can operate on a pre-trained LLM (e.g. gpt-4o) and does not require weight access.

拒绝理由

The GlyphRestorer module detailed in Figure 3 includes many seemingly-random architectural decisions. Ablations of different components would help know why it's beating the baselines in Table 2 (or is it mostly just the unique training data?).
Both datasets used for evaluation (Table 2 / Figure 5) are I think proposed in this paper. It would be better to have evaluation on at least one task that's also used in prior work.
No mention of CAPTCHA solving which seems to be the real-world problem the authors are solving.

给作者的问题

Is this not just a model that can solve CAPTCHAs?
Are there no other OCR models designed specifically for visually perturbed text?
It could be helpful for readers if you included more examples from the GlyphSynth dataset in the appendix.

Typos:

Abstract: "we introduce GLYPHSYNTH publicly available"
Figure 5 should say Table 5

伦理问题详情

I think the authors may have built a model that is perfect for solving CAPTCHAs. I don't think this should qualify as a barrier to acceptance – this is a good paper! – but I think it should be mentioned in the Ethics statement.

2025-06-02

We appreciate the reviewer’s recognition of the coherence of our combined contributions, the effectiveness of GlyphDecode across model types, and its ability to operate without access to model weights. Below, we address the points raised by the reviewer:

Point 1. Ablation studies
| The GlyphRestorer module detailed in Figure 3 includes many seemingly-random architectural decisions. Ablations of different components would help know why it's beating the baselines in Table 2 (or is it mostly just the unique training data?).

Thank you for highlighting the importance of ablation analysis to better understand the source of performance gains. In response, we have extended our experiments with a systematic ablation study to assess the individual contribution of core components in GlyphRestorer.

Specifically, we evaluated four key ablations: (1) removing visual embeddings, (2) removing character (textual) embeddings, (3) replacing our modality-adaptive gated fusion with standard cross-attention, and (4) replacing our OCR backbone with a transformer-based model (TrOCR [1]).

We will include these ablation results and accompanying analysis in the revised version.

Model	EOCR (p=0.5)	EOCR (p=1.0)	POCR (p=0.5)	POCR (p=1.0)	VIPER (p=0.5)	VIPER (p=1.0)
GlyphRestorer (w/o visual embedding)	16.89%	19.16%	15.54%	14.99%	15.22%	13.04%
GlyphRestorer (w/o char embedding)	64.32%	55.88%	67.29%	65.94%	62.60%	56.66%
GlyphRestorer (w/ cross-attention)	98.06%	97.14%	86.34%	68.25%	74.61%	51.11%
GlyphRestorer (w/ TrOCR backbone)	90.57%	77.92%	68.61%	34.89%	41.67%	29.97%
GlyphRestorer (full model)	98.90%	98.26%	87.00%	76.16%	77.83%	57.12%

References
[1] TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

2025-06-02

Point 2. Evaluation introduced in prior work
| Both datasets used for evaluation (Table 2 / Figure 5) are I think proposed in this paper. It would be better to have evaluation on at least one task that's also used in prior work.

Thank you for the valuable suggestion. In response, we extended our evaluation introduced in prior work, EvilText, which was originally proposed to assess the robustness of NLP systems against character-level visual perturbations [1]. EvilText adopts a CNN + autoencoder architecture to generate perturbed text and provides an open-source implementation [2], making it a suitable and reproducible choice for evaluation.

To ensure a fair comparison, we followed the original EvilText protocol to construct a train/test split using toxic vocabulary, and report numerical results in the table below.

In the restoration setting (measured by Normalized Edit Distance), our GlyphRestorer—trained solely on the GlyphSynth dataset (E+P+V)—achieved strong results even on EvilText perturbations, despite no access to its training distribution. To further test this, we also trained GlyphRestorer directly on EvilText's training set. As expected, performance improved further, demonstrating our model's transferability and adaptability to new distortion patterns.

We also observed consistent improvements in the safety classification task, where visually perturbed inputs led to severe performance degradation in base models. Applying GlyphRestorer significantly restored performance across all models and perturbation settings.

We believe this prior work evaluation complements our newly proposed dataset, and we agree that incorporating additional external tasks remains a valuable direction for future expansion.

EVILTEXT (Restoration)

Model	p=0.5	p=1.0
OpenAI/gpt-4o	45.07%	18.58%
Anthropic/Claude	50.54%	35.63%
Google/Gemini	52.39%	36.71%
Intern2.5-VL-38B	39.90%	20.16%
Qwen2-VL-7B	50.27%	24.78%
Intern2.5-VL-78B	31.24%	16.31%
Qwen2-VL-72B	53.42%	30.53%
TesseractOCR	59.65%	24.84%
PaddleOCR	62.60%	30.80%
EasyOCR	72.17%	47.46%
GlyphRestorer (EOCR)	77.82%	60.31%
GlyphRestorer (POCR)	80.07%	62.01%
GlyphRestorer (VIPER)	70.38%	42.47%
GlyphRestorer (E+P+V)	81.28%	63.12%
GlyphRestorer (EVILTEXT)	92.83%	91.46%

EVILTEXT (Safety Classification)

Model	Multiline (p=0.5 / p=1.0)	Singleword (p=0.5 / p=1.0)
OpenAI/gpt-4o	63.10% / 40.60%	75.60% / 63.50%
Anthropic/Claude	62.20% / 47.50%	33.80% / 14.50%
Google/Gemini	71.00% / 52.70%	68.20% / 56.00%
Intern2.5-VL-38B	53.90% / 29.80%	13.60% / 1.70%
Qwen2-VL-7B	71.30% / 45.40%	66.30% / 44.40%
Intern2.5-VL-78B	36.40% / 16.00%	13.50% / 1.50%
Qwen2-VL-72B	74.00% / 58.30%	72.00% / 56.80%
GlyphDecode (OpenAI/gpt-4o)	74.20% / 70.00%	82.30% / 78.30%
GlyphDecode (Anthropic/Claude)	88.90% / 86.70%	77.50% / 74.60%
GlyphDecode (Google/Gemini)	90.20% / 87.00%	84.00% / 79.00%
GlyphDecode (Intern2.5-VL-38B)	76.00% / 74.00%	66.20% / 64.50%
GlyphDecode (Qwen2-VL-7B)	77.20% / 75.20%	66.30% / 62.00%
GlyphDecode (Intern2.5-VL-78B)	76.80% / 77.60%	68.30% / 66.00%
GlyphDecode (Qwen2-VL-72B)	85.30% / 84.50%	81.00% / 77.60%

References
[1] Unicode Evil: Evading NLP Systems Using Visual Similarities of Text Characters
[2] https://bitbucket.org/srecgrp/eviltext/src/master/

2025-06-04

Thanks for the response. In light of these improvements, especially the promised improved ethics statement, I am raising my score to a 7.

2025-06-02

Point 3. Clarifying the connection to CAPTCHA solving
| No mention of CAPTCHA solving which seems to be the real-world problem the authors are solving.

Thank you for the insightful comment and for acknowledging the contributions of our work. While our model restores visually distorted text—superficially resembling CAPTCHA solving—its design and objective are fundamentally different.

Traditional CAPTCHA systems are based on random, meaningless text distorted to prevent automated recognition. In contrast, our model is explicitly optimized to restore semantically meaningful and toxic phrases that are intentionally perturbed to evade content moderation while remaining human-readable.

Importantly, as shown in Figure 3(b), our model architecture is designed to leverage semantic context during decoding. It processes both the perturbed text and its visual representation, and generates predictions by attending to surrounding characters—effectively modeling linguistic continuity that does not exist in traditional CAPTCHA settings.

We acknowledge, however, that some semantic CAPTCHAs exist. In light of this, we will revise the Ethics Statement to explicitly address this potential for misuse. We will clarify that our model is designed solely for improving online safety and not for defeating human-verification mechanisms, and we will encourage responsible use aligned with best practices in adversarial ML research.

Point 4. Minor typos and figure/table reference fixes
| Typos: Abstract: "we introduce GLYPHSYNTH publicly available", Figure 5 should say Table 5

We appreciate the reviewer’s careful reading. We will correct the typo in the abstract (“we introduce GLYPHSYNTH publicly available”) and fix the incorrect reference to Figure 5, which should in fact refer to Table. These issues will be addressed in the revised version to improve clarity and accuracy.

Question1
| Is this not just a model that can solve CAPTCHAs?

Thank you for the thoughtful questions and helpful suggestion. This concern is addressed in detail in Point 3, where we clarify the conceptual and practical differences between our setting and traditional CAPTCHA solving.

Question2
| Are there no other OCR models designed specifically for visually perturbed text?

While some OCR systems ([1, 2]) are designed to be robust against naturally occurring noise or degradation (e.g., blur, low resolution), they are typically not built to handle visually deceptive perturbations that are crafted to evade automated systems while remaining legible to humans.

Such perturbations often require the system to interpret not only the visual form but also the underlying semantic intent of the text. This shifts the task away from conventional character recognition toward semantic-aware restoration—a capability that lies beyond the standard design goals of traditional OCR models.

To our knowledge, there are no existing OCR systems that directly target this class of adversarial, meaning-preserving perturbations in an end-to-end restoration framework.

References
[1] https://github.com/JaidedAI/EasyOCR
[2] https://github.com/tesseract-ocr/tesseract

Question3
| It could be helpful for readers if you included more examples from the GlyphSynth dataset in the appendix.

In response, we have added representative examples from the GlyphSynth dataset. These include a variety of visually perturbed instances across all categories (e.g., drug, hate, insult) and perturbation rates (e.g., p = 0.5, p = 1.0). These examples illustrate the visual complexity and diversity of the dataset, highlighting its relevance for realistic safety evaluation.

Warning: This link contains real examples of toxic content (e.g., hate speech, insults, and drug references), which may be offensive or upsetting to some readers.
https://drive.google.com/file/d/1D5Ag3_dH6_yaNP-U02Aia-9w4JrMxGF3/view?usp=sharing

审稿意见

评分: 7置信度: 22025-05-13

The paper proposes GLYPHDECODE, a novel framework to combat content moderation evasion via visually perturbed text, where toxic words are disguised using visually similar Unicode characters (e.g., “f℮ntanyl” instead of “fentanyl”). It introduces two key components: GLYPHPERTURBER, which generates realistic visually perturbed text using OCR-based glyph similarity, and GLYPHRESTORER, a lightweight multimodal transformer that restores distorted text by fusing visual and textual features. The authors also release GLYPHSYNTH, a large-scale dataset containing perturbed toxic phrases across multiple categories (e.g., drug, hate, insult) in both word and sentence formats. Extensive experiments show that GLYPHDECODE significantly improves the ability of both closed and open-source Vision-Language Models (VLMs) to detect and decode harmful content that would otherwise bypass filters, outperforming baselines in both safety classification and restoration accuracy under adversarial scenarios.

接收理由

1.The paper addresses an underexplored yet critical aspect of content moderation: not just detecting but restoring visually perturbed text. This distinguishes it from most prior work, which focuses solely on detection or classification.

2.The paper evaluates GLYPHDECODE across multiple closed-source (GPT-4o, Claude, Gemini) and open-source (Qwen2-VL, Intern2.5-VL) models. It shows consistent performance.

3.The authors release a new benchmark dataset (GLYPHSYNTH) with visually perturbed toxic content across five categories (hate, crime, insult, drug, sexual). The dataset includes both single words and multi-line phrases, providing realistic and diverse test conditions.

拒绝理由

The paper does not provide direct comparisions with existing jailbreaking baselines. This makes the evaluation less comprehensive.

2025-06-02

We appreciate the reviewer’s acknowledgment of the importance of addressing restoration in content moderation, noting it as an understudied aspect. We also value the recognition of our multi-model evaluation and the contribution of the GLYPHSYNTH dataset. Below, we address the points raised by the reviewer:

Point 1. Additional jailbreaking baseline
| The paper does not provide direct comparisons with existing jailbreaking baselines. This makes the evaluation less comprehensive.

Thank you for the valuable suggestion. In response, we extended our evaluation to include a jailbreaking baseline, EvilText, introduced in prior work on evading NLP systems via character-level visual perturbations [1]. EvilText adopts a CNN + autoencoder architecture to generate perturbed text and provides an open-source implementation [2], making it a suitable and reproducible choice for evaluation.

To ensure a fair comparison, we followed the original EvilText protocol to construct a train/test split using toxic vocabulary, and report numerical results in the table below.

While EvilText represents just one type of jailbreaking strategy, we selected it due to its availability, its direct manipulation of character-level features, and its relevance to our focus on visually grounded attacks. We agree that exploring additional attack types is a valuable future direction.

EVILTEXT (Restoration)

Model	p=0.5	p=1.0
OpenAI/gpt-4o	45.07%	18.58%
Anthropic/Claude	50.54%	35.63%
Google/Gemini	52.39%	36.71%
Intern2.5-VL-38B	39.90%	20.16%
Qwen2-VL-7B	50.27%	24.78%
Intern2.5-VL-78B	31.24%	16.31%
Qwen2-VL-72B	53.42%	30.53%
TesseractOCR	59.65%	24.84%
PaddleOCR	62.60%	30.80%
EasyOCR	72.17%	47.46%
GlyphRestorer (EOCR)	77.82%	60.31%
GlyphRestorer (POCR)	80.07%	62.01%
GlyphRestorer (VIPER)	70.38%	42.47%
GlyphRestorer (E+P+V)	81.28%	63.12%
GlyphRestorer (EVILTEXT)	92.83%	91.46%

EVILTEXT (Safety Classification)

Model	Multiline (p=0.5 / p=1.0)	Singleword (p=0.5 / p=1.0)
OpenAI/gpt-4o	63.10% / 40.60%	75.60% / 63.50%
Anthropic/Claude	62.20% / 47.50%	33.80% / 14.50%
Google/Gemini	71.00% / 52.70%	68.20% / 56.00%
Intern2.5-VL-38B	53.90% / 29.80%	13.60% / 1.70%
Qwen2-VL-7B	71.30% / 45.40%	66.30% / 44.40%
Intern2.5-VL-78B	36.40% / 16.00%	13.50% / 1.50%
Qwen2-VL-72B	74.00% / 58.30%	72.00% / 56.80%
GlyphDecode (OpenAI/gpt-4o)	74.20% / 70.00%	82.30% / 78.30%
GlyphDecode (Anthropic/Claude)	88.90% / 86.70%	77.50% / 74.60%
GlyphDecode (Google/Gemini)	90.20% / 87.00%	84.00% / 79.00%
GlyphDecode (Intern2.5-VL-38B)	76.00% / 74.00%	66.20% / 64.50%
GlyphDecode (Qwen2-VL-7B)	77.20% / 75.20%	66.30% / 62.00%
GlyphDecode (Intern2.5-VL-78B)	76.80% / 77.60%	68.30% / 66.00%
GlyphDecode (Qwen2-VL-72B)	85.30% / 84.50%	81.00% / 77.60%

References
[1] Unicode Evil: Evading NLP Systems Using Visual Similarities of Text Characters
[2] https://bitbucket.org/srecgrp/eviltext/src/master/

2025-06-11

Thanks for the rebuttal. The author has resolved my concerns, thus I have increased the score to 7.

审稿意见

评分: 7置信度: 42025-05-14

This paper addresses the problem of visual content moderation where inputs have been corrupted via visual manipulations. It presents GLYPHDECODE, a framework to solve this issue designed to restore visually perturbed text to its original form. The approach operates in two steps: First generation of visually perturbed text images for training, then recovering the original text via a multimodal transformer architecture. It has been evaluated on a new publicaly available dataset and compared to state of the art baselines in text restoration.

接收理由

-- a new dataset for evaluating LLMs attack via input perturbation

-- a clear positioning to related work and description of the methodology

-- A detailed presentation of prompts (in the Appendix)

拒绝理由

-- Qualitative analysis should be enhanced with representative examples.

-- Related work should also better highlights major differences with the new dataset, for example with Lee et al. (2025).

-- The paper should be proof read (many typos, missing ref: Figures 1 and 2 are note referenced in the main text, Figure 1 should be moved from the title page, etc.

2025-06-02

We appreciate the reviewer’s recognition of our dataset for evaluating LLM vulnerabilities, the clarity of our methodological presentation, and the detailed prompt descriptions. Below, we address the points raised by the reviewer:

Point 1. Qualitative analysis
| Qualitative analysis should be enhanced with representative examples.

Thank you for the helpful suggestion. In response, we have expanded the qualitative analysis in two directions:

First, in addition to the perturbation examples already provided for GlyphPerturber, we now include a detailed qualitative analysis of GlyphRestorer. We present side-by-side examples of perturbed input and restored output, covering both successful and failure cases. This helps illustrate when restoration works well and where it still struggles, providing insight into model behavior beyond quantitative metrics.

Second, we also added representative examples from the GlyphSynth dataset. These include a variety of visually perturbed instances across all categories (e.g., drug, hate, insult) and perturbation rates (e.g., p=0.5, p=1.0). These examples demonstrate the visual complexity and diversity of the dataset, highlighting its relevance for realistic safety evaluation.

Due to OpenReview’s file restrictions, we provide these qualitative analysis figure via an anonymized external link:

Warning: This link contains real examples of toxic content (e.g., hate speech, insults, and drug references), which may be offensive or upsetting to some readers.
https://drive.google.com/file/d/1fC1eJAlu1IZlu4Vb3zQrhZnQPg-b8Qz4/view?usp=sharing

We will include these visual examples in the revised version.

Point 2. Related work comparison
| Related work should also better highlights major differences with the new dataset, for example with Lee et al. (2025).

Thank you for pointing out the need to more clearly highlight the differences between our proposed dataset and prior work, including Lee et al. (2025). We agree that a clearer comparison can strengthen the motivation and novelty of our contribution.

To address this, we will revise the Related Work section to more clearly delineate how GLYPHSYNTH differs from existing datasets in scope, structure, and applicability:

Lee et al. (2025) [1] focuses on phishing-specific content (especially, Bitcoin), whereas GLYPHSYNTH covers a broader range of socially harmful content, including hate, sexual, drug-related, and criminal material. This allows for evaluating safety classification in more diverse and realistic scenarios.
Lee et al. (2025) [1] and Wang et al. (2023) [2] provide text-only datasets without corresponding rendered images. However, visually perturbed text often includes complex combinations of Unicode characters, which cannot be faithfully rendered using standard text renderers. In contrast, GLYPHSYNTH uses a custom GlyphPerturber engine to generate text images with high visual fidelity, supporting mixed scripts, multi-line layout, and typographic realism. This enables robust evaluation under visually grounded attacks.
Seth et al. (2023) [3] is limited to single-word legibility evaluation on clean vocabulary, and does not target toxic or unsafe content. GLYPHSYNTH includes both single-word and multi-line examples, allowing for a broader set of tasks including restoration, safety classification, and contextual reasoning.

We will incorporate this comparative discussion into the revised related work section.

References
[1] Lee et al, BitAbuse: A Dataset of Visually Perturbed Texts for Defending Phishing Attacks
[2] Wang et al, Mttm: Metamorphic testing for textual content moderation software
[3] Seth et al, Learning the legibility of visual text perturbations

Point 3. Writing and formatting
| The paper should be proof read (many typos, missing ref: Figures 1 and 2 are note referenced in the main text, Figure 1 should be moved from the title page, etc.

We sincerely apologize for the overlooked typos and missing figure references. In the revised version, we will carefully proofread the paper, ensure that all figures (including Figures 1 and 2) are properly referenced in the main text, and relocate Figure 1 from the title page to a more appropriate position. We appreciate the reviewer’s attention to detail and will make all necessary corrections to improve the clarity and presentation of the paper.

2025-06-06

Thanks for the detailed responses to my comments. Incorporating all these elements will strengthen the paper. I maintain my score.

最终决定Accept

2025-07-08

Summary: The submission introduces the GlyphDecode framework, a two-step system to detect text perturbations intended to avoid content moderation systems by using special unicode characters. The method relies on turning the text into an image (visual representation of the characters), and having a multi-model model GlyphRestorer that can recover from the image back to the original "hidden" text. The method, trained on a synthetic dataset, is evaluated on a real-world corpus proposed by the authors, and is shown to be quite performant at the task of detecting harmful text that was previously undetected by prior state-of-the-art.

Reviewers sentiment was generally fairly positive. There was a good amount of exchange during the discussion period, which led several reviewers to increase their scores based on in-depth responses (with additional analyses) from the authors.

Meta Review Recommendation: I recommend acceptance of the work, as it was clearly of interest to the reviewers. I will however strongly recommend to the authors to include in their updated draft the experiments described during discussion, including:

The experiments involving EvilText and Detoxify (additional external baselines that confirm the strong performance of GlyphDecode)
The results from the ablation study of the method.
Additionally, as you move to a camera-ready, it is important to correct major presentation issues pointed out by all reviewers, such as typos in the abstract, Figures that are not referenced, etc.

This paper went through ethics reviewing. Please review the ethics decision and details below.
Decision: All good, nothing to do or only minor recommendations