PaperHub
7.0
/10
Poster3 位审稿人
最低3最高4标准差0.5
4
3
4
ICML 2025

Synthesizing Images on Perceptual Boundaries of ANNs for Uncovering and Manipulating Human Perceptual Variability

OpenReviewPDF
提交: 2025-01-23更新: 2025-08-05
TL;DR

By sampling along neural network perceptual boundaries, we generated images that induce high variability in human decision and can predict and manipulate individual behavior on these samples.

摘要

关键词
Perceptual variabilityObject RecognitionBehavior ManipulationBehavioral Alignment

评审与讨论

审稿意见
4

This paper studies individual perceptual variability by generating controversial stimuli—images perceived differently by various individuals. To do so the authors 1) sample images on the perceptual boundary of ANNs, 2) collect subject-specific labels through psychophysics experiments on the previously generated images, 3) train individually aligned neural network classifiers and 4) generate images eliciting divergent responses between individuals. The authors demonstrate the validity of their method with human experiments.

给作者的问题

  • How exactly are the uncertainty or controversial guidance signals integrated into the diffusion process?
  • Are you using classifier-free guidance diffusion models, and if so, how are the dual-label conditions handled?
  • Which classifiers (architectures and specifics) were tested for generating controversial stimuli?
  • Could you clarify precisely how fine-tuning is conducted (GroupNet, IndivNet, BaseNet), and what are the exact targets for population-level fine-tuning?
  • What prevents applying the full analysis presented for MNIST to more complex datasets like ImageNet?
  • Can you discuss how the generative model’s biases impact the validity of perceptual variability experiments (versus GAN-based or other type of generative models)?

论据与证据

The authors claim their computational framework can effectively produce controversial stimuli that uncover human perceptual variability. Evidence includes i) human experimental validation, demonstrating that synthesized images indeed elicit varying perceptual judgments and ii) various quantitive analyses (entropy and accuracy analysis).

方法与评估标准

The methods combine i) perceptual boundary sampling from ANNs, 2) individually aligned and 3) generation of images that elicit divergent responses between individuals. While the article is very clear and well written in general, I found section 3, describing the perceptual sampling method, not clear enough. It would have been useful to detail more how the uncertainty and controversial guidance integrate into the decision process.

The evaluation process relies on human experiments, which makes sense for this article. In general, I found the method clear (except section 3), and the evaluation criteria are relevant.

理论论述

This is not a theoretical claim, but the authors assert that their diffusion-based method produces more "natural" controversial stimuli compared to existing approaches. This claim, however, is insufficiently supported, especially given the comparison is largely limited to MNIST rather than photorealistic domains.

实验设计与分析

Experiments predominantly focus on MNIST. The effectiveness of the method on ImageNet is briefly demonstrated but not thoroughly analyzed. Detailed analyses demonstrating perceptual boundary manipulations on complex datasets would substantially strengthen the experimental claims.

补充材料

Essential information regarding diffusion model architecture, classifier details, and fine-tuning procedures are insufficiently detailed in the supplementary materials. Important points, such as classifier architecture, pre-training datasets, and exact fine-tuning targets, are either missing or unclear.

与现有文献的关系

The paper aligns well with existing literature, addressing perceptual variability and generative modeling approaches.

遗漏的重要参考文献

I have not noticed any important references that were not discussed

其他优缺点

Strengths:

  • Well motivated and clearly written.
  • Successfully demonstrates the possibility of manipulating perceptual variability.

Weaknesses:

  • Insufficient clarity and detail regarding the diffusion guidance mechanism.
  • Limited analysis beyond MNIST dataset.
  • Lack of clarity regarding classifier architecture and fine-tuning specifics

其他意见或建议

  • Clearly label axes in Figure 3a to improve readability and comprehension (e.g., akin to Figure 1 from the Golan et al. article).
  • Correct the referencing error: Figure A.2 should be in the supplementary material, not the main article.
作者回复

Response to reviewer Aam6

We sincerely thank the reviewer for the support of our work. Below we address the reviewer’s concerns and questions:

  1. Question:
    How exactly are the uncertainty or controversial guidance signals integrated into the diffusion process?
    Response:
    The uncertainty and controversial guidance signals are integrated into the diffusion process using a classifier guidance method. More details can be found in Appendix A.2.

  2. Question:
    Are you using classifier-free guidance diffusion models, and if so, how are the dual-label conditions handled?
    Response:
    We adopted classifier guidance methods for all of our experiments. Although classifier-free guidance diffusion models may offer interesting insights, they were not used in our work. In our approach, all guidance is introduced by the classifiers, and the dual-label conditions are handled by our proposed controversial guidance and uncertainty guidance.

  3. Question:
    Which classifiers (architectures and specifics) were tested for generating controversial stimuli?
    Response:
    For generating controversial stimuli using the base models (trained from scratch on the MNIST dataset), all five models are paired with each other. The guidance outcome results are presented in Figure A.14.

  4. Question:
    Could you clarify precisely how fine-tuning is conducted (GroupNet, IndivNet, BaseNet), and what are the exact targets for population-level fine-tuning?
    Response:
    We apologize for any confusion. For detailed information on the fine-tuning process, please refer to point 6 of our rebuttal to Reviewer 62LP. The goal of the population-level fine-tuning is to align the models with the population-level characteristics. Starting from GroupNet for the fine-tuning of IndivNet provides the model with a human group-level prior and helps prevent overfitting, particularly since the varMNIST-i dataset contains only a small amount of data.

  5. Question:
    What prevents applying the full analysis presented for MNIST to more complex datasets like ImageNet?
    Response:
    Please refer to point 1 of our rebuttal to Reviewer YcpF.

  6. Question:
    Can you discuss how the generative model’s biases impact the validity of perceptual variability experiments (versus GAN-based or other types of generative models)?
    Response: We chose classifier-guided diffusion models for their superior performance, stability, and flexibility. While we explored alternatives (e.g., VAEs and prior-free guidance[1], as shown in Figure A.3), none achieved the quality or precise individual alignment that our approach provides. We acknowledge that though a comprehensive comparison of generative model biases would indeed provide more insights, it would require developing new methodologies based on various backbone models, which falls outside the scope of this work.

  7. Question:
    The authors assert that their diffusion-based method produces more "natural" controversial stimuli compared to existing approaches. This claim, however, is insufficiently supported, especially given the comparison is largely limited to MNIST rather than photorealistic domains.
    Response:
    We apologize for any confusion in our writing. By "natural," we mean stimuli that are closer to the original distribution of the dataset. For example, in Figure A.3, stimuli generated by methods without a prior would not be recognized as digits by human participants, thereby classifying them as out-of-distribution or "unnatural" in our terms. We deliberately chose this terminology to distinguish our approach from others, such as the one proposed by Golan [1]. Although Golan’s method can generate an image x that causes classifier f1 to label it as y1 and classifier f2 as y2, it fails to produce images that human participants recognize as digits—a shortcoming that our method effectively overcomes.

References:

  1. T. Golan, P. C. Raju, and N. Kriegeskorte, "Controversial stimuli: Pitting neural networks against each other as models of human cognition," Proceedings of the National Academy of Sciences, vol. 117, no. 47, pp. 29330–29337, 2020.
审稿意见
3

The functional alignment between artificial neural networks (ANNs) and the human visual system has been a major hot topic in recent years. In this study, the authors generated images that lie on the perceptual boundaries of various ANNs and examined their relationship with individual differences in human perception. Experimental results using MNIST demonstrated that the images generated by this method effectively explain individual variability in human category judgments and can be used to manipulate perceptual variability.

给作者的问题

Are there specific reasons showing the large error bars in the customized dataset for guidance success rates in Figure 6c?

论据与证据

The study comprehensively presents its claims and supporting experiments. However, the organization of the paper is not well-structured.

In general, it is difficult to derive generalizable conclusions only from the MNIST dataset due to low diversity, and analysis using natural images is necessary. The Introduction and Methods sections describe the paper as if this study only used MNIST. However, in Figure 4b, natural image data is introduced without any prior explanation, and the figure result of perceptual variability suggests that natural images are more appropriate for the analysis. It is only in Section 5.2 that the use of natural images is explicitly mentioned, but the results are only in the Appendix.

Instead, the authors should restructure the paper and frame while including the results obtained from natural images, which are more relevant for discussing perceptual variability.

方法与评估标准

The MNIST dataset is not suitable for examining human perceptual variability, as evident from Figure 4b, where participants' judgments appear to be highly consistent. This suggests that MNIST lacks the complexity necessary to capture meaningful individual differences in perception.

理论论述

N/A. Theoretical (mathematical) proof was not conducted in this manuscript.

实验设计与分析

In addition to reporting the results, the authors should provide further analysis to explain why fine-tuning performance varies across different model architectures. A deeper investigation into the underlying factors driving these differences would strengthen the study's findings.

补充材料

As mentioned above, the results from natural images should be incorporated into the main text rather than being placed in the supplementary material. These findings should be presented as the core focus of the paper.

与现有文献的关系

This study may provide broader insights into understanding personalized models, which take into account individual differences across populations, including individuals with mental disorders or neurodevelopmental conditions. Such models are needed for appropriate diagnosis of these individuals.

遗漏的重要参考文献

Appropriate references are discussed in this manuscript.

其他优缺点

Other weakness:

The use of technical terms in the paper lacks consistency and deviates from standard terminology in cognitive science. For example, a numerical category classification task is not typically referred to as "decision making."

Additionally, the paper alternates between terms like "decision boundaries of classifiers" and "ANN perception boundary," suggesting that the authors may conflate perception and decision-making processes. The terminology should be clarified and used consistently to avoid confusion.

其他意见或建议

In Figure 4, the colors in the legend do not match those in the bar graph. The authors should ensure consistency between the legend and the figure for clarity.

作者回复

Response to reviewer YcpF

We sincerely thank the reviewer for the support of our work. Below we address the reviewer’s concerns and questions:

  1. Comment:
    The MNIST dataset is not suitable for examining human perceptual variability, as evident from Figure 4b, where participants' judgments appear to be highly consistent. This suggests that MNIST lacks the complexity necessary to capture meaningful individual differences in perception.
    Response:
    We agree with the reviewer that in the MNIST dataset participants' judgments appear to be highly consistent and are less complex than natural images. We conduct the major experiments using handwritten digits for two main reasons:

    • Efficiency: The models are much easier to train and require less time, which is very important for a large-scale behavioral experiment like ours.
    • Perceptual Variability: As mentioned by Reviewer 62LP and Reviewer YcpF, perceptual variability is harder to evoke on handwritten digits. If we can evoke perceptual variability even on a highly consistent dataset such as MNIST, then it could be applied to more complex datasets where human perceptual variability is more diverse. Based on this assumption, we evaluated our framework on natural images and observed that perceptual variability is indeed larger on natural images than on handwritten digits. Although further experiments on natural images were not conducted, it is reasonable to assume that individual perceptual boundaries are more complex on natural image datasets. This complexity might require additional rounds of experiments or a parameter-efficient method for finetuning(e.g., LoRA) as the models would become bigger and more complex. We plan to conduct more experiments and analysis on natural images in future work. We hope the reviewer can agree with our considerations.
  2. Comment:
    In addition to reporting the results, the authors should provide further analysis to explain why fine-tuning performance varies across different model architectures. A deeper investigation into the underlying factors driving these differences would strengthen the study's findings.
    Response:
    We appreciate the reviewer's suggestion, as further analysis of the differences in fine-tuning performance across various model architectures could indeed provide valuable insights. However, the primary focus of this paper is to demonstrate the feasibility of aligning humans and AI using our method and to explore the potential for manipulating perceptual variability within our framework. Given the limited space available, we believe that an in-depth analysis of architectural differences would be better suited for a separate, more focused study.

  3. Comment:
    The use of technical terms in the paper lacks consistency and deviates from standard terminology in cognitive science. For example, a numerical category classification task is not typically referred to as "decision making." Additionally, the paper alternates between terms like "decision boundaries of classifiers" and "ANN perception boundary," suggesting that the authors may conflate perception and decision-making processes. The terminology should be clarified and used consistently to avoid confusion.
    Response:
    Thank you for pointing out this inconsistency. We will pay more attention to the use of terms and revise the text to ensure consistency with standard cognitive science terminology.

  4. Comment:
    In Figure 4, the colors in the legend do not match those in the bar graph. The authors should ensure consistency between the legend and the figure for clarity.
    Response:
    We thank the reviewer for highlighting this issue. We will revise the figure to ensure that the legend colors match those used in the bar graph.

  5. Question:
    Are there specific reasons showing the large error bars in the customized dataset for guidance success rates in Figure 6c?
    Response:
    We admit that the customized manipulation experiment can be difficult to conduct, and the effect of manipulation is largely influenced by the individual status of the participants due to the need for precise alignment with each participant. Under these circumstances, large error bars can occur. Despite the large error bars and relatively low improvement, our statistical analysis confirmed that the results are statistically significant (p < 0.001).

审稿意见
4

This paper studies the human perceptual judgements by generating controversial stimuli that lie close to the boundary between different classes. The experiments first show that finetuning vision networks on data collected from human judgements enables these models to better capture the human behavior. They then show that using these models they can generate new stimuli that can sometimes specifically bias individuals towards particular choices.

update after rebuttal

Several issues were clarified. The discussion helped with understanding the practical issues in scaling the approach to natural images. I increased my score by 1 as a result. The authors agreed to make changes to address the issues and unclarities.

给作者的问题

  • Were subjects given extended time or was there a time limit in labeling the images? This is important because, some of the individual differences may be due to arbitrary position of gaze that may have influenced the declared labels.

  • line232: varMINST-i dataset is not defined. Is that a subset of samples labeled by one individual? If so, was one model trained per individual? The dataset and procedure should be better explained

-line 236: the description of model training is confusing. The first part of the description suggests that the network was jointly trained on two datasets but the latter part and the term "finetuning" suggests first training on mnist then on vamnist.

  • were there shared images across different individuals? how were the labels for repeated images from different individuals combined?

  • It is unclear whether the networks used in the study VGG, VIT, CORnet were trained from scratch or a pretrained network was used. The text gives me the impression that they were trained from scratch but the number of parameters in these networks and the relatively small dataset size makes me doubt that.

  • there are no references to the models used in the study

  • figure 6: it's unclear what the "individually customized dataset" is. Not explained.

论据与证据

  • The success rate in selective manipulation of human behavior is relatively low, showing limited success in using the proposed approach. This is especially true for handwritten digits with ~20% success rate. Surprisingly, while the initial experiment using natural images seem to be much more successful, none of the following experiments were conducted using natural stimuli.

  • section 5.2: the IndivNet only improves the behavioral prediction accuracy by 5% over groupNet, yet it is claimed that this approach captures individual differences in perceptual judgments. The result seem to suggest that the model training mostly captures the group effect in this behavior.

方法与评估标准

The methods are appropriate.

理论论述

No new theory was proposed.

实验设计与分析

They are generally appropriately designed although I have several questions about the details of how they are conducted.

补充材料

No supplementary material was provided

与现有文献的关系

The introduction and discussion sections puts the paper into the perspective of the prior literature. Specially the work by Golan and Kriegeskorte is very closely related to the current work.

遗漏的重要参考文献

All relevant work were cited

其他优缺点

Strengths:

  • the idea of capturing individual differences in human subjects by collecting personal data and training neural networks on them is interesting.

  • the paper was well structured, mostly well-written with clear figures

Weaknesses:

  • the dataset is almost too easy and restrictive

  • despite being well written many details were missing

  • overall, the method does appear to be very effective, especially in the context of capturing individual differences and successfully biasing judgements in a selective way

其他意见或建议

typos: line 161 additional parentheses line 173: incorrect figure reference fig A11 caption: incorrect subplot indicators

伦理审查问题

Human behavior experiments require IRB or similar approvals but they were not mentioned in the text.

作者回复

Response to reviewer 62LP

We sincerely thank the reviewer for the positive feedback. Below we address the reviewer’s concerns and questions:

  1. Comment: The success rate in selective manipulation of human behavior is relatively low, showing limited success in using the proposed approach. This is especially true for handwritten digits with ~20% success rate. Surprisingly, while the initial experiment using natural images seem to be much more successful, none of the following experiments were conducted using natural stimuli.
    Response: Please refer to point 1 of the rebuttal to Reviewer YcpF.

  2. Comment: Section 5.2: The IndivNet only improves the behavioral prediction accuracy by 5% over GroupNet, yet it is claimed that this approach captures individual differences in perceptual judgments. The result seems to suggest that the model training mostly captures the group effect in this behavior.
    Response: We believe the relatively small improvement is due to the similarity between the individual perceptual boundary and the group-level perceptual boundary. We hope the reviewer agree with us that it is reasonable to assume that individuals, in general, are close to the group, resulting in a relatively small improvement.

  3. Response to typos:
    We thank the reviewer for pointing out these mistakes. We will reorganize the appendix and revise the text in a future version.

  4. Question: Were subjects given extended time or was there a time limit in labeling the images?
    Response: In the major experiments, all subjects were given extended time to reduce random effects that could impact the experiment, such as arbitrary gaze positions or unexpected distractions.

  5. Question: (Line 232) The varMNIST-i dataset is not defined. Is that a subset of samples labeled by one individual? If so, was one model trained per individual? The dataset and procedure should be better explained.
    Response: The varMNIST dataset consists of images with multiple labels (labeled by different participants or trials, as described in the paper). Each participant performed around 500 trials on different images. The dataset corresponding to each participant is referred to as varMNIST-i, and one model is trained per individual. We appreciate your valuable suggestions and will clarify the dataset and procedure in future revisions to the main text.

  6. Comment: (Line 236) The description of model training is confusing. The first part suggests that the network was jointly trained on two datasets, but the latter part and the term "finetuning" suggest first training on MNIST and then on varMNIST.
    Response: The training process is conducted as follows:

    • Base Model: First, we train a model using only the MNIST dataset.
    • Group Model: Next, we finetune the base model to align it with the human group level. The finetuning dataset is constructed by mixing MNIST data into our varMNIST data, a common technique to prevent overfitting and model forgetting. This process yields the group model.
    • Individual Model: Finally, based on the group model, we perform another finetuning at the individual level to align the model with each individual. This finetuning dataset is created by mixing MNIST, varMNIST, and varMNIST-i together. The training and evaluation sets are carefully divided so that even if there is an overlap between varMNIST and varMNIST-i, nothing in the evaluation set appears in the training set. Parameter details can be found in Appendix C.2.
  7. Question: Were there shared images across different individuals? How were the labels for repeated images from different individuals combined?
    Response: Yes, there are shared images across different individuals. Each image-label pair is treated as a trial and used in the finetuning process.

  8. Question: It is unclear whether the networks used in the study (VGG, ViT, CORnet) were trained from scratch or if a pretrained network was used.
    Response: The base models are trained from scratch. The group models are finetuned from the base models, and the individual models are finetuned from the group models, as explained in point 6.

  9. Comment: There are no references to the models used in the study.
    Response: We sincerely appreciate your suggestion and will add references for these models in future revisions.

  10. Comment: Figure 6: It is unclear what the "individually customized dataset" is.
    Response: The individually customized dataset is generated under the guidance of finetuned individual models to better evoke variability and bias participants toward certain directions.

  11. Ethics
    Response: We mentioned the ethics approval in Appendix B.2.2. Regarding human data collection, we can provide ethics documents and participant agreements upon request. We will incorporate these details into the main text in future revisions.

审稿人评论

Thank you for the additional clarifications.

  1. Re comment1: finetuning CNN models should be feasible even in academic settings. As I mentioned in my original comment, the initial experiments are more successful with the natural images, and I expected to see at least some further results on that aspect. I'm unsure whether the "efficiency" argument is justified here. Can you show any indication that you could generalize the approach to that setting?

  2. Re comment2: I agree that subjects would mostly agree in their behavioral judgements. Can you quantify how much of the remaining accuracy gap is due to group-effect vs. individual?

作者评论

We sincerely thank the reviewer for the thoughtful and constructive feedback.

Response 1

a. We first conducted collection and manipulation experiments on the digit dataset, followed by collection experiments on natural images. Due to time constraints and high experimental costs, we were unable to complete the manipulation experiments on natural images. Although natural images showed a significantly higher success rate ( ~60%, Fig. 4) compared to digits ( ~20%) in the collection experiments, fine-tuning revealed that the accuracy gap between IndivNet and GroupNet for natural images was only ~2% (Fig. a.24, compared to ~5% for digits). This suggests that the 400 samples per participant may be insufficient for adequately fine-tuning natural image classification models, and further increasing the number of trials presents practical challenges. Since the manipulation experiment relies on IndivNet accurately capturing individual preferences, we did not proceed with natural image manipulation. Our earlier mention of "efficiency" referred to this issue, though the explanation was not sufficiently clear.

b. As previously noted, while we lack follow-up experimental data, it is reasonable to assume that training IndivNet on natural images would pose greater challenges. This might require specially designed parameter-efficient fine-tuning methods (e.g., LoRA) and additional large-scale behavioral data collection. Addressing this may necessitate a dedicated follow-up study, and we hope the reviewer understands our considerations.

c. The collection experiment results (Fig. 4) show that eliciting perceptual variability is more challenging on the digit dataset, which in turn demonstrates the effectiveness of our perceptual boundary sampling method across different datasets. We also generated some stimuli based on the finetuned individual models on ImageNet, in https://anonymous.4open.science/r/Figures-7CFB/ImageNet_Individual.png. Though we did not conduct further experiments, some of the images do display potentials of manipulation applications. Thus, the success of digit manipulation experiments suggests potential applicability to natural images.

d. We appreciate the reviewer’s feedback and will add these explanations to Section 6, Discussion and the appendix to clarify why natural image manipulation experiments were not pursued.


Response 2

Our dataset was specifically designed to elicit perceptual variability, resulting in many samples that GroupNet cannot reliably predict. To facilitate understanding, accuracy can be decomposed into three types of uncertainty: group epistemic uncertainty, individual epistemic uncertainty, and aleatoric uncertainty. Here, individual epistemic uncertainty reflects variations due to individual differences, while aleatoric uncertainty captures intra-subject variability. Additional experiments show that increasing data volume has diminishing returns on GroupNet accuracy (Table 1), whereas IndivNet accuracy continues to improve significantly (Table 2) using VGG models. The tables are at the link https://anonymous.4open.science/r/Figures-7CFB/tables.md. We also provide a figure showing relative accuracies (improvements at a reduced data volume/improvement at full data volume) v.s. data volume at https://anonymous.4open.science/r/Figures-7CFB/ImprovementVSdataamount.png. However, since repeated presentation of the same stimulus introduces trial-to-trial interference, the impact of aleatoric uncertainty is difficult to quantify. Thus, we estimate that individual epistemic uncertainty accounts for at least a 5% accuracy gap (i.e., IndivNet minus GroupNet), though a more precise estimate remains challenging.

最终决定

All reviewers initially shared a positive view on this paper. The rebuttal was helpful to resolve some misunderstanding and the authors have to update the paper accordingly.

In particular the reviewers identified the following strengths :

  • capturing individual differences in human subjects by collecting personal data and training neural networks on them is interesting,
  • paper was well structured, mostly well-written with clear figures,
  • successfully demonstrates the possibility of manipulating perceptual variability.
  • experimental results using MNIST demonstrated that the images generated by this method effectively explain individual variability in human category judgments and can be used to manipulate perceptual variability,
  • this study may provide broader insights into understanding personalized models, which take into account individual differences across populations, including individuals with mental disorders or neurodevelopmental conditions.

The authors should move the statement about the ethical approval of their experiment into the main text. Because they only received an approval from their university and they don't have an IRB number, they should provide the proof that they receive this approval and add it in appendix.