What Makes ImageNet Look Unlike LAION
We recreated ImageNet on the basis of LAION; we explain why it is unlike the original. The answer reveals a profound fact about dataset creation.
摘要
评审与讨论
This paper investigates the differences between ImageNet and the created LAIONet dataset out of LAION. Through three carefully designed experiments, the authors claim that the information bottleneck explains why ImageNet is less diverse than LAIONet.
优点
- The findings about information bottleneck in the paper are very interesting and insightful for future data curation efforts.
缺点
- The abstract states the "long-held intuition" that ImageNet images are "stereotypical, unnatural, and overly simple representations". I don't find enough references in Section 1.2 and any other sections.
- One important difference between ImageNet and LAION is their data sources - the former is from Flickr and the latter is from CommonCrawl. The two data sources should definitely exhibit different levels of data distribution and diversity. The reviewer think this should also be taken into consideration for analysis.
- Both DataComp and LAION come from CommonCrawl. In DataComp [1], the images come from various data sources more than Flickr (Fig 13 in [1]).
[1] Gadre, Samir Yitzhak, et al. "DataComp: In search of the next generation of multimodal datasets." arXiv preprint arXiv:2304.14108 (2023).
问题
None
Thank you for the review of our work.
Thank you for pointing out the lack of reference about ImageNet's uniqueness. The seminal work of "Unbiased look at dataset bias." (Torralba and Efros) has initiated such studies. We will consider providing further references in the final version.
We agree that LAION and ImageNet are created from different data sources. However, as long as images on Common Crawl and Flickr are sufficiently rich, we can assume they have the potential to contribute diverse images for each class. Therefore, the difference in intra-class similarity should be explained by identifying where this diversity is lost in the selection mechanism.
Thank you for pointing out to DataComp. We believe similar studies including a replication of our work potentially for other benchmarks can be done on DataComp as well and we will suggest this in our updated discussion.
Thanks for the replies. I would like to increase my current rating.
- The paper proposes LAIONet, an ImageNet-like dataset created from LAION-400M
- The dataset is created by filtering out instances with an image-text CLIP similarity of 0.3. Next, the images are selected based on the ImageNet category synset occurence + a high similariy with the text and the synset definition.
- The paper then analyzes LAIONet and finds that it is distinctly unlike ImageNet -- the intra-class similarity is lower, and the accuracy of ImageNet trained models drops by 5-12% on LAIONet.
- The paper then shows that the difference is because ImageNet relied on the image content for the selection process, and that relying on just the text captions creates an information bottleneck which mitigates the selection bias
优点
- The paper looks into the data creation process and how to mitigate biases in the process, which is important for the community
- The paper is easy to read and understand, all the experiments are explained very clearly
缺点
- The paper claims in Section 1.1 "Choosing an image reveals nothing more about the image than what can be learned from its textual representation. This powerful conditional independence property limits how much selection can bias the distribution of the image. In contrast, in the case of ImageNet (Figure 2b), there is a link from the image to the selection decision.". This isn't accurate -- choosing an image gives more information than the text representation which is used for LAIONet selection, namely the CLIP image-text similarity, with a high threshold of 0.3. LAIONet doesn't remove image content from the selection criteria, it just uses a CLIP model to do the image-text based selection instead of humans. There is a lot of focus on LAIONet only relying on texts to get closer to the "true" distribution and avoiding bias, whereas LAIONet is getting closer in distribution a dataset of concepts which CLIP recognizes and getting biased towards CLIP's understanding of concepts. It is possible though that CLIP has a different and lower bias than human annotators though but there is no discussion of this.
- The paper claims that ImageNet uses the image content for selection heavily, and for LAIONet there is an information bottleneck. It claims in Section 1.1 that "Selecting on the basis of the text caption, therefore, retains much of the entropy present in the image distribution" -- while this statement would be true theoretically for a noise-free dataset, the paper never touches upon or even considers the fact that LAION is a noisy dataset. Using a noisy dataset will produce a higher entropy and diverse dataset on account of mislabeled images as well. The noise also affects the performance of models -- it is not clear how much does the performance of models drop on LAIONet just because the images are mislabeled? The paper has a fundamental flaw that it simply considers one dimension, diversity, and creates LAIONet to be more diverse, without ever considered the label noise dimension -- diversity and noise are inversely correlated.
- The paper then mentions in Section 2 "We found CLIP zero-shot top 1 accuracy to only differ by 2% across datasets. Hence, at least from the CLIP view, LAIONet images are not harder to classify. ...". This discussion also has a flaw -- CLIP was used to filter the images to begin with, so there is an inherent bias here where the test set was created from the same model which is being evaluated.
- The section about "A WEAKER IMAGE-TO-SELECTION LINK MAKES IMAGENET MORE LIKE LAIONET" also comptely ignores noise and just mentions "weaker image-to-selection link", wherein lower MTurk selection frequency results in a distribution closer to LAIONet. There is again a confounding factor at play, which is the noise in labels -- if the MTurk selection frequency is lower, it means that the likelihood of mislabeling is higher.
- There is a discussion on figuring out whether images were used for selection for the creation of ImageNet (section 4.2, section 4.3). "These observations reject the hypothesis that the graphs of Figure 2 have the same structure and show a potential leak from the image to the selection." -- a leak suggests this was unintentional, whereas it is known ImageNet was created by looking at the images' content. I am not sure what the point / contribution of this discussion is?
- Also, section 4.3 creates a subset which is not like ImageNet, but also not like LAIONet, this is a third setting where the image isn't used at all since this section doesn't use CLIP based filtering.
问题
- The paper has two limitations which need to be addressed --
- There is no discussion of noise at all in the datasets and the paper just talks about diversity. At a bare minimum, all analyses should have shown and compared the prevalence of noise in ImageNet and LAIONet. Only then can any conclusions be drawn which are made in the paper regarding image-to-selection link and / or diversity
- The paper ignores the contribution of CLIP thresholding on the creation of LAIONet -- this creates a very strong link to the image content as well in the creation of LAIONet, and also adds a different bias from CLIP. A threshold of 0.3 is very high, and this thresholding is directly connected to the noise and diversity of LAIONet but there isn't any discussion around this either.
- I am not sure what is the value add of testing the hypothesis whether ImageNet data collection used image content or not when it is known that the image content was used already?
- The paper also mentions that models perform worse in more frequent classes, but the analysis is only show on LAIONet -- this is a surprising result, given that frequent classes will be seen more often during training, and models are expected to perform on infrequent classes. Does this only happen on LAIONet or does it happen on other datasets as well, specifically ImageNet? It could also be that frequent classes have a different label noise rate on LAIONet?
Thank you for a helpful review of our work. In the following, we discuss two main concerns raised in the review: 1) The contribution of CLIP thresholding in the creation of LAION, and 2) the possibility of noisy labels. We will also address other raised concerns in the last section of the response.
1. Contribution of CLIP thresholding in the creation of LAION
Thank you for this insightful question. We agree that image content is utilized in the creation of LAION where images with CLIP image-text similarity of 0.3 or higher are selected. However, in this process, the image content is compared to its caption, limiting the leak of image information to the extent that can be conveyed by the caption. Therefore, such image-text matching can only distort the distribution of images to the extent that the text can reveal information about the image. Given that images are a far richer modality than text, the text still serves as an information bottleneck. Compare this to the case when a human labeler directly examines the images, with all the potential biases and shortcomings they may have in recognizing the concept. Or when a search engine directly acts on image embeddings or uses popularity metrics to return the results. In these cases, the information leak can be much greater than the leak through matching with the caption.
In fact, for the specific selection mechanism we employ, the situation is even better. The worst-case scenario when LAION was created would involve a leak of information from the image beyond the class itself, such as the clarity of the image or the difficulty of identifying the object. Even if such information is present in the text and influences the selection of images into LAION, by intentionally avoiding the search for visual descriptions in the caption, we have grounds to believe that our LAIONet selection mechanism is not exploiting this leaked information and is thus less susceptible to selection bias.
Lastly, while it's conceivable that a leak of image information occurs when LAION instances are selected based on the 0.3 threshold, the degree to which this leak influences the final images is questionable. Notably, this leak occurs in the initial stage of obtaining a candidate pool; however, our final selection is predominantly influenced by the stringent requirement on textual similarity while deliberately neglecting visual descriptions. Our strict criteria for this similarity significantly reduce the dataset size from over 12M to less than 1M, serving as the primary requirement for image selection.
In summary, we have never denied the possibility of an information leak from the image through the text. In our depictions of causal graphs in Figure 2, we explicitly illustrate the bidirectional link between the image and the text. For instance, image information naturally leaks into the text when someone writes a caption describing the image, as was also evident when LAION candidates were filtered based on image-text similarity. The essence of our argument is however that this leak and the potential distortion of the final selected images are limited.
2. Potential noisy labels
While we cannot dismiss the existence of noisy labels, we believe that our selection criteria work to minimize their presence. Our experiments further support this claim, as we elaborate in the following.
When creating LAIONet, we approached every decision in the most conservative manner, prioritizing high-quality labels. We ensured the exact name of the class is present in the text, and we selected the largest possible textual similarity between the class description and text such that a majority of classes are covered. Additionally, we addressed a common issue with web-crawled images, where the image is merely an image of text, by employing text detection and recognition tools. In each of these steps, our choices were made in the most stringent manner.
Although inspecting the selected images and labeling them goes against our intention, examining the images may still provide some insight into the extent of noisy labels. We include random images from the classes where the ViT-base model performs significantly worse on LAIONet in Appendix G. These classes are more likely to contain significant noise. Looking at Figure 20, the difference in recall between LAIONet and ImageNet exhibits a long tail. In the majority of cases, this difference is less than 0.5. Our examples illustrate that such a drop can be attributed to the diversity of LAIONet images and labels are accurately assigned. In the rare instances where the difference in recalls exceeds 0.5, it may be because LAIONet has used a broader meaning for the class, or the images have appeared in a different context than in ImageNet. It’s questionable whether including these images in LAIONet is desired or not, but in any case, these classes constitute a very small portion of all classes and have minimal impact on evaluations.
Noisy labels may arise if textual similarity is not accurately captured or if an incorrect threshold is chosen. If textual similarity were not measured accurately, we would expect to observe a substantial difference when using a more powerful MPNet sentence encoder instead of the CLIP text encoder. However, consistent observations were made when contrasting an MPNet-based LAIONet with ImageNet in Appendix A. If thresholding textual similarity is a source of error in selecting LAIONet instances, we should see a significant change when switching to choosing a fixed number of most similar matches for each class. In Appendix B, we reject this hypothesis by observing consistent results when LAIONet is created from the most similar matches. We also discuss the choice of textual similarity threshold and how it can affect the results in Appendix C by walking through an example, confirming the strictness of our choice. None of these observations reveal a sign of a significant presence of noisy labels due to text-based selection.
3. Addressing other questions
The section about "A WEAKER IMAGE-TO-SELECTION LINK MAKES IMAGENET MORE LIKE LAIONET" also comptely ignores noise and just mentions "weaker image-to-selection link", wherein lower MTurk selection frequency results in a distribution closer to LAIONet. There is again a confounding factor at play, which is the noise in labels -- if the MTurk selection frequency is lower, it means that the likelihood of mislabeling is higher.
Regarding Section 4.1, we found all three versions of ImageNetV2 widely accepted and used across the field. Here noisy labels are less of a concern and the choice is mainly about consensus level. It’s also worth mentioning the most popular version of ImageNetV2, around which the original results of Recht et al. are oriented, is version b, which has the lowest average MTurk selection frequency. So, in our analysis, we are utilizing other versions that have higher MTurk selection frequency than that of the widely utilized one and these versions should be less noisy if noise is a concern in this context.
I am not sure what is the value add of testing the hypothesis whether ImageNet data collection used image content or not when it is known that the image content was used already?
Regarding the contribution of identifying the image-to-selection leak in ImageNet, we would like to emphasize that while the existence of this leak is evident, the significance of this link and the extent to which human annotation is influencing the effect were not clear. Our analysis from Section 4.1 through Section 4.3 demonstrates in multiple ways that this link has been significantly at play explaining the difference between ImageNet and LAIONet.
The paper also mentions that models perform worse in more frequent classes, but the analysis is only show on LAIONet -- this is a surprising result, given that frequent classes will be seen more often during training, and models are expected to perform on infrequent classes. Does this only happen on LAIONet or does it happen on other datasets as well, specifically ImageNet? It could also be that frequent classes have a different label noise rate on LAIONet?
Regarding the lower performance on more frequent classes, it's important to note that ImageNet models see almost the same number of samples per class during training. Therefore, a lower performance on more frequent classes could be indicative of increased difficulty in identifying objects from those classes, potentially due to the broad definition that class has. Also note that the frequency of a class can only be defined for LAIONet, as ImageNet comes with an almost similar number of examples per class. Therefore, we reported LAION-weighted accuracy and compared it to equally-weighted accuracy only on LAIONet (Figure 5). However, we agree that it would be interesting to use the frequency of the class as observed on LAIONet and calculate LAION-weighted accuracy on ImageNet as well. We conducted this experiment and added it to the revised version under Appendix H (Figure 29). One can see that LAION-weighted accuracy is again lower, consistent with our previous observations. Therefore, this property is more of the models and ImageNet classes than LAIONet.
I appreciate the response by the authors. I believe the paper has value, but with a very different lens. The paper makes the point that an information bottleneck is added wherein only the text is used, whereas in reality what is happening is that a human annotator is replaced with an automated CLIP annotator. I think the paper's point should be more along the lines of "contrastive models trained on large scale web datasets serve as better annotators than humans for diversity". There is a huge focus on reducing "information leakage" but in reality there is full information still being used, but there is a change (reduction) in bias used for annotations. The paper's focus is in a very different direction, one which I believe has limitations and misrepresents the situation (as shared next), which is why I cannot recommend acceptance. I will bump my rating, but again contingent on the fact that if accepted, the paper should at least dedicate a section discussing this alternative viewpoint.
- I still strongly contend the claim that there is an information bottleneck. This claim does a disservice to the role of the CLIP model for filtering the data -- a reader might not pay as careful attention to this and get the wrong conclusions. Using image content still is vital to create a new dataset. I don't understand what the paper refers to as "information leak" -- given that the rebuttal still mentions this, it is worth clarifying -- I believe what is being reduced is not "information leakage" but bias. Both humans and CLIP use the full image content to decide whether an image is aligned with a particular class or not (matching happens with different text strings in the two cases) -- i.e. there is full information "leakage" -- although the term "leak" implies this is unintentional, which it is not! What is improved here is not the reduction in leakage, but an improvement in the quality of the annotators -- CLIP might be a stronger annotator with fewer biases. For instance, CLIP might not care about image quality whereas humans might reject low quality images, or images where the objects are too small, or obscure labels which humans might mislabel but CLIP can easily label correctly (like an exotic dog breed).
- I do not agree with the statement which conveys that the follow up stage (where the texts are matched with the class name based texts which restricts the images from 12M to 1M) "serves as the primary requirement for image selection.". The final images still are based on the 0.3 threshold, there are just more filters applied. In fact, I would wager that you can change the second stage and still produce a reasonable dataset, but without the CLIP thresholding it would not be possible to create a high quality dataset.
- Re: Section 4.1 I agree that ImageNet v2 is valuable, but just because it is valuable / popular doesn't mean it doesn't have more label noise than ImageNet, and assuming so without data and just considering lower inter-annotator agreement as producing a weaker link without increasing label noise is not warranted. Thus, the paper / response doesn't show that "A WEAKER IMAGE-TO-SELECTION LINK MAKES IMAGENET MORE LIKE LAIONET". "Image-to-selection link" is an unclear concept used ambiguously in the paper. In this section inter-annotator disagreement is considered a weaker Image-to-selection link, and earlier using a CLIP model instead is used to represent "image-to-selection link". Why does CLIP have a weaker image-to-selection link and less information leakage? For instance, if I share a picture of a dog and ask an annotator if it is a dog, whereas use CLIP and match the image with "an image of a dog", how is CLIP using less information? In both cases, the systems look at the image and try to see if the image "matches" the textual representation (with different definitions of "matching" in both situations).
Thanks for getting back again. We see where you're coming from and appreciate the different perspective.
Your point is that in the creation of LAION, images were considered without any information bottleneck. So, the selection "Internet -> LAION" has no information bottleneck. We agree!
What we aimed to highlight in the paper is that the selection "LAION -> LAIONet" only uses text captions.
We think of the first selection step as creating a large and diverse candidate set of images. The second selection is where we go from a vast image database, LAION, to a relatively small dataset, LAIONet. Our point is that this second selection step uses relatively little information about the image. Hence, whatever entropy is left in LAION, we don't reduce much of it when we go from LAION to LAIONet.
This claim, btw, can be made formal. If the LAION distribution has n bits of entropy and we select based on k bits of information, the resulting conditional distribution must have at least n-k bits of information left.
You make a good point that for the composed selection step "Internet -> LAIONet" there is no information bottleneck, but rather a bias reduction.
We feel confident that this can be clarified with a simple update to our paper. We're happy to incorporate your perspective.
This paper conducts a comparative analysis between the predominant ImageNet dataset in the computer vision field and the recently widely-used LAION dataset. By analyzing their data collection processes, the intrinsic differences between ImageNet and LAION datasets are highlighted. Heuristically, guidelines for selecting data instances based on information bottlenecks are provided.
优点
-
Analyzing mainstream datasets helps deepen researchers' understanding of the data. At the same time, it aids the community in designing future datasets with minimal human-induced bias, which in turn helps enhance the generalization performance of models.
-
This paper is logically structured, and the conclusions regarding the differences between the ImageNet and LAION datasets are comprehensive. Starting from the inconsistent dataset filtering processes, it further analyzes the differences in intra-class similarity between the two. This leads to the conclusion that the image diversity in the two datasets is inconsistent.
-
This paper offers a wealth of visual analysis, which is very helpful in understanding the main conclusions.
缺点
-
This paper still lacks a central objective. Although a series of analyses point out the differences between ImageNet and LAIONet, both Figure1 and Figure5 seem to indicate that model performance on ImageNet and LAIONet is positively correlated. This suggests that LAIONet doesn't offer additional indicative value for model performance analysis, which is typically the most important for classification datasets.
-
Additionally, the ImageNet dataset and the LAION dataset were created at different times and for different purposes. The former emerged before deep learning became mainstream, aiming to provide a broad object-centric benchmark. In contrast, the latter was prepared for the pre-training of current large-scale models. Given that the paper suggests it can provide guidance for the construction of new datasets, and considering that the current processing methods for the LAION dataset (as well as similar datasets like COYO, mC4, etc.) are already being adopted, what specific new recommendations are included?
-
Considering the different collection times of the two datasets as mentioned above, is the gap in intra-class similarity related to the distributional shift of internet data? Also, given that ImageNet-1K was derived from ImageNet-22K, would an analysis of ImageNet-22K be more meaningful?
问题
Please refer to the weaknesses.
Thank you for your helpful review of our work. We give a detailed answer in the following.
This paper still lacks a central objective.
Concerning the objective of our paper, we pose the intriguing question of whether LAION offers images that ImageNet was not capable of and, if so, why. Answering this question comes with new insights about both ImageNet and LAION with potential takeaways for future dataset creation. In summary, our study retrospectively examines ImageNet, pinpointing a selection bias in its data-generating process. This insight was made possible through experimentation with LAION. Additionally, we demystify the distributional advantage of LAION and illustrate how one can leverage an information bottleneck to obtain more diverse representations of a concept.
... both Figure1 and Figure5 seem to indicate that model performance on ImageNet and LAIONet is positively correlated. This suggests that LAIONet doesn't offer additional indicative value for model performance analysis, which is typically the most important for classification datasets.
Regarding the consistency of performance rankings in ImageNet and LAIONet, we do not believe such consistency undermines the value of evaluation on LAIONet. For example, the accuracy of ImageNet-trained models assessed on LAIONet provides insights into the extent of generalization lost due to reduced diversity. Additionally, beyond accuracy, the comparison of ImageNet with LAIONet enabled the identification of lower within-class diversity. Therefore, evaluating on datasets like LAIONet, while it may not significantly alter model rankings, can still offer valuable insights into model performance in a new domain and illuminate dataset intricacies. It's also worth noting this consistency is not surprising as the widely-used ImageNetV2 also showed consistent ranking of the models but this does not undermine the new insights it offered.
... the ImageNet dataset and the LAION dataset were created at different times and for different purposes.
Concerning different purposes in the creation of ImageNet and LAION, we agree that these datasets are obtained at different times for different learning objectives. However, this comparison has been insightful about both datasets: First, using LAION as an image search engine enabled us to recreate ImageNet in a controlled way and get new insights about ImageNet by contrasting it with the new dataset. This is how we were able to identify a selection bias. Second, large-scale language image models trained on LAION demonstrate unprecedented robustness in image classification. This raises the question of what makes these models so robust. Previous works have conjectured that the LAION image distribution is the primary cause [Fang et al.]. Our work provides explicit evidence that even for the same task and classes, LAION can provide more diverse images than those selected into the classic ImageNet. Thus, such a comparison has been insightful in understanding LAION's capabilities as well.
... what specific new recommendations are included?
Regarding recommendations for future data collection, we propose that selection based on an information bottleneck, such as concise text, is a promising method to avoid selection bias. Whenever diversity is desired, this mechanism can be employed to acquire diverse instances from the target class. This class may not necessarily belong to ImageNet classes or other well-known benchmarks. As a concrete example, based on the feedback received for our work, a group of industry-based researchers found our selection mechanism promising for obtaining diverse images for a new concept, enabling them to calculate a more representative image embedding for that concept to be later used as part of a search engine. We also received feedback like ``our team spent a long time curating a dataset and then we realized the improvement on it does not translate to our real application. Now I can clearly see what goes wrong.'' These examples have been heartwarming for us.
Considering the different collection times of the two datasets as mentioned above, is the gap in intra-class similarity related to the distributional shift of internet data? Also, given that ImageNet-1K was derived from ImageNet-22K, would an analysis of ImageNet-22K be more meaningful?
Concerning different collection times, we agree this can explain a small part of the significant difference observed between the two datasets. This concern may be more pronounced for technology-related concepts and when images are compared across datasets. However, this is not the case for the majority of ImageNet classes, and we compare image similarity within each dataset rather than across datasets, so, this should be less of a concern.
We also agree a similar experiment can be done on the larger ImageNet and possibly on the larger LAION 5B version however we do not expect this to change the conclusions.
This paper analyzes the difference between ImageNet and the version of LAION dataset recreated with ImageNet classes. The main finding is that, the image selection of the creation process of ImageNet depends partially on images themselves except for text descriptions, leading to smaller intra-class variances and easier tasks.
优点
- The viewpoint of connecting and comparing older and newer datasets is interesting.
- The writing is generally clear and easy to follow.
缺点
- The only conclusion of this paper is that ImageNet is more of an easy dataset than LAION because the images are curated dependent on image similarities, which makes images of each class less diverse and has smaller intra-class variances. This conclusion is unsurprising since ImageNet is curated very carefully to exclude outlier examples.
- I do not see much value of the findings. Visual datasets should not be curated only using text descriptions, which leads to a higher probability of getting wrong images inside the dataset. Thus the findings do not reveal a drawback of ImageNet curation process. On the other hand, the datasets nowadays, like LAION, are mostly not curated using names of classes, while the conclusion of this paper only supports curation using the names of classes, and thus has limited values.
- This paper does not reveal anything related to the different curation processes of Imagenet and LAION, one for image classification and another for vision-text pretraining, but instead create another ImageNet-like dataset from LAION. Thus the title of this paper is inappropriate.
问题
- Why not pretrain models on both datasets and compare the differences to support your conclusion?
Thank you for the review of our work. In the following, we provide a detailed response to the raised concerns.
The only conclusion of this paper is that ImageNet is more of an easy dataset than LAION ... This conclusion is unsurprising since ImageNet is curated very carefully to exclude outlier examples.
Our study stems from the intriguing question of whether for the same task, the recently popular LAION can provide images that ImageNet could not. Indeed, one major finding here is that the ImageNet-like dataset created from LAION, termed LAIONet, demonstrates higher intra-class variability, posing a challenge for ImageNet models to generalize.
We find this observation to be particularly surprising, considering that we created LAIONet in the same manner as ImageNet, using LAION instead of Flickr as the image search engine. If we assume that LAION and Flickr images closely resemble random images from the web, one might expect the resulting dataset, for appropriately chosen thresholds, to resemble the original ImageNet. However, we show that this is far from the case.
As you pointed out, we diagnose the cause of this contrast and provide an explanation in terms of a difference in data-generating processes. More precisely, we identify a selection bias resulting from information leakage from the image at the time of selection. Although the existence of selection bias might be trivial, our work shows this mechanism has been significantly at play at the time of ImageNet creation.
We want to emphasize that the LIONet-ImageNet difference is not caused by noisy labels or outliers within LAIONet. In fact, our conservative approach to creating LAIONet, along with extensive experiments, supports the quality of LAIONet labeling. We have elaborated on this in our response to reviewer fkPT. The difference is precisely due to the selection bias in ImageNet, which has led to a reduction in the intra-class variance of the images.
I do not see much value of the findings.
Concerning the potential implications of our findings, our text-based selection exemplifies how a modality with less information can act as an information bottleneck for selecting instances from a richer modality. Our analysis of LAIONet reveals that such a selection mechanism can offer more diverse examples for a given concept. As a concrete example, based on the feedback received for our work, a group of industry-based researchers found our selection mechanism promising for obtaining diverse images for a new concept, enabling them to calculate a more representative image embedding for that concept to be later used as part of a search engine.
Our study also complements previous empirical observations [Fang et al.] that the robustness of contrastive language image models should be caused only by a diverse training distribution. Our work demystifies this distributional advantage and explains what was missing from a conventional dataset.
Visual datasets should not be curated only using text descriptions, which leads to a higher probability of getting wrong images inside the dataset.
It's essential to implement safety measures based on specific downstream tasks. In fact, in applications that require applying safety filters based on full image content, as long as the filter is accurate, the distribution of the safe images will remain mainly unaffected and our conclusions remain valid.
... the conclusion of this paper only supports curation using the names of classes, and thus has limited values.
As a minor clarification, our text-based selection relies on the unique description of the class as well and the name of the class is not the only identifier.
Why not pretrain models on both datasets and compare the differences to support your conclusion?
While training model on LAIONet is feasible, it does not substantially contribute to our conclusion beyond the current results indicating that LAIONet images are more diverse. Consequently, we deemed such an experiment unnecessary and instead directed our focus towards understanding the underlying cause of this difference.
As we are reaching the end of the discussion period, should the reviewers have any follow-ups, we will try our best to address them in a timely manner. If satisfied, we would greatly appreciate the reviewers updating the reviews/acknowledging our responses. We sincerely thank the reviewers again for the efforts devoted to the review process.
The paper presents an analysis comparing the ImageNet dataset with a version of the LAION dataset recreated with ImageNet classes, referred to as LAIONet. The main focus of the paper is on the differences in image selection processes between these datasets and the resulting impact on dataset diversity.
The reviewers' ratings are of three borderline reject and one borderline accept. While there is recognition of the paper's interesting approach and clear writing style. The reviewers express major concerns about the novelty and depth of its findings, methodological issues, and the lack of a central objective. The AC checked all the materials, and agrees with the reviewers to reject the paper.
为何不给更高分
The reviewers' ratings are of three borderline reject and one borderline accept. While there is recognition of the paper's interesting approach and clear writing style. The reviewers express major concerns about the novelty and depth of its findings, methodological issues, and the lack of a central objective.
为何不给更低分
N/A
Reject