6.0

/10

Poster4 位审稿人

最低6最高6标准差0.0

4.0

置信度

正确性3.3

贡献度3.0

表达3.3

ICLR 2025

PooDLe🐩: Pooled and dense self-supervised learning from naturalistic videos

Alex N Wang,Christopher Hoang,Yuwen Xiong,Yann LeCun,Mengye Ren

OpenReview PDF

提交: 2024-09-27更新: 2025-03-01

TL;DR

We uncover challenges of applying self-supervised learning to naturalistic, dense video data and propose a unified dense and pooled objective alongside architectural improvements to learn an effective visual representation.

摘要

关键词

computer visionrepresentation learningself-supervised learningegocentric videovisual representation

评审与讨论

审稿意见

评分: 6置信度: 42024-10-30

Self-supervised learning for naturalistic videos is challenging with the presence of independent objects of various sizes, imbalanced class distribution, etc. The paper aims to tackle these problems by combining global and dense supervision, as well as spatial decoder modules (SDM). The dense objective could help capture different object instances from the videos while the global objective could help learn the semantics of the objects using contextualized signals. The proposed method, PooDLe, was evaluated on representative video benchmarks, e.g. BDD100K, Walking Tours, ADE20K, Cityscapes, demonstrating superior performance on dense prediction tasks, i.e. semantic segmentation and object detection.

优点

i) Method. The proposed method, PooDLe, is technically sound.

ii) Performance. Compared to prior global or dense-only works, PooDLe demonstrated superior performance on dense prediction tasks like semantic segmentation and object detection.

iii) Ablation. The paper further verified that PooDLe improved the performance of small and rare objects, justifying the advantage of combining global and local supervisions, as well as the spatial decoder modules (SDM). It also provided ablations on the pretraining datasets, different components, and augmentation strategies, e.g. crop area, resolutions, etc.

缺点

The reviewer has several concerns and questions about the evaluation, ordered by their impact on the final score.

Concerns:

i) About SDM. SDM introduced extra operations on higher resolution feature maps or images. From L306-307, SDM was also used in the inference stage for semantic segmentation. In this case, the inference overhead is determined by both the backbone architectures (e.g. ViT-S, ResNet50), and the UNet based SDM. The reviewer wonders if the baseline approaches also have SDMs during inference? Were they of the same architectures, complexities (e.g. FLOPs)? If they are not of the same complexity, could the authors provide the FLOPs of each baseline and PooDLe. This could show if the comparisons of different methods on semantic segmentation are apples-to-apples, especially for inference.

ii) Some comparisons are apples-to-oranges. At L358-L360, it was claimed that PooDLe outperformed DoRA across different settings. If the reviewer understood correctly, the comparisons there were between PooDLe_ResNet50_epoch20 and DoRA_ViT-S_epoch100, which were not fully convincing. The authors are encouraged to present apples-to-apples comparisons under the same setting, e.g. backbone, #epoch, or remove the claims otherwise.

Questions

iii) Tasks beyond dense prediction. It was shown that augmenting dense supervisions with global objectives could help improve dense prediction tasks. The reader may also be curious if augmenting global supervisions with dense objectives could help improve global prediction tasks, e.g. action recognition, etc.

iv) About the "relationship between global crop area and input resolution". It was claimed in Contribution 3 (L101-102). However, there seems to be no explicit analysis of the "relationship" between "crop area" and "resolution", or the relationship could not be easily interpreted from Figure 5. To reveal the relationship between "crop area" and "resolution", the authors could consider analyzing extra metrics derived from them, one example could be the pixel-scale, i.e. crop_area/input_resolution, defined in [1].

v) Patch size used for ViT-S, which determines the FLOPs of some baseline approaches

vi) The videos of Walking Tours (WT) are at 4K+60fps, instead of 720p+30fps (L248-249)?

[1] Effective Self-supervised Pre-training on Low-compute Networks without Distillation. ICLR 2023.

问题

Please refer to the weaknesses.

评论- Response to Reviewer jDP3

2024-11-18

Thank you for taking the time to review our work and providing valuable feedback. We appreciate the positive comments on our method’s technical soundness, demonstrated superior performance on dense prediction tasks, and justifying ablations on PooDLe’s design components and on pretraining data attributes. We would like to address your concerns and questions about the evaluation below.

i) About SDM

We acknowledge that the SDM adds parameters and computation to a ResNet50. In comparison, PixPro [1] utilizes a similar FPN module, which we use in our evaluations. Furthermore, for other ResNet-50 baselines, we follow DeepLab [2] for evaluation, which uses dilated convolutions on the last 2 ResNet stages to increase the resolution of feature maps – this setup improves performance. FlowE also uses the same dilated setup for pretraining. While dilations add no parameters, it significantly increases FLOPs. The SDM is an efficient replacement for dilations in both performance and compute cost; we provide the FLOPs of different inference architectures for a single 512x1024 image below.

Also note: the SDM (frozen weights) and dilations are kept during linear readout semseg, but are removed in UperNet semseg and object detection, where PooDLe also achieves the best results. This also suggests that training with the SDM learns a stronger ResNet50 representation than dilations, even when it is removed at inference.

Table A. Inference overhead of different model architectures

Architecture	Associated Methods	GFLOPs
ResNet-50		43.3
ResNet-50 + SDM	PooDLe	60.5
ResNet-50 + FPN decoder	PixPro	124.4
ResNet-50 + dilated convolutions	FlowE, DINO, DenseCL	200.7
ViT-S/16	DINO, iBOT, DoRA, MAE	82.9

ii) Some comparisons are apples-to-oranges

You are correct – we will revise our claims to reflect the differences in between the methods, as we did not include ViT experiments. However, we would like to point out that even with the differences, in L270-282, PooDLe shows a notable improvement over DoRA on BDD and CityScapes semantic segmentation, e.g., +6.5 mIoU on BDD UperNet semantic segmentation.*

For completeness, on BDD, DoRA trained with 100 epochs attains 40.8 mIoU on BDD UperNet semantic segmentation versus 43.3 mIoU for 200 epochs of training.

iii) Tasks beyond dense prediction

This question brings up an interesting point. iBOT [3] demonstrated that augmenting global supervision with a dense objective improves image classification, but this work only trained on iconic ImageNet data. We find that iBOT and other global methods do not transfer well to naturalistic data composed of dense scenes, varying objects and imbalanced classes. However, our work is towards learning better representations from this naturalistic data; first for dense prediction tasks, but eventually for semantic global tasks as well. We also note that most naturalistic datasets, including BDD and Walking Tours, do not have global prediction tasks.

iv) About the "relationship between global crop area and input resolution"

Thank you for pointing this out – we do not directly explain the relationship between these two factors. To clarify, while developing PooDLe, we noticed that performance for higher resolutions would peak at larger global crop areas, i.e., larger fields of view. This is shown in Figure 5 where performance for 512x1024 is maximized at 0.275 mean crop area while 256x512 is maximized at 0.125. Thus, PooDLe seems to perform best at a specific pixel-scale, or a pixel density per ‘feature’ in the dense feature map.

We will revise Contribution 3 to better reflect what is shown in Figure 5 and clarify the message of the accompanying paragraph in Section 4.4.

v) Patch size

We used a patch size of 16 for the ViT-S baselines.

vi) Walking Tours videos

Yes, the original videos of Walking Tours are at 4K and 60 fps. However, we note that the dataset’s download code only supports 720p and DoRA’s implementation details stated that 30 fps was used [4]. We can revise and mention these details in our paper.

[1] Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning. CVPR 2021.
[2] DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. TPAMI 2017.
[3] iBOT: Image BERT Pre-Training with Online Tokenizer. ICLR 2022.
[4] Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video. ICLR 2024.

评论- Response to Reviewer jDP3

2024-11-26

Dear reviewer, thanks again for taking the time to review our work and provide valuable feedback. Since the final days of the discussion phase are approaching, we wanted to reach out and ask if there are any further questions from your end. We hope we were able to address your concerns about the SDM and the comparison settings in our work. Please let us know if you have any further comments.

评论- Response to the rebuttal

2024-11-29

The reviewer appreciates the rebuttal that addressed most of the concerns and questions. The reviewer would like to keep the original positive score (i.e. 6).

审稿意见

评分: 6置信度: 42024-11-02

The paper introduces PooDLe, a self-supervised learning (SSL) method tailored for complex, naturalistic video data with densely packed scenes and diverse object scales. Unlike standard SSL methods trained on iconic images, PooDLe addresses the spatial imbalance in cluttered video data by combining a dense SSL objective that maintains equivariance through optical flow warping with a pooled objective that captures semantic information from smaller, aligned subcrops. This dual-objective approach, augmented with a spatial decoder module (SDM), is shown to effectively represent both large and small objects.

PooDLe’s performance is validated on driving and first-person video datasets, achieving state-of-the-art results in semantic segmentation and object detection tasks, particularly excelling with small object recognition. Additionally, the paper introduces the Walking Tours Semantic (WT-Sem) segmentation task, presents a detailed analysis of class size and frequency on BDD100K, and studies the impact of global crop area, input resolution, and temporal stride on learning efficacy, highlighting key factors for SSL with naturalistic data.

优点

The paper is well-written with a clear structure to demonstrate a clear and logical progression of motivation and ideas. The authors use precise and well-chosen words that enhance readability and engagement.
The motivation to combine pooled representations and dense local object representations is well justified. The proposed method aligns well with the motivation.
The paper provides comprehensive/detailed analysis, testing the proposed method across various scenarios and datasets. This thorough experimental validation successfully demonstrates the method’s effectiveness on small and rare objects.

缺点

The main weakness of this work is lack of enough ablations for the proposed method. It is not clear (with the shown ablations) if local crops really are useful. Considering this is the main contribution in terms of novel idea, it should have been made clear. Results with just local crops might help, since adding these local crops on dense does not improve, but hurt smaller objects. It will be good to see accuracy on this and on datasets other than BDD as well.
Although there are several experiments presented, the technical contributions seems little on the weaker side, the method is build on top of FlowE, local crops is novel (but its contribution not clear), the improvement on dense consistency is expected due to use of features from earlier layers,
The proposed method shows some improvement over existing methods, but looking at a comparison with the baseline (FlowE), mostly its incremental (less than 1%) if we look at accuracy. That is why it is important to loot at this metric in the ablations. Also, this baseline is not shown for the other dataset (WalkingTour), where the proposed method do not outperform other methods across the board.

问题

The explanation for why combining SDM with Pool helps is not clear. Also, it is not clear why just pool should not help, its even hurting the small objects. SDM has additional parameters, can that be a reason (of course the fine-grained features will help too), SDM is not even connected with Pool.
In figure 4, it shows finer boundaries for smaller objects, but boundaries for bigger objects, such as car, is getting worse in comparison to other methods, also performance seems similar to the baseline approach FlowE.
L075: "These subcrops serve as pseudo-iconic views of foreground objects, functionally increasing the prevalence of smaller objects". It is not clear why taking random crops will increase prevalence of smaller objects, parts of bigger objects or backgrounds will have higher probability among these crops.
Ablations are not comprehensive, only one dataset, one metric, and one task is shown. It will be good to see accuracy metric, and also how these components perform with any other dataset. BDD is street-view, maybe Ego or some other natural videos, CCTV, etc. It will be good to see the ablations on the walkingtour dataset along with BDD. It seems to be working fine with driving videos, since scene will not change a lot, but what about other videos? Like ego-centric? Or static camera?
What about performance just with pool? It will be good to see if just learning from the crops are enough. SDM improves, but it comes with additional parameters too.
Since this work is build on top of FlowE, there should be results for FlowE on other datasets too (Table 2).
It is not clear why results were shown only with ResNet, why not ViT?
Will this method work on other simple tasks like classification?
It will be good to see accuracy in Table 3 and 4.
Using additional crops for training adds training overhead, there is no analysis shown on this.
Figure 9, it will be good to see t between 0 and 15.
The role of predictor and projector is not clearly explained, projector is not even shown. It is not clear.
L205: why crop features are averaged-pooled over spatial dimension? Fine-grained constraint should be better?

评论- Response to Reviewer TTAd (1/3)

2024-11-18

Thank you for taking the time to review our work and provide useful feedback. We appreciate the positive comments on our paper’s clear writing, the motivation for our method, and the thoroughness of our experiments for demonstrating PooDLe’s effectiveness across different settings.

We would like to address your comments and questions below.

1. Relationship between SDM and Pooled objective + ablation experiments (weakness #1)

Adding the pooled objective alone does not improve performance because the same representation has to satisfy two different objectives. The pooled objective trains the encoder features to capture semantic information over small local crops. The dense objective focuses on scene-level understanding and without the SDM, it largely ignores small objects: 8.7 => 12.8 mIoU, row 1 to row 5 in Table 4. However, having the SDM enables the model to propagate the high-level semantic information learned by the pooled objective and upsample it to higher-resolution features used in the dense objective.

The extra parameters added by the SDM are top-down ResNet blocks, which are shown in row 3 of Table 4 and do not provide improvement on small objects. Only when the lateral connections (single, linear projections) to low-level features are added do we see some improvement, as shown from row 1 to row 5: 8.7 => 12.8 mIoU. Finally, when we combine the SDM with the pooled objective, we observe further improvement from row 5 to row 7: 12.8 => 15.0 mIoU.

Note on technical contributions (weakness #2)

The technical contributions of our work include the integration of dense and pooled learning objectives alongside the SDM. As discussed above, our ablations show that only when we combine all three components do we see the most improvement on learning from dense scenes and overcoming spatial imbalance in naturalistic data.

2. Figure 4 tradeoff between small and large objects

The goal of Figure 4 is to show that FlowE misses small objects like the people on the far sidewalk, and that baselines which use pooled, iconic-based objectives produce noisy object boundaries. While PooDLe produces a noisier outline of the front car compared to FlowE, it also predicts a cleaner boundary for the buildings and the car on the right. Overall, in our experiments, we find that PooDLe still outperforms FlowE on large objects by 51.4 to 49.3 mIoU (Table 3).

3. Improved small object capture with random subcrops

This is a great question. While the random crop sampling doesn’t increase the probability of viewing any one pixel of a small object, it does increase the probability that some sizable part of the object is seen. Coupled with the reduced likelihood of capturing multiple objects in each subcrop, we can apply our pooled objective to learn stronger object semantics by using each subcrop as a ‘pseudo-iconic’ example.

评论- Response to Reviewer TTAd (3/3)

2024-11-18

10. Training overhead of additional subcrops

Thank you for noting this! The subcrops are at very small resolutions compared to our global crops and do not contribute much to training cost. We compare the training overhead of PooDLe with 6 subcrop pairs and without subcrops on a RTX A6000 with batch_size=4.

Table D. Training overhead of subcrops and pooled objective for PooDLe

Method	Time per iteration (sec)	VRAM (GB)
PooDLe	1.21	40.4
PooDLe with no subcrops / pooled	1.10	39.8

11. Additional delta_t values for Figure 9

We provide the BDD linear semseg results for delta_t=8 and the results from Figure 9 for reference. Delta_t=8 also performs well, but delta_t=15 appears to be the sweet spot for sufficient and not too noisy video motion for learning.

Table E. BDD linear readout semantic segmentation for varying delta_t

delta_t	mIoU
0	27.69
8	33.92
15	34.24
30	33.85
45	33.66

12. Predictor and projector

We refer to BYOL [1] for the introduction and analysis around the use of projector and predictor modules, as they are first introduced in that work. We omitted them in our method figure for clarity, as they are often omitted in other method figures, e.g. DINO [2].

13. The use of global pooling on subcrops

Our work aims to mitigate the spatial imbalance problem of fine-grained objectives which prioritize larger background classes by introducing a pooled learning objective over smaller subcrops. If we do not pool the subcrop features, then the subcrop objective (L205) would suffer from the same spatial imbalance problem as the dense objective. Furthermore, the subcrops would likely not perform well for the dense objective because they are designed to capture different views of single objects.

[1] Bootstrap your own latent: A new approach to self-supervised Learning. NeurIPS 2020.

[2] Emerging Properties in Self-Supervised Vision Transformers. ICCV 2021.

评论- Response to Reviewer TTAd

2024-11-26

Dear reviewer, thanks again for taking the time to review our work and helpful valuable feedback. Since the final days of the discussion phase are approaching, we wanted to reach out and ask if there are any further questions from your end. We hope we were able to answer your questions and address your concerns regarding our method’s ablations and improvements. Please let us know if there is anything else we can provide.

2024-11-26

Thank you for the detailed response. It answers most of my questions.

I have just one more question for the authors: why the mIoU and accuracy scores are not aligned in Table B above? How do we decide which is more important, since different variations are not consistent with these two metrics.

评论- Response to Reviewer TTAd

2024-11-27

mIoU and accuracy scores do not always align because mIoU weighs each class equally, whereas accuracy is computed over every image pixel. For instance, a model variation (and representation) that prioritizes larger background classes will have higher accuracy and lower mIoU given the long-tailed class distribution of naturalistic data. We favor mIoU as a better measure of semantic understanding because it does not ignore small foreground classes, which are also more reflective of downstream importance – e.g. pedestrians, which occupy only 0.25% of pixels in BDD100K.

2024-11-27

Thank you for the clarification. I do not have any further questions for the authors. I do have concerns about the novel contributions in this work, and since now we have more time for a discussion, maybe we can clarify this aspect in more detail. I understand that "integration of dense and pooled learning objectives alongside the SDM" is the main novelty (more of a technical contribution combining existing techniques) in this work. Please correct me if I misunderstood this. I understand the contributions, but if I am missing any other novel ideas, it will be good to know.

评论- Response to Reviewer TTAd

2024-11-27

Thank you for your response, we are more than happy to discuss the novelty of our work given the discussion extension.

First, we would like to highlight that identification and analysis of the spatial imbalance phenomenon is a key and novel contribution of our work. Other related methods that train on dense, naturalistic data and evaluate on dense prediction tasks (DenseCL, PixPro, FlowE, DoRA) fail to recognize the performance gap between small and large object classes. Demonstrating these limitations is a significant contribution of our work, and addressing this problem serves as a key motivation for our method.

Next, yes, our proposed integration of dense and pooled objectives alongside the SDM is our second main novelty. While it does outperform existing methods on dense prediction tasks when trained on naturalistic video, it is more important that only this specific combination of components effectively addresses the spatial imbalance problem.

We put forth that dense, naturalistic video is a plentiful data source for learning visual understanding. If we are to train on this type of data, it is essential to capture information about objects of all available scales in the learned representation.

2024-12-03

Thank you for the clarification, I appreciate the authors detailed response on this.

Analyzing imbalance between object sizes for SSL might be novel, but it is a known phenomenon in the community and has been well studied for traditional methods (Focal loss, Dice loss, etc., a recent summary here: https://arxiv.org/html/2403.07113v1).

The other issues are minor (missing ablation on other datasets, it is important due to incremental improvements; FlowE performance on other datasets; misaligned accuracy vs mIoU ablations, etc.) and not a deciding factor.

I will discuss with other reviewers/AC on the significance of the contributions (the novelty in analyzing the imbalance and other technical parts integrating those components). For now, I will raise my score to 6 for all the efforts from authors.

评论- Response to Reviewer TTAd

2024-12-04

We appreciate your engagement with our comments and your adjusted assessment of our work!

As you mentioned, imbalance is already well-studied for supervised methods, where class labels are used to prioritize underrepresented classes. We would just like to note how our setting is different from prior work. In SSL, methods like oversampling and loss reweighting aren’t viable as there are no class labels. Some prior work (https://arxiv.org/abs/2110.05025) has explored pooled SSL methods on imbalanced datasets of iconic images, where methods can leverage the implicit supervision of single object, size-normalized images. However, in our setting, the learned features can ignore small objects entirely as they are underrepresented and not preserved in their own iconic examples.

评论- Response to Reviewer TTAd (2/3)

2024-11-18

4. Comprehensiveness of ablation experiments

We are happy to provide the accuracy metric for our ablation table – we had omitted it for brevity and in favor of the Small/Large, Rare/Common mIoU breakdown. Note that the small/large and rare/common accuracy breakdowns are averaged over the class subsets, while the ‘All’ category is over pixels. Here are the accuracy values for our ablations in Table 3 and Table 4, which we will also add to the Appendix:

Table A. Accuracy values for Table 3 with breakdown by class grouping.

Method	Pretrain	All	Small	Large	Rare	Common
DINO	BDD	86.8	12.3	88.3	2.3	87.8
DenseCL	BDD	84.9	2.0	86.6	0	86.0
DoRA	BDD	88.1	19.3	89.5	7.2	89.1
FlowE	BDD	88.5	18.2	89.9	32.0	89.2
PooDLe	BDD	89.2	33.6	90.3	34.2	89.9
Supervised	IN1K	84.7	36.9	85.3	23.8	85.1
PooDLe	BDD*	90.7	35.6	91.2	46.9	91.2

*Pretrained on BDD, initialized with supervised IN1K weights.

Table B. Accuracy values for ablation Table 4 with class grouping.

Row	Model	All	Small	Large	Rare	Common
1	FlowE	85.0	22.8	86.3	6.1	86.0
2		86.2	14.2	87.6	6.3	87.1
3		86.8	11.9	87.7	13.6	87.7
4		86.6	22.1	87.9	16.9	87.5
5		84.2	25.5	85.3	28.2	84.9
6	PooDLe†	86.0	26.4	87.2	29.6	86.7
7	PooDLe	86.5	26.6	87.7	28.5	87.1

†Flow model trained without supervised labels.

We also believe it is standard practice to run ablations on only one dataset to demonstrate the effect of our method contributions of dense and pooled losses and the SDM module. BDD is a large and diverse dataset, featuring different cities, time of day, and rates of ego-motion, as the car can be moving or standing still. We show the effectiveness of our method on other datasets by training and evaluating on the egocentric Walking Tours videos.

5. Learning with only the pooled loss

We actually did run this ablation, which is equivalent to applying the pooled objective to many small crops captured obtained throughout the frame. However, it performed very poorly, with 26.59% mIoU at the ablation scale – far worse than other methods.

6. FlowE results on Walking Tours

We were not able to run FlowE in the full setting on Walking Tours due to computational constraints, but we are able to provide results in an ablation scale setting, using 192x384 resolution global crops and 20 epochs of training. The table below shows that PooDLe outperforms FlowE on ADE20K linear semseg.

Table C. ADE20K linear readout semantic segmentation, pre-training on Walking Tours

Method	mIoU	Acc
PooDLe	10.61	56.94
FlowE	7.53	53.34

7. ViT for PooDLe

We demonstrated the contributions of our method using ResNet-50, and we generally expect them to transfer to ViTs. However, some further work is likely necessary as there is no resolution hierarchy in ViTs which would affect the SDM’s design and the flow equivariance (dense) objective has not been used with ViTs previously. We leave these extensions for future work.

8. Would this method work for classification?

We believe PooDLe representations would perform reasonably well on classification, as our models learn semantic information via the pooled objective and are capable of differentiating between classes for semantic segmentation, i.e., pixel-level classification. Nonetheless, there are no global classification benchmarks associated with BDD or Walking Tours. However, please note that the goal of our work is to improve representation learning on naturalistic, uncurated data. This is challenging in comparison to learning from global classification datasets like ImageNet that are curated to be iconic and class-balanced.

9. Accuracy values on Table 3 & 4

We provide accuracy values for these tables in #4 above.

Note on accuracy improvement (weakness #3)

PooDLe’s accuracy improvement from 88.5 to 89.2 over FlowE is significant at such high accuracies and represents a 6% reduction in error rate. In addition, as the classes are highly imbalanced, incremental improvements are very difficult. Conversely, mIoU weighs each class equally and is usually the primary metric reported in prior works. Additionally, at a high level, it is more important to capture all classes well than to focus on per-pixel accuracy.

审稿意见

评分: 6置信度: 42024-11-04

This paper proposes an approach for self-supervision from video. The authors highlight the discrepancy between image datasets which typically contain a single central object and more challenging video data which is composed of complex scenes filled with many objects of differing sizes. The simpler image setting lends itself well to global embedding-based self-supervised methods, but such methods might not be ideal for representation of larger multi-object scenes.

The key idea of the paper is to mix dense and global objectives. For dense feature supervision, the authors follow FlowE (Xiong et al 2021) with one key difference being the use of a U-net style architecture to upsample and connect higher resolution features from earlier layers. The more "global" pooled-objective is applied to subcrops. Similar to how FlowE will match optical flow-warped dense features, subcrops are adjusted across frames by predicted optical flow to have a higher chance of containing the same subject. The motivation for learning on subcrops is for smaller, salient objects to take up a larger fraction of pixels.

The model is pretrained on BDD and Walking Tours and then segmentation and detection heads are trained on top of the frozen backbone. The resulting model outperforms other self-supervised approaches as well as ImageNet pretraining.

优点

This is a solid paper, well motivated introduction, straightforward idea, outperforming other methods with the appropriate ablations to break down different design decisions. Each of the key ideas behind the approach are sensible, I am surprised it is not already common to perform dense feature self-supervision on U-net/FPN style features.

I appreciate the level of specific, little details throughout the paper to understand exactly how things were implemented
it's interesting to see that both losses only start to play a role after a clearer distinction has been made between the pooled "global" and upsampled dense features (Table 4)
also cool to see that the method performs well with imperfect self-supervised flow

缺点

This is minor, but I am a bit confused about the subcrops. When reading I was expecting some mechanism for smarter choice of crops such that there's a higher probability of including an object of interest. For example, some sampling based on saliency or flow that would ensure more crops included people/bikes/etc, but it doesn't seem like that's the case? Instead, crops are chosen randomly and uniformly across the image so they are just as likely to sample a patch from the sky as from the street. The key difference as far as I understand is that by virtue of doing any cropping at all an object of interest will now take up a greater fraction of pixels, and using flow ensures the object will likely be included in both crops. Feels somewhat at odds with how the paper has been written and motivated which emphasizes putting extra attention on smaller objects as opposed to background context.

问题

Overall the paper is solid so this is again fairly minor, but I have a bunch of cropping related questions as that's the one area I had the most confusion:

Where exactly does the global cropping fit into all of this, what's the point of defining a global crop at all?
In Figure 7, I'm not sure what it means to change the number of subcrops, is it changing the proportion of correlated samples in a given minibatch? What is the method with 0 subcrops - is it using the global crop as input into the encoder?
Maybe I missed it, but is there ever an ablation on the importance of the subcrops being flow matched? Intuitively seems like a reasonable thing to do but does it matter much?
In Table 4, is the FlowE result trained with the proposed cropping approach? How does this compare to FlowE without using subcrops?
Is the training done on 192x192 crops but testing at 512x512, is there any issue with this discrepancy or does the model generalize well enough to the larger test time resolution?

伦理问题详情

n/a

评论- Response to Reviewer HmA1

2024-11-18

Thank you for spending the time to review our work and provide insightful comments. We are encouraged that you found our problem area to be well-motivated and our approach to be straightforward and sensible. We would like to address your comments and questions on cropping below.

First, we would like to make a general clarification on how we apply the two objectives to the image crops. For each pair of video frames, we crop a pair of larger global crops and then K pairs of smaller local subcrops from the global crop pair. We compute the dense loss from representations of the global crops, as they have a larger field of view with many objects. We compute the pooled loss from the subcrops as they are much more likely to contain a single object, allowing pooling and invariance-based learning.

Smarter choice of subcrops

This is a good observation. We are actually currently experimenting with saliency methods like ContrastiveCrop [1] and outlier methods like SSD [2] to improve crop selection. We had also considered using flow to find fast or independently moving objects, or flow clustering [3]. However, we did not continue this as objects of interest are not always moving, and camera ego-motion may make things appear to be moving when they are not. For instance, a bike may be locked to a fence and stationary, while cars are large, common moving objects that are already easy to capture. Additionally, these are observations specific to driving data like BDD and may not apply if we are in another setting with mostly static objects (such as a museum). Nonetheless, random small subcrops still improve foreground object hit-rate, as you mentioned and as analyzed in Section 4.5 of our paper.

Changing the number of subcrops in Figure 7

Following our clarification above, it is varying K, the number of paired subcrops generated from the pair of global crops. This increases the number of terms averaged for the loss in Eq. 2. For 0 subcrops, we do not use the pooled objective at all.

Flow-matching for subcrop selection

We are running two ablations to measure the importance of flow-matching for subcrops: (1) selecting 2 random, independent subcrops and (2) selecting 2 subcrops at the same location from the pair of global crops. It’s expected that (2) will perform better, but will be affected by the data and the amount of motion between frames. To quantify this effect, we are also running (2) at the default delta_t=15 as well as delta_t=45. We will provide an update when we have these results.

Table 4 and subcrop resolution

We’ll answer these questions together as they’re related to our general clarification above. FlowE is only trained with a dense objective and thus only on the global crop pair. Subcrops are used for the pooled objective in Table 4. During training for the main results, the same model backbone encodes subcrops at 192x192 resolution and the global crop at 512x1024 resolution. Ablation experiments use 192x192 subcrops and 256x512 global crops. We evaluate at 512x1024 for BDD100K and 512x512 for ADE20K tasks for both the main and ablation experiments, indicating that the model can generalize to different test time resolutions.

[1] Crafting Better Contrastive Views for Siamese Representation Learning. CVPR 2022.

[2] SSD: A Unified Framework for Self-Supervised Outlier Detection. ICLR 2021.

[3] Guess What Moves: Unsupervised Video and Image Segmentation by Anticipating Motion. BMVC 2022.

评论- Results on subcrop flow-matching

2024-11-25

Below are the results for the flow-matching ablations. We find that selecting subcrops at the same location in an image, with jitter (2) slightly outperforms flow-matching (1) for BDD pretraining at delta_t=15, but degrades quickly as delta_t increases to 45, obtaining the same mIoU and lower accuracy (-0.4%). Unsurprisingly, independent random crops performs very poorly, as the subcrops are very unlikely to contain the same subject.

This is a matter of hyperparameter selection, as the benefit of flow-matching for subcrop selection is dependent on object motion and object size. In particular, flow-matching would be more helpful for large, faster moving objects and should also benefit from more targeted subcrop strategies.

Table A: Comparison of different subcrop selection methods at dt=15,30,45

Row	delta_t	Subcrop method	mIoU	Acc
1	15	Flow-matching + Jitter	34.2	86.5
2	15	No flow + Jitter	34.9	86.6
3	15	Random crop	30.6	85.9
4	30	Flow-matching + Jitter	33.9	86.4
5	30	No flow + Jitter	34.2	86.4
6	45	Flow-matching + Jitter	33.7	86.5
7	45	No flow + Jitter	33.7	86.1

评论- Response to Reviewer HmA1

2024-11-26

Dear reviewer, thanks again for taking the time to review our work and provide valuable feedback. Since the final days of the discussion phase are approaching, we wanted to reach out and ask if there are any further questions from your end. We hope we were able to answer your questions about subcrop selection and resolutions, and welcome any additional input.

2024-11-27

Thanks for the thorough response! I realize the disconnect I had regarding global vs subcrops and where the dense and global losses are applied. All makes sense to me now!

Regarding the additional experiment, it does not affect my score as I still lean positive on the paper, but want to confirm I understand correctly. The best results reported here are those without the flow matching? It is a modest difference, but I guess maybe not too surprising.

评论- Response to Reviewer HmA1

2024-11-27

Yes, the best results reported here are at delta_t=15 without flow matching for subcrops. It is likely that selecting subcrops at the same location is good enough for capturing the same object between frames, while providing slightly more crop diversity. However, this selection method may be less robust to larger amounts of motion, as alluded to by our delta_t=45 experiments.

审稿意见

评分: 6置信度: 42024-11-04

This paper introduces PooDLe (Pooled and Dense Learning), a self-supervised learning method for learning visual representations from naturalistic videos. PooDLe is a joint framework combining dense flow-equivariance learning with a pooled objective using pseudo-iconic subcrops. It shows state-of-the-art performance on multiple benchmarks while demonstrating strong performance on small objects and rare classes.

优点

The authors propose a novel approach to handling spatial imbalance in dense scenes through unified pooled and dense objectives. The creative use of flow-informed cropping to generate meaningful subcrops is insightful.
The paper provides comprehensive experimental validation across multiple benchmarks, and the method demonstrates obvious improvements on challenging cases (small objects, rare classes).

缺点

Optical flow computation is typically computationally intensive, requiring expensive pixel-wise correspondence matching. It is important to discuss the computational costs of flow estimation. Limited discussion of real-time performance implications
Limited analysis of inference time requirements. e.g. how inference time scales with input resolution.
Missing discussion of potential memory bottlenecks from joint objectives. e.g. GPU memory requirements for different input sizes, and comparison with baselines.

问题

What is the computational overhead of the SDM compared to simpler upsampling approaches?
How sensitive is the method to video frame rate or motion blur?

评论- Response to Reviewer NLbx

2024-11-18

Thank you for spending the time to review our work. We appreciate the positive feedback, and suggested points of clarification on computation cost and sensitivity to video properties.

We will address your comments and questions below.

Optical flow estimation cost

First, we note that optical flow estimation is not used in real-time inference – it is only used in pre-training. It takes 0.069 seconds, 18.9% of a training iteration, to estimate optical flow using RAFT (24 iterations) for a pair of 512x1024 images. This is further diminished during training due to PyTorch optimizations. Additionally, our method should perform comparably with fewer flow iterations and the optical flow results can also be computed once and stored on disk.

Computational overhead of the SDM compared to simpler upsampling approaches

Without the SDM, prior methods use dilated convolutions for upsampling features [1, 2, 3]. We compare the FLOPs of a ResNet-50 + SDM vs ResNet50 + dilated convolutions for 4x upsampling on one 512x1024 image. Our SDM is more efficient by FLOPs. The SDM also improves memory usage in training, as seen in the next table below.

Table A. Computational overhead of upsampling approaches

Model Architecture	GFLOPs (inference)
ResNet50 with SDM	60.5
ResNet50 with dilated convolutions	200.7

Analysis of inference time and GPU memory requirements

We use a standard ResNet-50 as our encoder and ResNet blocks in our SDM, so we expect our model’s inference time scaling to follow most vision encoders.

GPU memory requirements compared to baselines

Here is a comparison of the training VRAM requirements of PooDLe compared to representative baselines with one sample. For PooDLe, this count includes the joint pooled and dense objectives. In comparison to the SDM, the dilated convolutions in FlowE quadratically increase the activation count in the 3rd and 4th ResNet stages, leading to greater VRAM usage.

Table B. GPU memory requirements of different methods

Method	Resolution	Architecture	VRAM (training, GBs)
PooDLe	512x1024	ResNet-50	9.72
FlowE	512x1024	ResNet-50	12.43
DenseCL	512x1024	ResNet-50	3.24
DINO	512x1024	ResNet-50	3.89
DoRA	224x224	ViT-S/16	4.95

Sensitivity to video frame rate or motion blur

In Figure 9, we demonstrate that our method maintains strong performance across different temporal stride sampling, which is analogous to varying video frame rate. Motion blur is difficult to directly quantify, but we note that our method is able to learn useful representations despite the wide range of motion levels in BDD100K and Walking Tours.

[1] DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. TPAMI 2017.

[2] Pyramid Scene Parsing Network. CVPR 2017.

[3] Self-Supervised Representation Learning from Flow Equivariance. ICCV 2021.

评论- Response to Reviewer NLbx

2024-11-26

Dear reviewer, thanks again for taking the time to review our work and provide helpful feedback. Since the final days of the discussion phase are approaching, we wanted to reach out and ask if there are any further questions from your end. We hope that we were able to address your concerns about computational costs and video properties, and welcome any additional input.

评论- Authors' general response

2024-11-18

Thank you to the reviewers for taking the time to review our work and provide valuable feedback. We appreciate that the reviewers found that our work is novel for tackling spatial imbalance (Reviewer NLbx), is clearly written and well-motivated (Reviewers HmA1, TTAd), justifies its design choices through a comprehensive set of ablations (Reviewer HmA1, jDP3), and demonstrates performance improvements over prior methods across various datasets (Reviewers NLbx, HmA1, TTAd, jDP3). We address each reviewer’s comments and questions below and will incorporate their suggestions in our revised paper.

评论- Authors' general response

2024-12-04

We would like to express our appreciation for the reviewers for taking the time and effort to review our work, provide helpful feedback, and engage in insightful discussions. We are glad that through the discussion period, we were able to do the following:

Demonstrate that the SDM’s inference overhead is minor compared to other methods, alleviating concerns about evaluation (Reviewers NLbx, jDP3)
Provide additional ablations to highlight the importance of our method’s integrated technical components (Reviewer TTAd)
Explore and discuss ideas for ongoing / future work on improving subcrop selection (Reviewer HmA1)
Improve our paper’s presentation by clarifying wording regarding evaluation comparisons and analysis of data augmentation parameters (Reviewer jDP3)

We will incorporate these discussions and improvements into our revised paper.

AC 元评审

2024-12-16

This paper proposes a self-supervised learning method for video with dense scenes. It combines an invariance-based objective on pooled representations and a dense SSL objective that enforces equivariance to optical flow warping. This dual-objective approach, augmented with a spatial decoder module (SDM), is shown to effectively represent both large and small objects. The experiments on first-person video datasets have shown better results than other SSL methods, particularly on small object recognition.

All reviewers are unanimously positive on this submission, and like the contributions including 1) the approach is somehow novel and well motivated; 2) the experiments are extensive/good and support the proposed claims; 3) the paper is well written with sufficient details and precision. The common concerns from the reviewers are 1) complexity overhead; 2) missing further ablation/comparison/discussion; 3) technical contributions are kind of incremental given the previous works. However, most of the concerns were well addressed after the rebuttal. The AC agrees with the reviewers' unanimous decision on accept.

审稿人讨论附加意见

The common concerns from the reviewers are 1) complexity overhead; 2) missing further ablation/comparison/discussion; 3) technical contributions are kind of incremental given the previous works. However, most of the concerns were well addressed after the rebuttal. Reviewer NLbx didn't respond, but the authors have provided sufficient responses.

最终决定Accept (Poster)

2025-01-22

Accept (Poster)