Looping LOCI: Developing Object Permanence from Videos
Loci-Looped evolves unsupervised object identification and tracking by integrating an internal loop that learns physical concepts like object permanence in a fully emergent manner. It outperforms state-of-the-art models in occlusion scenarios.
摘要
评审与讨论
The paper focuses on the problem of compositional scene representation learning from videos. Specifically, the authors propose an extension of the Loci (Traub et al., ICLR 2023) by an additional module where it can decide whether to leverage the sensory input to determine the object state or purely rely on previous object states. In this way, the proposed method is able to handle object occlusions and sensory interruptions (masking random frames to be black). The experiments show that the model obtains better object tracking performance on the ADEPT dataset compared with baselines, and better performance of sensory interruptions on the CLEVERER dataset.
优点
- The paper proposes an interesting extension of Loci to handle object occlusion and sensory interruptions.
- The proposed method exhibits strong object tracking performance on the ADEPT dataset compared with previous methods like Loci and SAVi.
缺点
-
My major concern about the paper lies in the limited contribution and generalizability of the proposed method. The design of the model is a bit complex and ad-hoc. The core contribution is the introduction of an inner loop that enables the model to imagine object dynamics without sensory inputs. To be more specific, the proposed model is encouraged to ignore the sensory input when the object is being occluded. This does not make sense when the camera is moving, as the camera motion (which needs to be inferred from video) should also be considered for inferring the view-centered object motion. Failing to account for that makes the model rather limited. I am interested in authors’ opinions about how to extend the current model to support this. It is interesting to see the current model’s performance on the LA-CATER Moving dataset proposed in [1] which contains camera movements.
-
On the other hand, the evaluation of the paper is rather limited. The main experiments regarding object tracking only consider ADEPT, a synthetic object with a simple background and at most 3 moving objects. It is unknown how the method will perform on more complex and realistic datasets. I understand this is a common concern for a lot of compositional scene representation learning works. But given the limited technical contribution of the paper and the complexity of the proposed method, I believe a more comprehensive evaluation will make the paper stronger. For example, why authors do not evaluate on the CLEVRER and Aquarium as in the original Loci paper?
[1] Object Permanence Emerges in a Random Walk along Memory. Pavel Tokmakov, et al. ICML 2022
问题
Apart from the questions I mention in the weakness section, I have a few more questions:
- In Figure 4, t=42, why the green object ceases to exist in the imagination of loci-looped?
- Can this inner loop mechanism be leveraged in other compositional scene representation learning methods? Demonstrating the effectiveness of this mechanism on other models will also make the paper stronger.
- Can authors provide more qualitative results (e.g. videos) about the tracking performance of the proposed method?
Thank you for acknowledging that Loci-Looped exhibits strong object tracking performance. Please see the general response as response to most of your comments.
Indeed, in Figure 4 the model narrows down the hidden object as it approaches the edge of the occluder. This interesting behavior is fully emergent as the model never encountered vanishing objects in the training dataset. The model does not completely let the object vanish, which becomes apparent by the surprise signal.
In general, this inner loop should be applicable to other compositional scene representation learning methods if they use a latent prediction module. For example this is the case in SAVi, where a percept gate could be incorporated after the Correction module. In contrast, this would not work for Steve as this model does not predict the next object states. However, we hypothesize that the success of the inner loop in Loci-Looped also depends on its inductive bias to maintain stable object representations which is enforced by the GateLORD architecture as well as by the regularization terms.
Thanks authors for the response. I believe the experiments of the paper still fail to demonstrate the generalizability of the method. I will keep my rating.
Dear reviewer - thank you for your time.
we should have re-mentioned this in the direct reply to you, however, in our general reply we state:
- We emphasize that LOCI-Looped has the potential to scale to more real world objects, more complex backgrounds and moving cameras, which is confirmed by a recently published related work on the MOVi-* and another state-of-the-art benchmark suite.
- Note that this work mostly beats the state of the art in MOVi-e and the other benchmark suite (comparing to NeurIPS 2022 SAVi++ [1] paper and relating to ICLR 2023 DINOSAUR paper [2]).
Thus, while not in this paper, but it has been demonstrated that this method generalizes.
This paper is about learning object permanence and learning to adaptively fuse inner and outer information - a challenge that has been around and has been tackled for decades cf. [3]. We solve this challenge for the first time without the provision of any mask/object information whatsoever.
We would highly appreciate if the reviewer finds the time to consider these critical aspects.
[1] Gamaleldin F. Elsayed, Aravindh Mahendran, Sjoerd van Steenkiste, Klaus Greff, Michael C. Mozer, and Thomas Kipf. SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos, December 2022. URL http://arxiv.org/abs/2206.07764. arXiv:2206.07764
[2] Maximilian Seitzer, Max Horn, Andrii Zadaianchuk, Dominik Zietlow, Tianjun Xiao, Carl-Johann Simon-Gabriel, Tong He, Zheng Zhang, Bernhard Schölkopf, Thomas Brox, and Francesco Locatello. Bridging the Gap to Real-World Object-Centric Learning, March 2023. URL http: //arxiv.org/abs/2209.14860. arXiv:2209.14860.
[3] Yuko Munakata, James Mcclelland, Mark Johnson, and Robert Siegler. Rethinking infant knowl- edge: Toward an adaptive process account of successes and failures in object permanence tasks. Psychological Review, 104(4):686–713, 1997. doi: 10.1037/0033-295x.104.4.686.
The authors propose a follow-up architecture with a different set of regularizers based on prior slot attention next-frame prediction work. The authors claim that the system is able to "form concepts of object permanence and inertia from scratch in a fully self-supervised manner." The authors evaluate the system on two datasets.
优点
- S1 The study could offer understanding on the advantages and limitations of current machine learning systems for modeling object permanence.
- S2 The approach description is written clearly. the authors provide detailed descriptions of their proposed approach.
缺点
W1 - A critical aspect of the result - visualization of slot attention decomposition is missing.
The result section does not contain any illustrations of slot decomposition and roll-out results across time. Please see Figure 6 in [1] and Figure 5 in [2]. It is critical to visualize slot decompositions, especially given how strongly authors are attempting to make the "object permanence" claim.
W2 - More ablation experiments are needed to justify the robustness of the system.
- For instance, for Figure 2, what would happen if the authors change the camera poses, such as following a spherical trajectory, while rolling out the model? The authors could visualize slot decompositions while varying camera poses.
- How important is each of the regularizers proposed in Table 1? Many changes are made going from Loci-v1 to Loci-looped.
W3 - More comparisons with baselines are needed The authors did not compare against many powerful frameworks, such as [1] and [2]. It would be very valuable to know where the proposed approach stands.
W4 - More discussions and analysis are needed to justify strong claims such as "forming the concept of object permanence and inertia from scratch."
- How is inertia property tested in the ADEPT's vanish scenario?
- Can the proposed system estimate the unknown inertial parameters of rigid bodies in the physical system given videos?
W5 - What are the limitations of this work? There does not seem to be any discussion. For example, how would the system perform if non-rigid materials were present in the scene? Suppose a cup of water is being poured into a basket that is occluded by a board, a similar setup to your current experiments. Could the system still infer that the water is permanent across time?
[1] Elsayed et al., SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos, NeurIPS 2022.
[2] Wu et al., SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models, ICLR 2023.
问题
Please see the weaknesses section above. Thanks.
伦理问题详情
N/A
Thank you very much for the thoughtful comments.
Thank you for the helpful comment, pointing out that the slot decomposition is missing in the illustrations. We now provide illustrations of the slot decomposition in the appendix. We hope this makes clear, that the objects persist in the slot representation when occluded. Moreover, we now provide videos in the supplementary material.
Concerning the importance of the different regularization terms: The Gestalt Change, the Position Change and Object Permanence Regularization are ablated in Traub et al. (2023), suggesting that the Gestalt Change and the Position Change regularization is important for learning. We are currently running ablation studies on the Input-Frame Reconstruction Loss and the Gate Opening Regularization and will add them to the Appendix.
We now clarify in the methods section that we define intertia as the continuation of motion unless acted on by an external force. We test this property by letting the model predict how the occluded objects continue to move when hidden or when a blackout occurs.
Indeed non-rigid materials present a challenge to the current model (and very likely to most others). We are happy to include this aspect as a limitation in the discussion section among other limitations.
In this manuscript the authors describe an extension of their Loci method for tracking objects in video. They add an internal connection that propagates latent predictions directly to a model that previously had only a loop through pixel space. This addition improves tracking for fully occluded images substantially for simplified rendered scenes.
优点
Overall this approach is interesting as an interpretable model of object tracking and the authors present some observations showing that their method tracks fully occluded objects now. The model still learns in an unsupervised way from video data, is evaluated on a few different tasks and does pass basic object constancy tests as used for children.
缺点
The tests are all performed in extremely reduced situations where simple objects move along fully predictable straight trajectories, which raises concerns about the scaling and generality of the approach. Additionally the work is clearly incremental as it extends a highly similar method from last year. Thus, I am not convinced this manuscript warrants another publication yet.
For the evaluations, I believe a more natural dataset with higher variability of the trajectories and object motion and/or in the properties of objects would be desirable. And even for the simple situations covered by the manuscript, more comparison models are necessary. At very least the PLATO and ADEPT models mentioned in the manuscript should really be tested on the same data for comparisons. And going further I am also not convinced that other models without explicit object representations categorically cannot represent the objects through occlusions. While I share the intuition that they should be worse, I think this should be shown properly by evaluating such more general models on the same test data.
And on the model side, it is an observation that this added loop improves the Loci model, but this seems to me like an incremental improvement about this specific model. To be convincing as an insight about models I would require an application to multiple models that shows that this is indeed a general productive direction for object slot models. As it stands now, I don’t see any clear insight to gain from this manuscript beyond the authors presenting a revised version of their model.
问题
I don’t have questions for the authors.
Thank you very much for acknowledging that the addition of the inner loop substantially improves tracking for occluded objects.
Thank you for the comment. Indeed, the tested occlusion scenarios are rather simple. The more surprising it is that these tests can not be solved by current state-of-the-art models like Loci-v1 or SAVi. Although our introduction of the internal information fusion process is incremental, our work clearly shows that this addition has a major effect on the model's ability to deal with occlusion scenarios. To the best of our knowledge this is the first time such a process is introduced for compositional scene representation models.
In general, the inner loop can be incorporated in all models that make use of a latent prediction model. For compositional scene representation models, this is however not very common as many models do not ensure slot consistency over time using recurrent next-step predictions. Incorporating the precept module into other compositional scene representation models is an interesting avenue for future experiments. The slot-wise information fusion process may also be an interesting candidate for other object-centric transition models in different model domains, for example in the DREAMer (Hafner et. al, 2019) architecture.
I appreciate the authors' response and agree with them that the loop they introduce here could still be an interesting addition for more complex situations or other models. However, I maintain that this would be necessary to complete this project and make it convincing enough. Thus, I did not change my ratings.
While we highly appreciate your time, with all due respect, you apparently have overlooked our comment in our general reply (and edits in the text):
- We emphasize that LOCI-Looped has the potential to scale to more real world objects, more complex backgrounds and moving cameras, which is confirmed by a recently published related work on the MOVi-* and another state-of-the-art benchmark suite.
- Note that this work mostly beats the state of the art in MOVi-e and the other benchmark suite (comparing to NeurIPS 2022 SAVi++ [1] paper and relating to ICLR 2023 DINOSAUR paper [2]).
Thus, it was already shown that the introduced loop does scale to more complex situations. (albeit not in this paper).
This paper is about learning object permanence and learning to adaptively fuse inner and outer information - a challenge that has been around and has been tackled for decades cf. [3]. We solve this challenge for the first time without the provision of any mask/object information whatsoever.
We would highly appreciate if the reviewer finds the time to consider these critical aspects.
[1] Gamaleldin F. Elsayed, Aravindh Mahendran, Sjoerd van Steenkiste, Klaus Greff, Michael C. Mozer, and Thomas Kipf. SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos, December 2022. URL http://arxiv.org/abs/2206.07764. arXiv:2206.07764
[2] Maximilian Seitzer, Max Horn, Andrii Zadaianchuk, Dominik Zietlow, Tianjun Xiao, Carl-Johann Simon-Gabriel, Tong He, Zheng Zhang, Bernhard Schölkopf, Thomas Brox, and Francesco Locatello. Bridging the Gap to Real-World Object-Centric Learning, March 2023. URL http: //arxiv.org/abs/2209.14860. arXiv:2209.14860.
[3] Yuko Munakata, James Mcclelland, Mark Johnson, and Robert Siegler. Rethinking infant knowl- edge: Toward an adaptive process account of successes and failures in object permanence tasks. Psychological Review, 104(4):686–713, 1997. doi: 10.1037/0033-295x.104.4.686.
The paper provides an improved version of the recently proposed Loci (location and identity tracking) model, by introducing an internal loop of prediction and updating. The major advantage of the proposed Loci-looped model is that the network is able to track objects even when they are occluded or during blackout. The paper claims that it shows surprise signal when an object violates object permanency, reflecting the learning of the rule of object permanency and inertia
优点
The infants' learning of objects' property is holistic: they not only learn to segment objects, but indeed learn certain properties of objects (such as object permanency) without supervision. It is nice to have a model that simultaneously learn both without supervision. As pointed out by the authors, although many models show good performance on some datasets, they use supervision or conditioning signals that are not available to human brain, thus do not provide insight for how these abilities can be jointly learned without supervision as infants.
The idea of separately generating object mask and visibility mask appears to be novel, which is required for demonstrating object permanency.
缺点
I believe that the ability of correctly segmenting objects in the newly proposed framework largely comes from the information bottleneck. Although it works well with the chosen datasets which have almost pure colors in each object, I doubt it will work well in environments with more complex texture as in nature or in datasets such as MOVI-c and beyond (https://github.com/google-research/kubric/tree/main/challenges/movi). Although I guess segmentation is not treated as a major contribution, I worry about how much the proposed principle can be scaled up and generalize.
If we are restricted to a simple environment, one can imagine that another model that simply clusters pixels based on colors and estimate the center of mass of the clustered pixels could likely segment and localize objects correctly in the CRATER dataset, without using a neural network. Then an RNN that learns to predict the center of mass based on the previous trajectory extracted with the above approach and weights its loss function based on the number of pixels in the correct color corresponding to that object can also shut down gradient when an object is occluded, and then receive teaching signal once the object reappears. This way, the RNN may also be able to learn to predict a linear moving trajectory. Now perhaps what I describe here essentially is similar to the idea of perceptual gate and perhaps the advantage of the current model is that the gate is learned rather than being hard coded as being decided by pixel counts, what I am trying to say is that the environment may not pose enough challenge for the task that the model aims to solve (all of segmentation, localization and tracking). If an environment allows defining the gate by a pre-defined rule, then learning it seems trivial. I am not against using such dataset for proof of principle. But I think this limitation should not be ignored.
I hope that the writing can be slightly clearer. For example, the meaning of Gestalt has expanded to capture the principle to organize parts into an object due to Gestalt school of psychology. If you follow the simple description of gestalt code as "mainly representing shape and surface pattern" in loci-v1 paper, it is a good idea to define it here as well.
There are other unclear parts. Please see my questions.
The slot error is claimed to serve as a surprise signal. But some details of its pattern appears strange to me. In Figure 3, after frame 30, the slot error is of similar magnitude for both the reappearing and vanishing object. If as the paper tries to claim that the model learns to imagine objects behind the plate during occlusion, then I would expect that it should predict the appearance and location of the object more or less correctly when the object should reappear. In other words, the slot error should be smaller in the reappearing case than in the vanishing case (which violates object permanency and should not be predictable at all). An indifference here seems to indicate either the prediction of occluded object is quite wrong upon reappearing or that the model somehow learns to predict vanishing object somewhat correctly?
问题
In equation 2: there seems to be no restriction being mentioned that the visibility mask should be smaller than the object mask (or even within it), I assume it is possible for the numerator to become larger than denominator and for the occlusion state to be negative. How do you stop this?
There is a sentence between Figure 2 and 3.3: "By adding Gaussian noise with a fixed standard deviation ..., learning is biased to move further into plateaus away from ridges where possible." I am sorry that I did not quite get what plateaus and ridges refer to here. Something about loss landscape as a function of all network weights?
Below that, it was stated that L0 regularization is imposed on gate opening, but L0 loss has no gradient. If I understand correctly, equation 10 indicates that you instead use 1 as gradient. To me it seems that you are actually imposing L1 loss for positive values instead of L0 loss.
In Result of 4.2 next to Figure 3, I did not really understand "significant correlation between the slot error of vanished objects and the visibility of reappearing objects". If these two quantities are of two different objects, why should we expect them to be correlated? I also don't understand by "likewise, we find the same pattern for the size of visibility mask" what correlation you refer to.
In the experiment of 4.2, does the occluding plate also get assigned a slot, or is it treated as part of the background by the network? How does the prediction for the plate looks like when it falls?
We deeply value the reviewer's recognition of our model's ability to emulate infants' holistic learning without supervision -- a pivotal aspect in understanding how these innate abilities can be jointly learned.
We appreciate the suggestion for a rule-based baseline. We have added a Loci-Visibility baseline, whose percept gate is controlled solely based on the perceived occlusion state i.e. Loci switches to the inner-loop when objects become occluded. We find that it performs superior to Loci-Unlooped highlighting the importance of the inner loop. However, it performs inferior to Loci-Looped, demonstrating the importance of learning an adaptive gate control function that flexbily balances the inner and the outer looop, rather than approximating a simple rule.
We also recognize the necessity of testing the model on more complex occlusion scenarios, such as non-linear object movements or containment situations. This limitation has been now duly addressed in our discussion, emphasizing the need for datasets that pose greater challenges in terms of segmentation, localization, and tracking to further evaluate and enhance the model's abilities.
Thank you for pointing out that Figure 3 was a bit confusing in that respect. Due to averaging effects the greater surprise in the vanishing trials was not explicitly illustrated in the plot. We have added an additional box plot (Figure 3a) showing that the maximum slot errror is significantly larger when hidden objects fail to reappear after occlusion and after the occluder rotates to the ground, compared to trials in which objects do reappear. We also now add a video to the supplementary material, which shows that our system expects the reappearance and then "parks" the object behing the occluder until the occluder falls over. Please note that this latter behavior is fully emergent, as our system was not trained on vanishing trials.
Answering the questions: The requirement that the object mask must be contained within the object mask itself stems from equation 1. This constraint emerges from our approach to computing the object mask, where we consider only slot-object k within the scene, disregarding other slots. Consequently, slot k only competes with the background for visibility yielding the object mask. Conversely, when deriving the visibility mask, slot k competes not just with the background but also with all other slots present. If the other slots do not intersect with slot-object k, the object mask aligns with the visibility mask. However, if there's an overlap between the remaining slots and slot-object k, the visibility mask becomes a subset of the object mask. This implies that the visibility mask can never exceed the object mask (paragrpah 3.2.1.).
Indeed, the sentence "By adding Gaussian noise with a fixed standard deviation ..., learning is biased to move further into plateaus away from ridges where possible", refers to the landscape of the rectified tanh activation function. We rephrased this sentence to "... the gates tend to either close or open, rather than remaining partially open", to improve comprehension.
The distinction between L0 and L1 regularization lies in the computation of the backpropagated error. In the scenario of the L1 Loss, the backpropagated error equates to the value of alpha (gate opening; alpha * 1). However, in our case, the backpropagated error is either 1 (when alpha > 0) or 0 (when alpha = 0, expressed as heavyside(alpha) * 1).
Thank you for pointing that the correlation needs more explanation. The concept here revolves around the similar trajectories of objects reappearing and vanishing. The visibility spike of reappearing objects (observed after frame 30 in Figure 3) signifies an expected moment when both vanishing and reappearing objects should become visible again. Therefore, the visibility of reappearing objects serves as an indicator for the visibility of vanishing objects. Our observation reveals that both the slot error and the visibility mask for vanished objects rise within this specific time interval. This suggests that the model anticipates the vanishing objects to become visible again during this interval, effectively predicting their reappearance. We've revised the paragraph to convey this argument more clearly and moved it to the appendix.
Indeed, the occluding plate is also assigned a slot. It is represented as one object in the model. We have added illustrations to the appendix showing the model's prediction for the falling plate. Moreover, we have added video material to the supplementary material.
Thank you for your consideration and time. We first provide a brief summary of our answers and modifications for your convenience. We then provide more detailed answer to common concerns and provide individual reviewer specific answers. The revised paper version shows novel text passages in red. Minor textual changes and deletions are not visualized.
Here are the highlights of our answers and revision:
- We confirm system surprise in the vanishing object scenarios (by plotting maxima, which were averaged-out previously). We furthermore show clear system surprise when the occluder falls over. Note that this is fully emergent behavior.
- We add video material and illustrations in the appendix showing the individual slots over time, selective information request for Gestalt- and Position-Code-Input -- as well as further details on how Loci-Looped deals with the situation when an object does not re-appear.
- We clarify that the novel internal loop yields the first fully unsupervised learning model that learns about object permanence and inertia from video without any prior information about objects.
- We emphasize that LOCI-Looped has the potential to scale to more real world objects, more complex backgrounds and moving cameras, which is confirmed by a recently published related work on the MOVi-* and another state-of-the-art benchmark suite.
- We clarify that the comparison with SAVi implies also superior performance on SAVi++, whose internal loop relies on SAVi's slot assignment logic.
- We clarify that comparisons with PLATO are not meaningful, because PLATO is provided with perfectly slotted information.
- We clarify that comparisons with the ADEPT model are not meaningful because this model is trained in a fully supervised manner and is informed by the physics simulator.
- We add a Loci-Visibility baseline, which yields performance between Loci and Loci-Looped, showing that the internal loop with adaptive fusion mechanism is superior to a mere occlusion indicator.
First of all, we want to express our appreciation to the reviewers' and area chair's detailed, excitingly positive, and constructive feedback and the time they have spent to improve our work. We worked hard to increase the conciseness and comprehensibility of our manuscript. In the following, we respond to the general points that were addressed by multiple reviewers. Please find our answers to specific questions in the responses to the individual reviews.
A recurring concern raised by several reviewers centered on the scalability and generalizability of our model to more complex datasets. We acknowledge this concern as an important step for future research. However, our work is about developing object permanence from video in a fully unsupervised fashion and without any prior information about objects whatsoever. None of the current other state of the art systems (including those mentioned by the reviewers) are able to accompolish this. Besides, in Traub et al. (2023) it was shown that Loci is able to handle more complex objects with diverse texture patterns. In the short time it is unfortunately not possible for us to create a suitable dataset and probe this ability in this paper. However, we refer to a recent related publication on Loci, which demonstrates largely superior scene segmentation performance in the MOVi-e dataset.
Another shared concern raised was the model's inability to handle moving cameras, an aspect not explicitly addressed in our current architecture that primarily focuses on static camera poses. We acknowledge this as an intriguing extension that we've highlighted in our discussion section. We anticipate that the perceptual gating mechanism could seamlessly align with a model accommodating camera motion. Utilizing motion signals, derived from optical flow or provided as command signals, offers a promising approach to dynamically update predictions in line with camera movements. This adaptation aligns predictions with updated sensations, allowing the percept gate to function as intended in ensuring correspondence between predictions and sensory input. Moreover, note that a recent related paper reports that Loci yields largely superior performance on the MOVi-e benchmark which also includes non-stationary camera poses.
Lastly, missing baselines was criticized by the reviewers, for example lacking comparisons to SAVi++ and Slotformer, besides our comparison with SAVi (which shows that Loci-Looped clearly outperforms SAVi). In our understanding these comparisons are of limited use. SAVi++ main extension is its improved performance on real-world datasets, incorporating camera motion and explicitly exploiting ground-truth depth information in training. Neither of these characteristics apply to our study of object permanence and our datasets. Moreover there is no architectural improvement from SAVi to SAVi++ that would adress the problem of maintaining stable slot representations of temporairly hidden objects, suggesting that the performance of SAVi is a good indicator on how SAVi++ would perform on our tests. Slotformer, on the other hand, is not a compositional scene representation model but a slot-based video prediction model that trains and relies on pre-computed slot-representations, for example, computed using SAVi or Steve.
Concerning the intuitive physics models: We were not able to train PLATO on the ADEPT vanish scenario (as also stated in Piloto et al. (2022)), because the model expects aligned input masks that need to be provided consistently. In addition, PLATO requires a very coarse temporal resolution (15 frames for one video) simulating only short occlusions, whereas Loci-Looped and SAVi can be trained on fine temporal resolutions (41 frames) simulating longer occlusions. We did not include the ADEPT model as baseline as it would be a skewed comparison in our opinion. The model depends on supervised information to train its encoder, its decoder and its particle filter. Moreover, the ADEPT model uses an out-of-the-box physics engine. We did not include baselines without explicit object representations as numerous related work suggest that object agnostic models perform inferior (Piloto et al. (2022), Smith et al. (2019) , Wu et al. (2023), Villar-Corrales et al (2023)). We now give a detailed explaination on our choice of baseline models in the appendix.
This paper extends an unsupervised object-centric video model with a latent dynamics model (which can run in the absence of new visual information) to address the problem of unsupervised object tracking and video prediction under occlusion. The authors link this to the concept of object permanence in cognitive science.
The method is novel, interesting, and (at least on one dataset/setting) convincingly demonstrates that the model tracks objects through multiple time steps of occlusion without direct supervision. The reviewers agree that this, overall, is an interesting submission of relevance to the ICLR audience. Especially the novelty of the method and the positive results (albeit in a limited setting) prompted one reviewer (out of 4) to recommend acceptance of the paper.
During the rebuttal, the authors have added several clarifications, visualizations, and video results (one video per model, per dataset).
Regarding weaknesses, the reviewers (post-rebuttal) primarily highlight the limited experimental validation of the method. I agree with this concern and second the recommendations made by the reviewers: I recommend that the authors extend the experimental validation of the method and clearly investigate the limits of the approach. For example, does the method work when there is camera motion present? Does it work in combination with more recent architectural innovations that make it scale to more complex data (e.g. textured scenes). The authors say it “has the potential” to do so, but given the current state of the field and availability of these models (incl. open-source code) I would highly recommend validating this intuition.
Furthermore, comparisons to baselines (incl. model ablations) are indeed a bit sparse: e.g. the suggestion of running a comparison against SlotFormer is interesting even though it is a two-stage model (and should not be dismissed solely based on this detail); it is unclear whether an attention-based dynamics module (w/ access to past time steps) as in SlotFormer would already “solve” object permanence on the benchmark tasks discussed in this paper (e.g. via attending to the correct object slot prior to occlusion).
Several questions by the reviewers were left unanswered and I highly recommend studying these in detail and addressing them in a future version of the paper.
While I would love to see this kind of paper presented at a conference like ICLR, I think that in its current state the paper does not quite meet the bar for acceptance. I recommend that the authors expand the experimental validation of their approach and I am sure that reviewers at a future venue will be more receptive to this paper.
为何不给更高分
Insufficient experimental validation (see main meta review).
为何不给更低分
N/A
Reject