6.3

/10

Poster4 位审稿人

最低6最高7标准差0.4

2.8

置信度

正确性3.3

贡献度2.8

表达2.8

NeurIPS 2024

Inference via Interpolation: Contrastive Representations Provably Enable Planning and Inference

Benjamin Eysenbach,Vivek Myers,Russ Salakhutdinov,Sergey Levine

OpenReview PDF

提交: 2024-05-15更新: 2024-12-24

TL;DR

While prediction and planning over time series data is challenging, these problems have closed form solutions in terms of temporal contrastive representations

摘要

关键词

contrastive learningpredictionplanninginferencetime-series

评审与讨论

审稿意见

评分: 6置信度: 32024-06-14

This paper presents a variant of contrastive learning to time series data. Contrastive learning has been used in many areas but the insight from this paper is joint distribution of representations is also Gaussian. In addition, the experimental results show that their theory can be applied to tasks up to 46 dimensions.

优点

The problem studied in this paper is fundamental.
This paper provides a strong theoretical analysis.
The writing of this paper is clear. It is easy for me to follow the main idea of this paper.

缺点

Using contrastive learning to address various problems is not novel. However, this paper provides some new insights such as representations learned via temporal contrastive learning following a Gauss-Markov chain.
The evaluation section is too simple, and it is possible to add more evaluation. Since it is an ML conference, more experimental results can make the paper more solidated.

问题

As for section 5 Numerical Simulation, is it possible for the authors to use real datasets? For example, the authors can collect some datasets from the Traffic and Stock areas and evaluate the solution mentioned in this paper over them.
In addition, I think efficiency is also important for high-dimensional tasks. Could the authors show the running time of each experiment?

Minor:

As for citation, the template of NeurIPS is [10, 12] instead of (10;12). I am not sure whether you have changed the template, but it would be great if you could fix it.

局限性

I think the authors have adequately addressed the limitations.

作者回复

2024-08-07

Dear Reviewer,

Thank you for the detailed review and for the suggestions for improving the work. We have run an additional experiment on a Stock price datasets, and revised the paper to include the running time for each experiment. Together with the responses below, does this fully address the reviewer's concerns about the paper?

Kind regards,

The authors.

is it possible for the authors to use real datasets? For example, the authors can collect some datasets from the Traffic and Stock areas and evaluate the solution mentioned in this paper over them.

As suggested by the reviewer, we ran an additional experiment on a Stock dataset. Data are the opening prices for the 500 stocks in the S&P 500, over a four year window. We remove 30 stocks that are missing data, so the resulting data is 470-dimensional. For evaluation, we choose a 100 day window from a validation set, and use Lemma 3 to perform ``inpainting'', predicting the intermediate stock prices given the first and last stock price (see Figure R2 in the rebuttal PDF). While we do not claim that this is a state-of-the-art model for stock prediction, this experiment demonstrates another potential application of our theoretical results.

Could the authors show the running time of each experiment?

For the datasets considered in this paper, the training time for temporal contrastive learning was less than 5 minutes per dataset. We will add the exact runtime for experiment to the final paper.

Using contrastive learning to address various problems is not novel.

We agree that prior work has used contrastive learning to address a wide range of problems. Our paper uses theoretical analysis to unlock a new use cases (e.g., the planning capability demonstrated in Fig 5). We will revise the paper to note that prior work has used contrastive learning to address several other downstream problems.

As for citation, the template of NeurIPS is [10, 12] instead of (10;12).

We have fixed this. Thanks for catching this!

审稿意见

评分: 7置信度: 42024-06-15

This paper shows that learning contrastive representations with the infoNCE objective combined with an L2 constraint on these representations permit Gaussian distributions on the representations. The authors then show for timeseries data, that, when we make a specific Markov chain assumption (a kind of linear state-space model) then the distribution over learned representations permits a nice Gaussian form for all future or past representations. As a consequence of this, planning can be performed through simple linear interpolation within the learned space.

The paper shortly validates their method in a didactic experimental setup.

优点

This paper is very well written, the presentation of theoretical results, their intuition, and how these are used is for a large part excellent.

The experiments nicely contribute to the main point of the method, that, within a contrastive learned space we can do planning in an effectively closed form rather than using an expensive dynamic programming method. The visualizations nicely show how the authors' method differ from the baselines.

缺点

Summary

I find the discussion and narrative of this paper too weak to confirm the novelty and contributions of the method. The authors make assumptions on their model such that it satisfies a nice practical property: planning through interpolation. However, the theoretical results that are presented should be well known consequences of Gaussian assumptions for linear state space models. The presentation makes this insight almost seem profound, but the weak discussion of related work makes me skeptical of the contribution.

Although this paper presents a nice observation, taking all listed points of improvement into account, I don't think this paper is sufficient yet for an accept. But I'm open to improve the score if the authors address my concerns.

Major comments

Lemma 2 is stated as the main result. Although this way of defining waypoint conditionals is an interesting perspective on planning, the result should be a well-known consequence of using Gaussians (The same goes for Lemma 3). This makes it unclear what the novelty/ contribution of this result is. The authors need to improve this by discussing prior work/ results after introducing their model.
The title of the paper: inference via interpolation, is too hidden within the main text (at the end of 4.4 in Eq. 7). I would strongly suggest turning this result into a corollary (indeed with proof referring to Newman and Todd 1958 Pg. 471), then moving this result all the way to the top of section 4, and restructuring the text to move towards this result.
The interpolation in Eq. 7) only presents the mean-representation. There is no such discussion on the covariance.
The experiment of figure 5 is not well-designed. The introduced method makes use of strong assumptions that allow it to make use of linear-interpolation for planning towards future or past states. Still, the authors compare this to baselines where (also without discussion) we suddenly assume that these have the same functionality, why is this a reasonable assumption? As a result, a baseline without planning outperforms the baselines with planning.
Discussion Line 347 (last sentence) strongly oversells the presentation of the paper. The interpolation property is a nice consequence of the method's assumptions (Eq 3. and As.1, 2) and the contrastive learning. However, there is no discussion on how and why other methods cannot achieve similar things. My suggestion is to drop the 'when' comment, and only keep the 'how', because that's what the paper ultimately shows.

Technical comments

Eq 2. is confusing since it is first written that $(x, x^+) \sim p(x, x)$ but in the Eq. it appears as $p(x)$ . The term $p(x, x)$ is wrong, write this as $p(x, x^+)$ and keep its use consistent. Perhaps use $(x^+, x) \sim p(x^+ | x)p(x)$ and $x^- \sim p(x)$ ?
Assumption 2, phrasing this result as an assumption weakens the story of the authors, especially since this is the equation that the authors' rely on for later analysis. It's usual to make assumptions or simplifications of the real world, so the fact that approximation error or model misspecification might not satisfy the result should not take away from it. I suggest making this a proposition and writing an assumption around this that discusses approximation error, please directly refer to the proof afterwards (citation + where to find it).
Lemma 3, $\eta$ isn't clearly shown to have zero entries apart from the first and last index. Why not use the same definition as in Appendix A.4?
Missing proof of Lemma 3, no reference to appendix.

Minor comments

Line 166 typo: 'that that'
The two representations are $\phi$ and $\psi$ composed with $A$ . Calling this two separate representations is confusing, there is only 1 learned representation and $A$ is used to learn the transport from embedding $\psi(a)$ to $\psi(b)$ . Maybe state this upfront, i.e., in the first sentence line 187 say $\psi(x)$ and $\phi(x) = A\psi(x)$ , that helps when reading chronologically.
Line 215, refer back to Eq.3 for the definition of $c$
Line 231, proof of lemma 2 is in appendix A.3, not A.2.
The three examples in 4.3 for lemma 2 are nice to have for an unfamiliar audience, but I don't see how this contributes to the result since $A$ is learned, and will likely never satisfy trivial solutions.
Line 268, the interpolation result is nice, but the linear interpolation insight is not particularly profound (the text makes it seem as if it were so).
The start and end symbols in Figure 4 are too small to see and overlap with the black dots too much, even at 500% zoom.
Figures 5, 6 are not chronologically ordered/ numbered (6 appears before 5).
The choice of baselines and experiments are somewhat limited. I would have liked a comparison to a decoding method which the authors criticize in the introduction paragraph 2 (e.g., using the DreamerV2/ PlaNet RSSM model).
There is still space left on page 9, the authors could have spent more space on discussing the limitations of their method, which right now is rather shallow.

问题

局限性

the paper concludes that it provides a kind of 'recipe' for how to get models that have a linear interpolation property for planning, but aside from the model that the authors designed in a very specific way this is not discussed.
the experiments are somewhat trivial since the authors expected baselines, that were not designed for a particular feature (planning through linear interp.) to suddenly have this feature.

作者回复

2024-08-07

Dear Reviewer,

Thanks for the detailed comments and suggestions for improving the paper. It's clear that a large amount of time was spent on this review – we appreciate it, and it will help improve the paper. We have added a new autoencoder baseline. To address the reviewer's concern about the novelty of the results, we propose a revision (see below) that would clearly delineate known results (joint Gaussianity) and new results (about temporal contrastive learning). Below, we also discuss several other revisions to the paper (e.g., clarifying the aim of the experiments), as well as answering questions. Do these revisions and discussion fully address the reviewer's concerns about the paper?

I would have liked a comparison to a decoding method

As suggested by the reviewer, we have added an additional autoencoder baseline to the experiments from Figures 5 and 6, with the results shown in the rebuttal PDF (Fig. R2). As one might expect, this autoencoder baseline does not have the interpolation property.

Lemma 2 is a well-known consequence of using Gaussians

The key novelty in Lemma 2 is that we are analyzing the representations produced from temporal contrastive learning. We agree that it is well known that, for a jointly Gaussian distribution, any marginal and posterior distribution is also Gaussian. The novel part of Lemma 2 is saying that the joint distribution over contrastive representations obeys a Gaussian distribution.

Would it be more clear if we separated Lemma 2 into two results?

[not novel] the known result about jointly Gaussian distributions and their marginals/posteriors (citing prior work).
[novel] our new result about temporal contrastive representations being jointly gaussian

If the reviewer thinks it would clarify the writing, we can make a similar revision to Lemma 3.

The authors compare this to baselines we assume have the same functionality

Thanks for raising this concern. We believe that the experiment does give an important bit of information, but agree that the framing was confusing. We agree that it is unclear whether prior methods have the same functionality (though, there is some prior work that uses PCA for interpolation (Oliveira, 2006) and the VIP representations for generating subgoals (Zhang et al., 2023). At the same time, before working on this paper, we also believed that temporal contrastive methods didn't have this functionality. The aim of this experiment is to show that temporal contrastive methods do have this functionality, and that this functionality isn't caused by some artifact of the dataset. We will revise the paper to clarify.

My suggestion is to drop the 'when' comment, and only keep the 'how', because that's what the paper ultimately shows.

Thanks for the suggestion—we have made this revision.

The title of the paper: inference via interpolation, is too hidden within the main text (at the end of 4.4 in Eq. 7). I would strongly suggest turning this result into a corollary (indeed with proof referring to Newman and Todd 1958 Pg. 471), then moving this result all the way to the top of section 4, and restructuring the text to move towards this result.

Thanks for the suggestion. We have done this.

The interpolation in Eq. 7 only presents the mean-representation. There is no such discussion on the covariance.

We have revised the paper to include the covariance.

Notation for $p(x, x^+)$ in Eq. 2

Thanks for the suggestions here. We have revised the paper based on them.

I suggest making Assumption 2 a proposition and writing an assumption around this that discusses approximation error

Thanks for the suggestion – we have done this.

Clarify which entries are zero in Lemma 3

Thanks for the suggestion – we have done this.

Missing proof of Lemma 3, no reference to appendix.

The proof of Lemma 3 is in Appendix A.4. We have added a link to this Appendix section right after the statement of Lemma 3.

Line 166 typo: 'that that'

We have fixed this.

in the first sentence line 187 say $\psi(x)$ and $\phi(x) = A \psi(x)$

Thanks for the suggestion – we have done this.

Line 215, refer back to Eq.3 for the definition of $c$

We have done this.

Line 231, proof of lemma 2 is in appendix A.3, not A.2.

We have fixed this.

The three examples in 4.3 for lemma 2 are nice to have for an unfamiliar audience, but I don't see how this contributes to the result since $A$ is learned, and will likely never satisfy trivial solutions.

When the matrix $A$ is learned on the datasets used in the paper, we empirically observe that it is often close to a rotation matrix (all eigenvalues are close to one). Thus, we believe that the assumption that $A$ is a rotation matrix (Examples 2 and 3) closely resembles practical settings. One can prove that $A$ will be an identity matrix if and only iff the dynamics are reversible (i.e., $p(x^+ \mid x) = p(x \mid x^+))$ ; thus, Example 1 is applicable when users work with reversible dynamics (e.g., a diffusion process with no drift).

Line 268, the interpolation result is nice, but the linear interpolation insight is not particularly profound

We have revised this sentence to de-emphasize this result.

The start and end symbols in Figure 4 are too small to see and overlap with the black dots too much, even at 500% zoom.

We have increased the size of these symbols.

Figures 5, 6 are not chronologically ordered/ numbered (6 appears before 5).

We have fixed this.

the authors could have spent more space on discussing the limitations of their method

We will do this.

References

Oliveira, P. (2006). Interpolation of signals with missing data using PCA.

Zhang, et al. (2023). Universal visual decomposer: Long-horizon manipulation made easy

评论- A solid paper given the finetuned presentation

2024-08-08

I want to thank the authors for addressing most of my comments and the additional experiment in the global response.

I'm happy that the authors aim to improve upon so many points, all of my concerns are answered. To respond to a few answers in particular:

Would it be more clear if we separated Lemma 2 into two results?

The main problem I had was that I couldn't exactly deduce what the contribution was, so this concern is about presentation. The sentence leading up to Lemma 2 also read: "Our main result is that the posterior distribution over waypoint representations has a closed form solution ...", which added to this confusion.

I think a minor rephrasing and 1-2 additional sentences that: a) discuss that the result in lemma 2 is a known consequence of the Gaussians, b) cite prior work, and c) that your contribution is that you tie these puzzle pieces together so that you can use this for intermediate state-planning, would clear this concern. This also goes for lemma 3.

When the matrix $A$ is learned on the datasets used in the paper, we empirically observe that it is often close to a rotation matrix

This is an interesting insight of the method, could the authors add an explicit part to the paper (either main or appendix) that discusses this?

I will raise my score from 4 to 7 and contribution from 2 to 3. This is a good paper with an interesting message that I think should be explored more in reinforcement learning literature. I think these types of model-features could potentially improve the scalability of planning algorithms when used in combination with neural networks (e.g., as in EfficientZero).

2024-08-13

Dear Reviewer,

This is a good paper with an interesting message that I think should be explored more

Thanks for the kind words :)

I think a minor rephrasing and 1-2 additional sentences that ... add an explicit part to the paper that discusses the rotation matrix ...

We will make these changes.

审稿意见

评分: 6置信度: 22024-07-14

This paper proposes to use contrastive learning technique in time series feature learning, which are helpful for improving inference performance later on.

优点

The method proposed here is simple but effective as shown in the numerical experiments. Employment of contrastive learning in time series control task has the potential as a complementary technique to the generative modeling methods that are currently popular.

缺点

Seeing the analysis of the learned feature unrolling as a Gauss-Markov process reminds me of the very classic Kalman filter. In fact, although Kalman filter is not designed to be learned in the discriminative way that this paper proposes, the equations in this paper share fair bit of similarity to those in Kalman filter. I would like to see some discussion on the difference that separates from this work from the prior time-series modeling with similar formulations.

问题

Does such a contrastive learning method for time series require sampling from the environments repeatedly from any given starting state? If so, how is the applicability of this method on the environments that are only sampled once?

局限性

N/A

作者回复

2024-08-07

Dear Reviewer,

Thanks for the detailed review, and for the insightful comment about the connection with Kalman filters. As noted by the reviewer, both Kalman filters and the representations that we analyze correspond to a probabilistic model of time series data. The key difference (again, noted by the reviewer) is that Kalman filters are fit using generative methods while we propose a method for fitting a similar generative model using discriminative methods. This finding is useful because discriminative methods avoid the need for a decoder, thus significantly reducing the number of parameters (the decoder typically has many more parameter than the encoder) and the compute required for training (the decoder typically requires more time for a forward pass than the encoder). We will add this discussion to the paper. Together with the answers below, does this address any concerns the reviewer might have about the paper?

Kind regards,

The authors.

Does such a contrastive learning method for time series require sampling from the environments repeatedly from any given starting state?

No – this contrastive learning method does not need to be initialized in a single fixed state, nor does it need to be initialized in every possible state. We expect that Assumption 2 will hold when the training data covers a range of different states; as with other machine learning applications, we expect that the learned representations will generalize if given sufficient data (so we do not expect that every state needs to be visited to ensure Assumption 2).

审稿意见

评分: 6置信度: 22024-07-15

This paper addresses the challenges of probabilistic inference with high-dimensional time series data. It provides compact, closed form solutions in terms of learned representations by extending a prior work to show that the learned representations can be modeled as a Gauss-Markov chain. It demonstrates the effectiveness of this theoretical result via numerical experiments on 39-dimensional and 46-dimensional tasks.

优点

This paper clearly states the backgrounds and motivations. It builds upon prior work to derive a theoretical result that is useful for managing high-dimensional data in practical applications. It enhances reader understanding through examples that build intuition and illustrations that visualize methods. It also has comprehensive numerical experiments to not only validate theoretical results but also demonstrate the applicability to real-world applications.

缺点

The writing of this paper can be improved, particularly in the preliminary section, which is hard to follow due to a lack of self-contained explanations for some notations and definitions. For example, why use product of the marginal distributions to sample negative pairs and what does ≈ mean for similarity? What is the infoNCE objective without resubstitution, particularly what is f and B in (2)? Why the subscript of expectation in (2) seems to not match with the sampling rule (x, x+) ∼ p(x, x) mentioned earlier? What does TSNE in the experiment section stand for?

问题

The usefulness of the geometric properties remains somewhat abstract. While the special case provides some insights, it is not fully clear what it actually means for practical applications. Could you elaborate on how the geometric properties are used in practical scenarios and whether it has been used in your numerical experiments? Why linear interpolation can be used for your high-dimensional tasks while nonlinear interpolation is required for your synthetic task?

I suggest separating Assumption 2 into the assumption inherited from the prior work and a statement of its results. Additionally, the term "prior work" is used throughout the paper, creating some ambiguity. It would enhance clarity by specifying which particular prior work is the one that shows "the representations learned by contrastive learning encode a probability ratio".

Addressing the above two concerns can help understand the contributions better.

局限性

Yes, the authors have adequately addressed the limitations.

作者回复

2024-08-07

Dear Reviewer,

Thanks for the detailed comments and suggestions for improving the paper. It is clear that a lot of time went into carefully reading the paper! It seems like the main suggestion was to clarify a few sections of the text, which we have done. Per the reviewer's suggestion, we have also clarified the implications of our work on practical applications. Do these revisions, together with the answers below, fully address the reviewer's concerns?

Kind regards,

The Authors

it is not fully clear what it actually means for practical applications. Could you elaborate on how the geometric properties are used in practical scenarios?

Our theoretical analysis is broadly applicable to several practical applications. As noted by Reviewer 4BoG, our analysis suggests that a version of temporal contrastive learning can be used anywhere Kalman filters and similar probabilistic time series models are used today. Our analysis removes a key barrier in using these sorts of probabilistic models: whereas typically fitting these models to data requires learning a generative model (which need lots of parameters and compute), our analysis shows how discriminative methods can be used to fit similar models (and the computer vision and language processing communities have already developed highly efficient algorithms and implementations of these discriminative methods). We will clarify this in the paper.

how are the geometric properties used in your numerical experiments?

We use the geometric properties of these representations in all of our experiments. In Fig. 4, we use these geometric properties to solve a retrieval problem: which states are likely to be visited in between two other states? In Fig. 5, we use these geometric properties to solve a planning problem: inferring a sequence of waypoints through a maze, and then using these waypoints to control an agent interacting with the maze.

Why linear interpolation can be used for your high-dimensional tasks while nonlinear interpolation is required for your synthetic task?

We use the math in Lemma 3 to perform the interpolation for all our tasks. This interpolation is linear with respect to the representations, but nonlinear with respect to the observations. We will modify the text to clarify. [Let us know if there was something else that was confusing here!]

Why use product of the marginal distributions to sample negative pairs?

Almost all contrastive learning papers sample negative examples in this way (Chen et al., 2020; He et al., 2020; Oord et al., 2019; Radford et al., 2021; Wu et al., 2018). Not only is this choice easy to implement (just sample pairs of states independently from dataset/replay buffer), but prior work also shows that it is highly effective.

what does ≈ mean for similarity?

We use `` $\approx$ '' when providing intuition about three worked examples in Sec. 4.3. Let's look at Example 1, where we assume $A = I$ and $c$ is large. Plugging these example values into Lemma 1, we get $\Sigma^{-1} = \frac{c}{c+1}I + \frac{c+1}{c}$ . As $c \rightarrow \infty$ , we have $\Sigma^{-1} \rightarrow 2I$ (and $\Sigma \rightarrow \frac{1}{2} I$ ). Thus, as $c \rightarrow \infty$ , we have $\mu \rightarrow \frac{1}{2} (\psi_0 + \psi_{t+})$ .

NB: There was a typo in the original version of Lemma 1: the mean should have been $\mu = \Sigma (A^T \psi_{t+} + A \psi_0)$ rather than $\mu = \Sigma^{-1} (A^T \psi_{t+} + A \psi_0)$ . We have corrected this in our local copy of the paper, and use the correct version in the answer above.

What is the infoNCE objective without resubstitution

The infoNCE objective is a commonly used contrastive learning objective, often attributed to Oord et al. (2019) but which also appears in earlier work (Sohn, 2016). We use ``resubstitution'' to mean that the positive example $x^+$ doesn't appear in the denominator: while early work on contrastive learning typically included the positive term in the denominator, subsequent work found that excluding it can boost performance (Yeh et al., 2022). We have clarified these details in the paper

particularly what is f and B in (2)?

B is the batch size; we have added this detail to the paper. We have revised Eq. 2 to remove $f(\cdot, \cdot)$ and replaced it with $\frac{1}{2}\|\|\phi(x) - \psi_s^+)\|\|_2^2$ .

Why the subscript of expectation in (2) seems to not match with the sampling rule (x, x+) ∼ p(x, x) mentioned earlier?

Thanks for catching this typo, which we have fixed.

What does TSNE in the experiment section stand for?

t-Distributed Stochastic Neighbor Embedding (). We have added a citation.

"prior work" is used throughout the paper, creating some ambiguity... Which particular prior work shows "the representations learned by contrastive learning encode a probability ratio"

We have addressed this concern by clarifying the particular prior works that these statements refer to. E.g., we provide a citation to (Ma & Collins, 2018; Poole et al., 2019) for the statement that "the representations learned by contrastive learning encode a probability ratio

References

Karamcheti, et al. (2023). Language-Driven Representation Learning for Robotics.

Ma, et al. (2023). VIP: Towards universal visual reward and representation via value-implicit pre-training.

Mikolov, et al. (2013). Distributed Representations of Words and Phrases and their Compositionality.

Oliveira, P. (2006). Interpolation of signals with missing data using PCA.

Oord, et al. (2019). Representation Learning with Contrastive Predictive Coding.

Park et al. (2021). Finetuning pretrained transformers into variational autoencoders.

Sermanet, et al. (2017). Time-Contrastive Networks: Self-Supervised Learning from Multi-view Observation

van der Maaten, et al. (2008). Visualizing Data Using t-SNE.

Yeh, et al. (2022). Decoupled contrastive learning.

Zhang, et al. (2023). Universal visual decomposer: Long-horizon manipulation made easy.

2024-08-13

Thanks for the authors' clarifications and I have no more concerns. This is a solid paper overall, though there is room to enhance the clarity of the writing. Based on these considerations, I have increased my score.

作者回复

2024-08-07

Dear reviewers,

The attached rebuttal PDF contains the results of new experiments, which demonstrate stock prediction as a downstream application (Fig. R1) and which compare to a new autoencoder baseline (Fig. R2). Responses to each reviewer individually can be found in separate comments below.

Kind regards,

The authors.

最终决定Accept (poster)

2024-09-25

While the paper's scores alone are borderline, the consensus I gather from reading reviews is that the paper should be accepted, with reviewers praising the paper's clear writing and various technical strengths, particularly after various clarifications given on rebuttal.