PaperHub
7.3
/10
Rejected4 位审稿人
最低4最高5标准差0.5
5
4
5
4
3.8
置信度
创新性3.0
质量3.0
清晰度2.8
重要性3.3
NeurIPS 2025

DiffSDA: A Framework for Unsupervised Diffusion Sequential Disentanglement Across Modalities

OpenReviewPDF
提交: 2025-05-07更新: 2025-10-29

摘要

关键词
Sequential DisentanglementDiffusion Models

评审与讨论

审稿意见
5

This work deals with the problem of unsupervised representation learning for sequential data. More specifically, the problem statement is related to finding disentangled representations of sequential data, by partitioning static from dynamic latent factors. This separation enables a number of application use cases, such as “video swapping”, zero-shot disentanglement, speaker identification, and time-series predictions.

The proposed method features a simple design, that builds on established approaches to learn disentangled representations and diffusion-based generative models that are conditioned based on latent factors of variations. The approach described by the authors is general, in that it requires minimal architecture changes to accommodate different data modalities. In a gist, the idea proposed by the authors is as follows.

A semantic encoder, based on Disentangled sequential autoencoder (reference [11] in the paper), takes the clean data sequence x01:Vx_0^{1:V} as input and produces the disentangled latent factors: the static factor s0s_0 and the sequence of dynamic factors d01:Vd_0^{1:V}. This encoder is implemented using a U-Net and linear modules per frame, followed by LSTMs to summarize the sequence and obtain s0s_0 and d01:Vd_0^{1:V}. These extracted latent factors serve as the ground truth "clean" data for training a separate DDIM model. The DDIM model, which is distinct from the one that generates sequential data, estimates the distribution of the disentangled latent factors: pT0(s0,d01:VsT,dT1:V)p_{T_0} (s_0, d_0^{1:V} | s_T, d_T^{1:V}​). It is trained using the outputs from the semantic encoder as its "data". Once trained, this model allows for the generation of new, diverse disentangled latent factor sequences by sampling from Gaussian noise and simulating the reverse (denoising) process. This provides the ability to create novel static and dynamic factor combinations independently. The conditional diffusion model is responsible for generating sequential data, conditioned on latent factors of variation. What the authors refer to as the "stochastic encoder" corresponds to the forward diffusion process. The so called stochastic decoder uses a learned neural network that estimate the score function, which is used to iteratively denoise a noise sample to obtain sequential data.

The authors performed a very large experimental campaign, on video data, on audio data and on time-series data. They defined appropriate performance metrics for the use-cases they considered in their experiments, and provided several additional results in the appendices. Overall, the proposed method performs very well, and the experiments support the claim of a versatile approach to disentangled representation learning that works across various modalities.

优缺点分析

  • Strength:

    • By large, the biggest strength of this article, in my opinion, is the breadth of the experimental campaign conceived by the authors. I think this is a convincing way of supporting the claims of this paper
    • The proposed method is simple and relies on established building-blocks. One could see this as lack of originality, but from the practical point of view, it is not evident to combine existing methods into a full-fledged system that actually works.
    • The article is mostly clear and well written, experiments are polished and convincing
  • Weaknesses:

    • In my opinion, the main weakness of this paper is the presentation of the proposed method. Authors focus on claiming a single loss function, with few hyper-parameters, and present the overall “system” as a standalone model that can be easily trained in one go, using a simple and well known “denoising loss”. In reality (see my paper summary above), the method requires several separate training stages, and the auto-encoding architecture is (in my opinion) a bit of a stretch. For example, the forward diffusion of the latent variables is presented as an encoder, which fits the narrative (and in some prior work on diffusion models it has been justified), but that (again, in my opinion) hinders a proper understanding of the proposed method.
    • It is hard for me to find stark differences between the proposed semantic encoder, and the related work it builds upon (reference [11] in the paper). The claim of designing a novel sequential encoder should be better reflected in the text.
    • Older work on disentanglement of the latent space of a variational auto-encoder are rich in (theoretical) results on a more fine-grained interpretation of the latent space, as well as impossibility results ([1], [2], [3] for example). It appears to me that the main focus of this work is on the generative quality (while respecting the “semantic constraints” imposed by the simple latent factors).

[1] Locatello et. al, “Challenging common assumptions in the unsupervised learning of disentangled representations”, ICML 2019

[2] Shu et. al, “Weakly supervised disentanglement with guarantees”, ICLR, 2020

[3] Khemakhem et. al, “Variational autoencoders and nonlinear ica: A unifying framework”, AISTATS, 2020

问题

  • Can you please spell out what are the key differences between the proposed semantic encoder, and what exists from the literature? Please, make explicit reference to the work [11] that inspired your design.

  • You present qualitative results for the zero-shot experiments. While compelling visually, can you define a quantitative metric in this specific case?

  • Similarly, for the multi factor disentanglement experiments, we are presented with qualitative results, but I am wondering which quantitative metric could be defined in this case. Furthermore, you explore the latent space using PCA, which establishes linear directions of variation. This requires an assumption of linearity of the disentangled factors, especially for the static component in your experiments. Is this a reasonable assumption, and does it apply to the dynamic part of the latent factors of variation?

局限性

Yes, to some extent, in the conclusion.

最终评判理由

I will keep my positive score, as I think this is a good paper whose strength is in its empirical validation.

格式问题

I did not spot issues on paper formatting.

作者回复

We thank Reviewer Dbeh for the detailed and thoughtful review, particularly for highlighting the strengths of our experimental design and the versatility of our approach. Below, we respond to each concern and suggestion, and we would be happy to incorporate the reviewer’s feedback into the final version of the paper.

W1:

In my opinion, the main weakness of this paper is the presentation of the proposed method. Authors focus on claiming a single loss function, with few hyper-parameters, and present the overall “system” as a standalone model that can be easily trained in one go, using a simple and well known “denoising loss”. In reality (see my paper summary above), the method requires several separate training stages, and the auto-encoding architecture is (in my opinion) a bit of a stretch. For example, the forward diffusion of the latent variables is presented as an encoder, which fits the narrative (and in some prior work on diffusion models it has been justified), but that (again, in my opinion) hinders a proper understanding of the proposed method.

Response:

We thank the reviewer for raising these thoughtful concerns, and we appreciate the opportunity to clarify the presentation of our method.

On the Training Procedure and Model Structure:

Our method involves a single training stage designed to disentangle the latent space into static and dynamic components. This training is performed using the loss in Eq. 5 and consists of three modules:

11 a semantic encoder (extracting latent features),

22 a forward stochastic process (diffusing latent representations), and

33 a denoiser (decoder).

This unified training pipeline allows us to learn disentangled latent representations in an end-to-end manner. Importantly, we do not rely on explicit prior assumptions over the latent variables (e.g., Gaussian priors as in 1111 ).

We acknowledge that our method includes an additional stage only when we wish to generate new samples. Since the training process does not assume priors over the latent space, sampling from it is not immediately feasible post-training. To enable generation, we introduce a second stage where we learn to model the distribution of static and dynamic features-this step is isolated from the disentanglement training itself (that is, without disentangling efforts). Importantly, our disentanglement results do not rely on this generative stage and are also validated through swapping and other non-generative tasks that do not require DDIM sampling.

We appreciate the reviewer’s comment regarding the terminology of referring to the forward diffusion process as an "encoder." While this usage is consistent with prior literature on diffusion models, we agree that it may be misleading in this context. We will revise the manuscript to use the more precise term "forward diffusion process" and clarify its role within the model to avoid confusion.

In summary, the core of our method is trained in a single end-to-end stage using a standard denoising loss. The additional generative stage is optional and only used for sampling, not for learning disentangled representations. We will revise the manuscript to improve clarity on this point and refine our terminology as suggested.

W2:

It is hard for me to find stark differences between the proposed semantic encoder, and the related work it builds upon (reference 1111 in the paper). The claim of designing a novel sequential encoder should be better reflected in the text.

Response:

The key difference between our proposed semantic encoder and the encoder design in [11] lies in the assumptions made about the latent representations of the static and dynamic factors, s0**s**_0 and d0τ**d**_0^\tau. In [11], as well as in many prior works based on variational encoders, the latent space is constrained by a Gaussian prior. This assumption helps computing the KL divergence term in the variational framework but can restrict the expressiveness of the learned representations. In contrast, our approach does not impose any prior distribution over s0**s**_0 and d0τ**d**_0^\tau. By removing this constraint, we allow the model to learn more flexible and expressive representations that are better suited for capturing the complex structure of static and dynamic factors, particularly in real-world, unlabeled data. We believe this design choice contributes to the improved generalization and disentanglement performance observed in our experiments, especially in settings where variational assumptions may be too limiting.

W3:

Older work on disentanglement of the latent space of a variational auto-encoder are rich in (theoretical) results on a more fine-grained interpretation of the latent space, as well as impossibility results ( 11 , 22 , 33 for example). It appears to me that the main focus of this work is on the generative quality (while respecting the “semantic constraints” imposed by the simple latent factors).

Response:

Indeed, earlier works on disentanglement provide rich theoretical insights into the structure and limitations of latent representations, especially within the context of variational autoencoders and related frameworks. We agree that a deeper theoretical understanding of probabilistic modeling is valuable.

Our primary theoretical contribution lies in the probabilistic formulation of disentanglement within the diffusion framework. However, we acknowledge that the main focus of our work is empirical: we aim to demonstrate the practical effectiveness of this formulation in real-world settings, particularly in scenarios where previous approaches often struggle as demonstrated.

We view this work as a stepping stone, and we hope it motivates further investigation into the theoretical foundations of disentanglement in diffusion-based models. In particular, we believe that bridging the gap between empirical performance and theoretical guarantees is a promising direction for future research.

Q1:

Can you please spell out what are the key differences between the proposed semantic encoder, and what exists from the literature? Please, make explicit reference to the work 1111 that inspired your design.

Response:

Please see our response for W2.

Q2:

You present qualitative results for the zero-shot experiments. While compelling visually, can you define a quantitative metric in this specific case?

Response: Yes, we can apply a similar evaluation protocol used for disentanglement to our setup. Specifically, we utilize the AED and AKD metrics introduced for sequential disentanglement to evaluate the zero-shot swapping task described in Section 4.1. While the detailed results are omitted here for clarity, a few key trends emerged. As expected, models performed best when trained and evaluated on the same dataset. However, we found that models pre-trained on larger datasets, such as VoxCeleb, transferred more effectively to smaller ones like MUG-particularly in preserving dynamic content (AKD). In contrast, static content (AED) was generally more sensitive to domain shifts and performed best in-domain. These results highlight the benefit of large-scale pre-training for generalization in low-resource, zero-shot settings.

Q3:

Similarly, for the multi factor disentanglement experiments, we are presented with qualitative results, but I am wondering which quantitative metric could be defined in this case. Furthermore, you explore the latent space using PCA, which establishes linear directions of variation. This requires an assumption of linearity of the disentangled factors, especially for the static component in your experiments. Is this a reasonable assumption, and does it apply to the dynamic part of the latent factors of variation?

Response:

This is a great question. We believe that if ground-truth labels were available, a swap-based metric could be extended to multiple factors, similar to existing evaluations for two factors, by isolating and manipulating each factor independently. For example, one could modify a person’s hair color while expecting attributes like eye color or facial expression to remain unchanged.

Regarding the PCA latent space, while we do not have a theoretical guarantee of linearity, it is nonetheless intriguing to observe some degree of linear structure. Enforcing linearity explicitly could potentially improve factor decomposition and presents an interesting direction for future research.

Finally, yes: our approach can also be applied to dynamic factors. We demonstrate such disentanglement in the appendix. Specifically, Figures 36–50 illustrate traversals along various dynamic axes such as facial expressions, head rotation, viewing angles, and more.

评论

Dear authors, thank you for the rebuttal, your answers and discussions.

I will keep my positive score.

审稿意见
4

This paper proposes a probabilistic framework using diffusion models for unsupervised disentanglement of sequential data into static and dynamic factors of variation expanding on the previous approach DiffVAE. Their approach is designed to reduce the reliance on extensive hyperparameter tuning by using a single score matching objective. The proposed framework is designed to learn disentangled representations, and the use of PCA further seems to separate static and dynamic information into multiple interpretable factors. The effectiveness of the proposed method is demonstrated through competitive qualitative and quantitative evaluations across different data domains, including the introduction of a new evaluation protocol for video disentanglement.

优缺点分析

Quantitative results show good reconstruction and semantic editing capabilities over an extensive range of datasets and show strong empirical results on chosen benchmark metrics. Disentanglement seems to be facilitated through the use of structural inductive biases placed on the model rather than the auxiliary losses with additional hyperparameters fulfilling one of the key objectives of the proposed work.

There are a number of weaknesses in this paper:

  1. Experiments are mostly on short sequence datasets. It is also unclear the specific advantages of the proposed framework over the DiffAE methods when applied on a per frame basis.
  2. The probabilistic formulation needs to be further elucidated. As it stands, it is unclear on how the proposed probabilistic framework leads to the specific training objective.
  3. It is unclear on how the proposed evaluation method is novel and what makes the evaluation method well suited for measuring disentanglement.

问题

  1. The inductive biases which facilitate disentanglement should be made explicitly clear, especially for unsupervised learning with a single, simple loss objective.
  2. The probabilistic formulation of Eq. (1) should show the correct factorization of the joint distribution in terms of the conditioning variables (sTs_T, dT1:Vd_T^{1:V}, xt1:Vx-t^{1:V} only appear as conditional variables on the RHS, which should also be reflected in the LHS). Also Equation (2) of the posterior seems to imply some independence of the static and dynamic components.
  3. Since one of the main contributions of the paper is a probabilistic formulation of the diffusion model for use in sequential disentanglement, the evidence lower-bound could be made explicit or a derivation should be included which makes the dependence of the latent variables on the loss function explicit.
  4. If the proposed method claims to be a method specifically for sequential disentanglement, quantitative results should be reported and compared with on datasets specifically designed to measure sequential disentanglement capabilities.

局限性

Yes

最终评判理由

Overall this is a reasonable paper but, as I mentioned in my reply to the rebuttal, the scope of the work is relatively narrow and I am still not confident about the evaluation process.

格式问题

None

作者回复

We thank Reviewer DNDW for the detailed and constructive feedback, which helped us clarify key aspects of our probabilistic formulation, inductive biases, and evaluation design. Below, we respond to each point, and we would welcome the opportunity to incorporate the reviewer’s suggestions in the final version of the paper.

W1:

Experiments are mostly on short sequence ...

Response:

Our experiments deliberately target short video clips lasting up to 10 seconds and sampled at one frame per second (or at higher rate for shorter videos), yielding a total of 10 frames. Such clip lengths are standard in modern video‑representation learning (for example, VideoMAE V2, Internvideo2, and LanguageBind are operating on 8 or 16 frames).

DiffAE is not designed for sequential or unsupervised disentanglement, making it unclear how to adapt it for extracting static and dynamic features across frames. Applying its semantic encoder frame by frame yields individual embeddings per frame, but these do not naturally separate static and dynamic components. This is expected, as DiffAE’s encoder is intended to summarize a single image to aid the denoising process, not to model temporal consistency or factor disentanglement.

W2:

The probabilistic formulation needs to be further elucidated....

Response:

Our framework is grounded in the continuous-time score-based generative modeling paradigm introduced by Karras et al. [1], and thus does not rely on variational inference or an ELBO-based training objective. Instead, we optimize a denoising score matching loss that learns the conditional score function:

xlogp0t(xtτx0τ,z0τ),\nabla_{**x**} \log p_{0t}(**x**_t^\tau \mid **x**_0^\tau, **z**_0^\tau) ,

which characterizes the reverse-time stochastic process xtτx0τ**x**^\tau_t \to **x**^\tau_0 , conditioned on the latent factors z0τ**z**^\tau_0, which are provided as inputs to the denoising model Dθ**D**_\theta. The joint generative distribution is defined in Eq.(1) , while the posterior in Eq.(2) specifies how we infer these latent variables from the observed sequence during training.

The training objective in Eq.(3) arises as a natural consequence of denoising score matching:

Ex0τp0,xtτp0t,z0τp(x01:V)[Dθ(xtτ,t,z0τ)xlogp0t(xtτx0τ,z0τ)2],\mathbb{E}_ {**x**^\tau_0 \sim p_0, **x**^\tau_t \sim p_{0t}, **z**^\tau_0 \sim p( \cdot \mid **x**^{1:V}_ 0)} \left[ \left\| **D**_\theta(**x**^\tau_t, t, **z**^\tau_0) - \nabla _{**x**} \log p _{0t} (**x**^\tau_t \mid **x**^\tau_0, **z**^\tau_0) \right\|^2 \right],

which corresponds to learning the score of a reverse-time SDE conditioned on the latent variables. This formulation allows the model to generate samples conditioned on latent structure inferred from the input sequence. To summarize: 1) The prior defines the generative trajectory from noise to data, conditioned on static and dynamic factors. 2) The posterior specifies how these factors are inferred during training. 3) The training objective directly approximates the conditional score function via posterior-sampled latent variables. Given the opportunity, we will revise our manuscript to emphasize the above discussion.

W3:

It is unclear on how the proposed evaluation method is novel...

Response:

The novelty and suitability of our proposed evaluation method for measuring sequential disentanglement can be understood from two perspectives: its design to work without labeled data and its alignment with meaningful factors of variation in real-world videos.

Previous approaches to evaluating disentanglement in sequential data, such as those applied to the MUG dataset, relied heavily on labeled attributes. For instance, identity (static factor) and expressions (dynamic factors) were manually annotated. In contrast, our method is designed to work with real-world, unlabeled video data, which is far more representative of practical applications.

To achieve this, we adapt evaluation metrics traditionally used in animation and visual applications. Specifically, AKD and AED, and reformulate them for the sequential disentanglement setting. Additionally, we standardize the evaluation by precomputing a list of input pairs, ensuring fair comparison across different methods under identical conditions.

Our method is well-suited for evaluating disentanglement in videos due to three key reasons: 1. Dynamic Feature Evaluation: AKD captures motion-related differences (e.g., expressions) by comparing keypoints across predefined video pairings. Larger AKD scores indicate better disentanglement of dynamic features. 2. Static Feature Evaluation: AED assesses identity preservation by comparing feature embeddings before and after manipulation. Low AED scores signal that static attributes like identity are well preserved. 3. Empirical Validation: On the MUG dataset, our metrics closely match trends observed using label-based evaluations, confirming their effectiveness, despite being label-free.

We also provide an in-depth analysis of MUG’s traditional evaluation protocol in App. F, highlighting its reliance on predefined attributes and human judgment. Our AKD and AED metrics (detailed in App. E) address these issues and generalize to both labeled and unlabeled datasets.

Q1:

The inductive biases which facilitate disentanglement....

Response:

We are happy to clarify how disentanglement is facilitated in our method, particularly in the context of unsupervised learning with a single, simple loss objective.

As noted in lines 183-187, our framework does not rely on auxiliary losses or supervision. Instead, disentanglement emerges from two key inductive biases built into the model design:

11 Sharing the static vector across all time steps inherently prevents it from capturing temporal variation. Since each frame is reconstructed using the same static vector, any changes over time must be modeled by the dynamic vector. If the static vector encoded temporal changes, it would produce identical outputs across frames, collapsing the temporal dimension and harming both disentanglement and reconstruction. This design bias encourages a clean separation between static and dynamic factors.

22 The dynamic factors are intentionally constrained to be low-dimensional, limiting their capacity to store detailed or identity-specific static information. This dimensional bottleneck promotes the disentanglement of motion (or other dynamic content) from identity.

We further elaborate on this in App. G.2, where we provide empirical validation of these assumptions. Specifically, we conduct ablation studies showing that when the static component is not shared across time steps, disentanglement quality significantly degrades. We also show that when the dimensionality of the dynamic vector is increased, it begins to capture static features, weakening the separation between static and dynamic information.

Q2:

The probabilistic formulation of Eq. (1) should show the ...

Response:

The formulation in Eq. (1) reflects a modeling assumption about how we structure the generative process. While certain variables appear as conditioning terms on the right-hand side, we did not explicitly mark them as such on the left-hand side. This can understandably create confusion regarding the probabilistic semantics. To clarify, the equation is intended to define a conditional joint distribution, where the conditioning variables are fixed and inform the generation of other variables. In the final version, we will revise the notation to make this interpretation clearer and fully consistent with standard probabilistic conventions.

Regarding Eq. (2), we do assume independence between static and dynamic components in the posterior, as this supports our goal of learning a disentangled representation from the input sequence. Since the input already contains specific static and dynamic traits, modeling them as independent given the data encourages their separation during inference. In contrast, for the prior, where no input is available, we allow dependence between static and dynamic factors to enable the generation of more coherent and realistic samples.

Q3:

Since one of the main contributions of the paper is a probabilistic formulation of the diffusion model for use in sequential disentanglement...

Response:

Our framework is not based on an ELBO, as it follows the score-based diffusion formulation of Karras et al. 11 , where the training objective is derived from denoising score matching, not variational inference. Importantly, while the forward noising distribution is independent of the latent variables, we condition the score model on latent factors to enable expressive, disentangled generation in the reverse process. This is consistent with conditional score-based models where the learned score is conditional, even if the forward process is not. See also the discussion above on Weakness 2.

Q4:

If the proposed method claims to be a method specifically for sequential disentanglement...

Response:

We do not overlook traditional quantitative metrics; in fact, we actively incorporate them across audio and video domains. For audio, we evaluate on TIMIT and the large-scale LibriSpeech dataset using standard metrics such as Static EER, Dynamic EER, the Disentanglement Gap (as in C-DSVAE), and time-series metrics from SPYL. For video, we benchmark on the labeled MUG dataset using widely adopted judge-based metrics (see Table 9). Acknowledging the limitations of these protocols for real-world, unlabeled data, we also introduce a novel evaluation method that enables fair and consistent assessment without requiring attribute labels or manual judgments.

11 Elucidating the Design Space of Diffusion-Based Generative Models, Karras et al. NeurIPS 2022.

评论

I appreciate the responses provided in the rebuttal but I am still concerned about the constrained view of sequential disentanglement. I may update my score during the final reviewers discussions.

评论

We thank Reviewer DNDW for their continued engagement with our work. We note your remark about our “constrained view of sequential disentanglement” and would appreciate clarification on which aspects of our work you find limiting, so we can address them concretely. Your thoughtful feedback is invaluable and will help us strengthen the paper. If there are any additional points you would like us to clarify or address, please let us know.

评论

I understand your sequential disentanglement generative framework and the fact there are no requirements that the static and dynamic components involve each independent factors of variations. The PCA will generate a few uncorrelated principal components, which is definitely helpful. But I understand where you are coming from.

审稿意见
5

This paper proposes a framework called DiffSDA for unsupervised sequential disentanglement across modalities. By combining diffusion models and autoencoders, the framework effectively decomposes sequential data (e.g., videos, audio, and time series) into static and dynamic latent factors. Experimental results demonstrate that DiffSDA outperforms existing state-of-the-art methods on multiple real-world datasets.

优缺点分析

Strengths:

  1. The paper is well-structured and logically coherent, with a solid theoretical framework and comprehensive experimental design. The evaluation is thorough and convincing.
  2. The proposed DiffSDA framework has broad applicability in unsupervised sequential disentanglement tasks. Its application of diffusion models in this field is innovative.
  3. The work identifies limitations in existing evaluation protocols and introduces a new evaluation method, providing a standardized tool for future research.
  4. Both qualitative case studies and quantitative metrics demonstrate that DiffSDA achieves strong performance.

问题

  1. Typo/Inconsistency:

    In Figure 2, the order of the first two rows in the left panel appears inconsistent with Figures 3 and 4. It would be helpful to standardize the presentation and update the corresponding descriptions in Section 4.1 accordingly.

  2. Interpretability of Latent Factors:

    The paper mentions that PCA can further disentangle static and dynamic factors, but are there more intuitive ways to visualize or explain the semantic meaning of these factors? Additionally, in Section 4.3, the meaning of the variable "b" is not entirely clear. Could the authors clarify this?

  3. Generalization Mechanism in Zero-Shot Tasks:

    The model performs well in zero-shot tasks, but the underlying mechanism for cross-modal generalization is not thoroughly discussed. I guess the generalization ability might partly stem from the pre-trained VQ-VAE autoencoder, which is inherently capable of reconstructing visually plausible images from diverse latent representations. What are the authors’ thoughts on this?

局限性

Yes.

最终评判理由

I don't have further questions and will keep my positive score.

格式问题

N/A.

作者回复

We thank Reviewer z7mY for the thoughtful and constructive feedback. Below, we address each point and would be happy to incorporate the reviewer’s suggestions into the final version.

Q1:

Typo/Inconsistency: In Figure 2, the order of the first two rows in the left panel appears inconsistent with Figures 3 and 4. It would be helpful to standardize the presentation and update the corresponding descriptions in Section 4.1 accordingly.

Response: We acknowledge the inconsistency in the order of the first two rows in the left panel of Figure 2 when compared to Figures 3 and 4. We will rectify these inconsistencies across Figures 2, 3, and 4, and update the corresponding descriptions in Section 4.1 accordingly in the revised version of our paper.

Q2:

Interpretability of Latent Factors: The paper mentions that PCA can further disentangle static and dynamic factors, but are there more intuitive ways to visualize or explain the semantic meaning of these factors? Additionally, in Section 4.3, the meaning of the variable “b” is not entirely clear. Could the authors clarify this?

Response:

We agree that providing clearer and more intuitive interpretations of the latent static and dynamic factors can enhance the paper’s accessibility. In the paper, we aimed to demonstrate interpretability through both textual explanation and illustrative figures. We employed PCA to disentangle the static (s) and dynamic (d) factors without relying on any labels-similar in spirit to the supervised linear projection used in DiffAE; however, the latter requires labels, in contrast to our unsupervised approach. For example, Figure 2 (right) shows how, given a video, we can isolate a latent direction within s that corresponds to a spectrum between masculine and feminine character appearances.

Due to space constraints in the main paper, we included extended qualitative results in the appendix. Notably, Appendix H.5 presents latent directions associated with darker hair color changes (Fig. 25) and brighter hair color (Fig. 27), whereas Fig. 28 shows a direction controlling perceived age (older vs. younger) semantics. For dynamic features, we also observe interpretable vectors within d. For instance, Fig. 39 demonstrates a vector controlling facial expressions, while Fig. 42 shows one responsible for head rotations.

We will clarify these insights further in the revision, including more direct references.

Clarification on Variable b: The variable b refers to the batch size, and the subscript j indexes the j-th sample within a batch. We will make this explicit in the revised version of the manuscript to avoid ambiguity.

Q3:

Generalization Mechanism in Zero-Shot Tasks: The model performs well in zero-shot tasks, but the underlying mechanism for cross-modal generalization is not thoroughly discussed. I guess the generalization ability might partly stem from the pre-trained VQ-VAE autoencoder, which is inherently capable of reconstructing visually plausible images from diverse latent representations. What are the authors’ thoughts on this?

Response:

We appreciate the reviewer’s thoughtful observation regarding the potential role of the pre-trained VQ-VAE in enabling cross-modal generalization. Indeed, we concur that the strong zero-shot performance is partly attributable to the VQ-VAE, which was trained on a diverse corpus of images and provides a unified latent representation space. This shared space facilitates the model’s ability to generalize across datasets, even when they differ in visual characteristics.

To further investigate this hypothesis, we conducted an additional zero-shot experiment. Specifically, we trained DiffSDA on downsampled (64×64) VoxCeleb dataset and tested its transferability to the Mug data set. Importantly, in this setting, we did not use the VQ-VAE. Preliminary results from this setup indicate that zero-shot generalization degrades compared to the original pipeline that includes VQ-VAE. These findings support our belief that the VQ-VAE contributes to the model’s ability to generalize by encoding diverse images into a coherent and transferable latent space.

评论

Thank you for the response! I' ll keep my score.

审稿意见
4

This paper proposes a diffusion model based method to achieve latent space disentanglement in sequential data. The method is applicable to wide modality. The experiments are mostly on video and audio signals.

优缺点分析

Strength:

  • The proposed method tries to address and open challenge
  • The manuscript is comprehensive and well written, with fairly large number of experiments

Weakness

  • Lack of a clear demonstration good disentanglement is achieved
  • The comparison to SOTA can be improved
  • Broad impact should be discussed

问题

  1. The comparison to SOTA is very limited. The comparison is limited to [14] and [15], which are from the same research group. A google search returned the following two existing work in the past 5 years: S3VAE: Self‑Supervised Sequential VAE (Zhu et al., 2020) and C‑DSVAE (Contrastively Disentangled Sequential VAE, Bai et al., 2021). They should be included in the SOTA comparison.

  2. The metrics used in experiments are all high level, and complex benchmarks are used. At least in the classic disentanglement papers, simple benchmarks (e.g., dSprites) have been used. Their simplicity facilitates the understanding of the effectiveness of the disentanglement. The author(s) should consider including such simple benchmarks (e.g., by constructing sequential data from static dSprites) to demonstrate the effectiveness.

  3. The formalization of sequential disentanglement as in Eqn (10) around line 582 needs more elaboration. A fully disentangled latent space can be interpreted as latent variables are mutually independent (or orthogonal from Banach space perspective). Please elaborate the statement.

  4. Related to 3), around line 184, the proposed method promotes disentanglement. Without theoretical analysis, the empirical demonstration in G.2 is not sufficient. More comprehensive experimentation is needed.

  5. I have strong reservations in suggesting a new evaluation metric (AED and AKD) in the same paper which proposes a new method. I can understand the limitation of the current metric and the need to propose new ones. At a minimum, the proposed metric (i.e., AED and AKD) should be used on existing methods to demonstrate their consistency. It's OK to list them in the appendix.

  6. There is obvious broad societal impacts for the proposed method, which should be mentioned. The answer of NA on line 1099 is not accurate.

局限性

The potential broad societal impact should be mentioned.

最终评判理由

The authors have agreed to revise the manuscript if it's accepted, which addresses one of my big concerns.

格式问题

No.

作者回复

We sincerely thank Reviewer ntGx for the thoughtful and constructive feedback, which helped us strengthen the paper both empirically and conceptually. Below, we address the reviewer’s concerns in detail, and given the opportunity, we would be happy to incorporate these suggestions into the final revision.

Q1:

The comparison to SOTA is very limited. The comparison is limited to 1414 and 1515 , which are from the same research group. A Google search returned the following two existing works in the past 5 years: S3VAE: Self‑Supervised Sequential VAE (Zhu et al., 2020) and C‑DSVAE (Contrastively Disentangled Sequential VAE, Bai et al., 2021). They should be included in the SOTA comparison.

Response: Our comparisons already include SPYL and DBSE, which are among the most recent and competitive state-of-the-art methods for sequential disentanglement. Regarding S3VAE, we considered it during our survey; however, we were unable to find publicly available code, and the method involves auxiliary tasks with potentially proprietary components, making fair reproduction difficult.

In response to the reviewer’s concern, we have now added C-DSVAE to our comparisons. As shown in the table below, our method consistently outperforms C-DSVAE, SPYL, and DBSE across multiple datasets and metrics, further validating the effectiveness and generality of our approach.

Dataset / TaskMetricC-DSVAESPYLDBSEOurs
MUG (video)AED ↓0.560.490.490.11
MUG (video)AKD ↓0.500.470.480.06
MUG (video)MSE ↓0.0010.0010.0013e-7
PhysioNet (TS - pred.)AUPRC ↑0.1580.3670.4730.50
PhysioNet (TS - pred.)AUROC ↑0.5650.7640.8580.87
PhysioNet (TS - pred.)MAE ↓12.5312.2211.219.89
PhysioNet (TS - class.)Accuracy ↑32.5446.9856.8764.6
Air Quality (TS - class.)Accuracy ↑47.0757.9365.8769.2
TIMIT (audio)Static EER ↓4.03%3.41%3.50%4.43%
TIMIT (audio)Dynamic EER ↑31.81%33.22%34.62%46.72%
TIMIT (audio)Dis. Gap ↑27.78%29.81%31.11%42.29%

Summary of results across datasets and metrics for C-DSVAE, SPYL, DBSE, and Ours.

Q2:

The metrics used in experiments are all high level, and complex benchmarks are used. At least in the classic disentanglement papers, simple benchmarks (e.g., dSprites) have been used. Their simplicity facilitates the understanding of the effectiveness of the disentanglement. The author(s) should consider including such simple benchmarks (e.g., by constructing sequential data from static dSprites) to demonstrate the effectiveness.

Response: We thank the reviewer for the suggestion. To address this, we conducted an additional experiment on a sequentially adapted version of the dSprites dataset. Our model achieved near-perfect reconstruction and strong disentanglement performance. Specifically, we observed high accuracy in recovering key factors: 89.8% for color, 95.9% for shape, 99.7% for PosX, and 99.9% for PosY. These results confirm that our method generalizes well even on simple, controlled benchmarks. We will include a summary of this experiment in the revised manuscript and provide full details in the appendix.

Q3:

The formalization of sequential disentanglement as in Eqn (10) around line 582 needs more elaboration. A fully disentangled latent space can be interpreted as latent variables being mutually independent (or orthogonal from the Banach space perspective). Please elaborate on the statement.

Response: We respectfully disagree with the reviewer’s interpretation of Equation (10) and the associated problem formulation. The equation in question defines a general generative model over sequences and serves as the foundation for our problem formulation. Our goal is to model the latent space z1:V**z**^{1:V} such that it disentangles into static and dynamic components (i.e., z=[s,d1:V]**z** = [**s**, **d**^{1:V}]), which is a central challenge in unsupervised sequential disentanglement.

The reviewer’s suggestion that disentanglement should correspond to mutually independent or orthogonal latent variables represents a particular solution or modeling assumption rather than a general problem formulation. Indeed, enforcing independence in the latent space appears in our posterior probability. However, we deliberately avoid making such assumptions at the problem definition stage in order to remain broadly applicable to various modeling choices and disentanglement criteria. We therefore emphasize that our formulation is intentionally general, aiming to provide a foundation upon which different disentanglement strategies (including those based on independence or orthogonality) can be built and compared.

Q4:

Related to 3), around line 184, the proposed method promotes disentanglement. Without theoretical analysis, the empirical demonstration in G.2 is not sufficient. More comprehensive experimentation is needed.

Response: Thank you for raising this point. To directly address your concern, we expanded the empirical evidence and conducted an additional experiment on the MUG dataset, where we have access to both static and dynamic labels. Specifically, we aim to investigate: (1) the effect of sharing the static representation across time, and (2) whether using a small dynamic representation serves as a useful inductive bias for sequential disentanglement.

Below, we report results on the swap task using the MUG dataset, following a similar setup to Appendix G.2. This time, we evaluate performance using the accuracy of a classifier applied to the swapped samples, leveraging the availability of ground-truth labels. The results reinforce the insights reported in G.2. First, when the static representation s is not shared across time, all information collapses into it, thereby eliminating the ability to achieve sequential disentanglement. Second, we observe that the disentanglement gap, defined as the difference in swap accuracy between the static and dynmaic tests, is more pronounced when using a smaller dynamic latent size (64 vs. 256): 95.59% − 80.08% versus 92.12% − 81.28%. A larger disentanglement gap reflects better separation of dynamic and static factors.

We note that while the gap is relatively small in this dataset, the experiment in G.2 shows a much larger difference, suggesting that the effectiveness of the bottleneck inductive bias may vary across datasets.

d0τd_0^\tau sizess shared?Verification ACC ↑ (Static Frozen)Action ACC ↑ (dynamics frozen)
64yes95.59%80.08%
256yes92.12%81.28%
64no99.69%16.71%
256no99.83%18.18%

Disentanglement effect of MUG dataset

Q5:

I have strong reservations about suggesting a new evaluation metric (AED and AKD) in the same paper which proposes a new method. I can understand the limitation of the current metric and the need to propose new ones. At a minimum, the proposed metric (i.e., AED and AKD) should be used on existing methods to demonstrate their consistency. It’s OK to list them in the appendix.

Response: To clarify, we do not rely solely on AED and AKD for evaluation. In fact, we report standard sequential disentanglement metrics across all modalities - video (Table 9), audio (Table 4), and time series (Table 5). Our proposed metrics are intended to complement existing ones, especially in settings where traditional label-based evaluations are not feasible. Importantly, our results show strong consistency between the rankings and trends produced by AED/AKD and those from established metrics, supporting their validity. We also include detailed comparisons in the appendix to further illustrate this alignment.

Q6:

There is obvious broad societal impact for the proposed method, which should be mentioned. The answer of NA on line 1099 is not accurate.

Response: DiffSDA advances unsupervised sequential disentanglement across video, audio, and time‑series data, promising gains in interpretability, fairness, and domain reach. For example, engineers could discover whether a fault‑detection model secretly relies on a person’s accent or gender, while a language‑learning app might overlay only a speaker’s mouth movements onto a tutor video so that deaf or hard‑of‑hearing students can lip‑read in their own language.

At the same time, its power to manipulate high‑fidelity sequences heightens risks, especially in persuasive media, privacy, and clinical decision‑making. Editors can now swap a smile or a voice into a video with near‑perfect realism, which is excellent for film dubbing or creating sign‑language avatars but just as effective for deepfakes that spread disinformation. Likewise, fraudsters could overlay a convincing heartbeat pattern onto bogus insurance claims.

评论

I appreciate the authors effort for providing detailed rebuttal. I do not completely agree with the authors interpretation of sequential disentanglement and its relationship to the related concept in modeling, but I can see the authors' assumption is relevant and effective for the generative tasks. Hence I can accept it.

My remaining concern is related to Q6. It's obvious to me that the technique can be misused, and may have negative societal impact. According to neurIPS checklist, these issues should be identified and discussed in the manuscript. It's unclear to me if the authors plan to include the discussions in the revision. Please clarify.

评论

We thank the reviewer for the follow-up and for acknowledging our detailed rebuttal. Regarding your remaining concern about Q6, we agree that discussing the broader societal impacts is essential. We will add a Broader Impacts section to the revised manuscript, in the main paper, based on the text provided in our rebuttal. We appreciate your constructive feedback and will ensure this addition is clearly incorporated into the final version.

评论

Acknowledged. I will upgrade my rating.

评论

Dear authors,

You mention that your method's loss has less terms than previous methods. I am not certain whether that means your method requires less tuning or effort to make it work; after all your likelihood modeling/architecture is more complex as far as I can see.

More importantly, the extra terms of previous methods (e.g., low mutual information between static and dynamic) introduce inductive biases not covered by just likelihood modeling. So, given your method do not have these extra terms, what is the main source of disentanglement?

Best, AC

评论

You mention that your method's loss has less terms than previous methods. I am not certain whether that means your method requires less tuning or effort to make it work; after all your likelihood modeling/architecture is more complex as far as I can see.

Thank you for the thoughtful question. While our approach introduces some additional modeling components, we have taken care to keep both the design and tuning complexity as streamlined as possible.

In terms of disentanglement, our method relies on tuning just two main hyperparameters—the dimensions of the static and dynamic vectors (s and d)—whereas prior work often involves balancing multiple loss terms with carefully chosen weights, which can increase both the tuning effort and sensitivity. Architecturally, although our model might seem more elaborate, in practice we found that adjusting just two hidden layer sizes was sufficient for stable and competitive performance. This limited search space kept the tuning process straightforward. Regarding likelihood modeling, we recognize that diffusion models bring additional modeling complexity. However, we see this as a necessary and practical choice for handling real-world datasets such as CelebV-HQ, VoxCeleb, and TaiChi-HD. Importantly, we build on EDM [1], a highly robust and stable diffusion framework, and adopted most of its default hyperparameters. This led to a smooth training process with minimal increase in tuning time.

Overall, our goal was to balance modeling power with simplicity and robustness. In practice, we found our tuning demands to be comparable to or lighter than those of prior methods.

More importantly, the extra terms of previous methods (e.g., low mutual information between static and dynamic) introduce inductive biases not covered by just likelihood modeling. So, given your method do not have these extra terms, what is the main source of disentanglement?

We appreciate the opportunity to clarify how unsupervised disentanglement is achieved in our method by learning with a single, simple loss objective.

As outlined in lines 183–187, disentanglement emerges from two key inductive biases built into our model:

  1. Shared static vector across time: Using the same static vector for all time steps inherently prevents it from modeling temporal variation. As a result, temporal changes must be captured by the dynamic vector. If the static vector encoded time-varying features, it would produce identical outputs across frames, collapsing the temporal dimension and impairing both reconstruction and disentanglement.

  2. Low-dimensional dynamic vector: By constraining the dynamic vector’s dimensionality, we limit its capacity to store detailed or identity-specific information. This bottleneck encourages separation between dynamic content (e.g., motion) and static identity features.

We provide empirical validation of these assumptions in App. G.2. Ablation studies show that removing the shared static vector significantly degrades disentanglement. Similarly, increasing the dimensionality of the dynamic vector leads it to capture static features, weakening the separation between the two factors.

Finally, we acknowledge the importance of additional loss terms and novel inductive biases, such as those introduced in prior work, which can further enhance disentanglement. Investigating how such mechanisms can be integrated into our framework is a promising direction for future work.

[1] "Elucidating the Design Space of Diffusion-Based Generative Models", Tero Karras et al. (NeurIPS 2022)

最终决定

This paper proposes a framework called DiffSDA for unsupervised sequential disentanglement across modalities. By combining diffusion models and autoencoders, the framework effectively decomposes sequential data (e.g., videos, audio, and time series) into static and dynamic latent factors (a popular task with quite some prior work). The paper models time dependency in the latent space, and conditioned on the latent representations, high fidelity generation is achieved with a conditional diffusion model that is done independently for each time step. To my knowledge, this is the first work to use diffusion model for this specific sequence disentanglement task. With reasonable architecture design, the model is trained with a single diffusion loss, and the authors claim this to be a major advantage (of course the method still requires tuning given complexity of the architecture).

Pros: The reviewers and I are impressed with the diversity of datasets and the experimental results. The generated image sequences are of high quality (resolution), which are hard to achieve without diffusion models.

Cons: On the other hand, the reviewers and I have concerns on the novelty of the proposed method, which is essentially applying diffusion model-based likelihood modeling with encoder feature guidance to sequence data. Moreover, given the method does not contain other regularization terms to encourage disentanglement, at this point it requires tuning of representation capacity via dimensionality. I suggest the authors to perform comparisons with such regularizations (which may in fact further improve performance), as well as careful analysis of the source of disentanglement and submit the improved work to a future venue.