Generalizable Implicit Motion Modeling for Video Frame Interpolation
摘要
评审与讨论
The authors propose GIMM to effectively model intermediate motions. Three core designs are: normalization over the initial bidirectional flows, motion encoding (spatiotemporal motion latent extraction from flows) and adaptive coordinate-based INR. The framework first extracts bidirectional flows of the input frames via off-the-shelf optical flow models (e.g., RAFT) and normalizes them. Next, a Motion Encoder extracts motion features from the normalized flows. The motion feature maps are forward warped respective to the target timesteps. A Latent Refiner refines the warped motion features, and using the refined motion features, a coordinated-based MLP network predicts the normalized flow maps from the target timestep, which could be reversed to the bidirectional flows of original scales for frame synthesis. This implicit motion modeling framework is first individually trained with reconstruction loss on the predicted flow maps. Once it is pre-trained, a frame synthesis module is attached at the end of GIMM, which is jointly trained in an end-to-end manner with frame reconstruction loss.
优点
- Strong results: the visualized video in the supplementary video and the visualized motions fields look great in quality.
- To my knowledge, use of INRs for motion modeling in VFI, without training at test time is a novel approach.
缺点
- Lack of Experiments
1.1 Benchmarks
- The authors use the test set from X4K1000FPS [42] and SNU-FILM-arb for evaluation, where SNU-FILM-arb is a dataset introduced by themselves. Although there are diverse public datasets available for evaluation, they do not report the experimental results on the public datasets prevalently used in the field. Namely, Vimeo90k [49] is the most commonly used dataset, which they used for training and evaluation of modeled flows. I wonder the frame interpolation performance on Vimeo90k, as I believe there is no reason not to report them. In addition, I think there could have been other options for arbitrary-timestep frame interpolation benchmarks, e.g., Adobe240fps [52]. The proposed SNU-FILM-arb could be a useful benchmark for future studies, but I think they should have shown the validity of their method with public benchmarks before that.
1.2 Ablation studies
I find the ablation studies of their method is a bit limited. The main points of their arguments are not experimentally verified in the ablation studies. I find the ablation studies in Table 2 to be a little off-topic, not focused on the main arguments.
- Normalization: although they claim normalization of flows over scales and directions is one of the key designs of the framework in the abstract and introduction, the experimental results on its effectiveness cannot be found in Section 4.2 and Table 2.
- Motion encoding: similar to normalization, the experimental studies on the effectiveness of motion encoding cannot be found in the paper. They only show how the Latent Refiner affects the performance, but not the motion encoding process, which is one of the contributions that the authors claim.
- Generalizability: The title and the abstract emphasizes the 'generalizability' of the method, but do not show any experiments on generalizability. Although they argue that their method can be smoothly integrated with existing flow-based VFI works, there is no experiment on it. This part especially makes me consider the paper to be a bit over-claimed.
- Weak Analysis / Explanations
This is aligned with my concern aforementioned in 1.2 Ablation studies. The authors do not give a good explanation of the results or try deeper analysis on their main arguments.
- Normalization: In the abstract and introduction, the authors claim that normalization of flows are one of the key designs, but neither do they experimentally show or intuitively explain its importance / role. I think there should have been at least an intuitive explanation of its necessity, along with an experimental result supporting it. The authors mention that they perform normalization following IFE[13], but if that is the case simply following prior work, it cannot be a contribution of the paper. Should the authors of the paper claim this to be a contribution, it needs further analysis.
- INR: to my understanding, this part is the biggest contribution of the paper. However, I am not very convinced with the use of INRs. To my understanding, the INR takes the 3D coordinate as the input along with the motion latent codes at each pixel. I wonder why the spatial coordinates are necessary, as it will always predict the flow of its own spatial position. Furthermore, in that case, the target timestep t is the only part that would make the difference, which makes me curious why we have to adopt the INR form rather than another form of conditioning mechanism. For instance, I think a conditional U-Net used in the diffusion literature could suffice. The authors do not give a solid explanation of why the INR form is necessary, and I wish to hear a clarified explanation.
- Presentation
I feel that some important details, especially the core designs, are not described sufficiently. For instance, although the INR part is the largest contribution of the method, it is explained with only a couple of lines in the main manuscript, and the supplementary materials does not provide sufficient explanations. My main questions (#2 above & #1 in Questions) rise from the ambiguous description of the INR part. What is the input / output of the INR model, how are the input / output tensors shaped, etc. I had to make assumptions on this part, which made me fundamentally wonder why the INR form is necessary.
[52] Su, Shuochen, et al. "Deep video deblurring for hand-held cameras." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
问题
- What is the input / output of coordinate-based MLP (INR), and why are the spatial coordinates necessary?
- I had a hard time trying to understand this. To my understanding, the motion latent code L_t from the Latent Refiner would have a shape of H x W x D where H, W, D denote height, width and dimension, respectively. According to the main paper and the supplementary, the latent L_t and 3D coordinates x,y,t are concatenated and fed to the SIREN network. In that case, the concatenated representation would have a preserved spatial size of HxW, with each coordinate having different latent code L_t. If that is the case, as mentioned in the weakness section above, I believe there is not much reason to include the spatial (x,y) coordinates as the input of the INR network as the network predicts the latent flow of its own spatial coordinate, and I even wonder why an INR form is necessary. For example, can't it be simply be in a form of a conditional U-Net, with target time-steps given as in diffusion literature? The experiments in Table 2 show that the use of spatial coordinates do affect the performance, but I failed to understand the reason for this. Could there be an explanation on the reason for the increase? There could possibly be some misunderstandings by me and wish for some clarifications, as it is a very important component of the paper.
- What is the input / output of the Latent Refiner?
- According to the Fig.2 of the main paper, the output of the Motion Encoder K_i seems to be fed to the Latent Refiner. However, in the supplementary, Latent Refiner does not seem to take K_i as input, but only the warped motion features only. I wonder what is the right visualization.
- In connection to the #1 question above, I understood that the output of Latent Refiner to be of shape H x W x D. I feel that it should be correct, but feel uncertain since there is not much description on it.
-
How is the normalization of flows done? With maximum scalar values? With a large constant value? Or by log-scaling? I wonder how it is done, for the paper to be self-contained.
-
How is the forward-warping method evaluated for motion modeling in Table 1? To my understanding, the backward flows from the target timestep, F_{t->0} and F_{t->1} are used for evaluation. However, forward warping could only provide F_{0->t} and F_{1->t}. which does not sound like a fair comparison. Did the authors use techniques such as flow reversal [48] or complementary flow reversal [42]?
-
I wonder the computational costs of the framework, i.e., number of parameters, runtime, with comparison to state-of-the-art methods. The proposed framework seems to consist of many modules which could possibly require vast computational costs.
局限性
Yes, the authors have addressed their limitations.
Thank you for the constructive comments. Please find the following for our response.
Q1:The authors use the test set ... I wonder the frame interpolation performance on Vimeo90k,...
A1: Due to the word limit, please refer to the global response and A5-2 in our response to Reviewer rbyn.
Q2: ... other options for arbitrary-timestep frame interpolation benchmarks, e.g., Adobe240fps [52].
Thanks for your suggestion. We calculate PSNR on the Adobe240fps following the evaluation settings of IFRNet [24] and use the test split divided by VideoINR [7]. The results are listed below:
| Method | Adobe240fps (PSNR) |
|---|---|
| IFRNet | 31.08 |
| AMT | 30.70 |
| UPR-Net | 32.01 |
| CURE | 31.64 |
| EMA-VFI | 31.26 |
| GIMM-VFI-R | 32.31 |
| GIMM-VFI-F | 32.33 |
Our method achieves stronger performance. This further illustrates the effectiveness of our method for the arbitrary-timestep VFI task.
Q3: What is the input / output of coordinate-based MLP (INR)
A3: We concatenate of the motion latent of and spatiotemporal coordinates of as the input to the INR. The INR outputs the corresponding normalized flow.
Q4: ... why an INR form is necessary ... can't it be simply be in a form of a conditional U-Net, with target time-steps given as in diffusion literature?
A4: As described in Section 2, the INRs have effective modeling ability for complex, high-dimensional data and can learn continuous mapping from a set of coordinates to a specific signal. Thus, we are motivated to leverage the INR for continuous motion modeling at arbitrary timesteps. We experiment by replacing the INR with a timestep-conditioned U-Net as in diffusion literature. The results on VTF, VSF and model parameters are listed below:
| Method | VTF(PSNR) | VTF(EPE) | VSF(PSNR) | VSF(EPE) | Params |
|---|---|---|---|---|---|
| GIMM (U-Net) | 36.96 | 0.39 | 29.96 | 2.90 | 4.27M |
| GIMM (INR) | 37.56 | 0.34 | 30.45 | 2.68 | 0.25M |
Replacing INR with U-Net results in worse performance, especially on the 6X motion modeling benchmark VSF. This demonstrates the strong continuous modeling ability of INR. Besides, GIMM with INR has a much lighter architecture which improves the efficiency. Therefore, it is necessary and proper to use INR for continuous motion modeling.
Q5: ... the use of spatial coordinates do affect the performance, but I failed to understand the reason for this ...
A5: As explained in A4, we are motivated to use INR for its effective modeling ability, learning a continuous mapping from a set of coordinates to a specific signal. The coordinates provide positional information for INRs and help to learn continuous mapping since the coordinates are inherently continuous. In our motion modeling, we aim to model continuous motion between timesteps. The motion refers to dense optical flows, which consist of spatiotemporal changes. Therefore, it is necessary to use spatial coordinates for effective motion modeling. The experiment results in Table 2 demonstrate the necessity in practice.
Q6: why the spatial coordinates are necessary, as it will always predict the flow of its own spatial position.
A6: In addition to A5, we would like to further clarify that motion latent code serves as a function space for the implicit representations of different instances. The latent code is used for making INRs generalizable which means no need for test-time optimization. The spatiotemporal coordinates are still required for continuous modeling of spatiotemporal changes, such as dense optical flow in our case. Similar insights are shared in the related literature [5].
Q7: Generalizability:...
A7: As described in Section 2 (page 3) and Section 3.2 (page 4), “generalizability” means the generalizable modeling ability of the generalizable INRs (GINRs) across different instances. Unlike per-instance modeling INRs, GINRs do not require test-time learning. Our GIMM takes the motion latent code as an additional input to the INR and achieves generalizable motion modeling without the need for test-time optimization. A similar definition for generalizability can be found in the referenced paper [23].
Q8: How is the forward-warping method evaluated for motion modeling in Table 1?
A8: The forward-warping method produces 'backward' flows and from estimated flows and without flow reversal techniques. For example, can be easily obtained from , where indicates forward warping the objective with the referenced motion vector .
Q9: I wonder the computational costs of the framework.
A9: Due to the word limit, please kindly refer to the global response.
Q10: How is the normalization of flows done?
A10: The normalization process follows IFE[13]. We agree with the reviewer and will delete this part from our key designs in our revised manuscript.
Q11: Motion encoding: … the experimental studies on the effectiveness of motion encoding cannot be found in the paper…
A11: Thanks for your suggestion. Due to the word limit, please refer to the global response.
Q12: What is the input / output of the Latent Refiner?
A12: The Latent Refiner refiner takes both the motion features and the coarse motion latent (warped motion features) as input and outputs the residual for the coarse motion latent. We will update the figure of Latnet Refiner’s architecture in the supplementary of our revised manuscript.
Q13: Presentation. I feel that some important details, ..., What is the input / output of the INR model, how are the input / output tensors shaped, etc….
A13: Thanks for your suggestion. We will specify more details about the INR in the revised manuscript.
Thank you for the detailed response.
Q1. Vimeo90k
My concerns on Vimeo90k has been partially addressed. To make myself clear, my initial concerns on Vimeo90k was raised since I could not understand the main reason for using VTF and VSF as a main benchmark, as it is on flows, which is not the ultimate goal of VFI.
Although It could be a benchmark to show that the proposed framework models motion fields properly, I believe it is a limited benchmark, which could only serve as an assistance to frame reconstruction, for deeper analysis on the method, and cannot be a main benchmark for comparing the validity as a VFI method.
Yet, the authors seem to use the VTF and VSF as one of the main benchmarks, considering the responses in the rebuttal. I wonder the reason for this, as the performance on flow reconstruction does not necessarily mean better frame interpolation results, although it has a correlation.
If the reconstruction performance between the flows and frames perfectly correlate, I think it is reasonable to report the frame reconstruction performance on original benchmarks, rather than reporting the flow reconstruction performance with a modified benchmark.
Could the authors provide further explanations on this matter?
Q3 - Q6. Necessity of spatial coordinates
Thank you for the response.
However, I still struggle to understand and agree with the use of spatial coordinates. The authors cite [5] as a work that shares the same insight, but to my understanding, there is a crucial difference to [5]. In [5], the model takes relative spatial coordinates between the query coordinate and the reference latent code as the coordinate input.
Let be spatial coordinates of an image. The work of [5] is on image super-resolution, and they use coordinates of to predict the RGB of , where . Their INR model uses latent codes at along with their corresponding relative coordinates to the query coordinate, , respectively, as inputs. Formally, it would be as follows: , where is the INR function.
I believe this is meaningful, as the goal of the INR model output is to predict the RGB of a different coordinate to the input latent code.
However, according to GIMM’s formula, the INR model predicts the feature of the same position to the motion latent code, of which I think is not necessary. Putting the GIMM formula in the form of equation above, it would be something like this: . The latent code of predicts the feature of the same location, different to the formula of LIIF [5]. Here, I do not think the coordinate information of is necessary to the function .
For further explanation, Eq.2 of [5] takes as the coordinate input of the INR model, whereas I believe the GIMM formula takes itself as the coordinate input. denotes query coordinate and denotes the coordinate of reference (key) latent code. This is very different.
The authors express continuous mapping using the spatial coordinate inputs, but it still fails to convince me, as this task does not require continuous representation in terms of the spatial dimension, with the input and output sharing the same spatial size.
Q7. Generalizability
Thank you for the clear response. It has been well addressed that the term ‘generalizability’ has been used different with my understanding, and in that case, the use of such term is understandable. However, the main part that caused the confusion and still concerns me is, “Our GIMM can be smoothly integrated with existing flow-based VFI works without further modifications”, which is claimed in the abstract and conclusion. This part has not been shown adequately, without experiments, and I still feel that this part is an overclaimed part.
Q8. Forward warping
Thank you for the clarification. I think it would be better if described in the paper.
Q2, Q9-Q13.
Thank you for the clarification.
Although some of my questions have been clarified, the clarified parts are mostly on further description of the method. Yet, I still believe that the current version of the paper contains several overclaims (e.g., necessity of spatial coordinates, generalizability to existing VFI works).
Dear Reviewer ZRiZ,
Thank you for the feedback. Regarding your concerns, we would like to clarify further. Please find the following for our response.
New-Q1: Q3 - Q6. Necessity of spatial coordinates
New-A1: We would like to clarify that it is necessary for GIMM to include spatial coordinates within its input.
To make it clearer, we first summarize three important explanations that we made in the rebuttal, numbered 1), 2) and 3) as below:
1)(A4) INRs can effectively learn continuous mappings from a set of coordinates to a specific signal.
2)(A5) Our INR-based method GIMM should include spatial coordinates in its input, since GIMM models dense optical flows between timesteps which contain spatiotemporal changes.
3)(A6) We cite [5] (Please refer to the “Learning implicit function space.” paragraph in Section 2 of [5]) for the shared insight that latent code serves as a function space for the implicit representations of different instances. The latent code is used for making INRs generalizable — free of test-time optimization. The coordinates are the key to learning continuous mapping since they are continuous.
According to your feedback, we would like to clarify two points.
i) We cite [5] (Please refer to the “Learning implicit function space.” paragraph in Section 2 of [5]) to explain the usage and effects of the latent code in INR. The latent code works as a function space to make the INR generalizable across instances. The latent code alone cannot achieve the best continuous modeling performance for the INRs, as can be observed in Table 2 (page 8). We didn’t cite [5] to discuss the necessity of spatial coordinates.
ii) We integrate spatial coordinates with temporal coordinates in our GIMM’s inputs since GIMM is designed to continuously model the spatiotemporal changes of motion. Specifically, GIMM models dense optical flows, which vary in spatial distribution as the timestep changes. For instance, the optical flow associated with a moving ball will occupy different spatial positions at different timesteps. Consequently, we use spatiotemporal coordinates in our GIMM’s INR to enable continuous modeling of spatiotemporal changes. If we exclude spatial coordinates from our coordinates , the continuous mapping learned won’t be spatiotemporal, and there is likely to be spatial noise for the predicted flows especially around the occluded regions. According to our Table 2 (page 8), spatial coordinates are proven to be necessary for GIMM’s motion modeling.
New-Q2: Q7. Generalizability and Plug-in ability of our method
New-A2: First of all, to avoid confusion in the response, we would like to clarify that “Generalizable/Generalizability” means the generalizable modeling ability of the generalizable INRs (GINRs) across different instances without requirements on test-time learning. As described in A7 and according to your feedback, your concern about the “Generalizable/Generalizability” term is well-addressed.
Regarding your current concern about our model’s plug-in ability on existing flow-based VFI methods, we have conducted experiments during the rebuttal and demonstrated the effectiveness of our GIMM when GIMM is plugged into other VFI methods, such as IFRNet and TTVFI. Please refer to A5-1 in our response to Reviewer rbyn for more details.
New-Q3: Vimeo90k. … I could not understand the main reason for using VTF and VSF as a main benchmark….
New-A3: We would like to clarify that our main benchmarks are arbitrary-timestep frame interpolation benchmarks, i.e., XTest and SNU-FILM-arb. We perform evaluations on these benchmarks, comparing our GIMM-VFI with VFI methods of different motion modeling strategies (Table 1, page 7) and with state-of-the-art methods (Table 2, page 8).
Our core contribution is the GIMM module which performs motion modeling. To assess the modeled motion quality, we use the VTF and VSF as benchmarks. The results of these benchmarks are utilized to demonstrate modules’ motion modeling capabilities and to perform an ablation study on the GIMM module design. Therefore, for questions and justifications on the designs of GIMM, we evaluate the designs on VTF and VSF.
I am deeply grateful for the efforts for clarification by the authors.
Necessity of spatial coordinates
Unfortunately, I am not convinced with the explanations provided by the authors. First, I agree with the authors on the usage of latent codes working as a function space in INRs for generalizability, and that continuous modeling in terms of spatial axis is necessary for continuous modeling.
Yet, the authors emphasize “continuous mappings” and “spatiotemporal changes”, which does not resolve my main concern. The explanations they provided with an example of a moving ball, to my understanding, could make sense in a forward-warping based approach, but not in a backward-warping based approach. According to their method, the latent codes of each spatial position would contain the history of pixels going through the corresponding locations with time, and would already contain the information necessary for their own location, which would only necessitate temporal timestep conditioning. If not predicting a flow of position different to the latent code, the spatial coordinate input for INR still seems redundant. Although the authors claim that its effectiveness has been proven experimentally, the insufficient explanations makes me feel uncertain of the results.
Plug-in ability
In response A5-1 to the reviewer rbyn, the authors said “Notably, plugging in a better continuous modeling module doesn’t guarantee better model fitting since model fitting requires more on the model’s learning strategies and its overall design.” I think this statement contradicts the claim of their paper, where they say that their method “can be smoothly integrated with existing flow-based VFI works without further modifications”, although they somehow succeeded in integration with existing VFI works.
The integration, for instance, to IFRNet, could have been possible, but not smoothly, as several components of IFRNet are not theoretically aligned well with the proposed method. IFRNet requires 1) coarse-to-fine prediction of flows with a hierarchical structure, and 2) feature refinement (prediction) jointly with flow predictions. However, the proposed method does not use 1) coarse-to-fine (hierarchical) prediction, and 2) joint prediction of flows and features. Although the integration could be made in some kind of form, it needs quite some modifications, which makes me feel that the paper is over-claimed.
Thanks for your feedback on our response. We would like to clarify your current concerns further. Please find the following for our response.
Necessity of spatial coordinates
First of all, we would like to highlight our explanations of using spatial coordinates, as described in our previous responses.
-
Our GIMM aims to perform continuous motion modeling. The modeled motion is dense optical flows, which are of spatiotemporal changes. Therefore, GIMM aims to continuously model spatiotemporal changes.
-
We use INR to achieve continuous modeling since INR learns continuous mapping. The key fact for INR to learn continuous mapping is that the input coordinates are continuous.
-
Latent code serves as a function space for the implicit representations of different instances [5]. The latent code is not the key to learning continuous mapping.
According to the above three points, we clarify that using spatiotemporal coordinates is necessary for our GIMM to continuously model the motion of spatiotemporal changes. We would also further make some specific clarifications according to the current feedback.
Q: The explanations they provided with an example of a moving ball, to my understanding, could make sense in a forward-warping based approach, but not in a backward-warping based approach.
A: We use the moving ball example as a piece of evidence for clarification in our above explanation 1), dense optical flows are of spatiotemporal changes. The spatiotemporal changes exist in all optical flows and all flows are faced with the problem of occlusion. This point is irrelevant to the warping approaches.
Q: According to their method, the latent codes of each spatial position would contain the history of pixels going through the corresponding locations with time, and would already contain the information necessary for their own location, which would only necessitate temporal timestep conditioning.
A: As described in the above explanation 2) and 3), the latent code is used to make INR generalizable. The key to achieving continuous modeling is the coordinates rather than the latent code. The latent code does not assure continuous modeling .
Notably, the goal of our method is to continuously model the motion of spatiotemporal changes. Therefore, it is necessary to include spatiotemporal coordinates as the inputs to our INR in GIMM.
Q: If not predicting a flow of position different to the latent code, the spatial coordinate input for INR still seems redundant.
A: The goal of INR is to perform continuous modeling for its target signal, i.e., dense optical flows in our case. Simply using the latent code as input will not make continuous modeling. Since the flows are of spatiotemporal changes, the spatiotemporal coordinates are necessary for continuous modeling.
In fact, the usage of spatiotemporal coordinates in INR can also be found in related literature [41].
Plug-in ability
Q: In response A5-1 to the reviewer rbyn, the authors said “Notably, plugging in a better continuous modeling module doesn’t guarantee better model fitting since model fitting requires more on the model’s learning strategies and its overall design.” I think this statement contradicts the claim of their paper, where they say that their method “can be smoothly integrated with existing flow-based VFI works without further modifications”, although they somehow succeeded in integration with existing VFI works.
A: Please consider the full context of the quoted sentence from our response in A5-1. Here is the complete description with the context:
“We would like to clarify that our method GIMM focuses on continuous motion modeling, which further enables frame interpolation at arbitrary timesteps. Arbitrary-timestep interpolation relies more on continuous modeling while fixed-timestep interpolation relies more on model fitting at the specific timestep of 0.5. Notably, plugging in a better continuous modeling module doesn’t guarantee better model fitting since model fitting requires more on the model’s learning strategies and its overall design.”
We would like to summarize and further clarify the following:
- Our method aims to perform continuous motion modeling. The continuous modeling ability further enables frame interpolation at arbitrary timesteps. Since the task we focus on is interpolation at arbitrary timesteps, the plug-in ability we referred to is for enhancing the arbitrary-timestep interpolation ability of the existing flow-based VFI works.
- Arbitrary-timestep interpolation relies more on continuous modeling while fixed-timestep interpolation relies more on model fitting at the specific timestep of 0.5. The quoted sentence is used to explain that a module designed for the arbitrary-timestep interpolation task doesn’t guarantee its performance on the fixed-timestep interpolation task.
Therefore, the quoted sentence neither contradicts our claim of plug-in ability nor holds relevance to it.
Q: The integration, for instance, to IFRNet, could have been possible, but not smoothly, as several components of IFRNet are not theoretically aligned well with the proposed method.
A: By integrating GIMM into existing flow-based VFI methods, we mean to use the flows from GIMM to replace the original flows in the VFI methods. For instance, regarding the IFRNet, we simply replace all the flows of IFRNet with the flows predicted by GIMM. We keep the architecture of the IFRNet exactly the same. We believe our operation of simply replacing the flows with ours (from GIMM) can be described as ‘smooth’.
The same plug-in operation can also be found in Section 3.3 of the reference [13].
Plug-in ability
- In response A5-1 to the reviewer rbyn~
Thank you for the clarifications. I now have understood that the main argument of the sentence was on continuous-arbitrary timestep. This part has been clarified.
- Integration method
However, on the integration with existing VFI works, I cannot agree with the authors. Replacing all the flows of IFRNet with flows by GIMM cannot be said to be smooth integration with existing flow-based VFI methods, in general sense. In that form of integration, it can no longer said to be IFRNet. The integration method described by the authors neglects important components of IFRNet, such as the hierarchical coarse-to-fine estimation of flows, and joint prediction of flows and features based on estimations from the previous stage. The IFRNet model simply becomes a decoder for frame synthesis, given flows from GIMM. It loses many important characteristics of the IFRNet model. Although I cannot figure the precise number, the number of parameters involved would also greatly differ, to my assumption. IFRNet has 5M params, while GIMM-RAFT has 19.8M params. In short, as the integration method the authors explained requires important changes to the original model, I do not think the proposed method can be considered to have a ‘smooth integration’ / ‘plug-in’ ability.
If simple replacement of estimated flows on other flow-based VFI methods can be considered ‘smooth integration’, then majority of flow-based VFI methods can also claim the same thing, by replacing estimated flows by each other. In that case, it is no longer a novelty / contribution of the method.
The authors cite [13], but the work of [13] do not mention / claim their method to have a great plug-in ability. As part of their framework, they simply use an existing flow-based VFI work for blending / frame synthesis.
Given the authors’ response, my concern that this is an overclaim is firm.
Thanks for your timely feedback on our response. We are glad that our previous responses have addressed some of your concerns.
For the Integration method of our plug-in ability, we would like to clarify the following.
However, on the integration with existing VFI works, I cannot agree with the authors. Replacing all the flows of IFRNet with flows by GIMM cannot be said to be smooth integration with existing flow-based VFI methods, in general sense … The IFRNet model simply becomes a decoder for frame synthesis, given flows from GIMM. It loses many important characteristics of the IFRNet model.
As described in our previous response, we can simply use the flows from GIMM to replace the original flows in the existing VFI methods to achieve integration. We keep the architecture of the VFI method exactly the same and only the flows are replaced by ours. Take the IFRNet as an example, the IFRNet keeps all of its structures when integrating our GIMM into it. Named the model after integration as “IFRNet+GIMM”, it not only extracts image features of the input images but also predicts masks for warping and synthesising the interpolated frames. Both the hierarchical encoding and decoding part of the IFRNet are well-leveraged in “IFRNet+GIMM”. Therefore, we believe that the important attributes of IFRNet are kept and the integration/module plug-in is smooth.
The experiments for the plug-in ability have been conducted in A5-1 of our response to the reviewer rbyn. We provide results with IFRNet as below:
| Method | SNU-FILM-arb-4X | SNU-FILM-arb-8X | SNU-FILM-arb-16X |
|---|---|---|---|
| IFRNet | 34.88 | 31.15 | 26.32 |
| IFRNet+GIMM | 36.46 (+1.58dB) | 32.20 (+1.05dB) | 27.73 (+1.41dB) |
We believe that the plug-in ability of our GIMM is proven to be effective and easy to realize.
Although I cannot figure the precise number, the number of parameters involved would also greatly differ, to my assumption. IFRNet has 5M params, while GIMM-RAFT has 19.8M params.
It should be noted that 19.8M is the number of parameters for GIMM-VFI-R, our complete interpolation method implemented with RAFT. GIMM-VFI-R contains the RAFT flow estimator, GIMM and the frame synthesis module.
When plugging in existing VFI methods, we simply integrate GIMM and optionally integrate the flow estimator if there are no flow estimators used in the VFI methods. In the case of the IFRNet, we integrate GIMM and the flow estimator RAFT. The number of parameters of the integrated module is 5.05M (RAFT of 4.8M plus GIMM of 0.25 M).
Besides, it should also be noted that when comparing the performance of an interpolation method, the variant of the best performance is used. Therefore, IFRNet here refers to its best variant IFRNet-L, which has 19.7M parameters rather than 5M.
Due to the limited discussion period, I leave my response on ones that could be quickly replied first, and may possibly fail to reply to all matters.
Continuous modeling
The authors emphasize the continuous modeling ability. Although their proposed design of using INRs for continuous modeling is novel, there are numerous methods capable of continuous modeling of flows. Many recent works either scale the bidirectional flows using the target timestep, i.e., in forms such as , or use a conditioning method with the target timestep, and this way, continuous modeling of flows is in fact possible [1, 16, 18, 20, 21, 24, 32, 33, 42].
It is acceptable if the authors claim that their continuous modeling is more precise, but the claim that it is impossible for these existing methods to provide continuous motions in a plug-in form is not true. All of the methods cited above are capable of continuous estimation of flow maps at arbitrary timesteps, and their intermediate flows could also be plugged into other frameworks, in the sense of the authors use. Yet, none of these works do not claim their work to have a plug-in ability.
Thus it is hard to say that their method has a special plug-in ability.
IFE [13]
The same plug-in operation can also be found in Section 3.3 of the reference [13].
I belive this statement of the authors implies that [13] uses an plug-in operation, and I think the authors tried to use [13] to support that their plug-in strategy is valid. To my understanding, I think the authors' intention is that [13] do not have a plug-in ability, as their flow modeling method is "not generalizable", and is trained per-instance. However, I rather do not find this part important in the 'plug-in' perspective. If the authors claim 'generalizability' is the important part for plug-in ability, as mentioned above, numerous methods existing could also be said to have a 'plug-in' ability.
IFRNet
The experiments with IFRNet still does not sound to be be a 'smooth integration'. For a method to be easy to be 'plug-and-play' form, a framework needs to be capable of disentanglement. For instance, in flow-based VFI, a method should have explicitly separated parts of flow estimation and frame synthesis, so that the estimated flows can be smoothly replaced. However, IFRNet's contribution is 'joint' prediction of flows and features, which means that the prediction of flows and features could be entangled, and thus I find it awkward to replace the flows by predictions from GIMM.
Thanks for the timely feedback. We would like to further clarify the following.
Continuous modeling
… Many recent works either scale the bidirectional flows using the target timestep, i.e., in forms such as …
We would clarify that for many recent VFI methods [18,24,27,50] that perform backward warping operations, e.g., IFRNet, they require bilateral flows and . We believe there is no reason that should be used in such scenarios.
…, but the claim that it is impossible for these existing methods to provide continuous motions in a plug-in form is not true. All of the methods cited above are capable of continuous estimation of flow maps at arbitrary timesteps, and their intermediate flows could also be plugged into other frameworks, in the sense of the authors use…
Please quote our complete response. The original sentence is that:
“it is quite impossible or much harder for these existing VFI methods to provide proper continuous motion to achieve the plug-in ability claimed by us.”
We claim some methods are impossible (As described above, is not for flow-based VFI methods based on backward-warping) and some methods are much harder to plug in.
GIMM focuses on the motion and can take flows between the input frames and predict intermediate flow at any given timesteps. The flows can be easily obtained from pretrained flow estimators. In contrast, the existing VFI methods take the input images as input and predicted flows that may not be appropriate to use in other VFI methods.
Therefore, we believe that GIMM has an effective plug-in ability to enhance the existing VFI methods’ ability on arbitrary-timestep interpolation. Notably, this is proven by our experiments in A5-1 of our response to the reviewer rbyn.
IFE [13]
We would like to clarify that our GIMM can be plugged into the existing flow-based VFI methods. The “existing flow-based VFI methods” indicate methods that are able to perform interpolation across instances. Therefore, we believe that this is the reason that IFE[13] does not claim its plug-in ability as it requires performing per-instance optimization.
Once more, as described in our previous response, the specific operation of plugging in is to use the flows from GIMM to replace the original flows in the VFI methods. We cite IFE[13] because the same operation has been used in it.
IFRNet
We would like to highlight that when integrating GIMM with IFRNet, we keep the original structure of IFRNet and simply replace its flows with ours. In our experiments, IFRNet+GIMM achieves a performance gain of over 1dB across all the subsets of the SNU-FILM-arb benchmark.
Given the description above, GIMM makes IFRNet achieve significant improvements on arbitrary-timestep interpolation with a simple operation of replacing the flows. We believe that this can be evidence that our method can be integrated into the existing flow-based VFI methods, smoothly.
bilateral flows
We believe there is no reason that should be used in such scenarios. is an example that existing methods cited above are also possible continuous modeling, with plug-in ability across each other methods. In addition, using flow reversal techniques [42, 48], scaled forward flows could be used in backward warping-based methods, so it is not impossible. Backward-based works such as [1,18] also choose to use scaling with timesteps.
Once again, if simply using an intermediate flow estimation from another model could be declared 'plug-in' ability, most of the flow-based methods would be capable of 'plug-in'.
If simple replacement of estimated flows on other flow-based VFI methods can be considered ‘smooth integration’, then majority of flow-based VFI methods can also claim the same thing, by replacing estimated flows by each other. In that case, it is no longer a novelty / contribution of the method.
We would like to clarify that our core module GIMM aims to perform continuous motion modeling.
GIMM can take flows between the input frames and predict intermediate flow at any given timesteps. Furthermore, GIMM can also achieve good performance with different flow estimators, which makes it quite flexible. Notably, in Table 3 (page 8, GIMM-VFI-F vs. GIMM-VFI-R), we show that integrating a better flow estimator (FlowFormer [17]) can enhance model performance.
To the best of my knowledge, no other existing VFI methods contain a similar module that i) is designed for continuous motion modeling with flows as input and ii) can be augmented flexibly with different flow estimators. Therefore, it is quite impossible or much harder for these existing VFI methods to provide proper continuous motion to achieve the plug-in ability claimed by us.
By plugging in GIMM, existing VFI methods can achieve better arbitrary-timestep interpolation performance as proven in A5-1 of our response to the reviewer rbyn.
Therefore, we believe that our GIMM is novel and has an effective plug-in ability.
The authors cite [13], but the work of [13] do not mention / claim their method to have a great plug-in ability. As part of their framework, they simply use an existing flow-based VFI work for blending / frame synthesis.
As described in our previous response, the specific operation of plugging in is to use the flows from GIMM to replace the original flows in the VFI methods. We cite IFE[13] because the same operation has been used in it.
As described in the “Implicit neural representations.” paragraph in Section 2 (page 3), IFE [13] also considers implicit flow modeling but it focuses on per-instance modeling and it is NOT generalizable. Therefore, a plug-in ability to existing VFI methods across various instances is not claimed in their paper.
We believe that IFE[13] does not contradict our claim.
This paper proposes a plug-and-play Generalizable Implicit Motion Modeling module to refine the optical flow in the task of video frame interpolation. Specifically, this module combines several core components of normalization, motion encoder, latent refiner, and coordinate-based network to achieve implicit motion modeling. Experimental results demonstrate the superiority of the optical flow obtained by the proposed GIMM approach.
优点
The accuracy of optical flow is an important issue that affects the effectiveness of video frame interpolation. It is valuable for the authors to model more accurate motion through implicit motion modeling.
缺点
1.The K, F, and Z symbols in Equations 6-7 should be labeled in Figure 2. In addition, the description of L150-151 for K is confusing to read, especially F is expressed as the difference between two coordinates.
2.How does Equation 11 get the optical flow of GT? As far as I know, none of the existing VFI datasets have a corresponding optical flow.
3.Table 1 allows for a comparison of some of the methods that model more complex motion than just modeling linear motion, such as QVI, ABME, BMBC.
4.Table 3 should include parameter and runtime comparisons to demonstrate the effectiveness of the GIMM.
- As a plug-and-play motion modeling module, Table 3, the authors should add GIMM to other existing flow-based VFI methods for a fairer comparison, e.g., ABME, TTVFI, RIFE, and IFRNet. Because the design of the optical flow estimation network and synthesis network used in the different methods are not the same, and the use of a higher performance optical flow estimation network also has a significant performance improvement. In addition, validation on vimeo and UCF101 is necessary.
问题
I'm most concerned about the efficiency of GIMM and the improvement it brings when plugged into other flow-based VFI methods.
局限性
The authors have discussed the limitations.
Thank you for the constructive comments. Please find the following for our response.
Q1: The K, F, and Z symbols in Equations 6-7 should be labeled in Figure 2. In addition, the description of L150-151 for K is confusing to read, especially F is expressed as the difference between two coordinates.
A1: Thanks for your suggestion. We will add the K, F, and Z symbols to Figure 2 and reformulate the description of L150-151 in our revised manuscript.
Q2: How does Equation 11 get the optical flow of GT? As far as I know, none of the existing VFI datasets have a corresponding optical flow.
A2: As described in Section 4 (page 6), we utilize Flowformer [17] to produce pseudo-ground truth optical flow for training and evaluation usages.
Q3: Table 1 allows for a comparison of some of the methods that model more complex motion than just modeling linear motion, such as QVI, ABME, BMBC.
A3: Thanks for your suggestion. We would like to clarify that our GIMM focus on continuous motion modeling for arbitrary-timestep interpolation with 2 frames as the input. While QVI takes 4 frames as input and ABME can only predict at the timestep of 0.5, BMBC is the only method that allows for continuous motion modeling and interpolation. Therefore, we extend Table 1 with experiments on BMBC. The results of BMBC’s motion modeling and interpolation compared with GIMM are listed below.
| Method | VTF (PSNR) | VTF (EPE) | VSF (PSNR) | VSF (EPE) | SNU-FILM-arb-Hard |
|---|---|---|---|---|---|
| BMBC | 28.89 | 0.95 | 23.19 | 8.23 | 28.51 |
| GIMM (-VFI-R) | 37.56 | 0.34 | 30.45 | 2.68 | 32.62 |
Our GIMM outperforms BMBC concerning both motion modeling and interpolation. This further demonstrates the effectiveness of our method. We will add this to Table 1 (page 7), in our revised manuscript.
Q4: Table 3 should include parameter and runtime comparisons to demonstrate the effectiveness of the GIMM.
A4: Please kindly refer to the "Parameters and Runtime" section in the global response.
Q5: As a plug-and-play motion modeling module, Table 3, the authors should add GIMM to other existing flow-based VFI methods for a fairer comparison, e.g., ABME, TTVFI, RIFE, and IFRNet. Because the design of the optical flow estimation network and synthesis network used in the different methods are not the same, and the use of a higher performance optical flow estimation network also has a significant performance improvement. In addition, validation on vimeo and UCF101 is necessary.
A5-1: We would like to clarify that our method GIMM focuses on continuous motion modeling, which further enables frame interpolation at arbitrary timesteps. Arbitrary-timestep interpolation relies more on continuous modeling while fixed-timestep interpolation relies more on model fitting at the specific timestep of 0.5. Notably, plugging in a better continuous modeling module doesn’t guarantee better model fitting since model fitting requires more on the model’s learning strategies and its overall design. Therefore, in terms of continuous modeling, we evaluate the plug-in ability of our proposed GIMM on arbitrary-timestep interpolation benchmark SNU-FILM-arb rather than fixed-timestep interpolation benchmarks, such as Vimeo90K and UCF101. Since there is a time limit for the rebuttal, we plug in the GIMM module to two of the representative existing flow-based VFI methods, TTVFI and IFRNet, for the experiment. Particularly, we plug in GIMM with a pretrained flow estimator RAFT to IFRNet, since there is not an existing flow estimator in the model. The results of PSNRs are listed below.
| Method | SNU-FILM-arb-4X | SNU-FILM-arb-8X | SNU-FILM-arb-16X |
|---|---|---|---|
| TTVFI | 34.48 | 30.39 | 26.24 |
| TTVFI+GIMM | 35.55 (+1.07dB) | 31.60 (+1.21dB) | 27.40 (+1.16dB) |
| IFRNet | 34.88 | 31.15 | 26.32 |
| IFRNet+GIMM | 36.46 (+1.58dB) | 32.20 (+1.05dB) | 27.73 (+1.41dB) |
Plugging in the GIMM module results in significant improvements for arbitrary-timestep interpolation. This demonstrates the effectiveness of GIMM for continuous modeling when integrated with existing flow-based VFI works. We will add this experiment to the Supplementary of our revised manuscript.
A5-2:In order to prove GIMM-VFI's effectiveness of the overall design for interpolation, we further provide evaluations on Vimeo90K and UCF101. Following EMA-VFI [50], we train GIMM-VFI for fixed-timestep interpolation. For evaluation, we calculate PSNRs and compare our method GIMM-VFI with the aforementioned and other state-of-the-art methods. The results are listed below:
| Method | Vimeo90K (PSNR) | UCF101 (PSNR) |
|---|---|---|
| IFRNet | 36.20 | 35.42 |
| AMT | 36.53 | 35.45 |
| UPR-Net | 36.42 | 35.47 |
| ABME | 36.22 | 35.41 |
| TTVFI | 36.54 | 35.51 |
| EMA-VFI | 36.50 | 35.42 |
| CURE | 35.73 | 35.36 |
| GIMM-VFI-R | 36.54 | 35.51 |
| GIMM-VFI-F | 36.67 | 35.54 |
Two variants of GIMM-VF with different flow estimators, both achieve competitive performance. This further demonstrates our method's strong interpolation ability.
Dear Reviewer rbyn,
We sincerely thank you for the review and comments. We have posted our response to your initial comments, which we believe has covered your concerns. We are looking forward to your feedback on whether our answers have addressed your concerns or if you have further questions.
Thank you!
Authors
Thanks to the author for providing abundant experiments to solve my problem.
In addition, I read the negative comments from reviewer ZRiZ, and I think that although INR has been explored in other areas, it is very appropriate for modelling continuous motion in video. Motion in video is varied, such as uniform motion, acceleration, rotation, etc., and a more generalised Implicit Motion Modeling is worth being accepted.
I raise my rating to ACCEPT and I am confident that the authors can improve the problems mentioned by all reviewers in the final version.
The paper proposed a video frame interpolation model, starting from an optical flow, encoding flows to spatial-temporal motion latent. The motion prediction model GIMM, took the encoded initial motion latent to arbitrary-timestep interpolation motions. Finally, the motions are used to predict bilateral flows which can warp with input frames to recover the interpolation result.
优点
- The paper proposed a novel flow estimating model, which can estimate bidirectional flows according to input time steps and input frames' flows. The model can be inserted with any flow-estimating VFI model.
- Quantitative and qualitative results show that the estimated flows are stable and sharp compared with baseline results. Providing a better interpolation result.
缺点
- The GIMM starts from pretrained flows, which may limit the network performance. In fact, recent VFI networks still tend to estimate blurry results according to inaccurate flow estimating while facing hard cases.
- More perceptual quantitative indicators need to be shown in the paper, eg. FID or LPIPS.
问题
- Is the flow normalization model necessary? What are potential temporal inconsistencies mentioned in line 146?
局限性
Yes
Thank you for the constructive comments. Please find the following for our response.
Q1: The GIMM starts from pretrained flows, which may limit the network performance. In fact, recent VFI networks still tend to estimate blurry results according to inaccurate flow estimating while facing hard cases.
A1: Please kindly refer to the "Pretrained flow estimator" section in the global response.
Q2: More perceptual quantitative indicators need to be shown in the paper, eg. FID or LPIPS.
A2: Thank you for the suggestion. We add the perceptual metrics FID and LPIPS to Table 3 (page 8). We report the best results in boldface and the second best with Italic boldface. The results are listed below:
| Method | XTest-2K (LPIPS/FID) | XTest-4K (LPIPS/FID) | SNU-FILM-arb-4x (LPIPS/FID) | SNU-FILM-arb-8x (LPIPS/FID) | SNU-FILM-arb-16x (LPIPS/FID) |
|---|---|---|---|---|---|
| RIFE | 0.126/11.99 | 0.152/13.52 | 0.038/6.65 | 0.072/11.99 | 0.134/19.82 |
| IFRNet | 0.108/23.93 | 0.164/23.75 | 0.046/9.92 | 0.066/11.65 | 0.115/16.91 |
| M2M | 0.098/9.25 | 0.158/8.67 | 0.036/5.98 | 0.061/10.13 | 0.112/17.37 |
| AMT | 0.153/13.92 | 0.187/13.97 | 0.072/9.25 | 0.089/10.34 | 0.136/14.72 |
| UPR-Net | 0.104/10.75 | 0.154/9.45 | 0.033/6.09 | 0.064/9.93 | 0.111/16.76 |
| EMA-VFI | 0.097 /7.21 | 0.156/8.61 | 0.041/7.07 | 0.074/12.17 | 0.130/19.58 |
| CURE | 0.111/26.42 | OOM | 0.035/6.98 | 0.063/12.72 | 0.114/22.62 |
| GIMMVFI-R | 0.113/6.52 | 0.149/6.49 | 0.033/5.89 | 0.060/9.59 | 0.110/16.45 |
| GIMMVFI-F | 0.103/6.74 | 0.142/6.58 | 0.031/5.86 | 0.059/9.95 | 0.109/16.79 |
Our proposed method GIMM-VFI achieves competitive performance according to these perceptual quantitative indicators.
Q3: Is the flow normalization model necessary?
A3: As described in the “Flow normalization” paragraph in Section 3.2 (page 4), the normalization process aligns temporal direction and scales the values of input flows. First, without temporal direction alignment of the inputs, it would be hard to define the direction of the output flow . The direction alignment is thus necessary. Second, for the scale operation, we experiment by skipping it and present results on VTF and VSF as below:
| Method | Scale Operation | VTF(PSNR) | VTF(EPE) | VSF(PSNR) | VSF(EPE) |
|---|---|---|---|---|---|
| GIMM | No | 32.07 | 1.28 | 22.23 | 10.72 |
| GIMM | Yes | 37.56 | 0.34 | 30.45 | 2.68 |
Skipping the scale operation performs much worse than the full GIMM settings across both benchmarks. It is therefore necessary to perform the complete flow normalization operation.
Q4: What are potential temporal inconsistencies mentioned in line 146?
A4: The potential temporal inconsistencies refer to the possible bias and noises in the estimated flows between input frames. In order to mitigate their negative impacts on motion modeling, we employ a Motion Encoder (ME) to extract motion features. The ablation experiment on Motion Encoder can be found in the global response.
Dear Reviewer d4Ah,
We sincerely thank you for the review and comments. We have posted our response to your initial comments, which we believe has covered your concerns. We are looking forward to your feedback on whether our answers have addressed your concerns or if you have further questions.
Thank you!
Authors
This paper aims to solve the Video Frame Interpolation task. To improve the capability of effectively modeling spatiotemporal dynamics, the paper proposes a Generalizable Implicit Motion Modeling (GIMM) module to leverage the implicit neural fields to estimate the flow field at an arbitrary time step. GIMM takes the spatial coordinates, time, and the latent motion code as the input of an INR, which makes it generalizable to new scenes.
优点
- The idea of using INR to achieve continuous time video frame interpretation is interesting but needs to be well investigated.
- The performance of arbitrary-timestep interpolation is good.
缺点
- Novelty: The idea of using time-dependent generalizable INR is not new and has been explored in human motion synthesis [1]. [1] also uses the concatenation of spatial coordinates, time step and motion latent as the input for an INR MLP. Moreover, the way of using INR is very similar to CURE[41] in video interpolation. Can you elaborate more details on the difference between the two methods? It seems the performance improvement over CURE might be attributed to the pre-trained flow estimator.
[1]. Wei, Dong, et al. "NeRM: Learning Neural Representations for High-Framerate Human Motion Synthesis." ICLR 2024.
-
Performance: In Table 1, GIMM (-VFI-R) only slightly outperforms a simple "Linear" approach on both Vimeo-Septuplet-Flow (VSF) and SNU-FILM-arb-Hard. Also, in Table 2, without spatial coordinates, the performance slightly drops. Do you even conduct ablation on the time variable? Maybe with only motion code as the input, the performance still remains similar?
-
The method relies on the pre-trained flow estimators, which may lead to a suboptimal solution.
-
Typo: "Latnet Refiner" in Figure 2 caption.
问题
- The paper is easy to read.
- The paper concatenates the motion latent code and the coordinates as the input to achieve generalizable INR. It seems the latent code dominates the input as it has a higher dimension. Will it make the INR less spatial-sensitive?
- A follow-up question is why not use meta-learning approaches or INR modulation [6]?
局限性
The paper discussed the limitations.
Thank you for the constructive comments. Please find the following for our response.
Q1: Novelty: The idea of using time-dependent generalizable INR is not new and has been explored in human motion synthesis [1]. [1] also uses the concatenation of spatial coordinates, time step and motion latent as the input for an INR MLP.
A1: We respectfully disagree with the reviewer that the idea of using time-dependent generalizable INR for motion modeling in Video Frame Interpolation lacks novelty due to the presence of NeRM. Although NeRM and our proposed method GIMM both leverage INR conditioned on coordinates and latent, they are quite different concerning the following main points. 1) GIMM performs the general type of motion modeling in the context of video frame interpolation. Unlike NeRM which focuses on human motion, GIMM models any types of motion that may exist within the input frames. 2) While NeRM synthesizes sparse pose motion at each timestep, GIMM predicts dense motion in the form of optical flow for the usage of interpolation. Consequently, GIMM requires spatial coordinates as the additional input to its INR for better modeling, akin to image-based INRs [5]. In contrast, NeRM only takes temporal coordinates. We plan to include NeRM in our references to enhance the comprehensiveness of our related work section. We will add this discussion in the revised manuscript.
Q2: Moreover, the way of using INR is very similar to CURE[41]... performance improvement over CURE might be attributed to the pre-trained flow estimator.
A2: As described in the “Generalizable INRs.” paragraph in Section 2 (page 3), CURE directly learns generalizable INRs from video, while our method leverages generalizable INRs for motion modeling to improve intermediate frame synthesis for flow-based VFI. Besides, we would like to clarify that despite both CURE and our method have leveraged pre-trained flow estimators, i.e., RAFT, CURE still performs linear motion modeling with a strong assumption of motion overlapping that should lead to suboptimal results. As observed from Table 3, our GIMM-VFI-R consistently outperforms the CURE algorithm across all benchmarks, achieving PSNR improvements exceeding 1 dB. Additionally, GIMM-VFI-R effectively handles interpolation at 4K resolution where CURE encounters out-of-memory issues. Furthermore, CURE contains a heavier architecture of 51.28M parameters (31.49M larger than our GIMM-VFI-R) and requires a longer time in inference.
Q3. Performance: In Table 1, GIMM (-VFI-R) only slightly outperforms a simple "Linear" approach on both Vimeo-Septuplet-Flow (VSF) and SNU-FILM-arb-Hard.
A3: As observed in Table 1 (page 7), the PSNR of our method is 0.36 dB and 0.20 dB higher than the "Linear" approach on the referred benchmarks, respectively. This improvement is significant, particularly considering that both methods utilize the pretrained flow estimator RAFT.
Q4: Also, in Table 2, without spatial coordinates, the performance slightly drops.
A4: As shown in Table 2 (page 8), removing spatial coordinates causes a 0.16 dB drop of PSNR on VTF and a 0.06 increase of End-Point-Error on VSF. This demonstrates the crucial role of spatial coordinates.
Q5: Do you even conduct ablation on the time variable? Maybe with only motion code as the input, the performance still remains similar?
A5: As presented in Table 2 (page 8) and analyzed in the “Implicit modeling.” paragraph in Section 4.2 (page 7), direct input motion latent code without using any coordinates results in a 0.52 dB decrease of PSNR on VTF and a 0.13 increase of End-Point-Error on VSF. This highlights the importance of implicit modeling in GIMM.
Q6: The method relies on the pre-trained flow estimators, which may lead to a suboptimal solution.
A6: Due to the word limit, please refer to the "Pretrained flow estimator" section in the global response.
Q7: The paper concatenates the motion latent code and the coordinates as the input to achieve generalizable INR. It seems the latent code dominates the input as it has a higher dimension. Will it make the INR less spatial-sensitive?
A7: Thank you for the suggestion. We reduce the dimension of motion latent from 32 to 8 and the results on VTF and VSF are listed below:
| Latent Dim. | VTF(PSNR) | VTF(EPE) | VSF(PSNR) | VSF(EPE) |
|---|---|---|---|---|
| 8 | 37.16 | 0.35 | 30.15 | 2.74 |
| 32 | 37.56 | 0.34 | 30.45 | 2.68 |
Reducing latent dimension leads to worse performance, with reductions of 0.40dB and 0.30dB in the PSNRs on VTF and VSF respectively. This indicates that a proper choice of a higher dimension, e.g., 32, will help generalizable INR achieve better performance rather than impose negative effects on the implicit modeling.
Q8: A follow-up question is why not use meta-learning approaches or INR modulation [6]?
A8: As suggested, we implement the Meta-learning approach [23] for modulating the weights of INR, specifically in modeling motion. We compare its performance on VTF and VSF with our method GIMM, and calculate their parameters. The results are listed below:
| Method | VTF(PSNR) | VTF(EPE) | VSF(PSNR) | VSF(EPE) | Param. |
|---|---|---|---|---|---|
| Meta-learning approach [23] | 30.19 | 0.88 | 24.50 | 6.80 | 43.92M |
| GIMM | 37.56 | 0.34 | 30.45 | 2.68 | 0.25M |
The Meta-learning approach performs much worse than GIMM while adding more than 170x larger model parameters.
Q9: Typo: "Latnet Refiner" in Figure 2 caption.
A9: We will correct the typo in our revised manuscript.
Dear Reviewer LUi9,
We sincerely thank you for the review and comments. We have posted our response to your initial comments, which we believe has covered your concerns. We are looking forward to your feedback on whether our answers have addressed your concerns or if you have further questions.
Thank you!
Authors
Dear Reviewers,
We would like to thank all reviewers for providing constructive feedback that helped improve the paper. Due to the word limit, we provide explanations and experiments for concerns shared by multiple reviewers in the following.
- Ablation on Motion Encoder (reviewer d4Ah and ZRiZ)
We conducted experiments without the Motion Encoder for a direct comparison. The results of motion modeling are presented below.
| Method | with ME | VTF(PSNR) | VTF(EPE) | VSF(PSNR) | VSF(EPE) |
|---|---|---|---|---|---|
| GIMM | No | 37.05 | 0.42 | 30.26 | 2.85 |
| GIMM | Yes | 37.56 | 0.34 | 30.45 | 2.68 |
GIMM implemented with ME produces higher-quality flows, demonstrating that Motion Encoder is indeed helpful for our motion modeling. We will add these results in Table 2 (page 8) of the revised manuscript.
- Evaluations on Vimeo90K and UCF101 (reviewer rbyn and ZRiZ)
We would like to clarify that our method focuses on continuous motion modeling, specifically for the arbitrary timestep interpolation task. Therefore, we didn’t include the evaluations on the fixed timestep interpolation benchmarks Vimeo90K and UCF101 in our submission. However, our proposed method still achieves competitive performance on the fixed-timestep interpolation benchmarks. We report the best results in boldface and the second best with Italic boldface.
| Method | Vimeo90K (PSNR) | UCF101 (PSNR) |
|---|---|---|
| IFRNet | 36.20 | 35.42 |
| AMT | 36.53 | 35.45 |
| UPR-Net | 36.42 | 35.47 |
| ABME | 36.22 | 35.41 |
| TTVFI | 36.54 | 35.51 |
| EMA-VFI | 36.50 | 35.42 |
| CURE | 35.73 | 35.36 |
| GIMM-VFI-R | 36.54 | 35.51 |
| GIMM-VFI-F | 36.67 | 35.54 |
Two variants of GIMM-VF with different flow estimators, both achieve competitive performance. This further demonstrates our method's strong interpolation ability.
- Pretrained flow estimator (reviewer d4Ah and LUi9)
As shown in Table 1 (page 7) and discussed in this paragraph (line 225, page 7), many works [20,32] leverage motion priors from RAFT [45] can achieve better performance than method without motion priors [50]. Notably, our GIMM outperforms existing methods [20,32] that utilize the same motion priors. This demonstrates that VFI methods can significantly benefit from pretrained flow estimators. Our GIMM offers the most substantial improvements. Furthermore, in Table 3 (page 8, GIMM-VFI-F vs. GIMM-VFI-R), we show that integrating a better flow estimator (FlowFormer [17]) can enhance model performance.
- Parameters and Runtime (reviewer rbyn and ZRiZ)
Following RIFE [19], we collect the models of each paper and test them on an NVIDIA V100 GPU with the same hardware for 480P frame interpolation. We report the parameters and runtime of each model and list them as below:
| Method | Params. (M) | Runtime (s/f) |
|---|---|---|
| RIFE | 10.71 | 0.01 |
| IFRNet | 19.7 | 0.03 |
| M2M | 7.61 | 0.01 |
| AMT | 2.99 | 0.03 |
| UPR-Net | 1.65 | 0.04 |
| EMA-VFI | 65.66 | 0.08 |
| CURE | 51.28 | 0.98 |
| GIMMVFI-R | 19.79 | 0.25 |
| GIMMVFI-F | 30.59 | 0.29 |
Compared with INR-based interpolation methods, our proposed method performs the fastest interpolation and maintains a relatively light architecture. However, there is still a runtime gap between the INR-based method and the non-INR-based interpolation method, we leave it for future research.
- Normalization (reviewer d4Ah and ZRiZ)
As described in the “Flow normalization” paragraph in Section 3.2 (page 4), the normalization process aligns temporal direction and scales the values of input flows. First, without temporal direction alignment of the inputs, it would be hard to define the direction of the output flow . The direction alignment is thus necessary. Second, for the scale operation, we experiment by skipping it and present results on VTF and VSF as below:
| Method | Scale Operation | VTF(PSNR) | VTF(EPE) | VSF(PSNR) | VSF(EPE) |
|---|---|---|---|---|---|
| GIMM | No | 32.07 | 1.28 | 22.23 | 10.72 |
| GIMM | Yes | 37.56 | 0.34 | 30.45 | 2.68 |
Skipping the scale operation performs much worse than the full GIMM settings across both benchmarks. It is therefore necessary to perform the complete flow normalization operation.
Although normalization is important, the normalization process follows IFE[13]. We agree with reviewer ZRiZ and will delete it from our key designs in our revised manuscript.
The paper presents a method for continuous motion modeling for video frame interpolation using implicit neural representations. The role of the pretrained flow estimator is questioned by d4Ah, which is sufficiently addressed by the rebuttal. Requested perceptual metrics are evaluated in the rebuttal and should be included in the main paper with further comment or analysis. LUi9 requires comparative discussion with respect to NeRM and CURE, which are explained in the rebuttal. LUi9 also points out small PSNR gains from GIMM - VFI-R compared to linear method, where the rebuttal disagrees to consider them significant. All the clarifications sought by rbyn are sufficiently addressed by the rebuttal. ZRiZ questions the use or INRs and the necessity of spatial coordinates, whose utility and correctness for continuous motion modeling is clarified by the authors over extensive post-rebuttal discussions. The use of INR in favor of U-Net is also empirically justified in the rebuttal, while additional results are also included on other benchmarks. In balance, while quality improvements for VFI are small in the supplementary videos, there is technical novelty in the implementation of INR for continuous motion modeling and quantitative results show positive trends for future work, so the paper is recommended for acceptance at NeurIPS.