Doubly Hierarchical Geometric Representations for Strand-based Human Hairstyle Generation
Hierarchical representation and generative model architecture designs for 3D strand-based human hair geometry in flexible amount and density, with the DCT frequency decomposition and optimal sampling.
摘要
评审与讨论
This paper introduces a method for generating realistic strand hair geometry using a frequency-decomposed representation. The approach constructs a hierarchical generative model for hair strands, leveraging discrete cosine transform (DCT) and k-medoids clustering to create coarse guide curves that effectively distinguish fundamental hair shapes from complex curliness and noise. Additionally, it employs a permutation-equivariant architecture (PVCNN) to support flexible guide curve modeling and facilitate the hierarchical generation of strands in low and high frequencies, transitioning from sparse to densely populated.
优点
- The paper proposes a hierarchical approach to hair strand generation, which starts from coarse guide strands and progresses to densely populated strands, ensuring detailed and realistic hair geometry.
- Utilizing DCT for frequency decomposition to separate low-frequency structural curves from high-frequency details is innovative, particularly as it avoids the Gibbs' oscillation issues associated with the standard Fourier transform.
- The use of k-medoids clustering for extracting representative guide curves ensures better retention of hairstyle characteristics compared to traditional UV grid sampling methods.
- The paper proposes a permutation-equivariant architecture for the VAE, allowing flexible modeling of guide strand geometry without being restricted to a fixed grid, enabling the generation of dense strands in any quantity and density.
缺点
-
Minor Technical Contributions:
- The primary contribution lies in the utilization of DCT for hair curves and k-medoids clustering to extract guide hair strands. However, as stated in the paper, "DCT is widely used in image, video, and audio signal compression." The novelty of using DCT for hair curves has not been adequately highlighted.
- The performance improvement of k-medoids clustering is not clearly demonstrated. This method should be compared with other clustering techniques (e.g., k-means) instead of grid-sampling, as shown in Fig. 3.
-
Insufficient Comparisons with Previous Work:
- The paper lacks sufficient comparisons with previous relevant strand-based hair modeling methods, such as Wang et al. (2009) "Example-based hair geometry synthesis."
-
Unfair Comparisons with Other Methods:
- In Fig. 7, the results of the nearest-neighbor upsample are confusing. There should be an explanation of how the nearest-neighbor upsample is performed, at least in the appendix, to ensure the comparison is fair and transparent.
问题
- For better validation of the method, it would be beneficial to include comparisons with other state-of-the-art clustering techniques and hair modeling methods.
- Further highlighting the novelty and advantages of using DCT specifically for hair curves could strengthen the paper's contributions.
- Ensuring fair and transparent comparisons with detailed explanations of all methods used in the evaluations would improve the credibility and reliability of the results presented.
局限性
NA
Thank you for your constructive feedback and suggestions, and your appreciation of our innovations.
[W1] Technical contributions
Please refer to the global rebuttal for a detailed discussion on our contributions. We will specify our innovations, and especially which components in our method are new with regard to existing strand hair modelling methods in related work to better show our contributions.
For some of your specific concerns:
[W1.1] novelty of DCT for hair curves not adequately highlighted
Thanks for the suggestion and acknowledging that utilizing DCT here is "innovative". We will specify in the contribution and related work that we do frequency decomposition on strand curves and introduce DCT in hair modelling, both for the first time.
[W1.2] k-medoids and comparison with other clustering techniques
First, we clarify that we do not perform clustering, but instead we sample guide strands (see W2 the discussion on Wang et al. (2009) for more details). It just happens that the optimal solution of the guide strand sampling can be achieved from the k-medoid clustering as resulting final medoids set. To discover this previously unknown property of this less famous clustering method in our application with a proof (Theorem 1) is considered our theoretical contribution. We compare our solution with grid sampling, because the latter is widely applied in recent hair modeling methods, especially the direct competitor HAAR.
k-means itself does not suffice as a sampling method. Unlike k-medoids, the cluster centers of k-means is not necessarily from the original data (existing strands). It can cause problems such as the root of the strand is not aligned on the head scalp. To make the k-means centers valid hair strands, we modify the solution with an additional projection step to project each root of a sampled k-means center to the nearest point on the head scalp, so it matches a valid strand. We show the results in Table D in the rebuttal PDF as an extension of Table 1 in the manuscript. Since k-means and k-medoids are similar in methodology, the performance gap is not large, while the k-means sampling may lose some accuracy due to the projection step. However as we showed in theory, the k-medoids is ensured to be the optimal solution.
[W2] Comparison with existing works
Please refer to the global rebuttal for evaluation and comparison with the SOTA method HAAR.
Thank you for introducing the work of Wang et al. (2009) which is relevant. We will add discussion in the related work. However, Wang et al. (2009) is not a generative method and the task setup is different. So we are not able to directly compare with this method. Also, as an early method there is no publicly available code or synthesized results. Wang et al. (2009) is a classic method based on Kwatra et al. (2005) that synthesizes texture variations from a given image example. So Wang et al. (2009) requires an exampler strand hairstyle (or combination of two if they are compatible) as well for the global shape and synthesizes only some local details.
For the hair representation, Wang et al. (2009) use PCA to encode each strand as a classic method while recent neural representations are more advanced. The hierarchy of coarse-to-dense hair strands is modeled by clustering, which is suitable for the spiky hairstyle in their illustration. But in more general cases, the densification of hair strands should concern more than one sparse guide strand. We believe that the relationship between dense strands and sparse guide strands should be modeled by sampling and interpolation rather than clustering. That is, each dense strand is affected by not only a single guide strand as from its own cluster, but also a collection of guide strands in its neighborhood. Our method is conceptually different from (and most likely more advanced than) Wang et al. (2009). Nevertheless, how to adapt the representation from Wang et al. (2009) to modern neural networks is non-trivial and worth exploring.
For these reasons, we believe that HAAR is a more recent and directly related SOTA method for us to compare with. But we will discuss Wang et al. (2009) in the related work section.
[W3] Details of nearest neighbors upsampling for fair and transparent comparison
For both our learning-based upsampling and nearest neighbor upsampling, we first need to sample a collection of root points for the dense strands on the head scalp UV map. In the usual pipeline of VAE generation and reconstruction, this is achieved by sampling from the learned density map. Specifically in the experimental evaluation here, we use oracle roots from the ground truth to ensure fair and direct comparisons.
For nearest neighbor upsampling, for each sampled root of the dense strand, we identify one guide strand whose root is the nearest to the sampled root. Then we make the dense strand at this root is the same as the nearest guide strand, i.e., we copy this guide strand and translate it to the sampled root.
Thanks for your suggestion. We will add the clarification to the appendix.
- Wang et al. (2009) Example-Based Hair Geometry Synthesis.
- Kwatra et al. (2005) Texture optimization for example-based synthesis.
thanks for the rebuttal and clarification. I will raise my score a bit.
Dear Reviewer Sftt,
We appreciate your insightful comments and suggestions, which helped to enhance our paper for better clarity and evaluation. Thank you very much for recognizing our contribution and raising your score.
Best regards,
Submission 11742 Authors
The authors propose reformulating hair strand generation by considering it in the frequency domain (DCT). This approach allows for decoupling high and low-frequency strands via frequency thresholding (low-pass filtering). The main idea is that we can first generate a set of sparse, low-frequency strands that can then guide the generation of dense, high-frequency details.
The pipeline begins with k-medoids clustering on a dense, low-pass filtered set of hair strands to obtain a sparse set of guide strands. They then train a VAE (Variational Autoencoder) using a dual-branch hybrid point-voxel architecture (PVCNN with an additional decoder). This VAE generates sparse guide curves, which are then used to generate the remaining dense strands (still low frequency) via a densification network. Additional high-frequency model is then used to add a variety of high-frequency details to the low frequency dense strands generated from VAE + densification networks.
优点
- Separating low and high frequencies is an interesting approach. It makes sense that artists work this way, focusing on low-frequency details first to outline the direction of the curve and then adding high-frequency details.
- Good motivation for using DCT instead of DFT
- The adaption of PVCNN for hair strand is quite clever and makes sense. There are several modification made to fit the hair strand problem ('voxelize' based on root point which reduce the space to 2D)
- The guide curve encoder, decoder and the densification process are all trained jointly, which helps unify the model toward generating accurate hair strands.
- One benefit of separating low and high frequency is the ability to generate high frequency details for each hairstyle.
缺点
- While separating low and high frequencies makes sense for artists, I'm still not entirely convinced that an automated pipeline must follow the same process to achieve the best results. There are benefit to separating them such as being able to generate variety of high frequency details from the same low frequency strands, but diversity can also be accomplished as a whole. I think a simple experiment that varies the frequency threshold for low-frequency strands (eventually removing the low-pass filter entirely and using all strands for clustering) might help. so we can see the benefit of separating low and high frequencies.
- As with any pointnet based model, the runtime can be higher than pure convolutional, as mentioned in the limitations in the appendix of the paper.
问题
- why the use of depth-to-space upsampling instead of deconv?
- I'm a bit confused about the density map for sampling root positions, which is necessary for decoding back into the proper position. I would appreciate some additional clarification on how this works
Others:
- Line 203, (T)o enhance …
局限性
Yes, the authors clearly outline the limitations in the appendix. This includes runtime (compare to pure convolution) and example of failure cases.
Thank you for your valuable feedback and appreciation of our method and innovations.
[W1] On the effectiveness of DCT frequency decomposition, and separation of low- and high-frequency with varying threshold
We appreciate the constructive suggestion.
The designing choice to learn first low- then high-frequency components adheres to the well known principle of spectral bias [28] for learning and generalization of neural models. This principle has also been adapted by some implicit neural representation models with frequency band controlling in the phase of training / optimization, e.g. BACON [A] and SAPE [B]. Another advantage specific to our per-strand representation is that, as we mentioned in Sec 3.1, we downsample the low-pass filtered curve to the resolution of two times the frequency threshold, which, from the Nyquist sampling theorem and empirically from the quantitative evaluation, are accurate enough to represent the low-pass filtered curve. So the low-pass filtering can help with data compression and the computation efficiency.
For reasons above we expect frequency decomposition helps the representation quality in the neural network model. We show the additional ablation experiments on varying frequency threshold in reconstructing straight and curly hair strands with both low- and high-frequency details in Table C, with hair strands reconstructed by aggregating results from both our low- and high-frequency models. We observe that the frequency threshold in the range from 8 to 12 is optimal, and empirically we use 8 which is more efficient. When the frequency threshold is too low, the low-filtered signal does not capture enough information of the principle growing direction. And when the frequency threshold is too high, high-frequency structure cannot be encoded more efficiently by DCT coefficients, and the increased computation cost hinders optimization. Our representation makes use of both spatial and spectral domain with correct setup of frequency threshold.
We also clarify that separation of low- and high-frequency components does not affect clustering (and in fact we do not perform clustering, but instead we sample guide strands. It just happens that we observe that the optimal solution of the guide strand sampling can be achieved from the k-medoid clustering as resulting final medoids set. ) Using low-frequency strands for sampling is just an implementation choice for the concern of efficiency, because the low-frequency can be resampled to lower resolution of controlling points according to the Nyquist sampling theorem, which is more efficient to process on. And we think the principle growing direction from the low-pass filtered strand is representative enough for sampling the guide strands.
[Q1] Depth-to-space upsampling instead of deconv
In 2D CNNs for image reconstruction and generation, deconvs have well-known drawbacks of checkerboard artifacts [D] which can be resolved by depth-to-space upsampling [E] without losing computation efficiency. So we use depth-to-space upsampling by default.
[Q2] Density map
At the end of the conv branch in the PVCNN decoder, we output a density map, same to the spatial resolution of the conv feature map. The density map is trained to optimize towards the GT probability values of a root points falling in each of the grid, for each training example. In training, we use oracle root points from the encoder and optimize the density map and strands simultaneously. During inference, we can sample root points from the probability maps when the oracle root points from the encoder are unknown, before generating strand details. Each sampled root points is assigned with a random 2D UV coordinate within the small square grid it comes from.
In practice, we output two density maps from the conv branch decoder, one for guide and one for dense strands, as mentioned in the appendix (the loss function part). At the implementation level, the sampling process can easily be helped by the torch.multinomial() function.
Although the density map sampling does not guarantee the exact same original root positions, empirically with a reasonable resolution of density map, we find that the resulting root positions correctly resemble the distribution of root points, and the reconstruction of the whole hairstyle is still significant advantageous over grid-based baselines. Our evaluation are all based on set chamfer measurements, thus not requiring strand-to-strand correspondence for evaluation.
-
[A] Lindell et al. (2022) Bacon: Band-limited coordinate networks for multiscale scene representation
-
[B] Hertz et al. (2021) Sape: Spatially-adaptive progressive encoding for neural optimization
-
[C] Shen et al. (2023) Ct2hair: High-fidelity 3d hair modeling using computed tomography
-
[D] Odena et al. (2016) Deconvolution and checkerboard artifacts.
-
[E] Wojna et al. (2017) The Devil is in the Decoder: Classification, Regression and GANs
Thank you for the detailed response.
The thresholding frequency experiment shows that separating frequency is beneficial, as the consistent performance drop at both low and high thresholds indicates that this separation is indeed useful. It's also nice that authors also compared with recent work like HAAR and achieved better (though a bit marginal) results in the user study.
However, while these aspects do strengthen certain contributions, I understand the other reviewers' concern that the overall approach could be seen as a series of design choices rather than a substantial contributions. But I'd argue that the combination of these elements does create a novel pipeline that hasn't been explored before. Whether this contribution is significant enough for NeurIPS is debatable (it might align more with computer graphic/vision conference). Regardless, I still believe it offers value.
After further consideration, I still stand by my original rating. I think the rebuttal is compelling enough for me.
Dear Reviewer vBP9,
Thank you very much for your positive feedback, for sharing your valuable insights with your expertise, and for recognizing our contributions of "a novel pipeline that hasn't been explored before".
Regarding the performance gap from user study vs. HAAR, it is hard to quantify the performance gap by averaging the subjective scores between 1-10 and conclude as “marginal” improvement, since many users often provide conservative mid-range scores when feeling uncertain. We have observed a clear advantage in the quality of our results over HAAR, as evidenced by our data and experience, which shows notably fewer instances of failure or unnatural examples. This improvement is attributed to our more sophisticated and flexible hair representation and learning model design. We have highlighted some of HAAR’s common issues in Figure B of our rebuttal PDF. Due to space constraints, we will include a more qualitative comparison in the revised manuscript.
We believe that our work will contribute values to the NeurIPS community. Our research addresses a generative learning problem and engages with several prominent ML topics, including geometric deep learning, learning on sets, implicit neural representations, and graph convolution. We hope that our approach may inspire further exploration in these sub-communities, e.g. regarding hierarchical abstraction methods for learning structured non-Euclidean data representation. Thus, we believe that NeurIPS is an appropriate venue to share our approach and exchange thoughts, which could potentially inspire other emerging applications.
Thank you once again for your dedicated review, which helped in guiding improvements to our work, especially with regard to the suggested ablation experiment.
Best regards,
Submission 11742 Authors
This paper proposes a system to generate hair geometries in a coarse-to-fine manner via VAE. The paper demonstrates the effectiveness of the proposed method via some simple baselines such as grid-based methods. The paper is relatively easy to read. But it presents limited comparisons with existing SOTA methods. The paper also has very limited discussion with related works, which makes its positioning unclear.
优点
The proposed method combines simple techniques such as DTC, k-medoid, and VAE, which could potentially be a plus as these methods are well-studied and can be improved further. The results seem to suggest the effectiveness of the proposed components. The coarse-to-fine generation method has the potential to generate fine details with reasonable computation efficiency.
缺点
It concerns me a little bit whether the paper's technical contribution is sufficient. Most of the introduced techniques, such as applying k-medoids, DCT, and/or VAE, are well established. It's also not clear from the related work section that these techniques are new to the hair generation applications. I fail to see comparison with SOTA methods such as HAAR [41] in the evaluation, which doesn't help with this concern. The evaluation metrics for ablation seem to focus on reconstruction, while the paper seems to claim generation as the main task.
问题
L145 - it's a bit unclear to me what this theorem 1 entails in the context of hair generation. Would be nice to clarify.
局限性
The authors include a discussion of limitation at section E. Additional limitation includes the concern of generative quality of VAEs.
Thank you for your valuable and constructive feedback.
[S1 and W1] Contributions
Please refer to our global rebuttal.
Although the methods of DCT and k-medoids are well established, these methods are never used in any prior work for hair modelling, because how to employ DCT and k-medoids to hair modelling is non-trivial and not straightforward. Our novel hierarchical hair representation proposes strand-level frequency decomposition and hairstyle-level optimal representative subset guide sampling, which enable using these techniques. The choices of DCT and k-medoids are well motivated against more popular variants DFT and k-means. Especially the discovery of k-medoids results as the optimal way of sampling guide strands and prove it theoretically (Theorem 1) is considered our theoretic contribution.
We use VAE as the generative model for our hierarchical strand hairstyle generation, which was also used to learn strand codec [32]. We do not claim any contribution on novel generative models there.
For technical contributions, apart from the more foundational methodology and theory on abstraction of hair strand data hierarchy (Sec 2), we would like to highlight also our several novel innovations with advantages in the corresponding neural model design (Sec 3) associated to the hierarchical hair parameterization. Please refer to our global rebuttal.
[W2] Relationship to existing works
Please refer to our global rebuttal. We will specify our innovations, and especially which components in our method are new with regard to existing strand hair modelling methods in related work to better show our contributions.
[W3] Evaluation against HAAR
Please refer to our global rebuttal.
[W4] Absence of evaluation metrics for strand hair generation quality
Indeed, quantitative evaluation of hair generation is hard, because currently there is no such a measurement to evaluate the generation quality of strand hair. Even the existing SOTA work HAAR did not evaluate the generation quality.
Evaluation of image generation can take domain-specific PSNR, SSIM measures, as well as FID and LPIPS that require a pretrained semantic encoder (eg. VGGNet). Unfortunately, strand hair has neither these domain-specific measures nor a VGGNet-like semantic encoder for strands to use FID and LPIPS measurements.
We will add lack of evaluation metrics as a limitation of the whold field of generative hair modelling in the discussion. Possible future direction could be to train a semantic encoder for strand hairstyles with self-supervised learning to enable FID and LPIPS measures, while it requires a large amount of high-quality data. We expect the most promising direction to achieve this would be to apply CT2Hair [A] in large scales for human hair capture and reconstruction, which requires specific equipments.
Instead, in the VAE generative framework, we quantitatively evaluate the VAE reconstruction as a way to show the quality and advantage of our hierarchical hair representation. The generation quality is shown by qualitative examples and a well-established human assessment study against the SOTA method HAAR in the rebuttal.
[Q1] Theorem 1
Theorem 1 implies that, if you want to sample a number of guide strands from the original hair with dense strands, then theoretically the optimal way is to perform k-medoids clustering on the dense hair strands and take the resulting set of medoids as guide strands. And this resulting set of guide strands (called the representative guide curve set as in Definition 1) has the smallest possible chamfer distance from the original dense strand set, from any other way to sample the same number of strands. (Note that we do not perform clustering, but we just use the resulting medoids as sampled guide strands. It just happens that the optimal solution of the guide strand sampling can be achieved from the k-medoid clustering as resulting final medoids set. )
In this way the extracted representative guide curves ensure the best possible retention of hairstyle characteristics for the modelling of hierarchical hair representation as used for training the neural generative model.
- [A] Shen et al. (2023) High-Fidelity 3D Hair Modeling using Computed Tomography.
The paper presents a representation for learning a generative model of hair strains. The suggested representation is hierarchical, going from low-frequency to high-frequency details. In turn, the suggested representation is incorporated into a VAE architecture. The method is evaluated on a dataset of synthetic strand hairstyles.
优点
Both quantitative and qualitative results are provided.
I appreciate the effort put into addressing the challenging task of hair strand generative modeling.
缺点
Presentation quality. The paper is difficult to follow. For instance, Section 3 should make a clearer distinction between implementation details and method details. Another example is Figure 4, which is challenging to interpret. The proof details are hard to follow as well.
Contribution he paper primarily presents itself as a collection of design choices, such as DCT, clustering, VAE, and PVCNN. For example, the contribution list includes "utilize the discrete cosine transform" as a contribution, and clustering is also claimed as a contribution. It is challenging to classify these specific choices as contributions. Instead, a more detailed discussion, perhaps extending from the concrete focus of hair modeling to broader ML topics, would have been more appreciated.
Evaluation The method is evaluated solely on a single dataset consisting of synthetic data. It is anticipated that this method should be applied to real data or in other settings for a more comprehensive evaluation.
问题
I would appreciate any response regarding the weakness stated above.
局限性
yes
Thank you for your valuable feedback.
[W1] Presentation quality and clarification
We will improve the proof details for better readability. For Sec 3 and Fig 4, we reviewed them and think the information is technically precise and clear. However, we understand that, since the whole field of generative hair modelling is very new, it may require some effort and time for readers to understand our methodology. So we make the following clarifications.
[W1.1] Sec 3 distinction between implementation/method details
We clarify that Sec 3.1 is the details of the parameterization of hair data for the neural model to process on, based on methods in Sec 2. These parameterization setup details are crucial to understand the following neural model design.
All the information in Sec 3.2 is the necessary method details to understand our method and motivation of neural model design based on the proposed hierarchical hair representation, while all the implementation details of the model are delayed to appendix C.1.
[W1.2] Fig 4
Each of the Fig 4 (b) - (d) are corresponding to a paragraph in Sec 3.2, which can be recognized by the subfigure and paragraph titles. We will add clearer pointers to connect them. In more details:
- Fig 4(a) illustrates the setup of guide / non-guide strands.
- Fig 4(b) The guide strand model. The main architecture illustration follows the PVCNN convention with conv and pointnet branches, plus components of 1D strand encoder and decoder and the VAE reparameterization to adapt to the generative hair modelling task.
- Fig 4(c) The densification model, combining bilinearly interpolated features (above) and the graph features (below) for decoding the dense hair strands. The graph features aggregate information from neighboring guide strands to sampled query locations on the scalp, as inspired by implicit neural representations (see the text in L224-233 for detailed explanation), which allows modelling arbitrary number of dense strands and density, and end-to-end joint training together with guides (b).
- Fig 4(d) Adding high frequency, which is similar to the architecture in (b).
[W1.3] Proof of Theorem 1
We update the proof with more explanations for better readability.
Proof: Assume that from the -medoids algorithm, we obtain the set of medoids with each is from the set of dense hair strands . Then, from Eq. (2), achieves minimum sum of cluster element-to-medoid distance .
Next, from the algorithm implementation, each element in is closest (or equally close) to the medoid of its own cluster than that of any other cluster, so is the subset of with cadinality that achieves minimum which is the minimum sum of each dense strand with its nearest medoid . After taking the average (divided by a constant ), is in the form of a unidirectional chamfer distance from to . So achieves the minimum unidirectional chamfer distance from , from all possible subset of with cadinality .
Then we show that in the reverse direction, the unidirectional chamfer distance from to , , is constantly 0. This is easy to infer because is a subset of , and each can find the same element from that is closest to itself with distance 0. Aggregating both directions, we conclude that , from all possible subset of with cadinality , achieves the minimum (bidrectional) chamfer distance between and , i.e., is the representative subset of with cardinality according to Definition 1.
[W2] Contributions See our global rebuttal. Some notes specific to your comments:
-
Instead of "utilizing the DCT", here we claim the strand-level frequency decomposition in our novel hierarchical hair representation, which is for the first time used in hair modelling. Introducing DCT to hair strand modelling, again for the first time, can be a (less significant) contribution but not our main focus. How to apply DCT to hair modelling is non-trivial, without our novel representation design with frequency decomposition.
-
Adaptation of geometric deep learning models such as PVCNN is also non-trivial, because hair strands are not point cloud data for direct application. We would refer to S3 in Reviewer vBP9's comments suggesting that this adaptation and modification in our model design are "quite clever and makes sense".
-
We clarify that we do not "cluster" strands but we "sample" guide strands. See our reply to Reviewer Sftt [W1.2]
-
We mainly focus on the learnable hierarchical generative hair representation and its neural model design. For broader ML topics, our novel method and neural architecture design is highly related to the topics of geometric DL, set-based modelling, graph NNs, implicit neural representations, and hierarchical abstraction of non-Euclidean data. In this way, our work can potentially inspire these communities on method design and extending the application to other novel application. Thanks for the comments and we will add this to the discussion.
-
Besides the hair representation (Sec 2), we would like to highlight our several novel innovations with advantages in the corresponding neural model design (Sec 3) associated to the hierarchical hair parameterization. Please refer to our global rebuttal.
I appreciate the authors’ thorough rebuttal and have no further requests. However, I remain concerned about the contribution of the paper, as noted in my original review. The paper presents itself as a collection of design choices, and I still question whether these specific choices can be classified as significant contributions to hair modeling.
Dear Reviewer mnz6,
Thank you very much for your response. Your comments have been valuable in helping to improve the evaluation and readability of our paper. However, we still stand by and wish to defend the contributions we have made.
-
[Regarding classic methods (DCT and k-medoids) for extracting the hierarchical hair representation]: Our major contribution here lies at the high-level hierarchical representation design, without which the direct application of aforementioned methods is impossible. For these individual classic methods, besides being introduced to hair modelling for the first time, we contribute by presenting the insights and theoretical justifications on why we opt for these classic methods, that are less utilized by recent researchers, instead of more common alternatives such as DFT and grid-sampling/k-means:
-
We use the DCT instead of the more popular DFT, by showing the motivational insights that DFT demonstrates the oscillation issue for frequency decomposition of strands as open curves.
-
We use k-medoids to sample guide strands, and show it is mathematically the optimal way to sample guide strands, which is more advantageous than the more commonly used grid-sampling / k-means in theory. In the meantime, using k-medoids for sampling and the theoretical optimality was never discovered before.
-
-
[For existing methods (PVCNN) that inspire our neural model architecture design]:
-
Geometric deep learning models are first introduced to strand hair modelling. Also, adapting (rather than directly using) existing geometric deep learning models (such as PVCNN) to hair is not straightforward, because they were originally designed for point cloud data which is highly different from hair strands. In fact, at the implementation level, we were unable to directly use almost any existing code from PVCNN due to the huge difference, but we built the architecture entirely from scratch with the PyTorch framework.
-
In addition to the components related to existing methods, we would also highlight our several innovations in the learning model to handle flexible non-Euclidean parameterization, resolution-free representation to generate any number of strands with any density, and the end-to-end training of the high-dimensional strand hair data, etc. We hope the significance of these innovations can be acknowledged once a reader thoroughly follows the full pipeline of our methodology.
-
Overall, we fully understand the concern that using some existing techniques as components in (part of) the methodology can be considered less novel contributions. But we would like to present a slightly diverse philosophy from a well-famed researcher with a good reputation in the field: "A new use of an old method can be novel if nobody ever thought to use it this way"; "the novelty arose from the fact that nobody had put these ideas together before". (Black, 2022).
We would be grateful if you could consider re-evaluating our contributions and innovations, taking into account the full pipeline of our methodology and learning model design, as well as the insights presented in our work. Regardless of your decision, we sincerely appreciate the time and effort you have dedicated to the review. Your comments are invaluable in helping us to improve our work.
Best regards,
Submission 11742 Authors
- [Black, 2022]: Michael J Black (2022). Novelty in Science: A guideline for reviewers.
We thank all reviewers for their precious time and effort they put into reviewing our work. We address some common concerns of unclear contributions and short of evaluation as follows. We will update the paper with evaluation results and other clarifications.
Contributions related to existing work
The most related work HAAR [41] (the only strand hair generation method with code that we can compare with) adopts hair representation from [32], which maps each strand to a code with a pretrained strand codec VAE, then project to 2D scalp UV map that can be processed with regular 2D CNNs. The same representation is widely used in other recent strand hair modelling methods [40, 46, 36] for different tasks. In contrast, we do not borrow any existing hair representation, but we design a brand new hierarchical hair representation method with associated neural models. Our method is more flexible, sophisticated and better performing than grid UV map (2D CNNs) + strand codec. Before [32] (ECCV2022), earlier strand hair methods [23,33,40,42,44,45] mostly have a different focus on optimizing strand growth or connecting segments, while learning applies to other intermediate representations (e.g. orientation field) but not directly on strands, thus not requiring hierarchical strand representation with abstraction. In these methods, the strand representation is just simple polylines (a sequence of points). Some early work also apply PCA and clustering to strands [A].
Our hierarchical hair representation methodology and the associated neural model design are novel with a number of innovations in strand hair modelling. To compare with existing methods, we structure our contributions as follows
- Contributions on hair representation and hierarchical abstraction (Sec 2)
- (per-strand level) For the first time in hair modelling, we apply frequency decomposition on strands to facilitate learning following the spectral bias principle. We introduce DCT for that purpose, also for the first time in hair modelling, and showcase that DCT is better than the popular DFT for strands as open curves.
- (collection-of-strands level) For the first time in hair modelling, we introduce to use k-medoids clustering algorithm for guide strand sampling. We show that this way of guide strand sampling is theoretically optimal (closest to the dense strand set in terms of the chamfer distance) with a mathematical proof, while this property is not discovered before.
- Contributions on neural model design (Sec 3)
- For the first time in hair modelling, we adapt the family of non-Euclidean geometric deep learning models for learning on hair strands, instead of 2D CNNs, for more flexible modelling of hair strands. We opt for PVCNN for its efficiency, with modifications (as originally designed for point clouds) to fit the hair strand problem.
- We propose a novel neural mechanism for learning strand upsampling / interpolation with graph message passing. Inspired by implicit neural representations, our method handles modelling any amount of dense strands at any sampling density, which is never seen in deep models that directly process on strand hairstyles. Another benefit is enabling end-to-end joint training with guide strands, which we haven't seen in learning-based strand modelling either.
We will specify which components in our method are novel compared to existing strand hair modelling methods in related work to better show our contributions.
Contributions related to existing techniques DCT, k-medoids, PVCNN
Instead of using some technique, our major contribution is in the novel design of hierarchical hair representation and associated neural model design, or how we make it possible to use these techniques in hair modelling, which is not straightforward and thus never done before. Please refer to our replies to Reviewers mnz6 and VBS2.
Evaluation against SOTA method HAAR
-
Code of HAAR was not available before NeurIPS submission. So comparing with the full pipeline is difficult. Therefore, we include comparison with the hair representation they use, which is exactly the "grid-based + strand codec" baseline in Table 2, as we stated in the Sec 4, and also in Table B in the rebuttal PDF. We will specify the connection to HAAR in the Table 2 caption (like Table B). The quantitative comparison shows the advantage of our novel hierarchical hair representation.
-
HAAR released their code and inference model (but not the artist hairstyle data) after NeurIPS submission. Due to lack of evaluation metrics for hair generation (see reply to Reviewer VBS2 W4), we conduct user study with human evaluation. We randomly generate 30 hairstyles using our method without selection, and 30 from HAAR, in total 60 examples, randomly shuffled before presented to the users. Each user will give a score 1-10 to each hairstyle on how realistic the generated hairstyle looks. For Code of Ethics, we provide examples and instructions on the Google form in the rebuttal PDF Figure A. We collected 54 valid responses. The resulting average scores are in Table A. suggesting the advantage of our generation over HAAR. We also show results in different hairstyles categories, For short hair, both methods perform good. Our method perform significantly better on long hair and especially curly hair, due to our sophisticated representation design, e.g. frequency decomposition, the learned neural interpolation, end-to-end training that facilitate optimization. Some qualitative issues with HAAR are shown in Figure B.
Evaluation on real-world data
The CT-scanned human hairstyles in CT2Hair [B] are the real-world strand data with the best quality we know. We evaluate on them in Table B of the rebuttal PDF, corroborating the advantage of our hair representation.
- [A] Wang et al (2009) Example-Based Hair Geometry Synthesis
- [B] Shen et al (2023) High-Fidelity 3D Hair Modeling using Computed Tomography
Dear reviewers,
We sincerely thank all reviewers for your valuable comments in helping us to improve and clarify our paper.
To highlight some of our improvements in the rebuttal and further response to reviewers:
-
New evaluation experiments including (1) comparing with generation results from HAAR (SOTA method from CVPR24, and for now the only existing open-sourced strand hair generation method we can compare with, released after NeurIPS submission deadline); (2) evaluation on real-world hairstyles data CT2Hair; and (3) a few more ablation experiments on sampling and frequencies.
-
More directly specifying what are new in strand hair modelling. Generative strand hairstyle modeling is a new field, due to the complex data structures and challenges in data acquisition, while earlier non-generative strand hair methods focus diversely on optimization rather than hierarchical abstraction for generalizable neural encoding. Our flexible non-Euclidean representation and framework are a novel paradigm that significantly diverges from the existing strand learning pipeline (pre-trained strand codec + 2D CNN on regular UV grids), as adopted by HAAR [41] alongside other recent work [32,40,46,36]. Consequently, most of our detailed design choices are new contributions to strand hair modelling as well, as elaborated in our global rebuttal. Theorem 1 for applying k-medoids to sampling is also a new observation.
-
Contributions beyond "simple and direct combination of existing methods", we conclude our clarifying response to this common concern, partly due to incomplete understanding of our full method pipeline. Details can be seen in our global rebuttal and further response to Reviewer mnz6.
-
The more important contribution is in the high-level design of our novel hierarchical hair representation and learning model pipeline, before we can apply any specific design choice.
-
Introducing classic methods (DCT and k-medoids) that are less utilized by recent researchers to a new learning problem of generative strand hair representation should be considered as a novelty.
-
We present detailed motivational insights and theoretical justifications on why we opt for DCT and k-medoids over more popular variants. Especially the theoretical justification for k-medoids is a new finding.
-
The geometric deep learning method (PVCNN) that inspired our model design is not directly applicable to the strand hair data which is highly different to point clouds. We adapt the idea with several modifications to handle the strand hair representation.
-
Several other novel designs in our learning model, which facilitate handling generation of hair strands of any amount and in any density, end-to-end training of the coarse-to-fine pipeline, disentangling generation of roots and rest of strands by the density map, etc. These innovations are specific to the complex and high-dimensional strand hair data, and thus not similar to existing methods.
-
Almost everything mentioned here is new to strand hair modelling, as discussed above.
-
As we approach the deadline for the author-reviewer discussion, we would like to kindly inquire if there are any further post-rebuttal questions and concerns from the reviewers. We are happy to address any remaining clarifications.
Best regards,
Submission 11742 Authors
The paper received diverse scores, two positive and two negative. The authors provided very detailed rebuttal, and most of the concerns are addressed. Only one concern is remaining: the technical part seems like - using existing method A for problem a, using existing method B for problem b and using method method C for problem c. This indeed makes the paper more engineering. The AC took a very careful look at the paper, and found the most interesting part is the hierarchical representation and the coarse-to-fine pipeline. The AC is familiar with hair modeling area and think the contribution of this submission is enough. Considering the two negative reviewers have very low score confidence. The AC finally recommend an acceptance.