Towards A Translative Model of Sperm Whale Vocalization
WhAM: a transformer model unifying generation, acoustic translation and classification of sperm whale vocalizations
摘要
评审与讨论
This paper presents a model to generate whale codas, which is fine-tuned from VampNet. They fine-tune the VampNet on their collected dataset and confirmed that the generated codas are similar to real ones, and learned representations are useful in some downstream tasks such as audio translation and classification.
优缺点分析
Strengths
- The dataset used for the research is unique, which might be interesting to some people.
- They seem to publish their models.
Weaknesses
-
This paper focuses on developing a model for whale vocalization. However, the model has limited technical novelty in model architecture, training objectives. Their approach seems to consist of tokenization, token modeling, and generation. These modules are borrowed from previous work, VampNet, without modification, and therefore, this work has limited technical novelty.
-
It is not very clear if the success in the style transfer is quantitatively verified. In my understanding, FAD measures the distance between two distributions, yet does not measure the semantic similarity between the translated and original ones. Their generation results can be realistic enough to mimic the real codas, but they might not preserve the semantics of the input.
-
The observations from their empirical results are limited. They combined several datasets to fine-tune VampNet and show that their trained model performs well on diverse whale-related downstream tasks. However, it is not clear what the key ingredient was, e.g., dataset combination or selection of VampNet. Table 1 only shows that their model outperforms the previous work. To give better observations and insight for readers, they need to provide more ablations into the combination of training datasets and base models.
-
Anon1 and Anon2 seem to be their developed dataset. However, the explanation of the datasets is abbreviated from the main paper. Also, it seems it is not clear if they can publish the dataset. If they can, the novelty of this work should improve.
问题
Lack of technical novelty and important observations is the main reason for my rating.
They also do not mention that they will publish the dataset used to train their model. If their dataset is unique, they should focus on the presentation of the dataset.
局限性
Yes.
最终评判理由
The rebuttal addressed my concerns about the dataset. However, it did not address my concerns about the lack of technical novelty and empirical analysis.
The author stated that their work is the first to apply the existing model to whale vocalizations, which I think does not much support the technical novelty unless there is a large barrier to applying to their task and they solved it. Also, empirical analysis is indeed limited to domain adaptation, fine-tuning, and the insight from the analysis seems to be a bit limited.
Considering that, I would like to keep leaning towards rejecting.
格式问题
No.
Thank you for your review, which has helped us clarify and strengthen our paper. Most importantly, our work is novel as it is the first application of neural acoustic translation to cetacean vocalizations (only one other paper for neural bioacoustic translations exists, and it concerns birds). We also respectfully disagree with the assessment of limited technical novelty and lack of ablations; we will highlight where those are found in the paper, below. Next, we address each concern in detail and provide additional context about our technical contributions, evaluation methodology, and the significance of this work for both machine learning and marine biology communities.
Weaknesses
This paper focuses on developing a model for whale vocalization. However, the model has limited technical novelty in model architecture, training objectives. Their approach seems to consist of tokenization, token modeling, and generation. These modules are borrowed from previous work, VampNet, without modification, and therefore, this work has limited technical novelty. ... Lack of technical novelty and important observations is the main reason for my rating.
We respectfully disagree with this assessment and are glad to more specifically state the study's contributions. Firstly, there are very few examples of acoustic translation works on animal vocalizations (birds) - and this is the first to be applied to whale (or any cetacean) vocalizations. Our contributions include:
- Novel evaluation framework combining FAD, expert perceptual studies, and downstream tasks
- Largest curated sperm whale vocalization dataset
- Training strategies bridging music-pretrained models to underwater bioacoustics
- First demonstration that style transfer can preserve bioacoustic structure in cetaceans
It is not very clear if the success in the style transfer is quantitatively verified. In my understanding, FAD measures the distance between two distributions, yet does not measure the semantic similarity between the translated and original ones. Their generation results can be realistic enough to mimic the real codas, but they might not preserve the semantics of the input.
We appreciate this comment/concern and want to clarify a fundamental aspect of our work. As stated in the paper: "We emphasize that translation is in the acoustic sense; semantic translation remains a distinct and more ambitious goal."
For this reason, we use "acoustic translation" rather than "semantic translation." Understanding animal communication semantically would indeed be groundbreaking. For example, the 1973 Nobel Prize was awarded for decoding the semantic content of bee communication ("waggle dances"). No such understanding exists for whale communication.
Our experiments argue that our model preserves acoustic properties and temporal structure, not semantic meaning (which remains unknown to science).
The observations from their empirical results are limited. They combined several datasets to fine-tune VampNet and show that their trained model performs well on diverse whale-related downstream tasks. However, it is not clear what the key ingredient was, e.g., dataset combination or selection of VampNet. Table 1 only shows that their model outperforms the previous work. To give better observations and insight for readers, they need to provide more ablations into the combination of training datasets and base models.
Thanks for this comment, We want to point out that we have indeed provided extensive ablations in our paper. Specifically:
- Section 4, final paragraph describes ablation experiments for domain adaptation and species-specific fine-tuning.
- Appendix D repeats the FAD and downstream classification experiments while ablating on model components and dataset composition. Additional ablations have been carried out as a result of discussions with other reviewers, and will be added to the paper.
Anon1 and Anon2 seem to be their developed dataset. However, the explanation of the datasets is abbreviated from the main paper. Also, it seems it is not clear if they can publish the dataset. If they can, the novelty of this work should improve. ... They also do not mention that they will publish the dataset used to train their model. If their dataset is unique, they should focus on the presentation of the dataset.
We appreciate your interest in our dataset. If this paper is accepted, we will share the Anon1 dataset. We are in conversation with the data collectors of Anon2 regarding a partial or full release.
The paper focuses on the acoustic translation methodology rather than dataset presentation because: (1) the dataset combines two existing sources with our preprocessing pipeline, and (2) the methodological contributions (translation framework, evaluation suite) have broader applicability beyond this specific dataset.
We will also release our model weights and preprocessing code, enabling researchers to apply our methods to their own bioacoustic data.
Thanks for the rebuttal.
The rebuttal addressed my concerns about the dataset. However, it did not address my concerns about the lack of technical novelty and empirical analysis.
The author stated that their work is the first to apply the existing model to whale vocalizations, which I think does not much support the technical novelty unless there is a large barrier to applying to their task and they solved it. Also, it is true that empirical analysis is limited to domain adaptation, fine-tuning, and the insight from the analysis seems to be a bit limited.
Considering that, I would like to keep leaning towards reject.
The paper proposed an neural Transformer architecture to generate synthetic sperm whale codas. The Sperm whales main communication is based on short sequences of clicks which called codas.
The WhAM (Whale Acoustics Model) proposed model is finetuning of VampNet which is a model that trained on musical data and on 10k coda recordings collected over the past two decades.
The paper shows that the proposed WhAM architecture generate very good synthesis codas while preserving the acoustic features. The author use the Fréchet Audio Distance (FAD) metric to evaluate the proposed model.
Moreover, they check the learned representations and shows that it get very good performance for downstream tasks such as rhythm, social unit, and vowel classification.
优缺点分析
Strengths:
-
The idea of the proposed model WhAM is very novel for using transformer based model for generating sperm whale vocalizations.
-
The paper use the well known Fréchet Audio Distance metric and and expert perceptual examination
-
The proposed method develop with collaboration with domain experts
Weaknesses:
-
The MATM network was fine-tuned where the audio codec is fixed which can harm the results of the proposed model.
-
The proposed model has noise in the generated samples for example unnatural onset/decay of click and spectral inconsistencies
-
The training dataset may has echolocation clicks which affect the generation results.
问题
-
What is the limitation of the proposed model when the only trained model is the MATAM? can you train a codec?
-
Can you ellborate more on the unnatural of clicks and spectral anomalies? Can you think about modification to the network in order to fix that issue?
-
Can you fix the echolocation sequences problem in the dataset? Can you tell how degradation in the performance it gives?
局限性
--
最终评判理由
All of my concerns were addressed appropriately by the authors.
格式问题
--
We thank the reviewer for their enthusiastic assessment of our work and for recognizing the novelty of applying transformer-based architectures to sperm whale vocalization synthesis. Your technical questions have prompted us to conduct additional analyses that strengthen the paper and clarify important implementation details. Below, we address each of your concerns.
Regarding training a codec: Thank you for this insightful question about the fixed codec limitation. Your comment prompted us to conduct a detailed analysis of the codec's impact on sperm whale vocalizations. We performed systematic encoding/decoding experiments measuring frequency-dependent signal-to-noise ratio (SNR in dB):
- 0.2-3kHz: +0.15 dB
- 3-4kHz: -1.95 dB
- 4-5kHz: +0.15 dB
- 6-7kHz: -1.95 dB
- 7-9kHz: -2.77 dB
While we see some degradation in the 3-6kHz range (important for vowel-like features per Beguš et al.), the codec still preserves the primary coda energy reasonably well. The degradation is not severe enough to prevent our model from generating natural-sounding codas (Section 4.2), nor from its embeddings to "capture" vowels (Section 4.3).
Developing a specialized bioacoustic codec remains valuable future work, though computationally expensive. For now, these results suggest the fixed codec, while not perfect, is sufficient for demonstrating acoustic translation capabilities. We've added this quantitative codec analysis to our limitations section. Thank you for motivating this important addition.
Regarding echolocation clicks: You raise a valid concern. While our dataset curators made efforts to filter echolocation clicks from communication codas, some false-positives likely remain given the dataset's scale and technological limitations of click detectors. We will run a more recently-developed coda-detector on our dataset; if there are noticeable differences, we will retrain the model and release this as an additional checkpoint. Unfortunately, we will not be able to repeat the listener perceptual study.
It's worth noting that only one of five expert listeners reported this issue, suggesting it's not a pervasive problem in our generated outputs. The model appears to learn communication coda patterns preferentially despite possibly "contaminated" training data. We acknowledge this limitation in our dataset section.
Regarding fixing unnatural click properties: Thank you for this question. The artifacts mentioned by expert listeners have different solutions:
- Background noise: Standard bioacoustic preprocessing (DC offset removal, noise-gating) would remove this. Indeed, Beguš et al. (2024) apply such denoising to their sperm whale data.
- "Unnatural" clicks: This could be addressed through higher p-norm losses to emphasize acoustic events of varying magnitudes. However, this optimization likely needs to happen at the codec level, which is beyond our current computational resources. More training data may also help.
We deliberately chose to use raw outputs from WhAM without classical post-processing, as this paper focuses on demonstrating our neural model's generative capabilities (and limitations). For practical applications, standard bioacoustic denoising would be applied, but including it here would obscure the model's true performance.
The paper proposes a machine translation model for sperm whales. It can generate sperm whale codas either at random or conditioned on vocalizations of other animals/species. The authors do not claim semantic translation and emphasise a more style-transfer-like effect. The embeddings learnt by their model can also be used for downstream classification tasks. They start with a model pretrained with music data, train on general animal audio for domain adaptation and finally finetune on a relatively large corpus of 10K sperm whale codas. Their evaluation includes 1) perceptual similarity of generated codas and 2) downstream classification performance based on learnt representations.
优缺点分析
Strengths
- The paper has good potential impact, both for sperm whale understanding and broader bioacoustics.
- The perceptual study using experts in the field was well conducted. I also liked the qualitative feedback included here, which gives potential directions for future work.
- The new dataset would also be great for the community. I understand the challenges regarding making this data public mentioned in the checklist, but urge the authors to try to release as much data as possible, after publication too.
Major Weaknesses
- For the domain adaptation stage, the authors used animal sounds from Audioset and FSD. I was curious why the authors preferred this over larger bioacoustics datasets like BirdSet [1] and iNatSounds [2]? These have 100s to even 1000s of hours of animal sounds and can be a much better starting point. I understand if the choice was for computational constraints, but it may be worth mentioning in the paper.
- I am a bit skeptical about the metric FAD. In particular, I am concerned about its dependence on the choice of the embedding model. I understand that the authors analysed a few different models and then chose CLAP. However, all of the embedding models were trained for general audio, and may not appropriately capture fine grained differences. I would recommend using a bioacoustic model (eg. [3]), not necessarily trained for whales, but for birds or general species sound datasets.
- In continuation to the previous point, from figure 3, it seems like for before Wham (untranslated), the artificial acoustic impulses are more similar to sperm whales than quite a few animal sound sources like B Seal, B Whale and L Whale. This was quite surprising to me, and further added to my skepticism of the metric FAD.
- Regarding the perceptual study, I wanted to clarify if the accuracy measures how often an expert was able to correctly distinguish a synthetic code from a natural one. If this is the case, then lower the accuracy, the better Wham’s performance, right? The quantitative discussion at some points seemed to be written more as if evaluating the experts’ performance.
- Based on my understanding, the 2AFC experiment in Fig 4 and Natural Codas in Fig 5 should be measuring very similar things. But, the former is at 80% on average while the latter is close to 60%. I understand that mixing would increase the number of choices and the evaluation is not exactly the same anymore, but is there another reason for this difference?
- I would appreciate it if the authors can include a few qualitative visualizations. Spectrograms of a few original and translated audio examples could help build better intuition. I would recommend including some examples for the qualitative feedback from the expert study, if possible.
- The claim of learnt embeddings being “useful” for downstream classification seems to be relatively weak. I would argue even CLAP embeddings are somewhat useful, and may be better than random performance. I understand that AVES may not be a realistic baseline to beat, but can the authors include baselines with some other audio embedding models like CLAP and [3]. This would better highlight the benefit of Wham’s learnt embeddings.
Minor Weaknesses
- I was a bit confused in Fig 5. I thought NCAIs were what was referred to as synthetic codas. Can the authors please clarify the difference.
- 2AFC vs 2AFC+spec. Seems like 3 experts were able to almost perfectly pick the synthetic codas with spectrograms while 2 were even more unsure. I am curious if these two groups are marine biologists and acoustics specialists.
- Its interesting that although the dataset consists of whales in the same vocal clan, AVES still does quite well (92%) on social unit classification. Can the authors comment on this?
[1] Rauch, Lukas, et al. “BirdSet: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics.” The Thirteenth International Conference on Learning Representations, 2025.
[2] Chasmai, Mustafa, et al. “The iNaturalist sounds dataset.” Advances in Neural Information Processing Systems 37 (2024): 132524-132544.
[3] Ghani, Burooj, et al. “Global birdsong embeddings enable superior transfer learning for bioacoustic classification.” Scientific Reports 13.1 (2023): 22876.
问题
I would be happy to increase my rating if the authors can include bioacoustics embedding models and pretraining on larger bioacoustics datasets; or defend their choices.
局限性
yes
最终评判理由
The authors have addressed my concerns. I am glad that the bioacoustics models were helpful.
Overall, I am happy with the paper and believe it has good potential in the bioacoustics community. I am in support of it being accepted, and maintain my original rating of 5: Accept.
The worse classification performance (compared to even CLAP) is the main reason I do not increase my rating to Strong Accept.
格式问题
Formatting looks good, no concerns.
We sincerely thank the reviewer for their thorough and constructive review, and for recognizing the potential impact of our work. We are particularly grateful for specific suggestions regarding bioacoustic embeddings and datasets, which have led to substantial improvements in our evaluation. Below we address each specific concern.
W4: Thank you for this excellent suggestion. We will retrain our model using BirdSet/iNatSounds for the domain adaptation stage. Given the larger data volume (100s-1000s of hours vs our current dataset), this will take considerable time but will be ready for the camera-ready version. We expect this enhanced domain adaptation to improve our results (and will report the results regardless of an improvement).
W5: This suggestion significantly strengthened our evaluation. We have now evaluated FAD using BirdNET [3] embeddings as you recommended. Results strengthen our results:
| Dataset | Before WhAM | After WhAM |
|---|---|---|
| Risso | 131.1760 | 99.1514 |
| Beluga | 162.1433 | 103.3841 |
| Orca | 204.1088 | 96.2505 |
| L. Seal | 200.5741 | 109.8147 |
| Narwhal | 127.1519 | 95.5090 |
| L. Whale | 173.5660 | 93.1842 |
| B. Seal | 213.2286 | 108.6346 |
| B. Whale | 202.5659 | 111.2529 |
| C. Dolphin | 148.9537 | 94.6659 |
| Ross Seal | 205.2280 | 105.3025 |
| A. Dolphin | 190.3531 | 80.7560 |
| Walrus | 184.3641 | 92.8567 |
| NCAI | 121.8530 | 59.7834 |
| S. Whale | 91.5416 | 90.0007 |
Analysis:
- WhAM dramatically reduces FAD scores across all non-target species (improvements of 50-110 points)
- Sperm whale FAD remains stable (-1.5 point difference is 1-2 orders of magnitude smaller than the rest, and could be due to disjoint test vs. reference sets)
We note that the difference in scale compared to Figure 3 is because we had to reimplement FAD ourselves in order to use these embeddings; the standard library we were using with CLAP was not compatible with BirdNET. Importantly, scale differences are irrelevant to FAD. These bioacoustic embeddings provide additional evidence for our model's effectiveness. Results will be added to the camera-ready version.
W6: Broadly speaking, the vocalizations of bearded seals (Erignathus barbatus), bowhead whales (Balaena mysticetus) and long-finned pilot whales (Globicephala melas) are "songlike" (trills, moans, groans, etc.). In that regard, acoustic impulses (even digital ones) are more similar to codas than these. If anything, this shows that FAD is not just picking up on background noise—which is relatively similar in these recordings as compared to the perfect silence of digital non-coda acoustic impulses. We will add this discussion to the paper.
We agree that FAD experiments alone would not be sufficient to support our overall claim regarding our model's acoustic translation capabilities. We chose to include them as they use an "objective" metric unlike the downstream classification and listener study experiments, which involve (to varying extents) human annotations.
W7: That is correct, accuracy refers to the success rate of the listeners. We will add arrows (\downarrow) to the y axis to emphasize that lower accuracy means better performance by WhAM.
W8: Your understanding is correct. The key difference is that in the 2AFC task, listeners directly compare a natural recording with the model's output on that exact same audio, allowing immediate A/B comparison. When the same acoustic content is presented in both versions, listeners can detect fine-grained artifacts that become apparent through direct comparison.
In contrast, the batch classification task requires identifying synthetic vs. natural audio without matched pairs. Listeners must rely on detecting absolute qualities that distinguish synthetic from natural speech across varied content, without the benefit of direct comparative cues. This fundamental difference in task difficulty explains the performance drop from ~80% to ~60%.
W9: Thank you for the great suggestion, we will add a sample of spectrograms to the camera-ready version. We will ask our study participants to annotate the spectrogram with their observations. We have already reached out and can confirm at least one participant is available to share their annotations.
W10: We agree that comparing against other embedding models would better demonstrate WhAM's utility. Following your suggestion, we evaluated BirdNET [3] and CLAP embeddings on our downstream tasks:
| Task | WhAM | AVES | BirdNET | CLAP | Random | Majority |
|---|---|---|---|---|---|---|
| Detection | 91.3 | 92.8 | 90.0 | 96.0 | 60.9 | 60.9 |
| Rhythm | 84.4 | 90.4 | 94.8 | 89.0 | 66.3 | 60.9 |
| Social Unit | 70.5 | 92.0 | 93.4 | 89.0 | 42.5 | 35.1 |
| Vowel | 79.6 | 91.8 | 85.0 | 82.0 | 66.3 | 66.3 |
We emphasize that AVES, BirdNET, and CLAP are all encoder-only models designed for classification, whereas WhAM can be used for generation. As expected, embeddings trained on animal datasets (AVES, BirdNET) generally outperform general audio embeddings (CLAP) on more involved bioacoustic tasks.
Thank you for suggesting these additional experiments—they confirm the same trend across different embedding models, and add breadth to our analysis. The complete comparison table will be included in the camera-ready version.
W11: Apologies for the confusion. NCAIs = Training data (synthetic beeps/clicks that sound artificial) Synthetic codas = WhAM's output (intended to sound like real whale codas) We'll clarify this terminology throughout the paper.
W12: Interesting observation! Of the (near-)perfect scores in the spectrogram-assisted 2AFC, one was a marine biologist and another was an underwater acoustician. Interestingly, one is a professor with over a decade of experience, while the other is a PhD student. However, with n=5, we cannot draw statistical conclusions about expertise-performance correlation. We can surely suggest this as future work.
W13: In [ANONYMIZED LOCATION], Social Units are stable family groups of sperm whales - essentially extended matrilineal families that travel and socialize together. They represent the main ‘operational’ level of social structure of the whales lives, while mature males live predominantly alone, all females spend their lives in their natal units. SUs which share the same vocal dialect of codas make up a higher level of social structure called a vocal clan; however, as units move around the ocean there is a need to recognize one from another so there are some distinct features of their dialects that allow for this recognition [1]. This is likely what provides a cue for the classification task (see also [2]).
AVES's 92% accuracy in distinguishing between SUs could reflect: (1) genuine group-specific vocal characteristics from these stable family units, and/or (2) confounding factors from recording conditions, as different SUs were necessarily recorded on different days and locations. While (1) represents the biological signal we're interested in, (2) is a limitation inherent to field recordings in marine environments.
In conclusion, We believe these additional experiments and clarifications fully address your concerns. The new bioacoustic embedding results particularly strengthen our claims about both FAD metrics and downstream utility. Thank you again for your constructive feedback that has substantially improved our work.
—
Citations
[1] S. Gero, H. Whitehead, and L. Rendell (2016) Individual, unit, and vocal clan level identity cues in sperm whale codas. Royal Society Open Science 3: 150372.
[2] P.C. Bermant, M.M. Bronstein, R.J. Wood, S. Gero, and D.F. Gruber (2019) Deep Machine Learning Techniques for Sperm Whale Bioacoustics: Detection and Classification of Echolocation Clicks and Codas. Scientific Reports 9:12588
I appreciate the authors for conducting these experiments; I am glad that the bioacoustics embeddings were helpful. Quick clarification about the embeddings: was the model used BirdNet [1] or Perch [2]?
Regarding the classification performance, I am a bit surprised that even CLAP performs significantly better than WHAM, but understand the authors' point about encoder-only models.
Thank you for the clarifications about NCAI and song-like vocalizations vs codas.
[1] Kahl, Stefan, et al. "BirdNET: A deep learning solution for avian diversity monitoring." Ecological Informatics 61 (2021): 101236.
[2] Ghani, Burooj, et al. “Global birdsong embeddings enable superior transfer learning for bioacoustic classification.” Scientific Reports 13.1 (2023): 22876.
The model used was BirdNET.
The paper presents WhAM, a generative model that theoretically can convert any audio prompt into whale sound. Technically, it adopts a masked language model trained on discrete audio units (DAC) and uses iterative decoding to generate the audio. The model can generate whale sounds reasonably, based on the general FAD measurement as well as the evaluation by the domain experts. The representation of WhAM can be used for whale-oriented understanding tasks.
优缺点分析
Strengths: (1) The authors show great expertise in animal/whale studies, which makes it a good interdisciplinary study between machine learning and animal studies. (2) The authors have conducted a good evaluation of their model, including FAD-based evaluation and the human experts' evaluation. The results are convincing. (3) I really appreciate that the authors provided sufficient biological background in their paper to make the paper more understandable to the community.
Weakness: (1) Lack of baseline: the authors mentioned the GAN-based approach in their introduction and related work sections, but didn't compare with these prior works experimentally. (2) Lack of ablation study: The authors used non-whale sounds (e.g., AudioSet) in their first stage of training. I would suggest that the authors justify the necessity of this adaptation.
问题
I may request more clarification:
(1) Personally, I'm a bit confused about the motivation. Why do we need to generate the whale sound from a random audio prompt, rather than some other more informative cues (e.g., given clan, gender, or some text-like information, if exists). (2) Experimentally, the authors only used around 20 hours of data, but trained the model for 1M steps. Given this small data volume and long training schedule, do the authors have a good reason for this setup? Would it make the model fully memorize every whale sound? do the authors have a plan to test the model's generalization capability?
局限性
The paper is solid in general. As the machine learning community may not have sufficient knowledge in this biological domain, I would recommend the authors to make some audio demos so readers could better understand this work.
最终评判理由
The paper provides a good implementation in generating whale sound, which is an interesting topic from marine bilology.
Since i'm mainly from machine learning community and have little knowledge about such field, I would only review based on my background, and is not confident enough.
I score the paper 4. It should not be lower since it's complete and no tech flaws have been found. It should not be higher since the method it borrowed from prior works and is not very original, and this work can hardly be generalized beyond this whale research.
格式问题
Formatting is good
We thank the reviewer for their thoughtful and constructive feedback on our work. We appreciate your recognition of our interdisciplinary approach and the thorough evaluation methodology. Your questions have helped us identify areas where we can provide additional clarity and strengthen our empirical analysis. Below, we address each of your concerns in detail, including new experimental results that validate our design choices.
W (1): At time of writing the paper, we were unable to obtain samples from either of the GAN-based approaches: Kopets et al. did not have any public code or samples, while Begus et al. had no samples but did release public code; unfortunately, we were unable to run their code (we have reported the issues to the authors). If we cannot resolve the issue by the camera-ready, we will mention this in a footnote.
W (2): We appreciate this suggestion. We note that we already justified the necessity of domain adaptation through downstream classification experiments (mentioned at the end of Section 4 with details in Appendix D.2), which showed that neither domain adaptation nor SSFT alone produces useful embeddings. Following your suggestion for more extensive ablations, we have now repeated this analysis for FAD:
| Model Configuration | Average FAD | Sperm Whale FAD |
|---|---|---|
| Tokenizer only | 0.97 | 0.44 |
| No SSFT (domain adaptation only) | 0.85 | 0.58 |
| No domain adaptation (SSFT only) | 0.55 | 0.63 |
| Full WhAM | 0.72 | 0.29 |
Importantly, when using the "SSFT-only" ablation on sperm whale inputs → sperm whale outputs, FAD increases to 0.63 (worse than average). This is a red flag—the model should preserve sperm whale acoustic properties when the input is already a sperm whale coda. In contrast, full WhAM achieves 0.29 FAD on the same task (which is the lowest FAD except for the reference dataset itself).
This suggests that without domain adaptation, the model cannot properly condition on input audio. Indeed, during development, we found that without domain adaptation, the model fails to generate coherent audio when translating from marine mammal inputs: outputs are so noisy that even non-expert listeners could distinguish them with ~100% accuracy.
Domain adaptation with diverse marine sounds (AudioSet, FSD) teaches the model to condition on input characteristics - a prerequisite for meaningful acoustic translation. Thank you for this suggestion - these FAD ablations revealed nuanced insights about the interaction between domain adaptation and species-specific training. We will add this interesting discussion to the ablation section of the paper alongside our existing downstream task ablations.
Regarding audio demos: We appreciate this suggestion and included audio samples as supplementary material with our submission. As a reminder, our supplementary materials include:
- Input/output pairs: Natural sperm whale codas and their WhAM translations for direct comparison
- Cross-species examples: Synthetic codas generated from both sperm whale and walrus inputs
To better serve readers unfamiliar with marine bioacoustics, we will enhance these materials for the camera-ready version. We will (1) add sample spectrograms directly in the main paper showing input/output pairs, (2) include a "listener's guide" explaining what acoustic features to focus on, and (3) add more diverse examples (different coda types, various acoustic conditions).
Additionally, we intend to release the full WhAM model on HuggingFace, enabling researchers to generate their own synthetic codas and explore the model's capabilities interactively.
Q (1): This is an excellent question that highlights a fundamental challenge in marine bioacoustics. We completely agree that conditioning on semantic or demographic information would be more scientifically valuable. However, the field faces severe data annotation limitations:
- Only about 10% of all recordings made of the sperm whale units in our datasets between 2005 and 2024 have been manually annotated and verified and which contain behavioral annotations. We note that recordings vary in duration from ~4mins to hours long - and this is the largest curated sperm whale dataset known to us. More have had codas detected and annotated with new algorithms, but for this work, to be conservative, we focused on manually verified annotations only.
- Speaker identity labels are rare as well: of the recordings which have been manually annotated only ~40% contain identified codas attributed to speakers which remains a very manual and time consuming step in the data annotation process.
- The vast majority of recordings in our datasets are from females and immatures of both genders (who sound very much like their mothers dialect). In particular, they mostly do not contain mature male vocalization.
Our self-supervised acoustic translation approach was designed specifically to work with this reality—using audio context rather than requiring semantic labels. This establishes the technical foundation that future work can build upon as richer annotations become available.
Regarding clan conditioning: All our experiments use recordings from a single clan. Multi-clan synthesis would require years of additional fieldwork across ocean basins. We note that even aggregating our current single-clan dataset from two existing sources took several months due to format inconsistencies, data loss, and lack of standardization in the marine biology community.
Q (2): Excellent observation about the data-to-steps ratio. Allow us to clarify, and we will add this clarification to the model details in the paper. The key insight is that our training setup generates far more unique training examples than the raw hours suggest. At a high-level, this is how tokenization works:
- Each 2-second audio snippet becomes a 14×120 token array
- Columns represent time steps, rows represent acoustic granularity (coarse-to-fine)
- During training, we randomly mask entire columns (time steps)
- With 120 columns, there are 2^120 possible masking patterns per snippet
In our actual training data:
- 20 hours = 36,000 snippets × 2^120 possible maskings ≈ 10^40 possible training examples
- Training examples seen: 1M steps × 24 batch size = 24M
- Since 10^40 >> 24M, we effectively never see the same training example twice
During development, we subjectively evaluated model outputs every few thousand steps and observed a gradual shift from music-like sounds to authentic coda-like vocalizations. The model continued improving throughout the entire 1M steps without plateauing. If the reviewer believes quantifying this progression will significantly improve the paper, we can plot FAD scores versus training steps for the camera-ready version.
I appreciate this detailed reply, which solves my concerns during the first round of review.
To my knowledge, this paper contains no noticeable flaws from the machine learning perspective, which merits acceptance. I really appreciate the contributions the author brings to the community.
I would maintain this score of 4, mainly because the methods used in this paper are mostly from existing literature, and the findings in this paper can hardly be generalized to other fields beyond marine biology research.
best,
We thank the reviewers for their strong engagement and support. Three of four reviewers gave initial positive ratings (5,5,4,2), recognizing the work as "novel" and having "good potential impact for sperm whale understanding and broader bioacoustics." All reviewers provided excellent suggestions that significantly strengthened the paper's technical content, empirical results, and exposition. Key improvements:
- FAD with bioacoustic embeddings (fPAV): Their suggestion to use BirdNET embeddings revealed compelling results, namely, significant FAD improvements across input domains, providing even stronger validation than general audio embeddings.
- Domain adaptation ablations (VhbL): This insightful request uncovered crucial findings---without domain adaptation, sperm whale→sperm whale FAD degrades from 0.29 to 0.63, demonstrating its necessity (lower FAD is better).
- Enhanced baselines (fPAV): Added CLAP and BirdNET comparisons that better contextualize the downstream classification results.
- Scaled dataset training (fPAV): Committed to retraining domain adaptation with larger bioacoustic datasets (BirdSet/iNatSounds) for camera-ready.
- Productive discussion on technical novelty (e5Hs): We explained that WhAM is the first neural acoustic translation of any cetacean (and second of any animal), requiring a wholly novel evaluation framework and years of dataset curation (among other innovations). e5Hs's suggestion about semantic translation helped us clarify scope---semantic animal translation would be Nobel Prize-level work (von Frisch 1973), while our acoustic translation is at the current scientific frontier.
The paper proposes WhAM, a method that re-expresses arbitrary acoustic attributes as sperm whale vocalizations. It is worth noting that modeling whale calls is an established goal in animal communication research, but this work does not attempt to provide a biologically grounded account of meaning or semantics. Rather, the system serves as an experimental probe to study what structure can be recovered and how experts engage with machine-generated calls. Its contributions are thus methodological and exploratory, and offer a new tool (acoustic translation) for analysis, expert feedback, and representation testing.
Three of four reviewers leaned toward acceptance, and the discussion addressed many concerns productively. The main outstanding issues relate to limited methodological novelty and generality at the ML level, though the paper clearly articulates and situates its novelty at the application level. In my view, the conceptual challenges inherent in this type of research, combined with the potential of this work to inspire dialogue and future development, substantially outweigh those concerns. So, I view it as a clear accept.