PaperHub
8.3
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
5
4
ICML 2025

Beyond Sensor Data: Foundation Models of Behavioral Data from Wearables Improve Health Predictions

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

This work develops foundation models of behavioral signals from 2.5B hours of wearable data, achieving strong performance on 57 health tasks, excelling in behavior-driven predictions and further improves when combined with sensor data.

摘要

关键词
WearablesHealth applicationsFoundation models

评审与讨论

审稿意见
4

This work provides a foundation model for wearable devices on the behaviour data instead of the raw sensor signals. The model was trained on a large-scale wearable dataset totalling over 2.5B hours of wearable data from 162K individuals. This paper has performed extensive experimentations on the choice of tokeniser, model backbone and hyper-parameter optimisation. The best performing model was evaluated on 57 tasks ranging from the the disease detection, health state monitoring (pregnancy) to behaviour monitoring (sleep). The model showed superior performance against a baseline that was trained using summary statistics of the selected behaviour.

给作者的问题

Will the model weights and codebase be made open source?

论据与证据

  • Strong performance of behaviour data for health detection: not clear; even though the authors have shown their model performance across a rich set of downstream tasks. It is not clear how difficult the downstream task given that there is no competitive benchmark, also detailed were not provided in terms of case/non-case distribution potentially making the binary classification task very hard or very easy.
  • Integrating behavioural and sensor data: yes, when combining the proposed behaviour model with a PPG model, the authors showed that the integration increased the performance on majority of the downstream tasks
  • Developing a foundation model for wearables behaviour data for irregularly sampled data: yes, this is the one of the first foundation models on behavioural time series data at this scale. But the authors have not talked about sharing the model weights or codebase make it much harder for others to reproduce the work.

方法与评估标准

This paper explored the combinations of three tokenisers and three backbone model architectures

  • Two types of tokeniser classes were selected one with dense model and another in the form of tuple allowing for a single input token for each behavioural measurement
  • The choices for model architectures were well-motivated including standard transformers and a state-space model i.e. Mamba-2
  • The choice of the pre-training loss has already been shown to achieve good performance in other data modality e.g. PPG

A key weakness of the method is on incorporating data modalities that are very sparsely sampled particularly around that have samples for ≤10% of the time such as fall count, body mass index and 6 minute walk distance. Even though these matrices are highly relevant for health but when treating them like time series requires a strong motivation as the model always just see some constant input therefore making it difficult to leverage the temporal dynamics.

Your results in Table 10 is painting the same story that the linear probing results on the learnt embedding has a R2R^2 of 0 with body mass index and 0.096 with numbers of times fallen. So at least a portion of input data with low sampling frequency and low variations can possibly be removed to computational efficiency.

The evaluation is the area that requires further justifications on.

  • The disease labels for the data used comes from self-report which will might under or over report different clinical outcomes introducing differential bias in the ground truth. Would be to discuss how this can be handled.
  • The case/non-case distributions are not being shown for each disease, so it is challenging to know the difficulty of each that. Would be great if the authors describe the case/non-case distributions for each disease and how the non-case subjects are selected.
  • On the ascertainment for sleep ground truth, the authors did not explain what sort of quality control is done on the sleep labels for example the removal of outliners, minimal wearable time etc.,. Without a careful quality control, it could lead to high measurement error which can potentially explain why the PPG embedding has low R2R^2 0.1-0.3 to the sleep statistics.

理论论述

None

实验设计与分析

I’ve checked its experimental setup for train and eval which is reasonable to obtain performance metric with confidence intervals for the comparisons for different models.

补充材料

I’ve read most of the supplementary materials as lots of the results are in the supplement.

与现有文献的关系

The key contributions of this paper are:

  1. Explore the concept of using behaviour level modelling instead of lower-level sensor representation for health inference which will make it easier to leverage longitudinal wearable time series. Previous foundation models in wearable signals are mainly for low-level sensor representation.
  2. It nicely paved a way of integrating behaviour level information with additional model like the PPG and demonstrated how this multi-modal approach could aid in health inference

遗漏的重要参考文献

None

其他优缺点

Strengths:

  • The authors have provided extensive reports on their definitions of the behaviour metrics and downstream task creation which is clear

其他意见或建议

  1. What sort of quality control did you apply for your input data for each modality.
  2. Can you provide a model card for your optimal model covering model size, numbers of layers and training config?
作者回复

Thank you for the positive feedback and helpful comments towards improving our work. We focus on responding to the major themes of your comments:

Measuring Difficulty of Downstream Tasks

We agree that it is important to contextualize the difficulty of the tasks. To overcome this, we included baseline statistics of the input data and an existing PPG foundation model as a strong baseline to contextualize the performance of WBM. The performance of these models shows that no task is particularly easy or hard, but the strong performance of WBM relative to these baselines helps show the efficacy of the proposed model.

Per your suggestion, we will include the case/non-case distribution for all binary outcome labels in the camera ready version of the paper to improve our evaluation report. However, importantly, prevalence does not immediately define the difficulty of the downstream task.

Using Highly Sparse Input Modalities

This is a key characteristic of an observational study such as AHMS collected under real-world conditions where these variables are less often logged due to their nature and behavior of the participants. It is true that given the unique nature of these variables, it is harder for the model to leverage their temporal dynamics and changes, but our goal here is to model the data as-is but use state-of-the-art architectures such as state space models and Transformers that can leverage any potential temporal dynamics or inter-variable relationships. We experimented with tokenizers such as “Tuple“ tokenizer that does not do any form of imputation (therefore no constant input for sparse variables); we observed degraded performance with the ”tuple“ tokenizer in our hyperparameter search experiment in Appendix Tables 8 & 9.

We do find that the learned embeddings retain information for several variables that are natively sampled at a weekly frequency (e.g. six minute walk distance and walking steadiness score, see Table 10 in the Appendix). However, as you point out, there are some variables that we do not capture well in our embeddings. This could be due to their low prevalence or the use of the contrastive loss for training, which we will discuss further in the camera ready. As you suggest, we could remove such variables for computational efficiency, and we will add a line explaining this to the camera ready version of the paper.

Quality Control of Inputs and Labels

For input behavioral data, we do not do a per-modality or per-variable quality control and we use the data as is. However, for turning these behavioral data into week-level segments for training WBM, we perform the following cleaning steps: 1) z-score each variable then clip any outliers to [-5,5], 2) drop weeks whose number of variables are in bottom 5% percentile of all weeks, 3) drop weeks that have less than 5 days of data, 4) drop subjects with less than 5 weeks of usable data, or who were enrolled in the study for less than 90 days.

For segment-level labels of baseline history and medication, you bring up an important point in that we are working with self-reported labels. In the camera ready version, we will discuss the caveat of parsing the labels from self-reported surveys (although for some of our evaluations we do use more rigorous lab measurements, eg, diabetes).

Finally, we clarify the quality control for defining sleep labels. To obtain sleep labels, a participant must wear the watch overnight in order to get sleep metrics. We further limited to weeks where 5/7 days in the week had sleep metrics, meaning the watch was on overnight. This means that the PPG was recorded overnight as well, ensuring a fair comparison between both techniques.

We will be sure to clarify all of these points in the camera ready version.

Making Model Weights Open Source

Unfortunately the model weights cannot be shared due to the specifics of the informed consent for participants in the study. We will provide all necessary details in the camera ready and will put a note for interested parties to reach out to the authors for more details. However, for completeness, we will include a model card with the exact details of model size, number of layers, and other necessary training parameters in the camera ready.

审稿意见
4

This paper develops a foundation model for behavioural data from wearables to improve health predictions. The authors process 2.5 billion hours of wearable data and comparing different tokenization strategies and model architectures. They find a Mamba-2 architecture with TST tokenization performs best. The model is tested on 57 health-related tasks including demographic prediction, disease classification and health state detection. Results show that the behavioural foundation model outperforms a statistical baseline on most tasks. The authors also compare WBM to a PPG foundation model, finding WBM performs better on some behavior-driven tasks like sleep prediction, while PPG excels at others. Combining WBM and PPG embeddings yields the best performance across most tasks, indicating complementary information between behavioral data and raw sensor signals.

Update after rebuttal

The authors acknowledged most of my concerns. There were two points I added further clarification on A) the ways in which the signals are different for PPG and WBM B) the label leakage. I am well aware that these are monumentally difficult to correct for but they nonetheless affect the results interpretation. A demographic classification model trained using pure physiological signals is not the same claim as a using physiological signals + demographic label conditioning, even if the two models perform to identical levels. However, these limitations are not reasons for rejections and I improved my recommendation 3 to 4 and do urge the authors to discuss these limitations, as they have done in their most recent response, in the paper.

给作者的问题

  • L294 How is the decision to fit to week or participant made?
  • Are some of the features estimated using the individual’s age and biological sex? What implications might that have?
  • How are WBM and PPG combined?
  • What is the computational cost of training and inference for the WBM model compared to traditional approaches?
  • Were any analyses done to model performance across different demographic groups to assess potential biases?
  • Have there been any tests on the model's robustness to different wearing patterns and compliance levels?
  • What privacy-preserving techniques could be implemented alongside this approach for real-world deployment? How might they affect performance?

论据与证据

  • The main claim is substantiated with good results of the behaviour model.
  • The claim of the model learning from behaviour might be overstated (see methods).
  • The smaller claims based on comparison of “sensor data” vs “behaviour data” do not appear to be backed up sufficiently.
  • The paper presents a very optimistic narrative about WBM's performance, but the results reveal that WBM only outperforms the PPG model in 18 out of 47 baseline disease and medication outcomes, and of these, only 4 results are statistically significant. A more accurate framing would acknowledge that behavioral data provides valuable information for specific types of tasks (e.g. sleep and mobility), low-level sensor data appears more broadly effective across the majority of tasks.
  • Related to the above, while WBM outperforms the simple baseline on a majority tasks, the performance improvements are quite modest (median AUROC improvement of only 0.017). This modest gain raises questions about whether the additional complexity of foundation models is justified compared to simpler statistical approaches for many applications, particularly when it comes to interpretability, sensitivity and failure modes. These tradeoffs are not discussed sufficiently.

方法与评估标准

  • The WBM is pre-trained from 27 variables from the wearable device, an Apple watch, to predict age and biological sex. These variables include estimated active energy (calories burned) and basal energy, which can be problematic. I am fairly certain the apple watch uses the individual’s age and biological sex directly to estimate these values, causing label leakage. This would then mean that the model is trained to predict age and biological sex not only on pure behaviour signal, but based on values that are conditioned on the age and biological sex. This could also be the case for VO2max to a lesser extent, typically calculated using the person’s estimated max heart which is in turn a direct function of the person’s age. Although label leakages are almost impossible to avoid in real-world healthcare research, I urge the authors to check if these can be minimised further or, at least, better acknowledged.
  • The approach used for combining WBM and PPG is not discussed.

理论论述

The theoretical claims about the complementary nature of behaviour and signal data are mostly supported with the combination of the two models. However this theoretical aspect is not a major claim of the paper.

实验设计与分析

  • The authors make claims on the signal strength of behaviour vs low-level sensor signals that are based on comparisons of WBM and PPG. However these approaches differ fundamentally in more than just signal types. The sampling frequency, data processing, model architectures appear very different. So this seems to be as much of a comparison of sampling frequency and architecture as data signals. It seems a better way to make these claims would have been to ablate the WBM by removing behaviour features and leaving in the “low level” features, i.e. heart rate.
  • There is possible selection bias in the data representativeness, i.e. limited to Apple watch users. This limitation isn’t sufficiently discussed.

补充材料

No separate supplementary material. Appendix includes various implementation details.

与现有文献的关系

This is a weak point of the paper as the discussion of the results in broader context is limited. The discussion does not place the findings in the context of other digital health interventions or wearable technologies beyond a narrow set of foundation models. The authors miss opportunities to connect their work to broader healthcare trends, the potential clinical impact of wearable-based health predictions or how these models might integrate with existing healthcare systems.

遗漏的重要参考文献

  • There are not many prior works in this field apart from the Merrill and Althoff (2023) already cited and an earlier SSL wearable paper that also uses behaviour signals Kolbeinsson et al. "Self-supervision of wearable sensors time-series data for influenza detection." (2021).
  • If the combination of WBM and PPG is ensemble-style, then some references to prior work on ensemble models in health would be appropriate.
  • It would also be clearer to move the seminal citations for rotary transformers and mamba to directly after the paragraph headers where they are named.

其他优缺点

Strengths:

  • The systematic approach to architecture selection is thorough and usually well-justified
  • The dataset size is impressive and allows for robust foundation model development
  • The diversity of downstream tasks provides a comprehensive evaluation framework
  • The behaviour model shows particular promise for sleep and mobility predictions

Weaknesses:

  • Limited discussion of computational requirements and model efficiency
  • Lack of evaluation across different demographic groups to assess fairness
  • The model's interpretability is not discussed, which is important for healthcare applications
  • No discussion of how the model might perform on non-Apple devices with different sensors
  • Limited exploration of more sophisticated fusion techniques when combining WBM and PPG

其他意见或建议

  • On line 216 “1-layer multi-layer perceptron”, does this mean a single-layer-perceptron or an MLP with one hidden layer? The current phrasing is a bit clumsy
  • The paper switches between “wearables data” and “wearable data”, the former seems more semantically correct but I do not have a preference as long as it is consistent.
  • L169 “Hourly aggregation ensures consistency across variables” using “ensures” here seems overclaimed, as it is not guaranteed. “Supports” or “promotes” might be more accurate.
  • L295 The split is 80/20 train/test. Should this be 80/10/10 to match with L183?
  • L205 TST is not properly defined and citation could be clearer
  • Caption for table 1 and table 2: WHB → WBM
作者回复

Thank you for your positive feedback and suggestions aimed towards improving our work!

Contextualizing Comparisons between WBM and Baseline/PPG

We appreciate your feedback on tempering our claims.

First, we clarify the WBM vs baseline comparison. The subject-level tasks in Figure 3 are intentionally simple, as they aggregate a subject’s full history. Basic aggregate feature statistics and demographics may perform well, explaining the small median improvement of WBM. However, WBM significantly outperforms the baseline in a few key cases (e.g. smoking status and anti-psychotics usage). Its real strength lies in the more difficult time-varying tasks, where it consistently surpasses the baseline in detecting changes in health state on all tasks.

Next, we emphasize that WBM and PPG are complementary. WBM excels in some tasks (e.g., sleep duration and infection), while PPG is stronger in others (e.g., diabetes). However, combining both achieves the best subject-level performance in 42/47 tasks (with a majority being significant), and in all but 1 of the segment-level tasks (where it’s within margin of error). Behavior data should complement, not replace, sensor data when building prediction models from wearables.

We will edit the language in the camera ready based on your feedback and our response above to better clarify our contributions.

Combining WBM and PPG Representations

We apologize that this was unclear. We will clarify in the camera ready that we combined WBM and PPG embeddings by concatenating the two 256D embedding vectors into one 512D embedding vector. There are many better ways to build multimodal representations using fusion techniques either at the input or representation level that we did not explore. We will add a discussion of these as future work in the camera ready.

Relation to Broader Wearable Community

Thanks for raising this point. We will improve the discussion by connecting our work with the broader space of digital health and wearables, and mention the potential clinical impact such wearables-based health predictions might have in the future if safely deployed at-scale.

Label Leakage in Downstream Tasks

This is a subtle but important point, as label leakage is a major challenge in building foundation models, and you are correct that a small number of our input variables (e.g. basal energy) rely on age and sex as inputs. However, we clarify that age and sex prediction are not meant to showcase the value of our model. We view these tasks as sanity checks that our model is able to encode information that we already expect should be partially available. We will make this caveat more clear in the camera ready. We will also emphasize the importance of the other tasks, especially the segment-level tasks as mentioned above. Label leakage should not be a major concern for the 55 other downstream tasks, as none of those labels are used as part of the input variables.

Computational Cost of Training and Inference

The final WBM was the result of 6 epochs of training which took 16 hours of training time on 8 A100 GPUs. The learned model can quickly perform inference, and embeddings can be used easily across many tasks. We will add these details in the camera ready.

Robustness to non-Apple devices and other wearing patterns

You bring up an important point about generalizing to non-Apple devices and other wearing patterns. During training, we opted to remove weeks with low wear time, so we expect performance will degrade when applied to participant weeks with limited wear time. Evaluations on non-Apple devices is not possible, as much of the data can only be collected on Apple devices (particularly behavioral features derived via proprietary algorithms). However, our training details and insights provide a useful framework for others to train models on other wearables. We will discuss these limitations in the camera ready.

Typos and Writing Suggestions

We appreciate the feedback on typos, writing and citation improvements; we will fix these in the camera ready. We clarify that “1-layer multi-layer perceptron” means there is one hidden layer in addition to the input/output layers.

Miscellaneous Responses

Thank you for bringing up important topics regarding interpretability, selection bias, and evaluation across different demographic groups. Due to space constraints, we point you to responses to R1 and R2 respectively for these topics.

We will also clarify in the camera ready version that the choice to fit models weekly vs. at a participant-level was made based on the task. For time-varying tasks where we make predictions at every week, we fit models at a weekly level. For static targets, we fit models at a participant level.

审稿人评论

I thank the authors for their response. I appreciate their willingness to address my concerns and their commitment to improving the paper. The additional details provide important practical context that strengthen the paper.

However, I have two points to reiterate.

A) On the comparison between WBM and PPG, this conflates differences in data signals with differences in sampling frequency, data processing and model architectures. This makes it difficult to isolate whether performance differences truly reflect the relative importance of behavioural versus low-level sensor data or simply approach variations.

B) On label leakage, I agree that this is not a critical issue and primarily affects, and devalues to a small extent, the sanity checks. However, it does also affect the interpretation of the main tasks. With the foundation model conditioned on demographic labels (age, sex), it will be able to learn these more directly than a model that only has access to behavioural data. Although behavioural data is often heavily influenced by demography, the signal will be very different to one provided by direct demographic labels/conditions. This means the model isn't modelled on behaviour signals, with their natural demographic influences, alone but also trained on indirect age and sex metadata.

With the extensive changes the authors have promised, I will not let these two points prevent me from raising my recommendation from 3 → 4. However, I do urge the authors to mention these two limitations in the paper to better place the results in context.

作者评论

Thank you so much for the thoughtful response, your willingness to engage with us, and for increasing your score! We appreciate that you are helping us produce a much stronger final paper. A few last comments below, and we’ll add a discussion of these limitations to the camera ready.

Re A — one point we would emphasize is that a major part of the difference in the data signals for PPG vs WBM is the difference in the native sampling frequency of these modeled quantities. Eg PPG is generally observed at 64Hz for 60 second intervals, whereas the health/behavior data for WBM has sampling frequencies that vary from every few minutes (e.g. heart rate, step count, active energy burned) to daily or weekly measurements, which we then project onto a fixed hourly grid. This makes it near impossible to disentangle the effect of differences in the underlying quantities being measured vs differences in the sampling frequencies. Another important difference we will mention is that the quantities modeled by WBM cover most periods of time during the week, whereas PPG is only opportunistically captured a handful of times during the day, depending on how often someone wears a watch and is at rest. We also did use different data processing and design decisions for each data type, and in this work we only used a frozen pre-trained PPG encoder and did not explore the same architectures used for WBM (e.g. Mamba-2). We will mention these points in our discussion.

Re B — one point that may help clarify the role that age/sex play in our modeling is to consider, as a thought experiment, what might happen if we had access to gold-standard reference values for some of the health/behavioral quantities that strongly depend on demographics. Take VO2max as an example — it is well known that VO2max declines with age, and tends to be lower for females than for males. The FRIEND study (https://www.mayoclinicproceedings.org/article/S0025-6196(15)00642-4/pdf) provides useful population distributions for VO2max by age/sex subgroups — for instance, the median VO2max for age 20-29 males is 48, whereas for 70-79 females it is 18.3. In fact, the upper 95th percentile for females 70-79 (24.1) is still lower than the lower 5th percentile for 20-29 males (29), so there is near perfect separability for VO2max between older females and younger males. In order to provide as accurate an estimate of VO2max from submaximal exercise data as possible, Apple Watch uses demographics as input to the VO2max algorithm, but even if we were to use gold-standard, invasively collected VO2max values, this strong demographic signal would still exist. In either case, using our wearables-derived and demographics-conditioned estimate of VO2max or the gold-standard value, we would expect learned representations of either data type to be strongly predictive of demographics, although they might have different performances and make different errors. We would also expect to see similar issues for other health/behavioral variables that are estimated using demographics as an explicit input (e.g. basal and active energy). To be clear — we agree that this is a super important point, and we’ll add it to our limitations in the discussion! We wanted to point out that there is no simple solution here, as many different health and behavioral quantities that we can collect via wearables will have strong correlations to underlying demographics; the distinction is that only some of the time are demographics explicitly used as inputs to estimate these quantities in the first place.

审稿意见
5

This manuscript considers the problem of health condition tracking using pretrained foundation models trained on the Apple health movement dataset. In contrast to past work that used raw sensor signals from PPG and ECG, they leverage higher-level ‘behavioral’ metrics that are extracted from IMU (eg steps), user input (BMI) or intermittent sampling (VO2Max). They survey several architectures, noting special challenges in irregularly sampled data. Following network architecture comparison, they used a dense matrix of features per hour that were passed through bidirectional mamba2 and trained using a contrastive loss with pairs of users as positive samples. They examine demographic classifications tasks, inter-subject tasks predicting health states, and intra-subject classification of demographic information like age and biological sex, showing impressive performance that is fairly competitive with PPG on most tasks done using linear probing. They also examine combinations with PPG, and discuss discrepancies.

给作者的问题

Minor

  • L365 “PPG will not provide the same holistic view of an individual’s week, since it is only captured opportunistically a few times each day” How frequently is this captured? If so, why is the PPG readout of deep sleep so good while normal sleep so bad.

  • Statsig of different comparisons e.g. classifier performance vs PPG. Should be listed per comparisons.

  • Ablations over feature importance. Given the uneven coverage, would be valuable to inform simplest set of features to construct. May obviate need for masking.

论据与证据

Yes

方法与评估标准

Yes

理论论述

N/A

实验设计与分析

Yes

补充材料

Yes, read relevant sections.

与现有文献的关系

Appropriate work is cited and related, see below.

遗漏的重要参考文献

THe work is well contextualized.

其他优缺点

Strengths:

  • The manuscript is polished and clear with a full description and documentaiton of experimental details and methods and clear interpretation of data
  • There are essential baselines included (including a null baseline and fairly SOTA PPG), multiple interesting tasks considered, and ablations.
  • The approach provides fairly large advantages over a baseline in many tasks.

Weakness

Overall I think the manuscript is a strong accept, but I am offering some directions that would improve its utility to myself and the field.

  • Scaling laws of performance with amount of pretraining data would be helpful.
  • The importance of the 27 features presented is unclear, and they vary quite a bit in their missingness. It would help to delineate which were the most important for the prediction. The R2 from model reconstruction is a start, but not a full interpretability analysis.
  • Clarifying the statistical significance of results. Comparisons in Figure 3 for instance are presented without error bars, and throughout the differences in models are so small that significance should be conveyed. It is also unclear to me how the bootstrap was calculated specifically.
  • For the sleep metrics in particular, my understanding is these are trained from an algorithm consuming the same information as behavior/PPG, except perhaps a raw version of the IMU. Can you comment on how including raw IMU would impact these results. This sensor stream is conspicuously absent from all of these papers

其他意见或建议

  • L172 “Driven by our goal of detecting health states at a temporal resolution of human behavior” unclear what behavior means here. Behavior is overloaded throughout the manuscript.

  • As the authors point out, contrastive pairs from the same user don’t necessarily make the most sense in this task, especially for intra-subject tests like sleep staging. Are there other tasks that might be useful they can propose.

作者回复

Thank you for your positive feedback and useful comments for enhancing our work. We respond to specific suggestions below:

Interpretability of WBM

Interpretability of foundation models is an active area of research that remains extremely important. Unfortunately, it remains non-trivial to understand how input features affect the learned representation in order to ascertain feature importance for any given downstream task. As you suggest, one technique might involve independently perturbing each input sensor to understand its effect on the learned representation. However, understanding the correct way to perturb these irregular data in a meaningful and scalable way remains an important open problem. We will discuss the importance of interpretability and some potential next steps in the camera ready version of the paper.

Sleep Metrics and IMU

When evaluating using sleep metric labels, our goal was to showcase one example where we expect behavior data to be much more predictive than PPG. Sleep metrics are only estimated when a subject wears their watch overnight, ensuring that we have some amount of PPG overnight alongside the other behavior data fields. In general, a passive measurement of PPG is attempted roughly every 2 hours for most subjects, and the measurement is only retained if the subject is sufficiently quiescent ensuring that the PPG data has low noise.

The sleep metrics on the Apple Watch are derived only from a continuous stream of IMU (3-axis accelerometer) during a sleep session, and PPG/behavior is not used. Processing such continuous IMU streams involves the use of complex data pipelines that were not available to us; the volume of such data would make it impossible to scale to using most days and subjects from across the study. Therefore, including such IMU data in our modeling was out of scope for our study. However, given that sleep labels are derived from IMU, we expect markedly stronger predictions if we include IMU in the input of our models.

PPG prediction of deep sleep & "PPG will not provide the same holistic view of an individual’s week" phrasing

It is generally the case that total sleep duration, sleep efficiency, and deep sleep in particular decrease with age. Since PPG (collected roughly every 2 hours - see above) contains strong age-related signals, we would expect that it should be able to leverage such information to make decent predictions about average sleep metrics for an individual. Note that the baseline model (which explicitly includes demographics) also performs better on deep sleep prediction, suggesting that demographics plays a more important role. We will rephrase this to “PPG does not provide as comprehensive a view of an individual’s week, since it is only measured a few times each day”. We will also add additional clarification and caveats around the sleep analyses.

Improvements on Contrastive Learning Framework

We agree that the contrastive framework could be improved upon. We explored the use of a masked auto encoder approach, but found that this resulted in poor performance (see Appendix A.5.3). We hypothesize that this may be due to the high degree of noise and irregularity in the behavior data, making complete reconstruction of the input an overly challenging task that leads to representations that do not generalize well to new tasks. We will expand on this hypothesis further in the camera ready, as well as discuss other techniques that future work could consider adapting to this type of data to improve upon our framework such as joint-embedding predictive architectures (JEPA). Even though our contrastive learning approach is not set up to capture intra-subject changes, empirically based on our results it still has some ability to do so.

Clarifying “Human Behavior” Throughout

Thank you for finding this unclear sentence, we will rephrase in the camera ready. We agree that the use of the term “behavior” may be confusing — we will be sure to carefully go through the manuscript and only use behavior in the intended use-case (i.e., when discussing behavior data) and avoid overloading the term.

Details on Statistical Significance and Bootstrap Performance

Thank you for the great suggestion for including bootstrap CIs and p-values in the manuscript, we will add these in the camera ready format. We will also clarify how we calculate bootstrap confidence intervals: we resample the test set 1,000 times and recompute performance metrics on each resampled test set for each method. The confidence intervals and p-values are then computed empirically on this bootstrapped set of performance metrics.

审稿意见
4

This paper proposes WBM, a foundation model trained on wearables dataset to improve health predictions. The paper states that behavioral signals including physical activity and mobility metrics align better with physiologically relevant timescales than raw sensor data. The proposed model is trained on over 2.5 billion hours of data from 162k individuals, and is evaluated across 57 health-related tasks. The results suggest that the proposed model has improved performance on behavior-driven tasks like sleep prediction compared to existing models based on raw sensor data.

给作者的问题

How does WBM compare against state-of-the-art transformer models? Have you analyzed potential biases in the dataset, particularly in terms of demographic representation? How does the model's performance vary across different demographic subgroups?

论据与证据

The claims made in the submission such as behavioral data provide valuable insights into health conditions beyond raw sensor data, and that the proposed model outperforms baselines in various health detection tasks is supported by experimental results. However, the reason that Mamba-2 is chosen to be the backbone for behavioral data modeling is not stated, nor is it rigorously compared against other deep learning architectures.

方法与评估标准

The methods are well-described in Sections 3-5, including dataset preprocessing, model architecture, and evaluation metrics. Evaluation is performed on a broad set of tasks. Inclusion of more baseline models would improve the rigor of the evaluation.

理论论述

No theoretical claims in this paper. The parameter details in appendix look solid.

实验设计与分析

The experiments provide reasonable baselines and comparisons. The evaluation on 57 downstream tasks is impressive, but some analyses lack deeper breakdowns on task-specific performance.

补充材料

I reviewed the appendix. It contains details on dataset, model architecture, and additional results. The pretraining loss details and ablation studies are well-documented.

与现有文献的关系

The work contributes to wearable-based health monitoring and references important works in this field.

遗漏的重要参考文献

N/A

其他优缺点

Major contributions include systematic evaluation across 57 health tasks with large datasets. Clear description of model architecture and experiments. Comparison to SOTA deep learning models beyond Mamba-2 is limited.

其他意见或建议

The paper is well-written and has clear and meaningful visuals.

作者回复

Thanks for your positive comments and constructive feedback to help us improve this work! We focus on responding to the major themes of your comments:

Choice of Mamba-2 and comparison to SOTA deep learning models:

This is an important point to clarify. As stated in Sections 4.2 and 4.3 and Appendix Section A.5.1, we compared the Mamba-2 model architecture with two alternative architectures: Self-attention Transformer, and Rotary Transformer. These two are among the most commonly used state-of-the-art Transformer architectures in various domains. In addition, for a fair comparison, we did a full grid search over these 3 architectures, and over 3 different tokenizers, as well as sweeping other hyperparameters. Within our hyperparameter search experiment, we observed that Mamba-2 with the TST tokenizer generally outperformed the other alternatives, including the 2 Transformers architectures. Refer to Table 9 in Appendix Section A.5.2 for a full comparison of how often TST+Mamba-2 achieved the best performance compared to the other 8 architecture+tokenizer combinations.

Inclusion of more baseline models would improve the rigor of the evaluation:

We have included comparisons of WBM with baseline architectures/tokenizers, and a competitive PPG baseline. In addition, we have discussed a baseline comparison with respect to other pre-training methods such as masked autoencoding (Narayanswamy 2024) in Appendix A.5.3, as well as a simple baseline of turning behavioral data to its mean and standard deviation statistics. However, we do agree with that including more baselines can improve our evaluations, and we will discuss this as a caveat in the camera ready version. We welcome any feedback on specific baselines that you feel would particularly strengthen our work.

Potential demographic biases and performance in demographic subgroups:

We agree that characterizing demographic biases are essential for health applications. The Apple Heart and Movement Study has its own limitation and biases (Truslow et al. 2024), for example, the dataset is more biased towards a younger male population. However, given the large scale of this study compared to other studies, our models are still trained and evaluated on a large cohort including participants from diverse demographics. In the camera ready version of the paper, we will include distributions of demographics statistics from our pre-training data. We will also add to the appendix the performance of our models within demographic subgroups on a representative subset of the full set of tasks considered, focusing on tasks where the combination of WBM+PPG performs best. Specifically, we will show demographic subgroup performance on a representative set of targets: heart failure, active smoker, and calcium-channel blocker baseline tasks, as well as on the pregnancy and infection tasks. We will add some discussion around potential fairness concerns and demographic biases to the camera ready version of the paper, noting that a complete fairness investigation into our final models was out of scope for this work.

最终决定

The paper has received Accept, Strong Accept, Accept, Accept. All reviewers have praised the paper and its contributions. While some claims required softening, and limitations around demographic conditioning and generalizability were acknowledged, the discussion phase successfully led to a consensus. Some reviewers raised concerns, both in their discussions with the authors and with each other/AC, regarding the dataset, particularly its potential demographic biases. A key point of concern was the lack of public availability of the dataset and the absence of released code or model weights. Open source of the model weights and at least a subset of the dataset is highly required. Based on the strengths of the work, I suggest for the paper to be accepted, but urge the authors to find a long-term sustainable solution for this problem.