Participants and data collection
The overall methods of this ongoing longitudinal study have been described previously51. In brief, between December 12, 2020, and January 15, 2024, we recruited and followed 164 adults diagnosed with BD type I or II, in the outpatient mood clinics of two academic psychiatric hospitals in Canada: The Center for Addiction and Mental Health (CAMH), Toronto, Ontario; and the Queen Elizabeth II Health Sciences Center, Halifax, Nova Scotia. This analysis focused on identifying specific descriptors that could be used to detect depressive episodes. Therefore, we excluded 31 participants who entered the study in a manic, hypomanic, or mixed episode.
The study was approved by the Research Ethics Board (REB) at the Centre for Addiction and Mental Health (CAMH), Toronto, Ontario, Canada, and at the QEII Health Care Centre, Halifax, Nova Scotia, Canada, in accordance with the Declaration of Helsinki. REB #: 059-2019. All participants signed an informed consent form approved by the local research ethics board before any research procedure was initiated. Primary diagnoses of BD I or II were established with the Structured Clinical Interview for the Diagnostic and Statistical Manual of Mental Disorders 5 (DSM-552; SCID-553) based on the diagnostic criteria of the DSM-5. Participants completed a baseline assessment, including the gathering of their sociodemographic characteristics, clinical history, cardiovascular health, chronotype, and medication regimen. Polarity (euthymia, depression, or (hypo)mania) at baseline was defined based on the Young Mania Rating Scale (YMRS)54 and the Montgomery-Asberg Depression Rating Scale (MADRS)55: euthymia was defined as both YMRS and MADRS scores < 10; depression as MADRS scores ≥ 10; and (hypo)mania as YMRS scores ≥ 10. All participants were treated by a psychiatrist affiliated with the study or a community-based psychiatrist.
Participants were instructed to wear an Oura ring (Oura Health Oy, Generation 2, Oulu, Finland) continuously throughout the study duration, including during sleep, and to charge the device during periods of inactivity as needed. No fixed charging schedule or specific charging intervals were mandated, allowing participants to manage device battery life flexibly. The ring produces a variety of data related to activity and sleep56 (Fig. 1). As an incentive, participants received $180 upon completing their first year. For the second year, they could choose between an additional $180 payment or ownership of the Oura ring. Additionally, participants received a daily email with a link to an electronic visual analog scale (e-VAS) to self-rate their mood, energy, and anxiety levels over the previous 24 h. Each e-VAS component was scored in one-point increment ranging from 10 to 90, with participants instructed to consider “50” as their baseline (e.g., their “usual” mood, energy, or anxiety): ratings higher than 50 indicating levels “higher than their usual”, and ratings lower than 50, indicating levels “lower than their usual.” Reminders were sent to participants when e-VAS were not completed for three consecutive days. Finally, participants also received a weekly email with a link to an electronic version of the 9-item Patient Health Questionnaire (PHQ-9)57 and the Altman Self-Rating Mania scale (ASRM)58 to rate their depressive or manic symptoms, respectively, during the past week. Daily e-VAS ratings and weekly symptom ratings were stored on a REDCap (Research Electronic Data Capture)59 secure database.

a Clinical course of two exemplary participants: one who experienced the onset of a major depressive episode and one who remained in euthymia. This panel illustrates the longitudinal variability in clinical polarity and highlights the differences between participants with and without the onset of a depressive episode. b Subjective and objective variables used to generate features: A-priori selection of three out of three variables provided by participants, 1 out of 15 variables related to activity by wearable, and 11 out of 23 variables related to sleep by wearable. c Collection and aggregation of data over various time scales to create variables: subjective data provided by participants consist of three self-reported measures; objective data provided by wearables consist of densely sampled data on activity and sleep.
To assess participant adherence to device-wear protocols, we computed non-wear time and missing-data ratios across modalities. Non-wear was directly provided in the Oura ring activity data dictionary, and we computed the missing data ratio (MDR) by modality (i.e., activity, sleep, scales) as the ratio of missing data observations by the total number of expected data observations in a participant’s study course. We performed another sensitivity analysis comparing the extracted digital biomarkers across two subsamples created by dividing the full sample at the median participation length.
Data generation: creation of variables (Fig. 1)
Figure 1 presents an overview of the clinical course and time scales for data generation. Based on the weekly PHQ-9, we used a 2-week rolling window to identify depressive episodes and the time of their onset: the onset of a depressive episode was defined as the start of at least two consecutive weeks with a PHQ-9 score ≥ 10. Previous studies have shown this approach to be both highly sensitive and specific60,61 to define major depressive episodes50 (Fig. 1a).
Among the 39 data generated by the Oura Ring, we selected a smaller set of 12 objective data produced by the ring (1 out of 16 related to activity and 11 out of 23 related to sleep) (see Fig. 1b). For activity, all variables included in the analysis were derived from the 5-minute sampled Metabolic Equivalent of a Task (MET)62 produced by the ring and aggregated over one-hour periods, day times, evening times, night times, one-day periods, one-week periods, and one-month periods. For sleep, all variables included in the analysis were derived from: (i) the 5-minute hypnograms56 produced by the ring and aggregated over one-hour periods; and (ii) a set 10 of daily data (awake duration, sleep-onset latency, restless sleep duration, light sleep duration, deep sleep duration, REM duration, total sleep duration, waking-up count, getting-up count, and sleep efficiency) aggregated weeks and months. Similarly, daily e-VAS data were aggregated over weeks and months (Fig. 1c).
Overall, considering the objective data produced by the ring, the subjective (e-VAS) data provided by the participants, and the various time periods, we created a set of 49 variables (see Supplementary Table 1).
Data processing: imputation of missing data and calculation of features (Fig. 2a, b)
Given our previous work showing that data in this sample were missing not at random (MNAR)63, we generated ten imputed datasets using the K-Nearest Neighbors (KNN) method64. KNN imputation identifies the k-nearest neighbours to each missing data point based on a distance metric and imputes the missing value using the median of these neighbours. This method uses information from similar observations, making it suitable for handling data MNAR. Each of these ten imputed datasets was generated based on different yet plausible imputed values (Fig. 2a). We conducted the subsequent analyses on each of these 10 imputed datasets and mean-aggregated the results to yield parameter estimates and standard errors. The following analyses were performed within each imputed dataset.
a Variable preprocessing, missing data imputation, and feature extraction: This consists of the use of multiple imputations to generate 10 imputed datasets, followed by the extraction of the main statistical characteristics (i.e., features) from each variable, spanning time scales from five minutes to one month. b Variables across various time scales used to generate matrices of features: each variable with a temporal resolution from five minutes to a month is used to generate seven features. All the features of all participants are used to create two matrices, one corresponding to euthymic states and one corresponding to depressive states. These matrices serve as input for the binary classifier. c Classification and generation of performance metrics: Input features and confounding factors used in the ensemble-based binary classifier for distinguishing clinical polarity (euthymia vs. depression). This panel also illustrates the classifier’s performance metrics, including Receiver Operating Characteristic (ROC) curves and Precision-Recall (PR) curves for both the Decision Tree model and the XGBoost model. Performance is quantified by the Area Under the ROC Curve (AU-ROC) and the Area Under the Precision-Recall Curve (AU-PRC). d Feature ranking: SHAPLEY plots are generated to show each feature’s average impact on the model’s predictions and class-specific impacts for the features extracted during euthymia and depression. The plots provide insights into the relative importance of each feature in driving the classifier’s outputs, enhancing model interpretability.
For each of the 133 participants, for each of the 49 variables, in each of ten imputed datasets, we calculated seven features characterizing variability over the entire duration of participation while euthymic (or depressed if applicable): the mean coefficient of variation (CV), kurtosis, skewness, mean absolute differences (MAD), absolute sum of consecutive changes (ASCC), and autocorrelation (AC) (Fig. 2b; Supplementary Table 2). We selected these features to assess the statistical distribution, variability, and self-similarity of the variables. We computed the CV to capture relative variability (i.e., normalising variance by the mean), rather than absolute variability (i.e., changes in the measure). We explicitly distinguished between absolute and relative variability to avoid misinterpretation in cases where mean levels differ substantially across mood states. The calculations were performed across all available time periods (from 5 min to 1 month) to analyze both short- and long-term fluctuations. To assess the impact of various temporal resolutions, we extracted within-day activity features, within-night sleep features, and weekly and monthly features for activity, sleep, and all e-VAS domains using identical preprocessing steps, standardized variability descriptors, and the same evaluation pipeline. For example, for the daily e-VAS ratings of a participant who was euthymic for five months (153 days), became depressed for 2 months (61 days), and became euthymic again for six months (182 days), we would calculate the seven features listed above over the 335 days of euthymia (i.e., concatenating the two periods of euthymia) and the 61 days of depression. For the same participants and their variable corresponding to monthly mood ratings, we would first aggregate daily e-VAS ratings over 13 months of participation and calculate the seven features listed above over the 11 non-overlapping 30-day periods of euthymia (i.e., concatenating the two periods of euthymia) and the two non-overlapping 30-day periods of depression. In total, 343 features were calculated as described below for each individual participant. See Supplementary Table 1 for the full list of data, variables, and features, and Supplementary Table 2 for the mathematical description and interpretation of the features.
Classification model development and evaluation of model performance
The present analyses were designed to evaluate the association between specific features and concurrently assessed clinical polarities: the ML models were used to distinguish between depressive and euthymic data rather than to forecast future episodes. Accordingly, all reported classification performance metrics (e.g., AU-ROC, AU-PRC) reflect the ability of the features to discriminate between clinical polarities in the same data segment, consistent with an associative rather than temporal predictive framework. We created two matrices based on the features of all participants: the first one corresponding to the euthymic states and the second one to the depressed states that were the input for ensemble-based supervised ML classifiers using either Random Forest65, XGBoost66 (Extreme Gradient Boosting), Logistic Regression, and Support Vector Machines (SVM). Input data were split into training and test sets via a random shuffle 80/20 split stratified by binary state (euthymic = 0, depressed = 1). We evaluated model performance using a nested cross-validation (CV) framework to ensure unbiased estimation and robust feature selection. The outer CV consisted of 5 stratified folds, where in each fold, the data was split into training and held-out test sets. Within each outer fold, model optimization and feature selection were performed using an inner CV with 5 folds and randomized hyperparameter search (50 iterations). The models considered included Random Forest, Gradient Boosting, Logistic Regression, and Support Vector Machines, each combined with one of three feature selection methods: SelectKBest, Recursive Feature Elimination (RFE), and SelectFromModel, and feature selection parameters (e.g., number of features to select) were tuned within the inner CV.
Preprocessing included robust scaling and KNN imputation for missing values. We then handled feature outliers by applying interquartile range-based clipping, and the best model and feature subset from the inner CV were retrained on the entire outer training fold and evaluated on the held-out test fold. We assessed model performance by computing the area under the receiver operating characteristic curve (AUROC).
To evaluate the stability and significance of our classification models, we performed label permutation testing. This involved randomly shuffling class labels and recalculating model performance (AU-ROC) across 5 cross-validation folds to generate a null distribution. The true-label model performance was then compared against this null distribution to assess whether observed results exceeded chance levels.
Model explainability: Permutation feature importance and robustness
To interpret the contribution of specific features to the classification model performance, we computed the SHapley Additive exPlanations (SHAP)67 values on the held-out test sets to provide model-agnostic, local explanation of feature contributions. SHAP values were aggregated across all outer folds to derive an overall feature importance ranking. This method provided insights into how each feature influenced the model’s classification performance. It helped identify clinically actionable digital biomarkers and uncover potential underlying mechanisms driving the model’s performance. We determined the significance of individual features by calculating their impact score (IS), derived as the average of the absolute SHAP values; the IS is a positive metric that quantifies the overall predictive power of each feature on the classification model’s output. Ideally, a highly impactful feature would exhibit a homogeneous impact distribution, with lower or higher values tilting towards the positive class (i.e., depressive episodes).
Moreover, to assess feature robustness and importance stability, we performed permutation importance analysis on the held-out test sets of each outer fold. For each fold, the best model was refit on 100 bootstrapped samples of the training data. For each bootstrap, permutation importance was computed on the held-out test set using 100 repeats to ensure stability of importance estimates. The stability metrics we calculated per feature included (i) feature permutation importance (ranked within the fixed feature scope selected by the best model), (ii) feature selection frequency (i.e., the proportion of bootstraps in which a feature ranked within the top k features; 5 for activity data and 10 for sleep and e-VAS data), and (iii) feature rank distribution. We aggregated feature ranks and importance statistics across bootstraps and outer folds to quantify feature robustness. This approach allowed us to identify features consistently important across resamples, enhancing confidence in their relevance.
Statistical analysis
We implemented permutation testing (N = 100) to generate a null distribution of differences between euthymia and depression groups and bootstrapping (N = 1000) to estimate variability in test statistics, addressing the relatively small sample size. This resulted in a total of 100 × 1000 resampled datasets derived from the original imputed dataset. The Mann-Whitney U test was applied to evaluate differences between groups across resampled distributions, providing a robust assessment of statistical significance. To ensure the reliability of our statistical analyses, we required each clinical state to have a minimum of five data points. This threshold was chosen to improve the power for each bootstrap run of the Mann-Whitney U-test; we corrected for multiple testing of all 7 (Table 2) and top 10 (Tables 3 and 4) feature comparisons with a False Discovery Rate (FDR) correction using the Benjamini–Hochberg method68; statistical significance was assumed for FDR-corrected p-value < 0.05. We also computed Cohen’s d effect sizes to quantify differences between the euthymic and depressive states.
Classifier performance metrics were evaluated on the held-out test set, indicating the mean (SD) from 10 imputed datasets (where relevant). We controlled for the effect of sociodemographic (e.g., age, sex) and clinical (e.g., pharmacotherapy) characteristics (see Table 1).
To help understand underlying differences driving the variation of features between euthymia and depressive episodes, we interpreted the ranking of the features by permutation importance, additionally contextualizing these metrics with the Mann-Whitney U-test statistic and the Cohen’s d effect sizes.
We also performed a sensitivity analysis by repeating the full analysis with and without imputation of missing data. We focused this sensitivity analysis on the daily e-VAS features that exhibited the highest classification performance and had the highest missing-data ratio (mean (SD), 16.71% (20.89%)). This was the most stringent test of whether the imputation process impacted our results. Classifier performance, extracted features, and permutation-based importance rankings were compared in both datasets to assess whether imputation meaningfully changed the results.
All analyses were conducted using an within-participant framework: for each participant, features were extracted separately from their depressive and euthymic segments, and statistical comparisons were performed within the same participant to minimize confounding by between-participant heterogeneity, which can be wide in mood disorders69,70. The reported group-level findings represent aggregated within-participant contrasts: each participant contributed paired depressive and euthymic segments (when applicable), and no analyses involved direct comparison of raw variability values across different participants, consistent with our within-participant analytic design.
All statistical analyses were conducted using Python version 3.11.
link

