Self-Supervised Learning Improves Accuracy and Data Efficiency for IMU-Based Ground Reaction Force Estimation

Objective: Recent deep learning techniques hold promise to enable IMU-driven kinetic assessment; however, they require large extents of ground reaction force (GRF) data to serve as labels for supervised model training. We thus propose using existing self-supervised learning (SSL) techniques to leverage large IMU datasets to pre-train deep learning models, which can improve the accuracy and data efficiency of IMU-based GRF estimation. Methods: We performed SSL by masking a random portion of the input IMU data and training a transformer model to reconstruct the masked portion. We systematically compared a series of masking ratios across three pre-training datasets that included real IMU data, synthetic IMU data, or a combination of the two. Finally, we built models that used pre-training and labeled data to estimate GRF during three prediction tasks: overground walking, treadmill walking, and drop landing. Results: When using the same amount of labeled data, SSL pre-training significantly improved the accuracy of 3-axis GRF estimation during walking compared to baseline models trained by conventional supervised learning. Fine-tuning SSL model with 1–10% of walking data yielded comparable accuracy to training baseline model with 100% of walking data. The optimal masking ratio for SSL is 6.25–12.5%. Conclusion: SSL leveraged large real and synthetic IMU datasets to increase the accuracy and data efficiency of deep-learning-based GRF estimation, reducing the need for labeled data. Significance: This work, with its open-source code and models, may unlock broader use cases of IMU-driven kinetic assessment by mitigating the scarcity of GRF measurements in practical applications.

Despite the value of estimating kinetics, measurement of GRF and joint moments typically requires gold standard measurements from force plates and marker-based motion capture.However, these devices are expensive, require trained personnel to operate, and confine the measurement environment.Consequently, such challenges associated with acquiring kinetic data limit the accessibility of biomechanical assessment and prevent translation of biomechanical studies to larger and diverse populations.
Using wearable sensors to monitor kinetic parameters in natural environments represents a promising opportunity for moving disease characterization, prevention, and rehabilitation assessment into real-world settings [7].Inertial measurement units (IMUs) are a practical choice for out-of-lab assessment due to their low cost and small form factor. IMUs typically measure acceleration and angular velocity (and magnetic field in some scenarios), which can directly be used to derive body segment orientation for subsequent kinematic estimation.
Prior studies have combined physics-based models and optimization methods to estimate kinetics from IMU data [8]- [10].However, optimization methods require a relatively high computation time and depend on task-specific movement objectives (e.g., minimizing energy, maximizing speed) [11], [12].Apart from physics-based models, other studies have attempted to employ deep learning models that directly map IMU data to kinetic parameters in an end-to-end manner [13]- [15].To train these models, previous studies have collected synchronized IMU, GRF, and marker-based motion capture data during hundreds of minutes of walking and running on treadmills equipped with force plates.The GRF and marker data then serve as "labels" to train models using conventional supervised learning.However, collecting these "labels" in large quantities to train accurate deep learning models is not always feasible for motions beyond treadmill gait.For overground walking, since GRF data is typically collected via one to three floor-embedded force plates, each trial often yields only one or several steps with "labels".The scarcity of data also holds true for a variety of other motions including sprinting, cutting, and jumping [16].
b Each window is a model input segmented from original trials.
data [20] and distribution shifts where the distribution of model training data differs from that of testing data [21].SSL aims to pre-train deep learning models by designing model pretraining tasks that are solely based on "unlabeled" data.One widely used pre-training task is masking and reconstruction, where a random portion of the input data is masked and the model is trained to learn the data representation by reconstructing the masked portion.When limited "labeled" data is available for a downstream prediction task of interest, representations learned from pre-trained models can enhance the training process, and lead to better performance than conventional supervised learning with randomly initialized parameters.SSL with masking and reconstruction has facilitated the training of powerful language models [17], [22], vision models [18], and multi-modal models [19].However, it is unknown whether SSL with masking and reconstruction can learn IMU representations that improve performance on downstream tasks of estimating kinetics.Additionally, unlike the fields of natural language and vision, the optimal pretraining dataset and strategy for this IMU-based application remain unclear.The aim of this paper is to use existing SSL techniques to improve the accuracy and data efficiency of an IMU-driven deep learning model, and open source it on GitHub.We first investigated the optimal SSL strategy using a large "unlabeled" dataset of 500 participants with real and synthetic IMU data.We then compared the performance and data efficiency of SSL pre-trained models on estimating GRF compared to baseline models whose parameters are randomly initialized following conventional supervised learning.We first estimated GRF on overground and treadmill walking datasets, which was within the distribution of the SSL pre-training datasets.Then, we estimated GRF on a drop landing dataset, which is not present in the pre-training dataset.To facilitate broader usage and reproduction of work, we make our framework broadly usable by releasing our source code and trained models.

A. Pre-training Datasets
We used three datasets for our SSL pre-training: real IMU data [23], synthetic IMU data [24], and a fused dataset combining them both (Table I).The real IMU dataset has 17 IMUs, and we used eight of them that were placed on the trunk and pelvis, as well as both thighs, shanks, and feet.For the synthetic IMU dataset, we generated data on the same eight body segments.Each IMU has a 3-axis accelerometer and a 3-axis gyroscope.The z-axes of the IMUs were aligned with the segment surface normal, x-axes were pointing left during standing, and y-axes were perpendicular to the x-and z-axes following the right-hand rule (Fig. 1).
1) Real IMU: This dataset is based on MoVi, a public human motion dataset with 90 participants who performed a collection of 21 everyday actions and sports movements such as walking, vertical jumping, and throwing [23].Participants wore a 17-IMU suit that operated at 120Hz (Noitom, China).We linearly resampled each trial from 120Hz to 100Hz and then segmented it into windows of 128 time steps (1.28s each) with 50% overlap (0.64s) between successive windows.We excluded 18,760 (56.5%) windows in which participants were standing or sitting still, when the average standard deviation of all accelerometer axes was smaller than 0.2m/s 2 .In total, we obtained 14,418 windows and randomized their order for pre-training.
2) Synthetic IMU: This dataset is based on AMASS, a large-scale, public human motion archive involving 500 participants [24].AMASS unifies 23 existing motion capture datasets that have different marker sets, sampling frequencies ranging from 59-120Hz, and a variety of motions, with walk-ing being one of the most common motions.Using kinematics to generate synthetic IMU data is a widely-used method for augmenting the volume of training data [28]- [31].To generate synthetic IMU data, we low-pass filtered the joint angles and positions of each trial at 15Hz using a zero-lag, fourth-order Butterworth filter and resampled the filtered data to 100Hz.Subsequently, for each trial, we placed eight synthetic IMUs on their corresponding body segments (Fig. 1) five times with five different randomly generated orientation variations.Specifically, for each orientation variation, each of the eight IMUs is rotated by three random angles (within ±10deg) along the x, y, and z axes, respectively.Then, we generated synthetic IMUs' positions and orientations via forward kinematics using a prior pipeline [32].Synthetic angular velocities were computed by taking the first derivative of orientation, whereas synthetic accelerations were computed by taking the second derivative of positions and incorporating gravity.Similar to the MoVi dataset pre-processing, we segmented the synthetic IMU data into windows of 128 time steps with 50% window overlap.We excluded 853,139 (54.0%) windows whose average acceleration was smaller than 0.2m/s 2 .We also excluded 12,408 (0.8%) windows that contained abnormally large synthetic acceleration or angular velocity with thresholds of 160m/s 2 and 2000deg/s, respectively, which were determined based on measurement ranges of typical IMUs (MPU-9250, Invensense, USA and BHI-160, Bosch Sensortec, Germany).In total, we obtained 713,828 windows and randomized their order for SSL.
3) Real and Synthetic IMU: This dataset combines real IMU data and synthetic IMU data described above, resulting in a total of 728,246 windows.Such a data dataset has the potential to leverage the benefits of a substantial data scale as well as authentic sensor noise representations.

B. Downstream Datasets
We evaluated the performance of IMU representations from SSL pre-training in estimating GRFs for three different tasks, with corresponding datasets of synchronized IMU and GRF data: overground walking [25], treadmill walking [26], and drop landing [27].The medial-lateral GRF (mlGRF), anteriorposterior GRF (apGRF), and vertical GRF (vGRF) of the right leg were used as the estimation target for all three tasks.
The IMU-to-segment position placements of downstream datasets do not need to be identical to those of the pretraining dataset (Fig. 1), as a prior study found that position variations within 100 mm may not substantially influence the accuracy of machine-learning-based GRF estimation models [29].Meanwhile, for orientation placement, the real IMU dataset for pre-training has natural human-introduced placement variation while the synthetic IMU dataset for pretraining has ±10deg orientation variations.Still, we assumed the following orientation placement convention: one of their surfaces aligned to the body surface and one of their edges aligned with the segment's long axis.We manually verified whether the orientation placement conventions of each dataset are consistent with the assumed alignment.If not, we rotated IMUs along axes by multiples of 90 deg to match the assumed alignment.
1) Overground Walking [25]: The original dataset contains 19 participants walking on stairs, ramps, treadmills, and level ground at various speeds, but we only used level-ground walking data to build the model to estimate GRF.The input IMU data are collected at 200Hz from four IMUs (Yost, USA) placed on the torso and right thigh, shank, and foot.GRF was measured by force plates (Bertec, USA) at 1,000Hz and low-pass filtered at 15Hz.We downsampled the synchronized IMU and GRF data to 100Hz to match that of the pretraining datasets.Each trial has three gait cycles with goldstandard GRF measurements.We extracted these gait cycles using windows of 128 time steps, starting at 40 samples before the stance phases.In total, there are 510 windows available for model training and evaluation.
2) Treadmill Walking [26]: This dataset includes 17 participants walking on an instrumented treadmill with various speeds, foot progression angles, trunk sway angles, and step widths.Eight 8 IMUs (SageMotion, USA) were placed on the trunk, pelvis, and both thighs, shanks, and feet to collect data at 100Hz.The GRF of the right leg is measured by an instrumented treadmill (Bertec, USA) at 1,000Hz, low-pass filtered at 15Hz, and downsampled to 100Hz.For each trial, continuous walking data were segmented into windows of 128 time steps with 50% overlap.We retained all data windows because GRF is measured by treadmill-embedded force plates throughout the entire trial.In total, there were 23,481 windows available for model training and evaluation.
3) Drop Landing [27]: This dataset includes 15 participants performing double-leg drop landing trials from a 30-cm-high box with various toe-out angles.Eight IMUs (Xsens, The Netherlands) were placed on the trunk, pelvis, and both thighs, shanks, and feet to collect data at 100Hz.The GRF of the right leg was measured by a floor-embedded force plate (AMTI, USA) at 1,000Hz, low-pass filtered at 15Hz, and downsampled to 100Hz.Each trial contains 0.8s of data starting at the moment of jumping, corresponding to 80 time steps.We appended zeros to the end of each trial to match the input length of our transformer.We retained all the trials that were successfully collected.In total, we obtained 509 windows from 509 drop landing trials for model training and evaluation.

C. Model Architecture
We built a transformer model to learn IMU representations during pre-training (Fig. 2(a)).The transformer's input is a window of IMU data with 128 samples and 48 IMU axes (eight 3-axis accelerometers and eight 3-axis gyroscopes).We first normalized each IMU axis by removing the mean and scaling to unit variance.Then, the 128 samples were sliced into patches and the values of each patch were flattened.We randomly masked a portion of patches with a trainable masking patch that is updated during the training process.The masking ratio is a critical parameter in SSL, and it depends on the patch length of transformers.We performed a grid search to jointly optimize the masking ratio (6.25%, 12.5%, 12.5%, 25%, 37.5%, 50%, and 62.5%) and the patch length (1, 2, 4, and 8).After masking, we linearly mapped each patch to a vector of length 192 and added sinusoidal positional The weights and biases of the self-supervised transformer were copied to initialize those of the transformer for downstream evaluation.A linear layer was appended to the transformer to map the transformer outputs to the final model outputs.During fine-tuning, we optimized the linear layer first and then fine-tuned the entire model to prevent distortion of pre-trained IMU representations [33], [34].The fine-tuning data preprocessing steps are identical to those used in SSL, except for the exclusion of masking.
encoding [35].Then, the data was fed into the transformer model, generating an output with dimensions identical to the input.Each output patch was linearly mapped to the original patch size (patch length × 48 IMU axes) to reconstruct the unmasked input data.The transformer was trained to minimize the mean squared error between the reconstructed data and the unmasked input data over the masked patches.The transformer has six self-attention blocks, with a per-block configuration of 8 attention heads, 512 feedforward units, 10% dropout, 10 −5 layer normalization, and rectified linear unit (ReLU) activation function.The total number of trainable parameters is 2 million.
The model for downstream prediction tasks (i.e., estimating 3-axis GRF) has the same architecture as the one used in pretraining (Fig. 2(b)).Both IMU and GRF data are normalized along each axis by removing the mean and scaling to unit variance.The output of each patch was linearly mapped to GRF.The data pre-processing steps are identical to those in pre-training except that no patch is masked.For each downstream task, we investigated the performance of three pre-trained models whose parameters are inherited from SSL pre-training on the three different pre-training datasets.
The proposed model uses data from eight IMUs as input (Fig. 1).For downstream tasks that used a smaller IMU set (e.g., overground walking [25]), we substituted each omitted IMU with a randomly initialized, trainable data patch that is updated during the training process.Prior studies reported high GRF estimation accuracy using a single foot-worn IMU [36], [37].Thus, we trained additional models using solely the IMU data from the right foot, in addition to the models trained with data from all IMUs.

D. Baseline and Other Pre-trained Models
The following baseline and pre-trained models have an identical model architecture to the proposed SSL model.Also, for downstream evaluation, the same fine-tuning protocol was applied to all the models.
1) Baseline (No Pre-Training): Parameters of this model are randomly initialized following conventional supervised learning with PyTorch's default settings.Specifically, we set linear layer weights following the Kaiming uniform method [38], attention layer weights following the Xavier uniform method [39], and model biases to zeros.
2) Motion Transfer Pre-training: This baseline model is pre-trained on a running dataset with real IMU data as input and 3-axis GRF data as output.The distribution shift between this pre-training and downstream dataset is predominantly from the motion.The data were collected from 15 participants running on an instrumented treadmill with various speeds, footwear, strike patterns, and step rates [14].The GRF is measured by an instrumented treadmill (Bertec, USA) at 1,000Hz and low-pass filtered at 15Hz.Five IMUs (MTi-300, Xsens Technologies B.V., the Netherlands) were placed on the trunk, pelvis, and left thigh, shank, and foot to collect data at 200Hz.We directly used the left thigh, shank, and foot IMU data to substitute the absent right thigh, shank, and foot IMU.We downsampled IMU and GRF data to 100Hz and segmented each trial into windows of 128 time steps (1.28s each) with 50% overlap (0.64s) to match those of downstream datasets.Both IMU and GRF data are normalized along each axis by removing the mean and scaling to unit variance.In total, there were 21,099 windows available for model pre-training.
3) Task Transfer Pre-training: This baseline model is pretrained on the AMASS dataset (Section II-A2) with synthetic IMU data as input and 3-axis body center acceleration in the global frame as output.The body center acceleration, which is measured by marker-based motion capture, is selected due to its association with GRF according to Newton's second law.Both synthetic IMU data and body center acceleration are normalized along each axis by removing the mean and scaling to unit variance.

E. Training Protocols
Models were implemented in Python 3.9 with PyTorch 1.13 and Numpy 1.23.One Nvidia RTX 2080 Ti graphics card was used to train and test all the models.For the proposed SSL pre-training, Motion Transfer Pre-training, and Task Transfer Pre-training, we used the AdamW optimizer [40] to minimize the mean squared error with a learning rate of 10 −4 , a weight decay of 0.01, and a minibatch size of 64 for 5 × 10 4 backpropagation steps without early stopping.We used a linear warmup of the learning rate for the first 20% of steps, and decayed the learning rate with a cosine decay schedule without restarts [41].We used the entire dataset for pre-training without splitting a test set because the model evaluation was performed based on downstream datasets that are different from pre-training datasets.
For downstream evaluation, we implemented a two-step fine-tuning strategy for pre-trained models, which optimizes the linear layer first and then fine-tunes the entire model to prevent distortion of pre-trained IMU representations [33], [34].For the first step where the linear layer was trained, we used AdamW optimizer with a minibatch size of 64, a learning rate of 10 −3 , and a weight decay of 0.01 for 300 backpropagation steps.For the second step where the entire model is fine-tuned, we used AdamW optimizer with a minibatch size of 64, a learning rate of 10 −4 , and a weight decay of 0.01 for 300 backpropagation steps.For both steps, we used the mean squared error loss function, linearly warmed up the learning rate for the first 20% of backpropagation steps, and decayed the learning rate with a cosine decay schedule without restarts [41].We trained and tested models with data from different participants for each dataset using a five-fold cross-validation approach, where participants were randomly split into five groups.Four groups were combined as the training set for each of the five iterations, and the remaining group was used as the test set.

F. Performance Evaluation
We compared different models using Pearson's correlation (ρ) and root mean square error (RMSE) between the estimated and gold-standard GRF across all the stance phases and landing phases of each participant.Swing phases of walking and flight phases of drop landing where vGRF is lower than 2% of the body weight are excluded from the analysis.In addition to RMSE, we compared the relative RMSE (rRMSE) of peak vGRF because its magnitude is larger during drop landing than walking.The rRMSE is defined as: where N denotes the number of windows of a participant and vGRF gold denotes gold-standard vGRF measurement.A oneway analysis of variance (ANOVA) was used to determine if estimation accuracy would be significantly affected by pre-training settings, i.e., baseline, using real IMU, using synthetic IMU, and using real and synthetic IMU.If significant differences were detected, paired t-tests with Bonferroni corrections were used to compare pairs of models.The level of significance was set to 0.05.Beyond overall GRF measures, we also quantified the error of our pre-trained models across a spectrum and compared it to that of baseline models.We performed spectrum analysis using Fast Fourier Transform (FFT) on the GRF measured by the force plate as well as on the GRF estimated by models for each window.Then, we computed the differences in FFT results and averaged the differences across all the windows.
In addition, we investigated scaling law curves for model performance by varying the downstream dataset sizes.For each fold of cross-validation, we randomly sampled a subset of the training set for model training, while used the entire test set for evaluation.We linearly reduced the sizes (100%, 90%, 80%, ..., 10%) for GRF estimation during overground walking and drop landing, whereas exponentially reduced the sizes (10 0 × 100%, 10 −0.2 × 100%, 10 −0.4 × 100%, ..., 10 −2 × 100%) for GRF estimation during treadmill walking due to its larger base size.

III. RESULTS
SSL pre-trained models achieved higher accuracy in estimating vGRF during overground walking and treadmill walking across various downstream dataset sizes compared to baseline models that have an identical model architecture (Fig. 3).The models pre-trained on synthetic IMU data and on a combination of the real and synthetic IMU data were both significantly more accurate than the baseline models when using the entire overground or treadmill walking dataset for fine-tuning.No differences were observed between pre-trained and baseline models in estimating mlGRF, apGRF, or vGRF during drop landing.Using 10% of overground walking data to fine-tune the SSL pre-trained models yielded comparable accuracy to the baseline model trained on 100% of overground walking data, indicating a 10x data efficiency improvement.Similarly, using 1% of treadmill walking data to fine-tune  the pre-trained models yielded comparable accuracy to the baseline models that are trained on 100% of treadmill walking data, indicating a 100x data efficiency improvement.There was no statistical difference between the three SSL pre-trained models for all three prediction tasks except that the model pre-trained on the synthetic IMU dataset was significantly more accurate than the model pre-trained on the real IMU dataset for treadmill walking.Apart from the baseline model, the proposed SSL achieved slightly better results than Motion Transfer, which pre-trained the model using running GRF (Table II-F).The proposed SSL achieved comparable results to Task Transfer, which pre-trained the model using body center acceleration measured by marker-based motion capture.
When evaluating performance on vGRF peaks, the SSL pre-trained models also outperformed the baseline model for overground and treadmill walking (Fig. 4).SSL pretraining increased the correlation of peak vGRF estimation from 0.59 to 0.64 for overground walking and from 0.69 to 0.90 for treadmill walking.SSL pre-training reduced the correlation of peak vGRF estimation from 0.70 to 0.68 for drop landing.The RMSEs and rRMSE of pre-trained models are 0.11 ± 0.03N/kg and 8.2 ± 2.4% for overground walking, 0.07 ± 0.02N/kg and 4.5 ± 1.0% for treadmill walking, and 0.20±0.06N/kgand 7.9±2.8%for drop landing, respectively.When using the right foot IMU data as model input, SSL pre-trained models consistently achieved higher accuracy than the baseline model across mlGRF, apGRF, and vGRF during overground walking, treadmill walking, and drop landing (Table III).Also, using a single foot-worn IMU resulted in lower correlation coefficients compared to using all the IMUs.Additionally, consistently smaller errors were observed in SSL pre-trained models compared to the baseline models across the spectrum for overground and treadmill walking (Fig. 5).For drop landing, the SSL pre-trained model exhibited comparable error to the baseline model across the spectrum.A masking ratio of 6.25-12.5% with shorter patch lengths achieved the    best performance across all three prediction tasks (Fig. 6).Thus, we implemented our self-supervised transformer models with a patch length of 1 (equivalent to 10 ms at 100Hz sampling frequency) and 12.5% masking, and all the above results are reported based on these models.

IV. DISCUSSION
This work shows that using SSL to pre-train deep learning models that predict walking GRF from IMUs can improve data efficiency.Prior deep learning models for biomechanical estimation are predominantly trained on proprietary datasets with a small number of participants.In contrast, we used opensource motion datasets with hundreds of participants to pretrain our models.Three key outcomes of this work are: • Compared to conventional supervised learning, using SSL and large "unlabeled" IMU datasets to pre-train deep learning models can improve data efficiency by 10-100x for IMU-based walking vGRF estimation.It can also improve correlation coefficient by 0.02-0.04when using the same amount of labeled data for training.• Low masking ratios (6.25-12.5%)are optimal for IMUbased SSL, indicating that IMU data have high information redundancy (Fig. 6).• This code and pre-trained model have been made publicly available, potentially facilitating researchers to improve the results of their customized downstream prediction tasks (Appendix A).In the absence of large-scale real IMU datasets, we utilized a large-scale motion dataset (AMASS) to generate synthetic IMU data for SSL.Its combined window duration is equivalent to 56 hours of data, substantially exceeding prior kinetic estimation studies that used 1.2-6 hours of real IMU data [13]- [15] and 0.2-2 hours of synthetic IMU data [42], [43].Models pre-trained on a relatively smaller real IMU dataset and a larger synthetic IMU dataset achieved similar performance.This suggests that the size of the synthetic dataset compensated for the distribution shift between real and synthetic data that arise from soft tissue artifacts, sensor noise, and human model simplification [28].The other two pre-training methods, Motion Transfer and Task Transfer, achieved slightly lower or comparable results to the proposed SSL pre-training.Their non-inferior performance might be attributed to the usage of gold-standard measurements from force plates and markerbased motion capture systems.The proposed SSL only relies on "unlabeled" IMU data.Thus, the results can potentially be improved and generalized to broader use cases by using "unlabeled" IMU data collected in real-life or sports environments.Ideally, the dataset should involve a wide range of motions that are related to clinical assessments and of interest to biomechanists, such as jumping, landing, and cutting.
SSL trains models to learn IMU representations specific to pre-training data.As a result, it improved GRF estimation during walking, which is one of the most common motions in both two pre-training datasets (MoVi and AMASS).In contrast, neither dataset has instances of drop landing.Consequently, the model cannot learn IMU representations of drop landing during pre-training, resulting in no enhancement during subsequent downstream evaluation.Tailoring the distribution of datasets for specific estimation tasks is a commonly used strategy for improving the performance of deep learning [44]- [46].To apply SSL, future researchers may need to increase the similarity between the distribution of pre-training and downstream datasets.This can be achieved by adjusting the weights of instances or by collecting additional pre-training data for the motions of the downstream datasets.
The RMSE of peak vGRF estimation of the pre-trained models were 0.56-0.88N/kgduring treadmill and overground walking, which are comparable to the smallest detectable change of peak vGRF during walking in patients with stroke or knee osteoarthritis (0.56-0.61N/kg) [47], [48].The RMSE of peak vGRF estimation of the pre-trained models was 3.53N/kg during drop landing, which is higher than the smallest detectable change of peak vGRF during drop landing (0.55-0.63N/kg) [49].Also, the performance of our pretrained models is comparable to or slightly higher than those reported in prior studies that used the same downstream datasets for vGRF estimation.Specifically, a prior deep learning model achieved a correlation coefficient of 0.96 ± 0.01 for overground walking and 0.96 ± 0.01 for treadmill walking [50], while our models achieved 0.95 ± 0.02 and 0.97 ± 0.01, respectively.Another prior deep learning model achieved a correlation coefficient of 0.92±0.11[27] for vGRF estimation during drop landing, while our model achieved 0.93 ± 0.04.Unlike prior studies, we excluded swing phase of walking and flight phases of drop landing for performance evaluation.The swing and flight phases have substantially smaller RMSE than stance and landing phases across three GRF axes, possibly because the prediction of zero GRF is inherently easier than that of positive and modulating GRF (Supplementary Table I).Thus, excluding swing and flight phases reduced correlations and increased RMSE of our models.Another reason for excluding swing phases is that prior studies used one IMU placed on foot for segmenting walking and running gait phases with less than 25 ms errors [51]- [53].Integration of such gait phase segmentation models into the GRF estimation model may obviate the need for GRF estimation during swing phase.Additionally, prior studies tailored their model architectures and hyperparameters for specific prediction tasks, and thus they might not be reused for other prediction tasks.In contrast, we demonstrated that our unified model architecture can be generalized to multiple downstream prediction tasks.
The optimal mask ratio is related to the information density of the input data.Previous language models used low masking ratios for natural language that is information-dense (e.g., 15% [22]), whereas vision models used high mask ratios for images that have high information redundancy (e.g., 75% [18], [19]).We observed that 6.26-12.5% masking ratios have the best performance (Fig. 6), indicating that IMU data are information-dense.Nevertheless, the optimal masking may increase when the IMU sampling frequencies are substantially higher than 100Hz due to increased information redundancy.Our proposed model used a window length of 128 samples (1.28s), which is longer than the typical duration of a walking gait cycle or a drop landing trial.However, we acknowledge that such a window length may limit the model performance on other tasks such as GRF estimation during balancing, which needs a longer window length to extend the temporal scope of the model.
Our code, pre-trained models, and tutorials are publicly available for researchers who are interested in using them for their downstream prediction tasks.Our models were trained with eight IMUs; however, they are robust to the distribution shift caused by a smaller IMU set.When using four IMUs for overground walking GRF estimation (Table II-F) or using one foot-worn IMU for both walking and drop landing GRF estimation (Table III), SSL pre-training improved accuracy across mlGRF, apGRF, and vGRF compared to no pre-training.Thus, the proposed SSL model can be used for sparse IMU configurations that are more practical for real-world measurements [31], [54], [55].We recommend placing IMUs with their z-axes aligned with the segment surface normal, x-axes pointing left during standing, and y-axes perpendicular to the x-and z-axes following the right-hand rule (Fig. 1).For a different IMU-to-segment orientation placement configuration, the recommended configuration can typically be achieved through a post-hoc rotation of the IMU data.
V. CONCLUSION SSL with large IMU datasets for model pre-training can improve data efficiency and accuracy compared to conventional supervised learning when the distribution of the downstream dataset is within that of pre-training datasets.Future studies may collect "unlabeled" IMU data under real-life or sports environments to expand the scale and distribution of pretraining datasets, thus potentially unlocking newer and broader use cases of IMU-driven kinetic assessment where only limited gold-standard GRF measurements are available.
APPENDIX A

Fig. 1 .
Fig. 1.Locations and orientations of eight synthetic IMUs generated from the AMASS dataset.Red arrows, green arrows, and blue dots represent x-, y-, and z-axes of IMUs, respectively.

Fig. 2 .
Fig. 2. Self-supervised IMU representation learning to boost downstream prediction tasks.(a) Continuous IMU data are segmented into windows, sliced into patches, randomly masked, and finally fed into a transformer model to reconstruct the original data window.(b)The weights and biases of the self-supervised transformer were copied to initialize those of the transformer for downstream evaluation.A linear layer was appended to the transformer to map the transformer outputs to the final model outputs.During fine-tuning, we optimized the linear layer first and then fine-tuned the entire model to prevent distortion of pre-trained IMU representations[33],[34].The fine-tuning data preprocessing steps are identical to those used in SSL, except for the exclusion of masking.

Fig. 3 .
Fig. 3. Correlation coefficient between gold-standard and estimated vGRF for SSL pre-trained and baseline models when using a range of reduced downstream datasets for training.The sizes of overground walking and drop landing datasets for model fine-tuning were reduced linearly whereas the size of treadmill walking dataset was reduced exponentially due to a significantly larger dataset size.Black arrows in task 1 -overground walking and task 2 -treadmill walking indicate improvements in data efficiency.

Fig. 4 .
Fig. 4. Scatter plot and regression line showing gold-standard and estimated vGRF peaks for SSL pre-trained and baseline models.All the windows are shown for task 1 (overground walking) and task 3 (drop landing), whereas 500 windows were randomly selected from 23,481 windows for task 2 (treadmill walking).

Fig. 5 .
Fig.5.The error between gold-standard measurements and model estimations across the spectrum.Pre-trained models (blue) exhibited smaller errors compared to baseline models (gray) for overground and treadmill walking.

Fig. 6 .
Fig.6.Correlation coefficient between the gold-standard and estimated vGRF for a variety of models trained with patch lengths of 1-8 and with 6.25-62.5% masking during pre-traing.Patch lengths of 1-2 and 6.25-12.5% masking have the highest accuracy.

TABLE I TWO
DATASETS FOR SSL PRE-TRAINING AND THREE DATASETS FOR DOWNSTREAM EVALUATION.

TABLE II CORRELATION
COEFFICIENT BETWEEN THE GOLD-STANDARD AND ESTIMATED GRF FOR THE BASELINE AND A VARIETY OF PRE-TRAINED MODELS.
a ml: medial-lateral; ap: anterior-posterior; v: vertical b The model is pre-trained on the synthetic IMU dataset.

TABLE III CORRELATION
COEFFICIENT BETWEEN THE GOLD-STANDARD AND ESTIMATED GRF WITH A SINGLE FOOT IMU DATA AS MODEL INPUT.