Federated Learning for multi-omics: a performance evaluation in Parkinson’s disease

Summary While machine learning (ML) research has recently grown more in popularity, its application in the omics domain is constrained by access to sufficiently large, high-quality datasets needed to train ML models. Federated Learning (FL) represents an opportunity to enable collaborative curation of such datasets among participating institutions. We compare the simulated performance of several models trained using FL against classically trained ML models on the task of multi-omics Parkinson’s Disease prediction. We find that FL model performance tracks centrally trained ML models, where the most performant FL model achieves an AUC-PR of 0.876 ± 0.009, 0.014 ± 0.003 less than its centrally trained variation. We also determine that the dispersion of samples within a federation plays a meaningful role in model performance. Our study implements several open source FL frameworks and aims to highlight some of the challenges and opportunities when applying these collaborative methods in multi-omics studies.


Introduction
In recent years, Machine Learning (ML) algorithms have gained popularity as a possible vehicle for solving many long-standing research questions in the clinical and biomedical setting.Concretely, the adoption of sophisticated ML models can aid in tasks such as biomarker detection, disease subtyping 1 , disease identification [2][3] , and the development of novel medical interventions.The nascence of powerful predictive methods such as ML has enabled investigations into advanced analytics of granular patient features, such as genomics and transcriptomics [4][5][6] , to achieve the ultimate goal of precision medicine.
Adopting ML methods is constrained by access to high-quality datasets.In biomedical studies, curating such quality datasets is particularly difficult due to the sample collection and processing costs and the barriers associated with recruiting patients who meet a study's inclusion/exclusion criteria.The challenge of this task is exacerbated by the fact that institutions that typically collect biomedical samples cannot easily share human specimens' data due to data privacy regulations, such as HIPAA, EU GDPR, India's PDPA, Canada's PIPEDA, mandated by medical systems, or at a national, and international level 7 .Federated Learning (FL) is an optimization method for performing machine learning model training among a group of clients, allowing each client to maintain governance of their local data.Initially developed for learning user behavior patterns on personal mobile devices without breaching individual privacy 8 , FL has found valuable applications in numerous domains, including finance 9 , medicine 10,11 , and the pharmaceutical industry 12 .In biomedical research, FL represents an opportunity to enable cross-silo analytics and more productive collaboration [13][14][15] .This work evaluates Federated Learning methods' practical availability and utility to enable large-scale, multi-institutional, and private analytics of multiple modality biomedical samples.Specifically, we aim to identify the frameworks biomedical researchers can use to perform federated learning in their work, the expected performance changes, and the implementation challenges they may face.We also discuss the opportunities and limitations of applying federated learning in multi-omics, where samples capture a patient's genomic, transcriptomic features, and clinical and demographic information through the case study of Parkinson's disease prediction.
We use the multi-modal Parkinson's disease prediction as a case study for testing Federated Learning on omics data.Timely and accurate diagnosis of neurodegenerative diseases like Parkinson's is crucial in exploring the efficacy of novel therapies to treat and manage the disease.Since the onset of these diseases typically begins many years before any visible symptoms, early detection is difficult to achieve at a clinical level alone.Usually, it requires information inherent to the patient's biology.Makarious et al. 2022 5 have already determined that leveraging genomic and transcriptomic information as part of the diagnosis process allows for higher model performance.This study demonstrates the feasibility and practicality of deploying FL for Parkinson's disease prediction.

Federated learning models trained using publicly available and accessible frameworks results follow central model performance.
We aim to evaluate the performance differences between classical ML algorithms frequently used in the biomedical setting, against federated learning algorithms suitable for cross-silo modeling.An overview of our approach (Fig. 1) shows the experimental design for evaluating several FL and central ML methods in the task of Parkinson's disease prediction based on genomic, transcriptomic, and clinico-demographic features (Supplementary Table 1).The datasets used in experiments originate from studies providing clinical, demographic, and biological information of Parkinson's Disease patients, the Parkinson's Progression Marker Initiative (PPMI) and Parkinson's Disease Biomarkers Program (PDBP).The PPMI dataset is a longitudinal, observational study where patients contribute clinical, demographic, imaging data, as well as biological samples for whole-genome sequencing and whole-blood RNA sequencing.PPMI specifically includes newly diagnosed and drug-naïve patients, collected at clinical sites globally over a span of 5 to 13 years.The PDBP dataset provides clinical, genetic, imaging, and biomarker data associated with Parkinson's disease, Lewy Body Dementia, and other parkinsonisms.Patients in PDBP are not necessarily newly diagnosed or drug-naïve.The PPMI dataset is used for model training, validation, and testing.The PDBP dataset is used strictly as an external test set.In our experiments, the PPMI dataset is split into K folds, one of which is used as a holdout test set, and the remaining folds are used for model training and validation.To establish a baseline performance of classically trained, central algorithms representative of methods used in the current biomedical research paradigm, several central ML algorithms are fit to the training set (Supplementary Table 2) and tested on the holdout PPMI fold, as well as the whole PDBP dataset.To simulate the cross-silo federated setting, the training set is split into N disjoint subsets, referred to as client datasets.Where each baseline ML algorithm is fit to the full centralized training dataset, the FL model is fit to N disjoint, siloed client datasets, the union of which equates to the entire training data set.An illustration of the FL training process is shown in Fig. 2. The fitted FL models are finally evaluated against the PPMI test fold, and the PDBP dataset.
In the optimistic FL setting where we compare a federation of N=2 client sites, which have been assigned samples through uniform stratified random sampling without replacement, against the central baseline algorithms (Fig. 3), it can be seen that for all the included FL methods, the absolute difference in performance is relatively small.For the internal test set, the central Logistic Regression (LR) Classifier 16 has an Area Under the Precision-Recall Curve (AUC-PR) of 0.915±0.039(standard deviation across K = 6 folds).Among the classifiers trained using FL, which implement the same local learner, FedAvg  AUC-PR of 0.842±0.009and FedAvgLR, μ = 0.5 LR, and FedProx μ = 2 LR, have an AUC-PR of 0.826±0.011,0.823±0.015,and 0.835±0.007.This general reduction in performance between the centrally trained classifier and the classifier trained collaboratively through FL is a trend observed in nearly all of the FL algorithms across both test sets.Similarly, the central MLP classifier 19 has an AUC-PR of 0.892±0.032and 0.826±0.007respectively, while the best performing FL variation, FedProx MLP μ = 2 has an AUC-PR of 0.868±0.06and 0.785±0.015for the internal and external test set respectively.A similar trend can be observed for FedAvg XGBRF 20 , where the performance reduction is proportional to that observed in other learners.The only exception to this pattern is in the case of SGDClassifier 21,22 , where in the PDBP exhibits a marginal improvement of 0.002 AUC-PR.The resulting performance details, including several additional centrally trained ML algorithms, are in Table 1, and Table 2.The statistical significance of pairwise observed differences in performance is presented in Supplementary Table 3.The results of this side-by-side comparison of FL methods in an idealistic setting with minimal heterogeneity among client datasets show a relatively small difference in performance compared to the central algorithms.

Sample dispersion among client sites negatively impacts global model performance.
By splitting the training samples among an increasing number of clients, we aim to understand the implications of federation configurations that have more dispersed samples (Fig. 4).In both the PPMI and PDBP data sets, there is a similar relative change in AUC-PR performance when increasing the number of client sites; the absolute performance scores, and variance are considerably higher for the PPMI test set than the PDBP test set.The performance of PPMI FedAvg XGBRF starts at 0.924 ± 0.015 AUC-PR in a federation of two client sites, and progressively drops to 0.861 ± 0.043 AUC-PR at 18 clients.For the PDBP performance, FedAvg XGBRF similarly reduces from 0.876 ± 0.009 to 0.752 ± 0.054 AUC-PR, at a minimum.Such a reduction in performance is also observed in the FedAvg SGD Classifier, which has an AUC-PR of 0.92 ± 0.025 for two clients, and an AUC-PR of 0.886 ± 0.055 for 18 clients in PPMI.In PDBP the same classifier starts at 0.847 ± 0.008 for two client sites, and ends at 0.798 ± 0.014 AUC-PR for 18 client sites.The trend of performance decline is observed for the LR classifiers as well, for both the PPMI and PDBP test sets.The FedAvg MLP, as well as both FedProx MLP classifiers do not exhibit such a reduction performance.In the PPMI test set, FedAvg MLP performance at N=2 client sites is 0.872 ± 0.072 AUC-PR, and at N=18 client sites, 0.876 ± 0.06 client sites.Similarly, for PDBP performance FedAvg MLP performance is 0.78 ± 0.012 for two client sites, and 0.781 ± 0.009 for N=18 client sites.A nearly identical trend is observed in FedProx μ = 0.5, μ = 2 across both test sets.Detailed results are in supplementary materials, Table 3, and Table 4.

Data heterogeneity at client sites does not significantly influence model performance.
To understand the implications of data heterogeneity among client sites, we examine the change in AUC-PR in a federation of two clients with respect to the split method (Fig 5).In our experiments, we find that the performance changes introduced by dataset heterogeneity vary.Some FL models, such as FedAvg LR, FedProx μ = 0.5, FedAvg MLP, FedProx μ = 0.5 MLP, FedProx μ = 2 MLP exhibit performance improvements as both label heterogeneity and dataset size heterogeneity are introduced.The greatest performance improvement in PDBP is 0.018 AUCPR by FedProx μ = 2 MLP, and 0.029 by FedAvg LR in PPMI.Conversely, find that FedProx μ = 0.5 LR, FedProx μ = 2 LR, FedAvg SGD, FedAvg XGBRF exhibit performance degradation on account of dataset heterogeneity, where the greatest reduction in performance is 0.014 AUC-PR by XGBRF in PDBP, and 0.031 AUC-PR by FedProx μ = 2 LR on PPMI.In all such cases, the performance changes induced by dataset heterogeneity are marginal relative to other parameters such as algorithm choice or quantity of participating clients.The implications of performance heterogeneity as the number of clients increases are shown in Supplementary Fig. 1, Supplementary Fig 2 .FL training time is not dramatically affected by choice in federated aggregation strategy.
To shed light on the computational costs associated with using different FL aggregation strategies, we measure the model training time for FL algorithms in which the federation consists of N=2 client sites.For the FedAvg aggregation strategy, FedAvg LR had the lowest mean runtime of 7.909e+00 ± 0.550 seconds.Similarly, for the algorithms implementing FedProx aggregation, FedProx μ = 0.5 LRClassifier had the lowest overall runtime of 8.747e+00 ± 0.158 seconds.FedProx μ = 2 LRClassifier had the second lowest runtime for FedProx variants with a runtime of 8.905e+00 ± 0.130 seconds.For the MLP classifier, FedAvg, FedProx μ = 0.5, and FedProx μ = 2 had a progressively increasing runtimes 8.755e+00 ± 0.141 seconds, 9.039e+00 ± 0.266 seconds, 9.260e+00 ± 0.163 seconds respectively.FedAvg XGBRF, and FedAvg SGD had a considerably higher runtime of 1.061e+01 ± 0.014 and 1.513e+01 ± 1.497 seconds respectively.A visualization of the runtimes for FL algorithms is presented in Figure 6.
In contrast to the FL algorithms, central algorithm training time is at least an order of magnitude lower.The fastest model to train, SGD Classifier fits the model in 6.771e-03 ± 0.001 seconds, and the slowest model, MLP Classifier fits its model in 1.609e-01 ± 0.009.Central LR Classifier and XGBRF run in 1.857e-02 ± 0.008 and 1.633e-01 ± 0.003 seconds respectively.The full list of runtimes including central and federated algorithms is presented in Table 7.

Discussion
In conjunction with increased access to genomic and transcriptomic data, the proliferation of high-quality Machine Learning open-source packages, has helped advance numerous longstanding challenges in biomedical research, such as disease subtyping, biomarker identification, and early disease diagnosis.The common bottleneck limiting such advances has thus shifted from the ability to apply ML methods to the availability of high-quality, well-designed datasets.
Federated Learning has been cited as a promising means of alleviating the data scarcity problem through data-private collaborative model training [23][24][25] .Previous works focus on applying FL to domains adjacent to multi-omics disease diagnosis, namely focusing on imaging data 13,26 , longitudinal health records 14 , named entity recognition 27 .Similar works such as Salmeron, Arévalo, and Ruiz-Celma 2023 28 approach benchmarking FL models on biomedical datasets, but focus on comparing different FL aggregation strategies, rather than evaluating FL against a central baseline, as is done in our study.By approaching federated learning for multi-omics disease diagnosis from a performance benchmarking perspective, focusing on algorithms that are broadly accessible in the open-source community, we hope to shed light on what kind of practical performance can be achieved in a real-world setting where deep AI and Software Systems expertise may be limited.We additionally aim to understand what fundamental pitfalls researchers must be aware of before applying such methods in their multi-omics tasks.
When comparing centrally trained models against collaboratively trained models that implement the same local learner algorithm, our results indicate the FL trained model performance tends to be consistently less than that of the central method, while approximately following the performance of the central ML method.The general reduction in AUC-PR testing score between the FL and central method is noteworthy, but not a substantial deterioration.It can also be observed that for the studied aggregation strategies, FL model performance follows central model performance.In cases where the central model is performant, the FL trained model will be as well.In the case of the strongest central classifier in the central setting, XGBRF, the FL method implementing the same algorithm as a local learner, FedAvg XGBRF, also had the highest performance among models trained using FL.Additionally, we see that in many cases, FedAvg XGBRF outperforms central ML classifiers such as Logistic Regression, SGD, and MLP at the same task by a significant margin.This empirical result indicates that in cases where institutions must decide between applying FL methods to their setting, or centralizing data by complying with potentially stringent regulations, FL can be considered an effective option.In addition to this, because the implementation of such methods is available through open-source, strongly documented frameworks, the resource investment to achieve scientifically meaningful results may not be significant.We also note that because an FL model's performance tracks, and seldom exceeds its central model performance, it can be crudely used to approximate the central model lower bound.Such an estimate of central performance, even if inexact, may be valuable for institutional stakeholders when deciding whether financial and administrative resources should be allocated to centralize several siloed datasets.We also note the overall reduction in performance between models trained using FL methods and models trained using central methods can be attributed to the federated aggregation process, which, in our case, is implemented as the unweighted average of the local learner model weights.Such a naive averaging process detracts from the parameter optimization implemented by local learners, but is a necessary cost to enable sample-private federated training.Furthermore, we note that several novel methods which implement more sophisticated weight aggregation strategies have been developed in academic settings, but are not always available as generally applicable open source packages.Overall, this test indicates that FL may be used to enable productive collaboration among institutions existing on opposite sides of geographic and policy boundaries, such as EU-GDPR, as well as across cloud providers and bare metal servers.
In the second arm of our study, we aim to understand the model performance cost of conducting collaborative training among a federation with increasing sample dispersion.Such a situation may arise, when institution stakeholders must comply with several layers of regulatory requirements, where centralizing some sites is easier than others.A concrete example of such a regulation is in the case of EU-GDPR, where transport of patient samples beyond the boundaries of the EU requires compliance with GDPR, and each country's respective legislation mandates the transport of samples between countries within the EU.In this series of experiments, we assume that the globally available set of samples is constant, but the quantity of federation members containing the samples varies.Our experiments show that some methods, such as the LR classifiers, FedAvg SGD and FedAvg XGBRF, tend to exhibit performance degradation when there are more siloes with fewer samples per silo.We also observe that methods implementing MLP as a local learner, tend not to exhibit performance degradation with respect to sample concentration at siloes.Such methods do not necessarily achieve the best performance for any federation configurations, however, in the most extreme federation configuration of 18 clients, are still outperformed by methods such as FedAvg SGD and FedAvg LR.Methods whose performance is not strongly affected by silo size may represent practical starting points for the application of FL in an exploratory task.Ultimately, because FL models appear to have an optimal operating point which is modulated by the federation configuration, the final choice in FL methods used to reach peak performance should be determined by an exhaustive search.This finding suggests that a practical future format for applying FL in the biomedical setting may be through the auto-ML paradigm, which frameworks such as H2O 29 , Auto Sklearn 30 currently implemented in the classical ML setting.We additionally find in our studies that the implementation of heterogeneous client sites, with respect to dataset size and label counts, does not necessarily result in performance reductions for all algorithms.Some models such as FedAvg LR, models implementing MLP as a local learner, tend to increase performance, while models such FedAvg XGBRF, and FedAvg SGD exhibit performance degradation when the number of client sites is two (Supplementary Figure 1).We further find that when the number of clients is four, such heterogeneity has varying effects on performance, different from the configuration with two client sites.Overall, performance changes with respect to client dataset heterogeneity are marginal relative to changes introduced by factors such a number of clients per federation, or algorithm selection.
When comparing training time among federated learning algorithms, we found a mild progression in training time between FedAvg LRClassifier, FedProx μ = 0.5 LRClassifier, and FedProx μ = 2 LRClassifier respectively.The same trend can be observed for the FL algorithms using MLP as a local learner.The progressive increase in runtime may be attributed to the relative difference in complexity between the FedAvg and FedProx optimization mechanism.The objective of FedProx weight aggregation function includes a regularization term, μ, designed to handle heterogeneity among clients 18 .The FedAvg optimization objective does not include this mechanism, making it conceptually simpler.In the context of this study, the performance differences incurred by choice in aggregation method are minor relative to parameters such as choice in local learner, number of federated aggregation rounds, and dataset size.
When comparing the training time of central and federated models, we find that the central model training time is at least an order of magnitude lower than the federated training time.This result is not surprising, given that federated models implement global weight aggregation and updating steps in model training.Given that our study performs federated optimization in simulation, production deployments of FL methods can be expected to have slower overall runtimes due to network latency and operating system throughput capabilities.
The algorithms used in evaluating collaboratively trained models using FL against centralized applications of their local learner methods are detailed in Supplementary Table 2.In our study, we omitted using closed-source FL methods available through platform interfaces since these methods allow data governance capabilities to external parties, or vendor security evaluations, which in some cases instantiates barriers to productive research.While numerous publications explore methodological improvements that push forward state-of-the-art FL model performance in an experimental setting, we encountered challenges in applying such methods in our case, as many of these academic studies do not result in broadly applicable packages.In our research, we found a set of open-source projects that implement FL methods and provide out-of-the-box solutions, or well-designed examples that could be interpolated to the multi-omics classification task to be limited.Ultimately, the FL interfaces made available by NVFlare 31 and Flower 32 were selected to conduct experiments, with local learners implemented using Sklearn 21 and DMLC 20 packages.Several open-source projects, such as Owkin 33 , Tensorflow Federated 34 , and OpenFL 35 provide full interfaces for implementing deep learning models in TensorFlow 34 , and PyTorch 36 , but such deep methods are less suitable for tabular tasks on datasets with only a handful of samples, as is the case the multi-omics datasets used in this study.Additionally, we found that while several packages provided abstract interfaces for implementing any arbitrary set of local learners and aggregation strategies, without detailed examples with a straightforward path to adaptation to a particular research task, the practical application of such methods becomes challenging, and less approachable for groups which may be resource constrained.
The extent to which Federation Site configurations could be studied was largely limited by the number of case patients within the dataset.Concretely, the implications of heterogeneity in site data could only be observed to the extent that each silo would maintain enough samples from case and control cohorts to allow the local learner to successfully train.Datasets at silos needed to have at least one sample from both the case and control groups.Similarly, although the PPMI dataset was collected across several geographically distributed institutions, point of origin information is not available for each sample, preventing the evaluation of performance on naturally occurring siloes.In our study, all experiments assume that the collective dataset available among all client sites has a constant size.An additional limitation of our work is in observing the effect of adding federation members, which contribute novel samples to the federation.
While FL methods enable data owners to maintain governance of their local datasets, on its own, FL does not provide end to end privacy guarantees.Our study examines the utility of FL methods in the multi-omics case study to understand the availability and characteristics of FL, and does not include a concrete evaluation of privacy, or security methods.Thus, we assume that the federated learning aggregation server is neither dishonest, curious, nor malicious in any way and fulfills its functions as an intermediary between client sites benevolently.Privacy preserving methods orthogonal to FL such as Differential Privacy (DP) enable the application of FL with formal guarantees of sample-privacy 37 .Such approaches were not included in the scope of this evaluation, but represent a factor which should be considered when applying FL methods in settings where verifiable sample privacy guarantees are critical.In our experimentation, we do not focus on the implications of the federation which has heterogeneous compute capabilities, since applying machine learning model fitting on datasets with few samples can be done without much difficulty.
The datasets utilized in our analysis, including PPMI and PDBP, are sourced from the Accelerating Medicines Partnership Parkinson's Disease (AMP-PD) initiative.This initiative plays a pivotal role in unifying transcriptomic and genomic samples, ensuring consistency and accuracy through central harmonization and joint-calling processes.Furthermore, the construction of machine learning features for our analysis is also centralized, leveraging these cohesive datasets.Recognizing the potential for broader application, our future focus includes exploring federated analysis tasks [38][39][40] .This involves enhancing cross-silo harmonization, jointcalling, and feature construction across diverse datasets.To facilitate this, the development of specialized federated learning libraries, specifically tailored for genomics and transcriptomics, is crucial.Such advancements will not only democratize access to federated learning (FL) methods for the wider biomedical community but also significantly broaden the scope for applying machine learning techniques in various biomedical contexts.
Overall, we believe that this work sheds light onto the feasibility, and noteworthy characteristics of applying Federated Learning for omics analysis.Through our experiments, we find that collaboratively trained FL models can achieve high classification accuracy in multi-omics Parkinson's Disease Diagnosis, and can remain relatively performance despite heterogeneity among client sites.We also find in our evaluation that although FL is a relatively novel research space in bioinformatics, there is sufficient access to open source methods which biomedical researchers may leverage to enable productive collaborations.
105 and is also made available for use under a CC0 license.
(which was not certified by peer review) is the author/funder.This article is a US Government work.It is not subject to copyright under 17  Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Faraz Faghri (faraz@datatecnica.com).

Materials Availability
This study did not generate new unique materials.

Method Details Datasets
The dataset used in this study as the basis for training and as the internal test set is the Parkinson's Progression Marker Initiative (PPMI) dataset.The PPMI dataset represents a longitudinal, observational study where patients contribute clinical, demographic, imaging data, biological samples for whole-genome sequencing, and whole-blood RNA sequencing.Samples are collected at 33 clinical sites globally and across a time span of anywhere from 5 to 13 years.This preprocessed dataset consists of 171 samples of case patients diagnosed with Parkinson's disease and 427 healthy control patients.The PPMI cohort contains newly diagnosed, and drug-naive patient samples.The PPMI cohort contains 209 (36%) female samples, 388 (67%) male samples (Supplementary Table 4).
The dataset used in this study for external, out-of-distribution validation is the Parkinson's Disease Biomarker Program (PDBP).It is a longitudinal, observational study where patients contribute clinical, demographic, and imaging data and biological samples for whole-genome and whole-blood RNA sequencing.The preprocessed dataset consists of 712 healthy control patients and 404 case patients diagnosed with Parkinson's disease.Each sample comprises 713 features, including genetic, transcriptomic, and clinico-demographic information collected at the baseline.The PDBP cohort consists of 480 (43%) female samples, and 636 (57%) male samples.(Supplementary Table 5).
Both PPMI and PDBP data used in this study were acquired through the AMP-PD initiative 42 , an effort to provide harmonized datasets that include common clinical and genomic data.Through this initiative, the PPMI and PDBP datasets are centrally joint-called and harmonized to allow standardization across cohorts.
Transcriptomic data from whole blood RNA sequencing was generated by the Translational Genomics Research Institute team using standard protocols for the Illumina NovaSeq technology and processed through variance-stabilization and limma pipelines 43 for experimental covariates.Gene expression counts for protein-coding genes were extracted, then differential expression p values were calculated between cases and controls using logistic regression adjusted for additional covariates of sex, plate, age, ten principal components, and percentage usable bases.A comprehensive description of the RNA-Sequencing method is presented in 44 for PPMI, and 45 for PDBP.
For genetic data, sequencing data were generated using Illumina's standard short-read technology, and the functional equivalence pipeline during alignment was the Broad Institute's implementation 46 .Applied quality control measures included criteria like gender concordance and call rate, with a focus on SNPs meeting the GATK gold standards pipeline and additional filters like non-palindromic alleles and missingness by case-control status thresholds.Polygenic risk scores (PRS) were constructed using effect sizes from a large European genome-wide metaanalysis, supplementing the genetic data from whole genome sequences.The process from sample prep to variant calling is comprehensively described in 42 .
Quality control for genetic samples based on genetic data output by the pipeline included the following inclusion criteria: concordance between genetic and clinically ascertained genders, call rate > 95% at both the sample and variant levels, heterozygosity rate < 15%, free mix estimated contamination rate < 3%, transition:transversion ratio > 2, unrelated to any other sample at a level of the first cousin or closer (identity by descent < 12.5%), and genetically ascertained European ancestry.For inclusion of whole-genome DNA sequencing data, the variants must have passed basic quality control as part of the initial sequencing effort (PASS flag from the joint genotyping pipeline) as well as meeting the following criteria: non-palindromic alleles, missingness by case-control status P > 1E-4, missingness by haplotype P > 1E-4, Hardy-Weinberg p value > 1E-4, minor allele frequency in cases > 5% (in the latest Parkinson's disease meta-GWAS by Nalls et al. 2019 47 ).As an a priori genetic feature to be included in our modeling efforts, Polygenic Risk Scores (PRS) were computed per individual in the PPMI and PDBP datasets, where we used the estimated effect sizes from the largest European Parkinson's disease meta-GWAS, using the genome-wide significant loci only 47 .The PPMI dataset is a minor portion of the most recent GWAS study (less than 0.1% of the sample size).The algorithmic feature selection process, described by Makarious et al. 2022 5 excluded the majority of features reaching the p value thresholds of interest thus reducing any impact caused by potential data leakage.The PDBP dataset is not included as a part of the study.
Compared to the PPMI dataset, PDBP includes an additional 40 genetic features, which are excluded from this study, allowing PPMI and PBDP to have the same feature set.Additionally, the PPMI samples are collected before any medical intervention, whereas the PDBP samples are, in some cases, collected after patient treatment has commenced.Since the PDBP dataset may include artifacts which result from disease treatment, the PDBP dataset is used exclusively for evaluation to avoid the possibility of label leakage.A shortened version of the final feature set is provided in Supplementary Table 1.A comprehensive feature list is available in the external code repository 48 .
Each sample consists of 673 features, including genetic, transcriptomic, and clinico-demographic information collected at the baseline.Of the 673 features, 72 originate from genome sequencing data and polygenic risk score, 596 are transcriptomic, and 5 are clinico-demographic.The clinico-demographic features include age, family history, inferred Ashkenazi Jewish status, sex, University of Pennsylvania Smell Identification (UPSIT) score.

Data Preprocessing
The construction of features from genomic, transcriptomic, and clinico-demographic data is handled for each cohort independently, and in a centralized manner, for the entirety of the cohort.As part of the initial data preprocessing, principal components summarizing genetic variation in DNA and RNA sequencing data modalities are generated separately.For the DNA sequencing, ten principal components were calculated based on a random set of 10,000 variants sampled after linkage disequilibrium pruning that kept only variants with r 2 < 0.1 with any other variants in ±1 MB.As a note, these variants were not p value filtered based on recent GWAS, but they do exclude regions containing large tracts of linkage disequilibrium 49 .Our genetic data pruning removed SNPs in long tracts of high LD such as in the HLA region (we excluded any SNPs within r 2 > 0.1 within a sliding window of 1 MB), while retaining known genetic risk SNPs within the region.For RNA sequencing data, all protein-coding genes' read counts per sample were used to generate a second set of ten principal components.All potential features representing genetic variants (in the form of minor allele dosages) from sequencing were then adjusted for the DNA sequence-derived principal components using linear regression, extracting the residual variation.This adjustment removes the effects of quantifiable European population substructure from the genetic features prior to training, this is similar in theory to adjusting analyses for the same principal components in the common variant regression paradigm employed by GWAS models.The same was done for RNA sequencing data using RNA sequencing derived principal components.This way, we statistically account for latent population substructure and experimental covariates at the feature level to increase generalizability across heterogeneous datasets.In its simplest terms, all transcriptomic data were corrected for possible confounders, and the same is done for genotype dosages.After adjustment, all continuous features were then Z-transformed to have a mean of 0 and a standard deviation of 1 to keep all features on the same numeric scale when possible.Once feature adjustment and normalization were complete, internal feature selection was carried out in the PPMI training dataset using decision trees (extraTrees Classifier) to identify features contributing information content to the model while reducing the potential for overfitting prior to model generation 21,50 .Overfitting here is defined as the over-performance of a model in the training phase with minimal generalizability in the validation dataset due to the inclusion of potentially correlated or unimportant features.The implementation of decision trees for feature selection helps remove redundant and low-impact features, helping us to generate the most parsimonious feature set for modeling.Feature selection was run on combined data modalities to remove potentially redundant feature contributions that could artificially inflate model accuracy.Export estimates for features most likely to contribute to the final model in order of importance were generated by the extraTrees classifier for each of the combined models, and are available on the Online Repository 48 .By removing redundant features, the potential for overfitting is limited while also making the models more conservative.Additionally, if a variant provided redundant model information, such as being in strong linkage with a PRS variant, it would be removed from the potential feature list.
Feature selection was performed using the extremely randomized trees classifier algorithm, extraTrees 50 , on combined data modalities to remove redundant feature contributions that could overfit the model to optimize the information content from the features and limit artificial inflation in predictive accuracy that might be introduced by including such a large number of features before filtering.In many cases, including more data might not be better for performance.With this in mind, we attempted to build the most parsimonious model possible using systematic feature selection criteria 51 .Among the top 5% of features ranked in the Shapley analysis, the mean correlation between features was r2 < 5%, with a maximum of 36%.By removing redundant features using correlation-based pruning and an extraTrees classifier as a data munging step, the potential for overfitting is limited while also making the models more conservative.
Clinical and demographic data ascertained as part of this project included age at diagnosis for cases and age at baseline visit for controls.Family history (self-reporting if a first or seconddegree relative has a PD diagnosis) was also a known clinico-demographic feature of interest.Ashkenazi Jewish status was inferred using principal component analysis comparing those samples to a genetic reference series, referencing the genotyping array data from GSE23636 at Gene Expression Omnibus as previously described elsewhere 42,52 .Sex was clinically ascertained but also confirmed using X chromosome heterozygosity rates.The University of Pennsylvania smell inventory test (UPSIT) was used in modeling 53 .A comprehensive description of data and preprocessing is described in 5 .

Quantification and Statistical Analysis
We conducted K-fold cross-validation on the PPMI dataset, where K=6, allowing each fold to contain approximately 100 samples.For each cross-validation fold, 1/K of the PPMI samples are withheld as a holdout test set.The remaining training split of K -1/K samples are further split using uniform stratified random sampling at an 80:20 ratio into training and validation subsets.The evaluation set was used for cross validated hyperparameter tuning in the central and federated models.The PPMI dataset was selected as the training and internal test set due to the fact that patients samples recorded in the PPMI protocol are newly diagnosed and drug-naive.Additionally, by training on the PPMI dataset, the model is developed on patients earlier on in their disease course.This intentional choice was made in the hopes the model would identify other individuals early on in their disease course and prioritize them for follow-up.The PDBP cohort samples are collected within five years after diagnosis, and may be actively taking medications.While the PDBP cohort is larger, because samples are collected several years after diagnosis, and because patients may be actively taking medication, there is a possibility of label leakage, ultimately motivating the usage of the PPMI dataset for training.Due to the similar nature of the PPMI and PDBP dataset, after processing, the PDBP dataset can be used as an external test set, approximating out-of-distribution model performance.
To conduct federated model training, the fully preprocessed PPMI dataset is split into disjoint client subsets, using one of the split strategies, and assigned to a local learner.To train a global model using the data among all federation participants, an iterative optimization process is run for a predefined set of rounds (Fig. 2).During this process, federation members fit local learner models to their locally available datasets.The parameters resulting from local model fitting are then sent to the central aggregation server.Once all model parameters are received, the aggregation server applies a federated learning strategy to the set of model weights, resulting in aggregated model weights, referred to as the global model.The global model is then sent back to the client sites, and used as the starting point for local learner optimization in subsequent iterations.The best performing global model on the evaluation set is used for final testing.
We simulate two types of heterogeneity in our experiments by distributing samples from the training dataset using different split methods.The split method uniform stratified sampling implements label, and site size homogeneity.The split method uniform random sampling implements label heterogeneity, but site size homogeneity.The split method linear random sampling implements label and site size heterogeneity (Fig. 3).
Local client data sets were formed by applying such sampling techniques to the centralized training data set.Uniform stratified sampling was used to form the K folds and during experiments that study homogenous federations.Uniform stratified sampling entails sampling the source dataset such that there is a nearly even distribution of samples among each of the N clients, and the ratio of cases to controls across each client subset is equivalent (where surplus samples are assigned to one of the client sites at random).This method was implemented by partitioning the source dataset by phenotype and then, without replacement, assigning 1/N samples of each phenotype partition to each client using uniform random sampling, with the last client receiving any extra samples.In practice, this additional data was less than 10 of samples.Uniform random sampling entails assigning 1/N using uniform random sampling.Linear random sampling entails assigning c i samples to a client site, where the following is true:

ൌ‫ܥ‬
In the above formula, C is the number of samples to distribute, in practice the size of the PPMI training set, and i is the index of the client site.As with previous methods, the final client receives any surplus samples left over.The effect of this linear random sampling strategy is that each of the N clients contains an increasing number of samples relative to the previous clients, and each client site contains a random distribution of cases versus controls.
To measure algorithm runtime, for central algorithms, we measured the quantity of seconds from model initialization, to model training completion.For federated algorithms, we measured the quantity of seconds from model initialization, to the end of the FL training simulation.For FL models, model optimization was conducted for a federation of N=2 federation clients, for 5 aggregation rounds.
In our simulation configurations, federation rounds operate synchronously, and without failure.Hyperparameters that were used to compute the final results are reported in the appendix, including the random seed used for the presented results.
The Federated Machine Learning methods implemented in the study utilize the federated aggregation methods FedAvg 17 , FedProx 18 , and the local learner classification methods, Logistic Regression 16 , Multi-Layer Perceptron 19 , Stochastic Gradient Descent 22 , and XGBoost 20 available through Sklearn 21 and DMLC 20,21 .The aggregation methods are implemented using NVFlare 31 , and Flower 32 , while local learner methods are implemented using Sklearn, and DMLC APIs.

Declaration of Interests
B.D., A.D., D.V., M.A.N., and F.F.'s declare no competing non-financial interests but the following competing financial interests as their participation in this project was part of a competitive contract awarded to Data Tecnica LLC by the National Institutes of Health to support open science research.M.A.N. also currently serves on the scientific advisory board for Character Bio and is an advisor to Neuron23 Inc.The study's funders had no role in the study design, data collection, data analysis, data interpretation, or writing of the report.Authors M.B.M, P.S.L and J.S. declare no competing financial or non-financial interests.All authors and the public can access all data and statistical programming code used in this project for the analyses and results generation.F.F. takes final responsibility for the decision to submit the paper for publication.The Bigger Picture The wide-scale application of artificial intelligence and computationally intensive analytical approaches in the biomedical and clinical domain is largely restricted by access to sufficient training data.This data scarcity exists due to the isolated nature of biomedical and clinical institutions, mandated by patient privacy policies in the health system or government legislation.Federated Learning (FL), a machine learning approach that facilitates collaborative model training is a promising strategy to address these restrictions.Therefore, understanding the limitations of cooperatively trained FL models, and their performance differences to similar, centrally trained models, is crucial to valuing their implementation in the broader biomedical research community.

Figures
Figures

Figure 1 :
Figure 1: The experiment workflow diagram and the data summary.The harmonized, and joint-called PPMI and PDBP cohorts originate from the AMP-PD initiative.The PPMI cohort is split into K folds, where one fold is left as a holdout (internal) test set, and the remaining are used for model fitting.The training folds are split using an 80:20 ratio to form the training validation split.The training split is distributed among N clients using one of the split strategies to simulate the cross-silo collaborative training setting.FL Methods consist of a local learner and an aggregation method.Similarly, several central algorithms are used to fit the training data.The resultant Global FL models and the ML models resulting from central training are tested on the PPMI holdout fold (internal test) and the whole PDBP test set (external test).

Figure 2 :Figure 3 :Figure 4 :
Figure 2: Federated architecture and training summary.The FL architecture used in the study also illustrates 1 round of FL training for the case of N=3 clients.The aggregation server aggregates trained local learner parameters from clients and computing a global model.Client Sites contain their own siloed dataset, each with different samples.The trained client parameters are represented by the blue, orange, and green weights; the black weights represent the aggregated global model.Client model aggregation implemented by the federated learning strategy is denoted by f.Once global weights are computed, a copy is sent to each client; the global model is used to initialize the local learner model weights in subsequent FL training rounds.=3 nd nt ts; ed is in

Figure 6 .
Figure 6.The mean runtime to train FL models using FedAvg and FedProx strategies. 17 LR, FedProx 18 μ = 0.5 LR, and FedProx μ = 2 LR, have an AUC-PR of 0.874±0.042,0.887±0.041,0.906±0.04.In the external test set, a similar relationship between central LR and federated LR is exhibited, where classical LR has an 105 and is also made available for use under a CC0 license.(whichwas not certified by peer review) is the author/funder.This article is a US Government work.It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.(whichwas not certified by peer review) is the author/funder.This article is a US Government work.It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.(whichwas not certified by peer review) is the author/funder.This article is a US Government work.It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.(whichwas not certified by peer review) is the author/funder.This article is a US Government work.It is not subject to copyright under 17 USC which was not certified by peer review) is the author/funder.This article is a US Government work.It is not subject to copyright under 17 USC ( 10.04.560604 doi: bioRxiv preprint

Table 3 :
AUC-PR score of models trained using Federated Learning as the quantity of client sites increased, tested on the PPMI dataset.Data reported is mean and standard deviation across K=6 fold cross validation.

Table 5 :
The total runtime in seconds to train central and federated models, averaged over K folds.Algorithms are grouped by aggregation strategy (Central, FedAvg, FedProx).The lowest training time for each group is bolded.105 and is also made available for use under a CC0 license.(which was not certified by peer review) is the author/funder.This article is a US Government work.It is not subject to copyright under 17 USC The copyright holder for this preprint this version posted December 26, 2023.; https://doi.org/10.1101/2023.10.04.560604 doi: bioRxiv preprint