A clustering approach to improve our understanding of the genetic and phenotypic complexity of chronic kidney disease

Chronic kidney disease (CKD) is a complex disorder that causes a gradual loss of kidney function, affecting approximately 9.1% of the world’s population. Here, we use a soft-clustering algorithm to deconstruct its genetic heterogeneity. First, we selected 322 CKD-associated independent genetic variants from published genome-wide association studies (GWAS) and added association results for 229 traits from the GWAS catalog. We then applied nonnegative matrix factorization (NMF) to discover overlapping clusters of related traits and variants. We computed cluster-specific polygenic scores and validated each cluster with a phenome-wide association study (PheWAS) on the BioMe biobank (n=31,701). NMF identified nine clusters that reflect different aspects of CKD, with the top-weighted traits signifying areas such as kidney function, type 2 diabetes (T2D), and body weight. For most clusters, the top-weighted traits were confirmed in the PheWAS analysis. Results were found to be more significant in the cross-ancestry analysis, although significant ancestry-specific associations were also identified. While all alleles were associated with a decreased kidney function, associations with CKD-related diseases (e.g., T2D) were found only for a smaller subset of variants and differed across genetic ancestry groups. Our findings leverage genetics to gain insights into the underlying biology of CKD and investigate population-specific associations.


Introduction
Chronic kidney disease (CKD) is a primarily asymptomatic disease characterized by a gradual loss of kidney function over a period extending from several months to years [1] .CKD affects approximately 9.1% of the global population, with a higher prevalence in high-income countries [2] .The leading risk factors for developing CKD are diabetes (40% of cases) and hypertension (29% of cases), followed by heart disease, family history of CKD, and obesity [3] .Other factors, such as exposure to HIV and contaminants, are additionally important in low-income countries [4,5] .The genetic ancestry also plays a crucial role, with increased risk rates of kidney failure in Black/African Americans and Hispanics/Latinos compared to individuals of European ancestry [6] .If left untreated, CKD increases the mortality risk for individuals with cardiovascular disease (CVD) and can result in the complete loss of kidney function [7] .Therefore, early detection is critical for improving quality of life and life expectancy.During the early stages of CKD, costeffective treatment options are available and can be tailored to the cause of the disease [8] .
CKD is de ned by a reduced functionality of the kidneys, which limits its ltering capability over a period of at least three months [9] .The main biomarkers for CKD detection include the urinary albumin/creatinine ratio (ACR) and the estimated glomerular ltration rate (eGFR) [10] .While ACR facilitated diagnosing albuminuria -an indicator of kidney damage characterized by an elevated excretion of urinary albuminthe eGFR estimates the lter volume of the glomerulus per unit of time using different biomarkers such as serum creatinine [10] .An abnormal kidney activity is indicated by high ACR values, reduced eGFR, or both.
Over the past few decades, many large-scale genomic studies, such as genome-wide association studies (GWAS), have successfully identi ed more than 500 independent genetic variants associated with reduced kidney function [11][12][13] .The association between genetic variants and various phenotypes has been studied, and the results are often shared in publicly available databases, like the GWAS Catalog [14] .
The association of one genetic variant with multiple traits can be considered to identify secondary traits associated with a phenotype.This understanding can help elucidate potentially shared disease mechanisms, assuming that genetic variants affecting a shared pathway also have a similar impact on the associated traits.
Soft-clustering methods provide a means to reduce the genetic complexity of a heterogeneous disease while also accounting for shared disease mechanisms.In contrast to hard-clustering approaches like Kmeans or hierarchical clustering, soft-clustering enables the factorization of high-dimensional data by identifying overlapping clusters [15] .Non-negative matrix factorization (NMF) is a family of algorithms within multivariate analysis that addresses the dimensionality challenge by extracting meaningful features from a given data set [16,17] .
In this study, we aimed to deconstruct the heterogeneity of CKD by identifying its genetic subtypes.First, we collect all published variant-trait associations for variants associated with reduced kidney function and apply soft-clustering using NMF.We used the algorithm's weights to calculate cluster-speci c polygenic scores (cPGS) within the BioMe biobank.Finally, we use a phenome-wide association study (cPGS-PheWAS) to validate and interpret the clusters.By deconstructing the complexity of CKD, this methodology contributes new insights into the disease pathways of CDK and enhances our understanding of population-speci c differences for CDK.

NMF identi ed nine clusters of CKD-associated variants
The most frequent CKD-associated secondary traits retrieved from the GWAS Catalog are related to kidney function (e.g., blood urea nitrogen, urea, uric acid, and cystatin C measurements), hemoglobin levels (e.g., hemoglobin measurements, hematocrit, and erythrocyte counts), T2D, body weight (e.g., body height, appendicular lean mass, BMI, BMI-adjusted waist-hip ratio), and pulse pressure (systolic and diastolic blood pressure measurements), among others (see Fig. S1).CKD-associated traits and their associated CKD variants were factorized into nine partly overlapping clusters by conducting NMF.To ensure the results were robust, we repeated the clustering with bNMF and got comparable results (Tab.S1).The top seven traits per cluster are summarised in Fig. 1.The 'Reduced lipids' cluster is associated with decreasing blood lipid levels (triglycerides, total cholesterol, use of lipid-lowering medications) and liver enzymes.The top traits of the cluster 'Increased body mass' show a positive association with body weight (appendicular lean mass, body height, and body weight).The clusters 'Increased blood volume' and 'Reduced blood volume' are positively and negatively associated with volemic traits (e.g., mean corpuscular volume and mean corpuscular hemoglobin), respectively.Similarly, clusters 'Increased/Reduced hematocrit' show opposite associations with hemoglobin content (e.g., hematocrit, hemoglobin measurements, red blood cell density, erythrocyte count), and clusters 'Increased/Reduced in ammation' convey opposite associations with markers of in ammation (Creactive protein) and blood lipids.Lastly, cluster 'Increased urate' is positively associated with kidney function biomarkers like urate, blood/serum urea nitrogen, blood proteins, and Cystatin C. The complete lists of the top features and variants per cluster, de ned as traits and variants in the top decile of the cluster weights of the matrices H and W, are listed in Tab.S2.The matrices H and W are also available as supplementary material.Fig. S2 summarises how the variants are distributed in each cluster, showing their overlaps.

cPGSs suggest clear differences between genetic ancestries
We extracted the cluster weights of the W matrix and used them to calculate cluster-speci c polygenic scores (cPGS) for participants of the BioMe cohort.Figure 4 shows the standardized polygenic score distributions for all NMF clusters across the BioMe cohort (ALL) and the individual continental populations EUR (n = 7,447), AMR (n = 5,336), and AFR (n = 5,660).A normal distribution was observed for the cluster 'Increased urate' (EUR, AMR, and AFR; Anderson-Darling test, all p-values available in Tab.S5).
Although polygenic scores are expected to have a normal distribution [18] , the other eight clusters present either a skewed tail (e.g., 'Increased hematocrit') or several peaks in their cPGS distributions (e.g., 'Reduced in ammation').As illustrated in Fig. 3, the peaks are caused by a few variants with relatively high cluster weights (the complete list of cluster weights for the top variants of each cluster is available in Tab.S2).For example, the top variant in cluster 'Increased in ammation' (rs429358, mapped gene: APOE) weighs 4.6, while the second one (rs17050272) weighs 0.2.In Fig. 4, we can also observe how this variant is more frequent in participants of inferred EUR ancestry.Similarly, the top variant of 'Reduced in ammation' (rs1260326, mapped gene: GCKR) weighs 5.7 and seems to be more frequent in the AFR population, while the second one (rs4418728) weighs 0.9.This unbalance in weight creates the three peaks of the distributions: the lower peak includes the scores of individuals without the top variant (0 copies), the middle one the heterozygous (1 copy), and the higher peak includes scores of participants with two copies of the top variant.Other ancestry-speci c differences are visible in the distributions of four clusters and are signi cant when testing with the Mann-Whitney test (all p-values available in Tab.S4).This suggests that some variants appear with different frequencies in people that do not share similar ancestry: 'Increased in ammation' (all combinations), 'Reduced in ammation' (all combinations), 'Reduced lipids' (EUR vs. AFR), and 'Increased body mass' (EUR vs AFR and AFR vs AMR).

Distribution of participants across clusters
As the cPGS are calculated separately per cluster, each BioMe participant might have high polygenic scores in multiple clusters.Therefore, to understand the cluster overlap in terms of relative risk, we checked how many individuals belonged to the top decile of 1 or more clusters.58% (18,431/31,701) of the whole BioMe cohort (ALL) were at high risk in at least 1 cluster.Of these, 60.2% were in the top decile for only 1 cluster, while 37% were at risk for 2-3 clusters (Fig. S4).

Discussion
CKD is typically de ned as a progressive loss of kidney function over time.Although numerous genetic variants have been identi ed as associated with CKD, their relationship to disease pathways remains largely unclear.The work described here is the most comprehensive assessment of how variants associated with CKD can be grouped according to different CKD-related factors.Speci cally, we included variant-trait associations of 322 CKD SNPs and 229 related metabolic traits from publicly available GWAS datasets.By analyzing these associations with NMF, a factorization approach that allows for minimal overlap between groups, we identi ed 9 clusters of CKD variants and associated traits.
CKD is commonly recognized as a heterogeneous condition with various underlying causes and risk factors, which are unlikely to represent a single disease process.This complexity is also re ected by the associated traits retrieved from published GWAS, which are related to kidney function, hemoglobin levels, T2D, body weight, and pulse pressure, among others.Attempting to deconvolute CKD's genetic heterogeneity and differentially grouping these traits, the nine clusters we identi ed represented different aspects of CKD.For example, the 'Increased urate' cluster, whose clustering weights represent abnormal levels of urinary metabolites like urate, blood/serum urea nitrogen, blood proteins, and Cystatin C, is related to decreasing kidney function.In normal conditions, such blood metabolites are excreted by the kidneys, but in CKD they accumulate and exert a detrimental biological activity [19,20] .A second cluster, which we summarised as 'Increased in ammation,' was strongly clustered around rising serum C-reactive protein (CRP) concentrations.CRP is a common in ammatory biomarker in chronic diseases like CKD, diabetes, and cardiovascular diseases [21][22][23] .In line with that, patients with CKD commonly experience chronic in ammatory states [24] .These states tend to worsen as the disease progresses toward end-stage renal disease and are re ected, or even modulated [25] , by increasing CRP levels [26][27][28] .
We then studied the genotype-phenotype correlation to demonstrate the utility of the clusters.We could replicate most of the top-weighted features on quantitative traits (i.e., biomarkers), while the validation on binary traits (i.e., diagnoses) was less robust and required additional clinical interpretation.For the clusters of 'Increased urate' and 'Increased in ammation,' the top traits were con rmed by the PheWAS.
CKD is also associated with dyslipidemia comprising high levels of triglycerides and LDL-cholesterol, and low levels of HDL-cholesterol and apolipoprotein A1 [29] .We could observe similar associations in clusters 'Increased in ammation,' 'Reduced lipids,' and 'Reduced in ammation.'Notably, we found multiple signi cant associations for cluster 'Increased in ammation' with reduced risk of dementias.
Glycerophospholipids play an essential role in neural membranes [30,31] , and their levels are directly correlated with serum triglycerides and inversely correlated with total cholesterol and eGFR [32] .
A limitation of this study is the need for more genetic diversity in the GWAS Catalog, which mainly consists of studies performed on the European population.This European bias is well described in the literature and has important implications for disease risk prediction across global populations [33] .Despite this lack of genetic diversity, we could still validate our results in BioMe, a biobank enriched for populations with non-European ancestries.We were most powered when jointly analyzing across ancestries (ALL), while signals replicated in different ancestral groups with some group-speci c differences.This result suggests that, although most CKD risk factors converge across ancestral groups, ancestry-speci c studies are essential.Another two limitations are the ltering rules used to select traits and variants for the algorithm's input matrix and the possible existence of non-additive interactions between risk factors that we did not consider in this study.Lastly, one of the input CKD studies, the PAGE study [34] , was also conducted using BioMe data.However, this should not impact the results since we are not looking at CKD case/control scenarios but at CKD subtypes.
Understanding the biological pathways that lead to CKD is essential to improve clinical management.For example, some clusters group similar traits but with opposite effect directions (e.g., 'Increased hematocrit' and 'Reduced hematocrit'), while others suggest potentially protective effects (e.g., against dyslipidemia in cluster 'Increased in ammation').This behavior might indicate that CKD can affect the same metabolic pathways differently, con rming the genetic complexity of the disease.Additionally, the clusters have a limited degree of overlap and, as each represents a speci c set of variants, participants might be high risk (i.e., in the top decile of the polygenic score) for more than one cluster.This additive disease model, similar to the mutational signatures in cancer, suggests a possible interplay of genetic susceptibility to multiple disease-causing mechanisms [35] .
In summary, by clustering genetic variants associated with CKD, we identi ed clusters with distinct trait associations, likely representing mechanistic pathways involved in CKD.We con rmed the validity of these clusters phenotypically.Further clinical investigations could explore whether individuals with a common disrupted pathway also share similar complications, a comparable rate of disease progression, or a different treatment response.In the future, classifying patients with CKD using their genotype may improve care by offering a more personalized and genetically informed clinical plan.

Trait-variants selection
We identi ed and aligned the alleles of 508 independent genetic variants associated either with decreased kidney function (de ned as low eGFR levels for at least three months) or with CKD (using ICD-9/10 codes) from the most recent GWAS and GWAS meta-analyses [11,13,34,36,37] (Fig. 5a).We then used the R package LDlinkR (R version 4.2.1) to retrieve all proxy SNPs in linkage disequilibrium (r 2 > = 0.6) with the lead variants, across all available 1000G human populations [38,39] and used the GWAS Catalog database to link the proxy SNPs to 805 associated traits (as of July 30th, 2022) [14] .We excluded genderspeci c GWAS and GWAS performed on less than 100 individuals.Additionally, as we are interested in secondary features associated with CKD, we excluded GWAS of traits directly related to eGFR or CKD (e.g., "Mild to moderate chronic kidney disease," "Estimated Glomerular Filtration Rate").We kept traitvariant associations with a signi cance threshold of less than 1×10 − 6 using a Bonferroni correction for all 2,401 associations in our data set.To reduce sparsity in the data, we excluded traits associated with less than ve variants; this threshold was empirically de ned by comparing the clustering results of traits associated with up to 15 CKD variants.We standardized effect sizes across all GWAS by dividing the regression coe cient beta (B) by the standard error, using the GWAS summary statistic results.Traits and variants were then arranged as a matrix with the standardized effect sizes (β) as values.Tab.S5 contains, for each input CKD variant, the list of CKD-associated secondary traits extracted from the GWAS Catalog and the corresponding exclusion criteria for those excluded during the ltering steps.NMF NMF factorizes the input matrix of trait-variant associations (X, of dimensions 229x322) into a matrix of traits (H, 229xK) and one of variants (W, Kx322), so that HxW ≈ X [16] (Fig. 5b).The factorization rank K corresponds to the number of clusters.We implemented NMF using the R package ButchR with 10,000 iterations, 30 random initiations, and the convolution threshold set to 80 [40] .The number of expected clusters was set between 2 and 20.ButchR suggests the optimal K based on six cluster evaluation metrics, like the mean silhouette width and the Frobenius error.If two or more K were presented, we considered results with the highest mean silhouette width and the lowest Frobenius error, as suggested by Alexandrov et al [35] .As additional validation, we also performed a Bayesian version of NMF [17] , using the code provided by Udler et al [41] .bNMF was run 1,000 times with up to 200,000 iterations in each run.

Cluster-speci c polygenic scores
The results of clustering provide cluster-speci c weights for each variant and trait.We used PLINK and the variant cluster weights to calculate cluster-speci c polygenic scores (cPGS) of the BioMe biobank participants [42] .cPGS were standardized within each cluster.The normality of each cPGS distribution was tested with the Anderson-Darling method.Differences between ancestry-speci c distributions were tested with the Mann-Whitney test.

Validation cohort (Bio Me )
We validated our results using the genetic and linked electronic health records (EHR) data of 31,701 BioMe biobank participants [43] (Fig. 5c).As a ne-scale population structure can improve the risk prediction of complex diseases within genetic groups [44] , we inferred the genetic ancestry of the BioMe participants.We then performed a Principal Component Analysis (PCA) using PLINK, excluding relatives above 2nd-degree (kinship method, estimated using KING [45] ) and variants with minor allele frequency below 0.05 [42,46] .We trained a random forest classi er to infer the genetic ancestry of BioMe participants using the 1000 Genomes labels as reference [47] .The labeled ancestries are Admixed American (AMR, n = 5,336), African (AFR, n = 5,660), European (EUR, n = 7,447), South Asian (SAS, n = 613), and East Asian (EAS, n = 728).For sub-population-speci c analyses, we removed participants with mixed ancestry (de ned as having a random forest probability ≤ 0.5) and outliers by only including the quantiles 0.25-0.90 [48](n = 11,404).

Modeling disease outcomes as a function of clusterspeci c polygenic scores
For each cluster, the cPGS were associated with the phenotypes available in the BioMe data set by performing a phenome-wide association study (cPGS-PheWAS).We tted linear regression models to analyze 988 quantitative traits (e.g., laboratory results) and logistic regression models for 832 binary traits with cPGS as independent variables, adjusting for sex, age, and the rst ten genetic principal components (stats R package [49] ).Binary traits included Phecodes mapped to ICD-9 and ICD-10 codes (a Phecode is considered if at least two relevant diagnostic codes were present in a patient's EHR) [50] and curated phenotypes [51] .Controls were identi ed as the reference category.Traits were only considered if present or measured in at least 100 biobank participants.The model parameters were standardized using the effectsize R package (re t method) [52] .Standardized coe cient estimates (linear regression) and odd ratios (logistic regression) were reported with the corresponding 95% con dence intervals.The Bonferroni method was used to adjust for multiple testing, and the alpha threshold was de ned as 2.7e-05 (0.05/(988 + 832)).We then compared the PheWAS results with the traits in the top decile of NMF's trait weights.Replication of cluster traits (also available as LaTeX code) The table lists traits replicated with the PheWAS on the ALL cohort for each cluster.'Dir' is the trait effect direction, 'activity' is the trait cluster weight, 'OR' is the standardized odds ratio (binary traits cPGS-PheWAS), 'Coeff' is the standardized coe cient estimate (quantitative traits cPGS-PheWAS), '95% CI' are The gure compares the standardized cPGS distributions between inferred ancestries of the BioMe participants.The x-axis represents the units of standard deviation (or z-scores).AFR, AMR, and EUR refer to the sub-cohorts of individuals with inferred African, Ad Mixed American, and European ancestry, respectively

Figures
Figures