Machine learning of COVID-19 clinical data identifies population structures with therapeutic potential

Summary Clinical outcomes for patients with COVID-19 are heterogeneous and there is interest in defining subgroups for prognostic modeling and development of treatment algorithms. We obtained 28 demographic and laboratory variables in patients admitted to hospital with COVID-19. These comprised a training cohort (n = 6099) and two validation cohorts during the first and second waves of the pandemic (n = 996; n = 1011). Uniform manifold approximation and projection (UMAP) dimension reduction and Gaussian mixture model (GMM) analysis was used to define patient clusters. 29 clusters were defined in the training cohort and associated with markedly different mortality rates, which were predictive within confirmation datasets. Deconvolution of clinical features within clusters identified unexpected relationships between variables. Integration of large datasets using UMAP-assisted clustering can therefore identify patient subgroups with prognostic information and uncovers unexpected interactions between clinical variables. This application of machine learning represents a powerful approach for delineating disease pathogenesis and potential therapeutic interventions.


INTRODUCTION
The COVID-19 pandemic has led to >4.6 million deaths to date, but the clinical outcome after primary infection is heterogeneous and approaches to predict outcome within individual patients are of value. Several demographic features increase the mortality risk, including age and comorbid conditions, and have been used to define clinical risk scores. The ISARIC4C mortality score is used widely and assesses nine demographic and laboratory values (Knight et al., 2020).
This clinical heterogeneity of COVID-19 has led to interest in further defining patient subgroups (Gutié rrez-Gutié rrez et al., 2021;Rodriguez et al., 2021), but although inclusion of more demographic and laboratory values can improve accuracy, the associated increase in data dimensionality is challenging for clustering algorithms. As such, integration of outputs from very high dimensional datasets usually requires either feature selection or dimensionality reduction. Although the latter approach loses some information, the associated compression and noise reduction greatly improves utility and is used commonly in biological assessments such as analysis of single-cell RNA sequencing data (Peyvandipour et al., 2020). The reduced dimensional outputs created through principal component analysis (PCA) are a linear combination of the input and struggle to capture complex nonlinearities (Jolliffe and Cadima, 2016). Nonlinear dimension reduction or manifold learning techniques such as Uniform Manifold Approximation and Projection (UMAP) (McInnes et al., 2018;Tang et al., 2016) are based on topological analysis and can define groups without prespecifying a target for the algorithm (Grollemund et al., 2020). The combination of UMAP dimensionality reduction and model-based clustering offer a powerful approach to unsupervised machine learning (Allaoui et al., 2020;Prabakaran et al., 2019).
To identify subgroups of patients with COVID-19, we obtained information on 28 clinical, demographic, and laboratory variables in a training cohort of 6099 patients on the day of their hospital admission with acute COVID-19 with outcomes followed for 28 days including inpatient and community mortality. Low-risk patients had a consistently high survival probability of 99% (training CI 98-100; wave 1 CI 96-100; wave 2 CI 97-100). For the intermediate risk group, the survival probability was 92%  in the training cohort, 93% (CI 90-97) in wave 1 and marginally higher in wave 2 at 95% (CI 93-97). High risk patients had a lower survival probability of 70% in both the training (CI 68-71) and wave 1 cohorts , but this improved in wave 2 to 82% (CI 78-85). However, patients in the very high-risk group had a consistently poor prognosis in each cohort. The survival probability was 41%  in the training cohort, 40% (CI 34-49) in wave 1, and 34% (CI 23-47) in wave 2.
Taken together, these data indicate that although overall prognosis improved for patients in wave 2, this was driven mainly by improvements in the outcome of intermediate and high-risk patients, with no apparent improvement for those in the very-high risk group.
UMAP transformation and GMM clustering of clinical variables identifies distinct patient groups 28 demographic and laboratory variables were obtained for each patient on the day of entry to hospital and used for inclusion within the UMAP analysis (Table 1). Patient subgroups were identified within the UMAP embedding using GMM clustering (Figure 2A).
A global estimate of intrinsic dimensionality indicated between two and six dimensions in the training cohort ( Figure S4A). The FAMD scree plot indicated an ''elbow'' point at four or five dimensions (Figure S4B). Cluster silhouette was relatively stable with increasing dimensionality ( Figure S4C). However,   Figure S4D). From this, a UMAP embedding with four dimensions ('4-D embedding') was selected as input for clustering analysis.
The number of mixture components -K -was substantially higher when selected by BIC than by silhouette width ( Figure S4E). As no clear optimal choice was apparent, several models were fitted, and the 28-day mortality rate was compared between clusters to ensure that the range of this value was comparable to that seen with the ISARIC4C score ( Figure S4F). The manual BIC model resulted in the largest range of 28-day mortality (2-65%) where K = 29, in addition to comprising markedly fewer components than the iScience Article optimal BIC model of K = 49, where the mortality range was 1-63%. In contrast, silhouette models resulted in substantially less diversity in the mortality range between clusters (e.g., 23-38% where K = 3 and 8-45% where K = 9). As such, K = 29 was chosen for onward analysis.
The final model -a GMM with 29 mixture components -was fitted to the training cohort using a 4-D embedding. The median number of patients per cluster was 175 (range 84-548). This model was then applied to patients within the validation datasets ( Figure 2B) with comparable representation. Patient distribution within clusters was broadly similar between the training and validation cohorts. Strong cluster separation was observed using silhouette width analysis, although some clusters showed low levels of separation ( Figure S5). In particular, clusters 17, 22, 26, and 14 each had consistently negative silhouette widths whereas clusters 27, 1, 24, and two had a negative silhouette only in the validation cohorts.
Patient clusters defined from the training cohort are predictive of mortality rates within confirmation cohorts 2-D visualization of the 4-D clustering was used to display clusters across the three patient cohorts (Figure 3A). Allocation of sex to the clusters was investigated as this is a significant determinant of mortality. Relative distribution of sex varied between clusters and supported its importance of a key factor in clinical outcome ( Figure 3B). We were next interested to assess their potential predictive power for subsequent mortality. The overall 28-day mortality rates for the training cohort and the wave 1 and 2 cohorts were 26%, 29, and 14%, respectively. Mortality at day 28 after admission varied markedly between clusters in the training dataset ranging from 2% within Cluster 18 to 65% in Cluster 24 ( Figure 3C).
These values were then compared to clinical outcomes for patients assigned to the equivalent cluster within the validation cohorts. Mortality prediction from cluster assignment modeled from the training iScience Article dataset was found to be highly predictive for patients within the validation cohorts. Indeed, of the 24 clusters within the wave 1 validation cohort, only two had significantly different odds of mortality from the training cohort (Table 2). Specifically, Cluster 12 had a mortality rate of 36% compared to 20% in the training cohort (OR 2.2, p = 0.03), and Cluster 26 had values of 39 vs 19%, respectively (OR 2.8, p = 0.02) ( Figure 4A). Cluster-associated mortality rates also showed a high degree of concordance between the wave 2 validation cohort and the training dataset ( Figure 4B) with differences observed only for Cluster 2 (2 vs 8%; OR 0.3, p = 0.03) and Cluster 28 (12 vs 25%; OR 0.4, p = 0.04).
These data show a high level of correspondence between the cluster-specific mortality rates in the training cohort and those seen in the validation cohorts indicating confidence that clustering can be generalized to external and temporally independent patient cohorts.  iScience Article Mortality rates within clusters reveal differential improvement in outcome between wave one and wave two of the COVID-19 pandemic The period between waves one and two witnessed changes in the clinical management of COVID-19, and the mortality rate of 29% in wave 1 fell to 14% in wave 2. As such, we were interested to determine the utility of cluster modeling to assess relative improvements within different patient subgroups across this period.
Several differences in the demographic profile of patients between the two validation cohorts were apparent. Patients within wave 1 were an average 6 years older than those in wave 2 and had higher rates of comorbidity. Indeed, 53% had cardiovascular disease compared to 29% in wave 2 and rates of dementia or cancer also both fell during wave 2 (34 vs 23%; 12 vs 11%) (Table 1). In addition, the proportion of patients with diabetic complications halved to 5% in wave 2 (Table S1). Changes in the profile of laboratory variables were also apparent over time. For instance, the neutrophil: lymphocyte (N:L) ratio fell from 6.0 to 4.8, driven by an increase in lymphocyte count and less marked neutrophilia, whereas estimated glomerular filtration rate (EGFR) and albumin both showed improvements (68 vs. 82 mL/min; 31 g/L vs 34 g/L) ( Figure 5A). iScience Article We next contrasted cluster-associated mortality rates for patients in wave 1 and wave 2, noting that these were managed in the same hospital. Comparison was possible for 20 clusters, of which, 17 showed a decrease in the odds of mortality, whereas two increased and Cluster 2 remained unchanged ( Figure 5B). However, these changes were only significant for Cluster 28, which showed a 70% fall in the odds of mortality within wave 2 (OR 0.3, p = 0.01). Clusters with above average mortality showed little evidence of improvement such as clusters 4, 17, and 24 (OR 0.7, p = 0.6; 0.9, p = 1; OR 0.8, p = 0.76). Strikingly, the number of patients within Cluster 2, which was associated with a very low mortality rate, increased markedly from 4% to 17% between waves 1 and 2.
As such, these findings reveal that the fall in mortality rate in wave 2 was not uniform across clusters. There was a marked improvement in outcomes in the middle of the risk spectrum, whereas high-risk clusters continued to have a bad prognosis. Moreover, low risk patients remained highly likely to survive.

Patient clusters reveal interactions between clinical variables that determine patient outcome
As patient clusters varied markedly in relation to 28-day mortality, we were next interested to determine the profile of clinical variables within each subgroup. In particular, the unsupervised nature of the analysis was thought likely to uncover interactions between variables that were not predictable before UMAP transformation and clustering.
Variables that were significantly associated with each cluster were identified and those which could be confirmed in more than one cohort were presented as a word cloud on the UMAP plot ( Figure 6A) where the size of the word was scaled according to strength of association with the cluster.
Interesting relationships between variables were observed within the clusters. Increasing age correlated with the clinical severity of cluster-associated outcome, but it was noteworthy that this was not uniform (Figure 6B). Cluster 24, which showed the highest rate of mortality, included many patients with dementia and was associated with low EGFR and high levels of lactate and urea, suggesting a prognostic interaction between impaired renal function and metabolic acidosis (Figures 7 and 8). In contrast, Cluster 17 comprised somewhat younger patients with markedly low EGFR and high levels of lactate and urea.
Furthermore, this process of deconvolution of the determinants of clinical risk within clusters does not need to be limited to variables included within the UMAP transformation, allowing additional features associated with each cluster to be assessed in relation to prognostic importance. As an example, the association tests for patients within the Cluster 17 validation cohorts were repeated to include additional variables that were absent in the training cohort (Table 3) and revealed high potassium concentration (K+) as an additional feature of this cluster.

DISCUSSION
Clinical practice is focused increasingly on defining and managing subgroups of patients to facilitate personalized approaches to therapy. Such a process is underway in conditions such as cancer where the use of genetic mutation testing has led to stratification of management (Middleton et al., 2020). However, across most conditions this process is not well advanced. The huge amount of demographic and clinical information that is acquired during patient assessment offers the opportunity to assess discrete subgroups of patients, but a major challenge has been the complexity of using such large datasets with the associated high-level dimensionality that they bring. Here, we used UMAP dimension reduction and GMM clustering to define subgroups of patients with acute COVID-19. We found that clusters were consistent between different populations, are predictive of subsequent mortality, and can be used to uncover unexpected relationships between clinical features.
We chose to use UMAP for dimensionality reduction; this uses topological data analysis and a cross validation optimization process to create a lower dimensional embedding, which retains much of the original data structure. This is an alternative approach to tSNE analysis but is scalable to larger datasets and more effectively captures local as well as global data structure (McInnes et al., 2018). A key advantage of UMAP is that the transformation can be applied to embed new observations into the same latent space, allowing it to be built into a model development pipeline. UMAP has been applied to a range of clinical and biological analyses, including stratification of patients with ALS ( iScience Article Grollemund et al. (2020) demonstrated that UMAP is capable of stratifying patients into risk groups and used dimension reduction to develop a 1-year mortality prognostic model based on the distribution of patients on a 2-D embedding and comparison of groups based on coordinates on a grid. We extended this approach by dividing the embedding into subgroups through unsupervised clustering. UMAP preprocessing can also improve the results of clustering algorithms (Allaoui et al., 2020;Hozumi et al., 2021). The combination of risk-stratification and improved clustering thus makes UMAP-assisted clustering a powerful approach for the detection of novel clinical risk groups. The default output from UMAP -a 2-D embedding -is ideal for visualization purposes. However, when applying UMAP for nonlinear dimension reduction, the choice is less clear. Our literature review did not indicate an established method to select dimensionality in this context. This decision will depend on both the complexity of the data and the ability of UMAP to represent topological structure on a lower-dimensional embedding. A qualitative approach was therefore undertaken, and a 4-D embedding was selected based on a global estimate of intrinsic dimensionality and assessment of the impact of dimensionality on the clustering model. Other parameters such as the number of nearest neighbors and minimum distance were fixed in our analysis.
Unsupervised clustering analysis was used to label patients by cluster given their distribution in UMAP space. BIC and silhouette width are validated methods for selecting the number of clusters and were adopted in this approach (Batool and Hennig, 2021;Gó mez-Rubio, 2021;Keribin, 2000;Rousseeuw, 1987). Selecting the number of clusters is a widely debated topic (Chiang and Mirkin, 2010;Fu and Perry, 2020;Fujita et al., 2014; Gó mez-Rubio, 2021). A useful property of a GMM is that it can be used to approximate any iScience Article other probability density function, given a sufficient number of mixture components (Nguyen et al., 2020). This allows the model to fit clusters which follow a non-Gaussian distribution. However, in this context the model may require multiple mixture components to fit a single cluster. BIC will optimize the number of components to best fit the underlying data distribution rather than the number of distinct clusters. When the assumption is made that each component corresponds to a single cluster of patients, BIC may lead to overestimates in the true number of clusters (Baudry et al., 2010;Gó mez-Rubio, 2021). Silhouette, on the other hand, measures the compactness within each cluster and the separation between them (Rousseeuw, 1987). Using silhouette to select the number of components therefore ensures the most distinct clustering configuration but can fail to give insight into clustering quality if cluster distributions require multiple components to be fitted (Gó mez-Rubio, 2021).
In our analysis, the optimal BIC indicated 49 components but began to plateaux beyond 29. In contrast, the maximum silhouette corresponded to three components but showed a secondary peak with 9. This large discrepancy suggested BIC may have somewhat overestimated the number of clusters, whereas silhouette provided a relative underestimate. We therefore compared the diversity of 28-day mortality risk between clusters, to optimize the potential clinical value of the modeling and found that the model with 29 iScience Article components resulted in the largest range (2-65%). As such, the decision to use BIC resulted in a model with a greater diversity of mortality risk between clusters, although some clusters had a negative silhouette width, indicative of relatively poor separation. Future work will investigate the use of an entropy criterion to combine components based on the loss of information (Baudry et al., 2010) and examine methods of determining groups from multiple survival curves (Villanueva et al., 2019). Other model-based clustering methods such as Hierarchical Density-based Spatial Clustering of Applications with Noise (Campello et al., 2015) and the use of Dirichlet processes can also be explored (Kottas, 2006).
As cluster labels are independent of outcome, unbiased comparisons can be made about the observations assigned to each subgroup. We chose to compare 28-day survival rates but other clinical outcomes, such as subsequent development of long-covid, could be compared without retraining the model. Model-based clustering allows validation of clusters in independent cohorts as the fitted model can be used to cluster new data. Indeed, we observed a concordance of results between the training and validation cohorts, which indicates that this model could potentially be used for analysis of any COVID-19 dataset provided the same predictors were available.
We applied the clustering approach to clinical variables taken on the day of hospital admission to predict subsequent mortality at day 28. This time point was chosen as most patients who die from acute COVID-19 succumb by 4 weeks after hospital entry, and it was felt that it would be challenging to extend prognostic modeling beyond this time. The associated mortality rates of individual patient clusters were very heterogeneous and ranged from 2% for the 316 patients in Cluster 18 through to 65% for 94 patients within Cluster 24. A further striking feature was that values obtained from the training cohort were strongly predictive of outcomes in the validation cohorts. As such, these findings show clustering of demographic, clinical, and laboratory features taken at the time of hospital entry is predictive of medium-term mortality. This is important as this information could be used to guide appropriate triage and clinical management at an early stage of the patient journey.
It was interesting to compare the mortality rates for patients within wave one and wave two of the pandemic treated at the same hospital. Dexamethasone emerged as a standard of care for all patients with severe COVID-19 who required oxygen therapy in July 2020 and as such would have been a default management iScience Article approach for such patients in wave 2. The overall mortality rate fell by nearly 50% during this period and lower 28-day mortality rates were seen for many clusters within wave 2, although no clear improvement was seen in the very high-risk clusters of 24, 4, and 17 which included many patients with cancer, dementia, and cardiovascular disease. This reveals that cluster analysis could be of value in assessing the differential impact of new therapies on patient groups and identifying those which represent an unmet need.
Furthermore, this use of an unsupervised machine learning approach to define patient subgroups enables identification of new and potentially unexpected interactions between clinical and laboratory variables. Patients in high-risk groups were characterized by medical comorbidity in association with laboratory variables such as impaired renal function and metabolic acidosis. The interaction between comorbidities within clusters also revealed a number of interesting features (Figure 8). For example, although patients with dementia and cardiovascular disease were enriched within the high-risk Cluster 24, this combination was also seen in several lower risk groups such as 15, 9, 19, and 29. As such, a striking feature from the analysis was that clusters comprised complex combinations of clinical determinants that would not have been easily predictable from clinical assessment. Ethnicity data was not available for the training cohort and in the validation cohorts >50% were of white ethnicity but this was not a key driver of clustering with only Cluster 15 associated with white ethnicity.
In summary, the application of untargeted machine learning to clinical data collected routinely at hospital entry allowed identification of subpopulations of COVID-19 patients with distinct mortality rates and clinical presentation on the day of admission. These were consistent across different clinical datasets and uncovered prognostic interactions between clinical variables that could influence management decisions. Furthermore, such clusters may also reveal activation of different biological pathways and might therefore uncover new mechanisms of COVID-19 susceptibility and a means to stratify therapeutic interventions in phenotype-informed randomized control trials, such as optimized management of renal impairment, metabolic acidosis, and cardiovascular disease for patients in Cluster 24. This approach has application to a wide variety of clinical conditions and contributes to the growing expectation that artificial intelligence systems will transform healthcare delivery and operate in real time to support clinical decision making in both acute and chronic conditions.

Limitations of the study
Potential limitations of our study include the fact that the training cohort was derived from several clinical sites and could contain site-specific effects with more exposure to reporting errors. The use of unconstrained covariance matrices in model fitting may limit scalability to larger and more complex datasets. In this context, a diagonal covariance matrix may be more appropriate. In addition, missing information was apparent across the three cohorts ( Figure S1).

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following: iScience Article d Any additional information required to reanalyse the data reported in this paper is available from the lead contact upon request.

Patient population
Clinical data were acquired for patients admitted to hospital with COVID-19. The CovidCollab cohort is a multicentre dataset of 6099 patients admitted between January and August 2020 and was used as the training cohort in model development (Alsahab et al., 2021). Two cohorts of patients admitted to the University Hospitals NHS Foundation Trust Birmingham (UHBFT) were used for validation. The first cohort ('wave 1') includes 996 patients admitted between January and August 2020. The second cohort ('wave 2') includes 1011 patients admitted between September 2020 and January 2021 (Table 1). Patients tested positive for SARS-CoV-2 by PCR and/or antibody test within 14 days of admission. Mortality at day 28 (including after discharge) was determined for each patient.
Ethical approval was provided by the East Midlands-Derby REC (reference: 20/EM/0158) for the PIONEER Research Database (data from University Hospitals Birmingham). For CovidCollab data, local, regional and national approvals were obtained from all participating sites. In the UK, this study was registered as clinical audit or service evaluation, with approval granted in line with local information governance policies, in line with assessment and guidance by the Health Research Authority. At the lead site (University Hospitals Birmingham NHS Trust), this study was registered as clinical audit (CARMS-15986). In other countries, local principal investigators were responsible for obtaining approvals in line with their local, regional and national guidelines and recommendations. Only routinely collected data were collected and patient care was not altered by this study. Anonymised data were securely transferred to the Birmingham Centre for Prospective and Observational Studies, University of Birmingham via REDCap. All sites were required to confirm that approvals were in place prior to being provided with logins; written data sharing agreements were arranged where requested by individual sites.

Variables and missing data
Data obtained for each patient included demographic, clinical and laboratory variables taken on the day of admission (Tables 1 and S1). Analysis was conducted in R (R Core Team, 2019) with RStudio (RStudio Team, 2016). Missing data were quantified by cohort ( Figure S1) and imputed from the observed data by predictive mean matching (Bailey et al., 2020;De Silva et al., 2019;Morris et al., 2014). The Multivariate Imputation by Chained Equations (MICE) (Van Buuren and Groothuis-Oudshoorn, 2011) algorithm was applied independently to each cohort under fully conditional specification. 28 variables (continuous and binary) (Table 1) were used for model development for which the maximum proportion of missing observations was 36% in the training cohort, 43% in wave 1 and 42% in wave 2. A single imputation of the training cohort was used for model development. Multiple imputations of waves 1 and 2 were analysed for validation (n. imputations = 5).

Data processing
UMAP was applied to the training cohort prior to model development (Melville, 2020) ( Figure S2). Binary variables were one-hot encoded (0/1), and all variables standardised prior to transformation (mean 0, variance 1). UMAP transformed the 28 clinical variables onto a lower dimensional embedding. Initial hyper-parameters were selected by visualising a 2-D embedding output ( Figure S3). A Euclidean distance metric was used with a target of 40 nearest neighbours and a minimum distance of 0.25.

Clustering model development
Gaussian mixture model (GMM) clustering was applied to automatically label the distribution of patients after transformation by UMAP (Rasmussen and Ghahramani, 2001;Scrucca et al., 2016) ( Figure S2). An Expectation-Maximization (EM) algorithm was used to estimate GMM parameters with unconstrained covariance matrices (Scrucca et al., 2016).
The number of UMAP dimensions to output for clustering, D, was determined qualitatively. Maximum likelihood estimation of intrinsic dimension (Johnsson, 2016(Johnsson, , 2019Levina and Bickel, 2004)  iScience Article used to test the effect of D on the clustering model. GMMs were fitted to each D-dimensional embedding with between 2 and 50 mixture components. Average silhouette width (Batool and Hennig, 2021;Maechler et al., 2019;Rousseeuw, 1987) and Bayesian information criterion (BIC) (Gideon, 1978;Scrucca et al., 2016) were measured. The maximum D was selected which did not substantially reduce silhouette width or result in high variability in BIC between models. From this, 4 dimensions (4-D embedding) were selected for onward analysis ( Figure S4).
The number of mixture components, K, and therefore clusters of patients, was determined using the optimal BIC, maximum silhouette width, and a qualitative assessment of BIC and silhouette width plots (manual BIC, manual silhouette). If multiple values for K were selected, the diversity of mortality rates was then compared between these models and the mortality rate at day 28 after hospital admission calculated for K clusters. ISARIC4C modelling had shown that mortality rates varied markedly between high risk and low risk groups (Figure 1) and the model with the largest range of mortality rate was therefore retained. Based on this, the final reported model was a GMM fitted with K = 29 ( Figure S4).

Clustering validation data
A two-stage process assigned patients from waves 1 and 2 into clusters detected by the GMM in model development. First, the UMAP transformation learned from the training cohort was applied to embed the validation cohorts onto the same 4-D embedding. Secondly, the trained GMM was applied to predict which clusters the new observations should be assigned to with the highest probability. This was repeated for 5 imputations and a majority vote was taken as the final cluster classification (Basagana et al., 2013). Separation of clusters was assessed for each imputation by silhouette analysis and estimates were pooled with a 95% confidence interval (CI) by applying Rubin's rules (Marshall et al., 2009).

2-D visualisation
For visualisation purposes the UMAP analysis was repeated to transform the 4-D embedding of the training cohort from 4-D to 2-D. The same transformation was applied to each validation cohort and scatter plots (UMAP plots) were created (Wickham, 2016).

Comparing cohorts
Complete cases of each variable were compared between the training and validation cohorts using a Wilcoxon test for continuous variables or Fisher's test for categorical variables (Fisher, 1935;Rey and Neuhä user, 2011). The Benjamini-Hochberg procedure was applied to adjust for multiple testing (Benjamini and Hochberg, 1995). Mortality rate at day 28 after hospital admission was calculated for individual clusters across each of the three patient cohorts and odds-ratios were calculated (95% CI, 1000 bootstrap resamples) (Mayer, 2020).

Cluster characterisation analysis
Fisher's tests were applied to detect variables significantly associated with each cluster (p = 0.001). Continuous variables were categorised into ordinal factors from literature review and tests applied to each factor level (Table S2). Only variables where the association was significant in the training cohort and at least one validation cohort were reported here. Results were summarised using word clouds with variables scaled by the strength of the association in the training cohort (Chen et al., 2010;Fellows, 2018;Le Pennec and Slowikowski, 2019). Word clouds were overlaid on to a UMAP plot of the training cohort to aid interpretation. Clusters with fewer than 10 patients in either validation cohort were excluded.

Risk classification
ISARIC4C mortality score was calculated for each patient (Knight et al., 2020) and revealed 0-88% risk of death within 28 days. Scores were divided into low (0-3), intermediate (4-8), high (9-14) and very high risk (15-21) for plotting purposes. The training cohort was assigned a score based off a single imputation, waves 1 and 2 were assigned a score based off the mode across 5 imputations. Kaplan-Meier (KM) (Therneau and Grambsch, 2000) curves were constructed to estimate survival rates with 95% CI. A log-rank test was applied to test for differences between groups. ll OPEN ACCESS