Secondary analysis of transcriptomes of SARS-CoV-2 infection models to characterize COVID-19

Summary Standard transcriptomic analyses alone have limited power in capturing the molecular mechanisms driving disease pathophysiology and outcomes. To overcome this, unsupervised network analyses are used to identify clusters of genes that can be associated with distinct molecular mechanisms and outcomes for a disease. In this study, we developed an integrated network analysis framework that integrates transcriptional signatures from multiple model systems with protein-protein interaction data to find gene modules. Through a meta-analysis of different enriched features from these gene modules, we extract communities of highly interconnected features. These clusters of higher-order features, working as a multifeatured machine, enable collective assessment of their contribution for disease or phenotype characterization. We show the utility of this workflow using transcriptomics data from three different models of SARS-CoV-2 infection and identify several pathways and biological processes that could enable understanding or hypothesizing molecular signatures inducing pathophysiological changes, risks, or sequelae of COVID-19.


In brief
We report a data-driven, network-based workflow to identify gene and functional modules in COVID-19 through joint analysis of gene expression data from three model systems of SARS-CoV-2 infection. Bringing together a consensus gene expression signature from these model systems and analyzing it jointly with other omics data, we build clusters of higher-order multifeature machines that provide a basis for addressing several basic and translational research questions and generation of hypotheses.

INTRODUCTION
In vitro and in vivo disease models often fail to completely recapitulate the disease manifestations in humans. Integrated secondary analysis approaches that can identify disease-related gene modules by leveraging knowledge from multiple disease models can find physiological functions in a disease. Functional complexes that arise out of these gene or protein modules are known to represent distinct biological functions. 1,2 Similarly, feature networks comprising biological processes, pathways, phenotypes, and cell types represent a higher-order multifeatured machines collectively working toward a common goal. Based on this premise, we implemented a multilayered datamining methodology that leverages protein modules to build functional modules or complexes in a disease. These functional complexes are built by linking together several heterogeneous data types such as single-cell RNA-sequencing (RNA-seq) markers, protein-protein interactions, and phenotype-genotype THE BIGGER PICTURE This study is based on the premise that combining information from multiple layers of data can result in new biologically interpretable associations in several ways. The underlying and unifying theme of this study is data integration, data mining, and meta-analysis for pattern detection that supports knowledge discovery and generation of hypotheses. The methods and the workflow used are disease agnostic and can be applied to any disease or phenotype that has multiple models and heterogeneous data elements. By integrating and joint analysis of several heterogeneous data types (multiple disease models, viral-host protein interaction data, single-cell RNA-sequencing data, protein-protein interactions, and genome-wide association study data), gene functional modules are identified that can have direct bearing on furthering the understanding of COVID-19.
Development/Pre-production: Data science output has been rolled out/validated across multiple domains/problems associations. To demonstrate the utility of this joint analysis approach, we analyzed transcriptomic data from two in vitro models (Calu-3 and Vero E6 cells) and one in vivo model (Ad5-hACE2-sensitized mice) of SARS-CoV-2 infection.
Coronavirus disease 2019 , caused by SARS-CoV-2, has affected more than 75 million people with more than 1.6 million deaths worldwide including $17.3 million confirmed infections and >311,000 deaths in the United States (World Health Organization, December 20, 2020). The limited and emerging stages of data and information surrounding this disease, and the necessity to find effective interventions (e.g., vaccines, small molecules), provides a strong rationale for a multilayered, secondary analysis of existing data collected from different models and studies. Some of the noteworthy discoveries surrounding SARS-CoV-2 are direct offshoots of secondary data analysis using available omics data generated in pre-COVID-19 times. These existing data include single-cell RNA-seq (scRNA-seq) data 3,4 from the Human Cell Atlas consortium or eQTL variant data 5 from the Genotype Tissue Expression (GTEx) database. 6 Thus, leveraging the available repository of datasets and information, even if they were not designed specifically to study COVID-19, can provide a jump start to discover different sides of this disease. Recently there have been several studies reporting network analysis-based approaches applied to both COVID-19-and non-COVID-19-related data to detect tissue-specific 7,8 or pan-tissue 9 networks of interacting genes specific to SARS-CoV-2 infections. These studies differ in the input ''seed'' genes used to construct the networks; some studies are focused on the SARS-CoV-2 entry-associated receptors and/or proteases 7,9 while the others use an expanded set of virus-host interactants in SARS-CoV-2. 8,10 However, most of these methods do not consider the differentially expressed genes (DEGs) in the host following the SARS-CoV-2 infection in their analysis. A recently published study 11 used differentially expressed host genes in SARS-CoV-2-infected bronchial epithelial cells (NHBE) along with the SARS-CoV-2 entry receptor ACE2 and SARS-CoV-2 entry-associated protease TMPRSS2 to construct a host gene regulatory network. This study, however, is based on a single SARS-CoV-2 infection model with a limited set (three samples) of SARS-CoV-2 infection samples. Additionally, the study did not consider other host-virus interactants specific to SARS-CoV-2 virus. To overcome these limitations and address some of these issues, we used transcriptomic data from three model systems (two in vitro and one in vivo) of SARS-CoV-2 infection, SARS-CoV-2 viral-host protein interaction data, and analyzed them jointly with non-COVID-19/SARS-CoV-2 data. For the latter, we used the scRNA-seq markers from three human lung studies, protein-protein interactions, and genome-wide association study (GWAS) data ( Figure 1). While we acknowledge the complexity of SARS-CoV-2 infection, we believe that our study supports knowledge discovery and formulation of testable hypotheses for COVID-19 pathogenesis.

Consensus transcriptome in SARS-CoV-2 infection
The pathophysiology of most viral infections is associated with host protein complexes, which are manipulated to hijack the individual cell biological processes. Therefore, to evaluate this phenomenon, we first built an interactome around the consensus transcriptome of SARS-CoV-2 infection. To obtain a consensus transcriptomic signature, we considered DEGs in at least two of the three SARS-CoV-2 models 12-14 compared (i.e., two cell lines, namely, transformed lung-derived Calu-3 cells and VeroE6 cells, and a mouse model) ( Figure 2A and Table  1). A strong concordance was seen among the upregulated and downregulated gene signatures from the three models. A total of 732 DEGs (537 upregulated and 195 downregulated) were shared between the SARS-CoV-2-infected human Calu-3 and non-human primate VeroE6 cell lines ( Figure 2B). Similarly, we found 325 upregulated and 369 downregulated genes common between the Calu-3 model and Ad5-hACE2-sensitized mice. While there was an overall concordance among the DEGs, each of the three models also had several DEGs unique to them ( Figure 2C and Table S1). We further validated these DEGs by comparing them with a transcriptomic signature from COVID-19 patients (GEO: GSE152075; nasopharyngeal swabs from 430 patients and 54 controls). 15 There was a stronger concordance with the transcriptomic signature from the Calu-3 and the Ad5-hACE2-sensitized mouse model systems than the one from the VeroE6 cell line model ( Figure S1). Finally, a total of 1,467 consensus genes (833 upregulated and 634 downregulated) were found ( Figure 2C and Table S2) from the three disease models. This included 106 genes upregulated and 41 genes downregulated in all three model systems ( Figures 2C and 2D), representing the ''core'' dysregulated transcriptome in SARS-CoV-2 infection. Both these sets of consensus signatures were enriched for several functional terms (Tables S3, S4, and S5) and human lung cell-type markers (Table S6 and Figure S2). Additionally, these gene sets were also enriched for several physiological and pathological traits (from the Phenotype-Genotype Integrator [PheGenI] 16 and GWAS catalog 17 databases) (Tables S7 and S8; Figure S3).

Interactome of consensus transcriptome of SARS-CoV-2 infection and virus-host protein-protein interactions
To build a consensus SARS-CoV-2 interactome, we used the SARS-CoV-2-human virus-host protein-protein interaction (PPI) dataset comprising 332 human proteins involved in assembly and trafficking of RNA. 18 These are in addition to the SARS-CoV-2 entry receptor ACE2, and SARS-CoV-2 entryassociated proteases, namely, TMPRSS2, CTSB, and CTSL. More than half (151 genes) of these 336 SARS-CoV-2-human interacting proteins were differentially expressed in at least one of the three model systems ( Figure 3A). Of these, 29 genes (16 upregulated and 13 downregulated) were part of the consensus signature.
Using the disease consensus transcriptomic signature and the SARS-CoV-2-proteome interacting human proteins as an input, we queried the STRING (v11) database 19 and generated a DEG-PPI integrated network. Only the interactions with highest confidence score (0.9) or experimental interaction score of 0.7 or more in STRING were used. We observed an enrichment for PPIs (p < 1.0 3 10 À16 ) among the combined gene set (Figure 3B). In other words, this combined set of SARS-CoV-2 consensus signature and SARS-CoV-2-human interaction map have significantly more interactions among themselves than would be expected for a random set of proteins of similar size drawn from the genome. We next identified network clusters from this joint interactome using a Markov clustering (MCL) algorithm. In brief, MCL clusters a network to determine modules of genes with more intramodular (within the module) than intermodular (with other modules) interactions. Each gene can only be assigned to a single module through this method. The inflation factor parameter determines the granularity (or ''tightness'') of the clusters and thereby the cluster size. In all our experiments with SARS-CoV-2 infection models, we used the default inflation parameter (2.5). With MCL clustering, we found 153 clusters of varying gene counts (Table S9). We selected 35 candidate clusters with each having at least five Transcriptomic data from three SARS-CoV-2 infection models are processed to identify differentially expressed genes (DEGs). Genes that are upor downregulated in two out of three models are considered as ''consensus signature.'' The consensus DEGs, along with the SARS-CoV-2-human virus-host interactome, are used to build an integrated network based on known protein-protein interactions (PPIs) retrieved from STRING database (v11). A Markov clustering (MCL) algorithm is then used to identify modules of highly interconnected genes from this integrated interactome. These gene modules are characterized through functional enrichments (pathways, biological processes), lung single-cell markers, and phenotypic trait (genomewide association [GWA]) loci. In the final step, an enriched feature network is constructed using all the enriched terms and features from the modules to extract functional complexes or communities of highly interconnected features.
genes. These 35 clusters were made up of a total of 797 genes of which 627 were consensus DEGs in SARS-CoV-2 infection models (see Figures 3C-3H for  six example clusters and Table S9 for more details). Of the 35 clusters, 29 clusters had at least one gene-encoding protein that interacts with the SARS-CoV-2 proteome. We hypothesize that these SARS-CoV-2-targeted human protein clusters are informative in deciphering the COVID-19 pathophysiology and inferring the function of the SARS-CoV-2 targets based on other members in the protein clusters.
Characterization of SARS-CoV-2targeted human protein modules Gene clusters: Functional enrichment The next step in our multilayered approach was to obtain enriched biological processes and pathways for the identified gene modules (Table S10). Cluster C-1 (190 genes) was enriched for innate immune response (48 genes) and type I interferon signaling (26 genes) while genes from cluster C-2 (92 genes) were involved in transport regulation (31 genes) and tube development (31 genes). We also found genes associated with abnormal cardiovascular development (21 genes) in cluster C-2. Clusters C-7 (20 genes) and C-8 (20 genes) had genes associated with abnormal interleukin and cytokine secretion phenotypes. Clusters C-12 (14 genes), C-28 (6 genes), and C-23 (8 genes) were all enriched for mitochondrion translation, organization, and transport. Finally, several genes regulating circadian rhythm in mammals (NFIL3, PER1, PER2, PER3, and SIK1) were seen in cluster C-25 (7 genes).
Gene clusters: Lung single-cell markers We next evaluated the candidate gene modules for lung singlecell associations by performing enrichment analysis of the modules against single-cell marker gene sets compiled from three different human lung scRNA-seq studies. [20][21][22] Of the 35 selected gene clusters, 17 clusters (633 genes) were enriched for markers of at least one lung cell type ( Figure 4 and Table  2). Cluster C-1 (190 genes) was enriched for proliferating cells including proliferating epithelial (e.g., proliferating basal), lymphoid (e.g., proliferating T cells, proliferating natural killer cells), and myeloid (e.g., proliferating macrophages) cell types. Cluster C-2 (92 genes) was heterogeneous showing enrichment for epithelial, mesenchyme, vascular endothelial, lymphoid, and myeloid cell types. Cluster C-9 (18 genes) showed enrichment for fibroblasts, myofibroblasts, and smooth muscle cells and shared enrichments with clusters C-1, C-2, and C-3. Some of the clusters were found to be specifically enriched for certain cell types. Ionocyte cell marker 22 genes, for instance, were specific to cluster C-5 (40 genes; 12 markers); clusters C-7, C-11, and C-13 were specifically enriched for myeloid cell markers (Table S11).
Gene clusters: Genotype-phenotype associations The 35 genes clusters also showed enrichment for several physiological and phenotypic traits that provide insights into COVID-19 pathogenesis (Tables S12 and S13). Among the most significantly enriched traits were respiratory system disease (clusters C-7 and C-8), asthma (C-7), autoimmune disease (clusters C-7 and C-29), allergic rhinitis (C-7), immune system disease (cluster C-7 and C-8), and diabetes (C-15). We also observed risk genes associated with several inflammatory disorders such as inflammatory bowel disease and Crohn's disease (C-7), ulcerative colitis (C-8), rheumatoid arthritis (clusters C-7 and C-8), and ankylosing spondylitis (C-8). Apart from elucidating the pathophysiology of COVID-19, the enriched traits can potentially help the researchers to understand or formulate hypotheses surrounding the long-hauler patients or survivors. For instance, could COVID-19 be a risk factor for autoimmune or neurodegenerative disease? A plausible mechanism could be through an overactivated innate immune system. [23][24][25] Both acute and delayed neurological and neuropsychiatric effects have been associated with previous viral pandemics. 26 (D) List of the ''core'' upregulated (106 genes, red font) and downregulated (41 genes, green font) genes in all the three model systems.
(E and F) Network representation of enriched biological processes and pathways in the core upregulated (E) or downregulated (F) genes from the SARS-CoV-2 infection models. Orange-and purple-colored nodes are genes up-or downregulated, respectively. The different colored rectangles are enriched biological processes and pathways. Enrichment analysis was done using the ToppFun application of the ToppGene Suite, and network was generated using the Cytoscape application.

SARS-CoV-1-targeted human protein modules
To demonstrate that the proposed workflow is disease agnostic and to identify modules that are specific to SARS-CoV-2 infection, we implemented the same workflow for another corona virus disease caused by SARS-CoV-1. To do this, we first extracted DEGs from three different SARS-CoV-1 infection models [28][29][30] (Calu3 model and two mouse models), and generated the consensus DEGs. There were 699 upregulated and 1,385 downregulated genes that were differentially expressed in at least two out of the three model systems. To generate the SARS-CoV-1-targeted human protein modules, we used 366 host-SARS-CoV-1 protein interactions identified on the basis of localization of viral proteins in human cells. 31 Comparing the DEGs and virus-host protein interactions of SARS-CoV-1 and SARS-CoV-2, we found over 300 DEGs (196 upregulated and 119 downregulated) and 135 viral interactants shared, and a large number of DEGs and protein interactions unique to each of them. We next generated SARS-CoV-1-targeted human protein modules following the same steps as described previously for SARS-CoV-2. We identified 68 modules that had at least five genes (Table S14). We also computed functional and lung cell marker enrichments for the SARS-CoV-1 modules. By analyzing the module compositions from both of the analyses (SARS-CoV-2 and SARS-CoV-1), we identified candidate modules that are potentially unique to each of these viruses. For instance, cluster C-5 (40 genes) from the SARS-CoV-2 interactome contained more than 90% of its gene members (37 out of 40) from the SARS-CoV-2 consensus signature or protein interactions. Interestingly, this module was enriched for marker genes from ionocytes and proximal ciliated cells, and several neurodegenerative disease pathways. Similarly, 9 out of 11 genes in cluster C-15 were specific to the SARS-CoV-2 interactome, which included genes belonging to trans-synaptic signaling and neurotrophic factor-mediated Trk receptor signaling pathways. Among lung cell markers, proliferating epithelial and basal cells along with transitional AT2 cell markers were specifically enriched in our identified SARS-CoV-2 protein modules. Likewise, we observed multiple functional pathways (e.g., TRAIL [tumor necrosis factor-related apoptosis-inducing ligand] signaling and IL12 [interleukin-12]-mediated signaling pathways), biological processes (e.g., endoderm formation, response to oxygen radical), and phenotypes (e.g., arteriosclerosis, abnormal mitochondrial crista morphology) enriched specifically in SARS-CoV-2. We also identified few protein modules containing a significant number of genes associated with both in-fections, potentially representing the pan-viral disease mechanisms involved (Table S14).

Meta-analysis of candidate gene modules and enrichment network visualization
To identify the semantic concordance between the enriched cell types, phenotypic traits, and functional terms for different gene clusters, we next undertook meta-analysis across all the enrichments. We selected a subset of enriched terms (top ten enriched terms from Gene Ontology-Biological Process, Reactome pathways, mouse phenotype), cell types, and traits (both Phe-GenI and GWAS catalog) from each of the 35 candidate clusters and converted them into a network layout. We used Gephi (https://gephi.org), an open-source graph visualization platform, 32 to construct and visualize the functional network. In this dense enriched feature network (1,198 nodes and 31,065 edges, Table S15), the enriched terms (biological processes, pathways, phenotypic traits, cell types) are represented as nodes, and two nodes are connected if they share at least one or more of the 35 candidate gene clusters from the combined interactome map. Since subunits of a functional complex (a cluster of, e.g., pathways, cell types, biological processes, phenotype) work toward the same biological goal, prediction of an unknown pathway or biological process or a phenotype as part of this complex also allows increased confidence in the annotation of that functional cluster. Additionally, by doing this, potential redundancies across different sources (e.g., ontology or cell types) could be reduced, apart from enabling interpretation of the enrichment results through intracluster and intercluster similarities of enriched terms. 33 We therefore investigated the substructure of the feature network by estimating community membership modules using the Louvain algorithm 34 (implemented in Gephi). Louvain clustering is a fast, iterative algorithm that is based on optimizing the modularity score 35 and is computationally fast, efficient, and suitable for large modular networks. The resolution parameter can be used to maintain the balance between module count and the individual cluster tightness. A low-resolution parameter value would lead to smaller, more tightly connected clusters and vice versa. With a resolution set to 0.25, we found 31 communities of highly interconnected biological terms and a high modularity score of 0.672 ( Figure 5). Visualizing these functional complexes, we observed high concordance between the functional terms, cell-type marker, and phenotype enrichments among the candidate gene modules (Table 3). For instance, cluster C-10 was enriched for vascular Article endothelial and smooth muscle cells, platelet degranulation, extracellular matrix, cell-substrate adhesion, and FOXP3 targets ( Figure 5). Elucidating the role of platelets in the thrombotic complications of COVID-19, two recent studies 36,37 reported that platelet hyperactivity contributes to the COVID-19-related coagulopathy. Furthermore, endothelial cell dysfunction and impaired microcirculatory function are reported to contribute to COVID-19 severity including venous thromboembolic disease and multiple organ involvement. 38 Foxp3 is a master regulator of regulatory T (Treg) cells, and its expression is associated with the immunosuppressive activity of these cells. Deficiency of functional Treg cells caused by mutations of Foxp3 leads to spontaneous systemic multiorgan autoinflammatory phenotypes in mice. [39][40][41][42] Interestingly, CD4 + CD25 + FoxP3 + regulatory T cell-based therapies are proposed for COVID-19 patient management. 43 Similarly, clusters C-11 and C-13, and C-7 ( Figures  3E-3G) were enriched for Toll-like receptor signaling pathway, cytokine-cytokine receptor interaction, nuclear factor kB (NF-kB) signaling, CD40 signaling, and myeloid cell types (conventional dendritic cells, mast cells, and monocytes). These clusters showed enrichment for abnormal interleukin secretion and T cell physiology and for several GWA loci such as granulocyte count, inflammatory biomarker measurement, Crohn's disease, and ulcerative colitis ( Figure 5 and Table 3).

DISCUSSION
We report a data-driven, network-based workflow to identify gene and functional modules in a disease through joint analysis of disease-specific and non-disease-specific data elements. By integrating high-confidence protein-protein interactions with disease-specific transcriptomic signatures, we first identified protein modules that could represent perturbed states in disease. As a first pass of characterizing these modules, we leverage existing heterogeneous omics data including different biological processes, pathways, single-cell associations, and genetic traits. Next, we construct a feature network using the enriched terms from different perturbed modules. These higher-order multifeature machines, or functional modules overlaid on protein modules representing perturbed states, enable us to identify biologically interpretable mechanisms underlying disease pathophysiology. This approach is disease agnostic and can be applied to any disease or phenotype that has one or more model systems with transcriptomic data. We demonstrate the utility of our approach by undertaking a secondary analysis of transcriptomic data from three models of SARS-CoV-2 infection. By integrating and analyzing the transcriptomic data from COVID-19 in vitro and in vivo models in the context of SARS-CoV-2-human virus-host protein interaction map, single-cell signatures of lung, gene annotations, and human genotype-phenotype associations, we have identified several functional modules that can have direct bearing on furthering the understanding of this devastating pandemic. We also demonstrate the disease-agnostic nature of our approach through analysis of transcriptomic data from infection model systems of another corona virus (SARS-CoV-1). Furthermore, we were able to identify SARS-CoV-2-specific gene modules and unique functional mechanisms by comparing the results from the SARS-CoV-2 with that of SARS-CoV-1 infection model systems. The various categories of cellular functions and phenotypic traits found by meta-analyses of SARS-CoV-2 model systems recovered both expected and potentially novel biological insights of COVID-19. The gene-level and higher-order feature-level clusters emerging from the joint analysis of COVID-19-and non-COVID-19 related data can serve as valuable resources for the scientific community to formulate or further investigate hypotheses.
Limitations of the study Our methodology holds certain limitations. The composition of the protein modules is dependent on the transcriptomic signature used, and any heterogeneity in the transcriptomic data can affect the module compositions. For instance, lack of tran-

infection models and SARS-CoV-2-host protein-protein interactions
Marker enrichment network of 17 candidate clusters (circular nodes) that were enriched for at least one lung cell type (rectangular nodes) marker in human lung. scriptomic concordance between different disease model systems could result in gene or protein modules that are very diverse. Although the Ad5-hACE2-transduced mice develop pneumonia after infection with SARS-CoV-2 and are useful for evaluation of vaccines and antiviral therapies, the infection is non-lethal. 14,44,45 A comparison of the transcriptomic signatures from the three model systems with DEGs from human samples indeed showed several genes that are uniquely dysregulated in COVID-19 patients (GEO: GSE152075) 15 suggesting the inherent drawbacks of current in vitro and in vivo models of COVID-19. Newer transcriptomic signatures as and when available from emerging refined and more representative in vitro and in vivo models of human COVID-19 [46][47][48][49] can be leveraged and analyzed using the current workflow. The small sample size from the three disease models used in the study is another limitation. However, we performed multiple randomized trials to demonstrate the robustness of the three transcriptomic datasets used in the study (see experimental procedures and Figure S4). Nevertheless, further (in vitro and in vivo) validations are warranted to test the hypotheses arising from the current study. Recently there have been several studies and databases reporting COVID-19-specific databases, 33,50-52 which can be either used to compile additional consensus gene sets or for further functional characterization of the modules discovered in the current study.
The STRING-based PPI network data suffer from incompleteness and a certain degree of noise. There are no set standards for the optimal STRING interaction score cutoff. Furthermore, although Markov clustering is recommended for module detection, 53 there are no specific guidelines for inflation factor threshold nor for the functional annotation of modules. Nevertheless, to overcome some of these limitations, we used a very stringent cutoff score of 0.9 for STRING interactions and selected 2.5 (default) as the MCL inflation factor. The cluster composition can also vary depending on the clustering algorithm and parameters. Additionally, the choice of external (non-disease) data elements is likely dependent on the disease or phenotype being studied. For instance, to identify cancer driver modules, DNA-level alterations (e.g., variants, copy number alterations) and RNA-level regulation data have been proved to be more effective. 54 To ll OPEN ACCESS Article alleviate the issues related to noise and incompleteness in PPI networks, graph neural network implementations, which are robust to structural noise in input networks, could be useful. Adding the expression profiles as node features could also be an efficient way to introduce disease-specific transcriptomics data into the network-based analysis. Using attention-based implementations allows us to assign dynamic similarity weights to nodes (proteins) based on the similarity of their neighborhoodaggregated features. Additionally, we also plan to explore mechanisms that can integrate heterogeneous human transcriptomic data coming from distinct sources including nasopharyngeal swabs and peripheral blood mononuclear cells. In summary, bringing together a consensus gene signature from multiple disease model systems and analyzing it jointly with other omics data provide a basis for addressing several basic and translational research questions for existing and emerging diseases.

EXPERIMENTAL PROCEDURES
Resource availability Lead contact Further information and requests should be directed to and will be fulfilled by the lead contact, Anil G. Jegga (anil.jegga@cchmc.org).

Materials availability
This study did not generate any unique reagents. Data and code availability All data generated or analyzed during this study are included in this article and its supplemental information files. Also, the code for reproducing our result files and figures is accessible publicly at https://github.com/ SudhirGhandikota/COVID19_secondary_analysis. Additional supplemental items are available Mendeley Data at https://doi.org/10.17632/3cwxv9swkc.1.

SARS-CoV-2 infection models: Differentially expressed genes
We used transcriptomic data from human (Calu-3) and non-human primate (VeroE6) cell lines, and from a mouse model (Ad5-hACE2) of SARS-CoV-2 infection ( Table 1). The SARS-CoV-2 infection triggered transcriptome in Calu-3 cell lines (GSE147507) 12 is based on six samples with three each of mock treated or infected with SARS-CoV-2. The second transcriptome signature is based on mRNA profiles of control and 24-h post-SARS-CoV-2-infection (USA-WA1/2020, multiplicity of infection = 0.3) in Vero E6 cells (kidney epithelial cells extracted from an African green monkey (GEO: GSE153940). 13 The third dataset is from a mouse model using Ad5-hACE2sensitized mice (GEO: GSE150847) 14 that develop pneumonia after infection with SARS-CoV-2, overcoming the natural resistance of mice to the infection. Raw data from GEO: GSE147507, 12 GSE153940, 13 and GSE150847 14 were obtained and analyzed using the Computational Suite for Bioinformatics and Biology (CSBB v3.0). 55 The raw data were downloaded from NCBI Sequence Read Archive (ProcessPublicData module), and the technical replicates were merged for individual samples before processing them (Process-RNASeq_-SingleEnd module). Quality checks 56 and quality trimming 57 were conducted prior to the transcript mapping/quantification step using the RSEM package. 58 Raw counts and transcripts per million were generated for all samples for further downstream analysis. Within each sample series, differential expression (DE) analysis was carried out based on treatment versus mock samples using the CSBB-Shiny server. 19 RUVSeq 59 was used to remove potential variation and sequencing effects from the data before performing DE analysis using edgeR. 60 DEGs were obtained by applying a 1.5-fold change threshold (i.e., log 2 FC R0:6 or log 2 FC% À 0:6) and a p value (false discovery rate [FDR] correction) of <0.05. For obtaining the human ortholog genes for mouse (Mus musculus) and green monkey (Chlorocebus sabaeus), we used ortholog mappings from the NCBI's HomoloGene.

SARS-CoV-2-human virus-host protein-protein interactions data
The SARS-CoV-2-human virus-host protein-protein interaction data included a set of 332 human proteins involved in assembly and trafficking of RNA viruses and shown recently through affinity purification and by mass spectrometry to interact physically with 26 of 29 SARS-CoV-2 structural proteins. 18 These are in addition to the SARS-CoV-2 entry receptor ACE2, and SARS-CoV-2 entry-associated proteases, namely, TMPRSS2, CTSB, and CTSL.

Consensus DEGs: Robustness tests
To test the robustness of DEGs and the consensus transcriptome from the three input disease models used in our framework, we performed four different randomized permutation tests. In the first set of experiments, we randomly permuted the phenotype labels in each individual study, identified the DEGs, and tried to obtain the consensus signature (genes that are differentially expressed in two or more studies). We repeated this for 1,000 iterations and observed that the number of DEGs found in each disease model is significantly less than the actual counts ( Figure S4A). Consequently, we did not identify a consensus signature in any of the randomized trials due to the low DEG counts. Given the small sample sizes, the same phenotype combinations were repeated a few times in our trials. In the second set of experiments, we permutated the labels in two of the three studies and reused the original DEGs from the third study. We again repeated this process 1,000 times for each combination (3,000 random trials in total) and computed the consensus DEGs in each case. Here too we did not observe a significant number of consensus DEGs ( Figure S4B) in any of our trials (<25 genes).
Our next set of experiments was designed to validate the level of connectivity observed among the SARS-CoV-2 consensus DEGs along with their interactions with the SARS-CoV-2 virus-host interactants. To achieve this, we first generated DEG sets in each individual study by randomly picking the Table 2. Candidate clusters in SARS-CoV-2 DEG and interaction map along with their enriched lung cell types same number of genes as obtained originally (Table 1) and identified the consensus signature from among them. As is the case in our earlier experiments, we observed that the counts of consensus DEGs in our random trials are significantly lower than the observed gene sets ( Figure S4C). These consensus genes were then combined with the SARS-CoV-2-human virushost interactome (336 genes), and the integrated gene set was tested for enrichment of protein-protein interactions from STRING. 19 We repeated these two independent steps 1,000 different times and plotted the enrichment p values in each case ( Figure S4C). On average, the consensus DEG counts from our random tests were around 300 genes while the empirical p values were less significant than the observed level (p < 1.0 3 10 À16 ). Although we found statistically significant (p % 0.05) PPI enrichments in some of our trials, we hypothesized that these might be driven by the 336 SARS-CoV-2 interactants. Therefore, we tried to test this in our final set of experiments by randomly picking 1,803 genes (1,467 conserved + 336 SARS-CoV-2 interactants) and then checked for their PPI enrichments. This time, we observed fewer significant enrichments (p % 0.05) among 1,000 independent trials ( Figure S4D). We also found that the average local clustering coefficient values (from STRING) in each trial were smaller than the actual value (0.42) ( Figure S4D). In all our experiments, we used the STRING API (https://string-db.org/help/api/) to compute PPI enrichments and to retrieve the clustering coefficient scores.
Functional and human lung cell markers enrichment analysis Functional enrichment for Gene Ontology biological processes, mouse phenotypes, pathways, and 4,872 immunologic 61 and 50 hallmark 62 gene sets from MSigDB 63 was done using the ToppGene suite 64 while the pathway enrichment analysis using the Elsevier Pathway Collection was done using Enrichr. 65 Additionally, to detect specific cell types potentially perturbed or affected in COVID-19, we intersected the DEGs and gene clusters from SARS-CoV-2 infection models with cell-type markers (FDR p % 0.05; logFC R 0.5) from normal adult human lung. [20][21][22] Genome-wide association trait enrichment analysis For gene and phenotype trait association analysis, we used data from the NCBI's Phenotype-Genotype Integrator (PheGenI) 16 and the NHGRI-EBI GWAS catalog. 17 We used significant (1 3 10 À5 ) vulnerability loci of various human physiological traits, excluding all intergenic variants. Additionally, we also included child trait associations for the mapped traits from the GWAS catalog. The child terms for each trait were obtained by parsing the experimental factor ontology hierarchy. 66 We applied Fisher's exact test to find the enrichments.

ACKNOWLEDGMENTS
This study was supported in part by National Institutes of Health grant 1UG3TR002612 and by the Cincinnati Children's Hospital Medical Center.   Table 3. Enriched functional terms shared among the 35 candidate gene clusters from the integrated network of consensus DEGs of the three SARS-CoV-2 infection models and the SARS-CoV-2-host protein interaction map