![]() |
Formats:
|
||||||||||||||||
Copyright © Springer Science+Business Media, LLC 2007 Moving toward a system genetics view of disease Rosetta Inpharmatics, LLC, 401 Terry Avenue N., Seattle, Washington 98109 USA Eric E. Schadt, Email: Eric_Schadt/at/merck.com. Corresponding author.Received March 23, 2007; Accepted May 21, 2007. This article has been cited by other articles in PMC.Abstract Testing hundreds of thousands of DNA markers in human, mouse, and other species for association to complex traits like disease is now a reality. However, information on how variations in DNA impact complex physiologic processes flows through transcriptional and other molecular networks. In other words, DNA variations impact complex diseases through the perturbations they cause to transcriptional and other biological networks, and these molecular phenotypes are intermediate to clinically defined disease. Because it is also now possible to monitor transcript levels in a comprehensive fashion, integrating DNA variation, transcription, and phenotypic data has the potential to enhance identification of the associations between DNA variation and diseases like obesity and diabetes, as well as characterize those parts of the molecular networks that drive these diseases. Toward that end, we review methods for integrating expression quantitative trait loci (eQTLs), gene expression, and clinical data to infer causal relationships among gene expression traits and between expression and clinical traits. We further describe methods to integrate these data in a more comprehensive manner by constructing coexpression gene networks that leverage pairwise gene interaction data to represent more general relationships. To infer gene networks that capture causal information, we describe a Bayesian algorithm that further integrates eQTLs, expression, and clinical phenotype data to reconstruct whole-gene networks capable of representing causal relationships among genes and traits in the network. These emerging network approaches, aimed at processing high-dimensional biological data by integrating data from multiple sources, represent some of the first steps in statistical genetics to identify multiple genetic perturbations that alter the states of molecular networks and that in turn push systems into disease states. Evolving statistical procedures that operate on networks will be critical to extracting information related to complex phenotypes like disease, as research goes beyond a single-gene focus. The early successes achieved with the methods described herein suggest that these more integrative genomics approaches to dissecting disease traits will significantly enhance the identification of key drivers of disease beyond what could be achieved by genetic association studies alone. Introduction Genetics is at the dawn of a new era with maturing technologies that enable low-cost, high-throughput genotyping of hundreds of thousands of DNA markers that in turn can be tested for association to complex traits of interest like disease and drug response. A number of studies have already leveraged the availability of such technologies to identify polymorphisms in genes that associate with diseases like age-related macular degeneration (Edwards et al. 2005; Haines 2005; Klein 2005), diabetes (Grant 2006; Sladek 2007), and obesity (Herbert 2006), to name just a few. In addition, there are scores of similar genome-wide association studies that are ongoing and that promise to deliver scores of genes that harbor variations that associate with diseases like obesity and diabetes. While these types of genetic discoveries provide a peek into pathways that underlie disease, they are usually devoid of context, so that elucidating the functional role such genes play in disease can take years, or even decades, as has been the case for ApoE, an Alzheimer’s susceptibility gene identified nearly 15 years ago (Peacock et al. 1993). Information that defines how variations in DNA that associate with disease actually impact the complex physiologic processes underlying disease flows through transcriptional and other molecular, cellular, tissue, and organism networks (Fig. 1
Integrating genetic and functional genomic data can provide a path to inferring causal associations between genes and disease. In the past, causal associations between genes and traits have been investigated using time series experiments, gene knockouts or transgenics that overexpress a gene of interest, RNAi-based knockdown or viral-mediated overexpression of genes of interest, and chemical activation or inhibition of genes of interest. A more systematic and arguably relevant source of perturbation to make such inferences regarding genes and disease are DNA polymorphisms, where gene expression and other molecular phenotypes in a number of species have been shown to be significantly heritable and at least partially under the control of specific genetic loci (Brem et al. 2002; DeCook et al. 2006; Hubner et al. 2005; Jin et al. 2001; Klose et al. 2002; Monks et al. 2004; Morley et al. 2004; Oleksiak et al. 2002; Schadt et al. 2003; Stranger et al. 2005). By examining the effects that naturally occurring variations in DNA have on variations in gene expression traits in human or experimental populations, other phenotypes (including disease) can be examined with respect to these same DNA variations and ultimately ordered with respect to genes to infer causal control (Fig. 2
Here we review different approaches for integrating expression quantitative trait loci (eQTLs), expression, and clinical data to infer causal relationships among gene expression traits and between expression and disease traits. We further review methods to integrate these data in a more comprehensive manner by constructing coexpression gene networks that leverage pairwise gene interaction data to represent more general relationships. This type of network provides a useful construct for characterizing the topologic properties of biological networks and for partitioning such networks into functional units (modules) that underlie complex phenotypes like disease. However, these networks are, by design, undirected and so do not explicitly capture causal relationships among genes. To infer gene networks that capture causal information, we review Bayesian network reconstruction algorithms that, like the methods operating on only two or three expression traits and/or clinical traits mentioned above, integrate eQTLs, expression, and clinical phenotype data to reconstruct whole-gene networks capable of representing direction along the edges of the network. Here, directionality among the edges corresponds to causal relationships among genes and between genes and clinical phenotypes related to disease. These emerging high-dimensional data analysis approaches that integrate large-scale data from multiple sources represent the first steps in statistical genetics, moving away from considering one trait at a time and toward operating in a network context. Evolving statistical procedures that operate on networks will be critical to extracting information related to complex phenotypes like disease as research goes beyond the single-gene focus. The early successes achieved with some of the methods described herein suggest that these more integrative genomics approaches to dissecting disease traits will significantly enhance the identification of key drivers of disease beyond what could be achieved by genetics alone. Leveraging the heritability of expression as a path to reconstructing networks Gene transcripts have been identified that are associated with complex disease phenotypes (Karp et al. 2000; Schadt et al. 2003), are alternatively spliced (Johnson et al. 2003), elucidate novel gene structures (Mural et al. 2002; Schadt et al. 2004; Shoemaker et al. 2001), can serve as biomarkers of disease or drug response (DePrimo et al. 2003), lead to the identification of disease subtypes (Mootha et al. 2003; Schadt et al. 2003; van’t Veer et al. 2002), and elucidate mechanisms of drug toxicity (Waring et al. 2001). Changes in gene expression often reflect changes in a gene’s activity and the impact a gene has on different phenotypes. Because gene expression is a quantitative trait, linkage and association methods can be directly applied to such traits to identify genetic loci that control them. In turn, genetic loci that control for expression traits may also associate with higher-order phenotypes affected by expression changes in the gene of interest, providing a path to directly identify genes controlling for phenotypes of interest. Therefore, identifying the heritable traits and the extent of their genetic variability provides insight about the evolutionary forces contributing to the changes in expression that associate with biological processes that underlie diseases like obesity and diabetes, beyond what can be gained by looking at the transcript abundance data alone. It is now well established that gene expression is a significantly heritable trait (Alberts et al. 2005; Brem et al. 2002; Chesler et al. 2005; Cheung et al. 2005; Jansen and Nap 2001; Monks et al. 2004; Morley et al. 2004; Petretto et al. 2006a, b; Schadt et al. 2003, 2005). If a gene expression trait is highly correlated with a disease trait of interest, and if the corresponding gene physically resides in a region of the genome that is associated with the disease trait, then knowing that the expression trait is also genetically linked to a region coincident with its physical location provides an objective and direct path to identify candidate causal genes for the disease trait (Alberts et al. 2005; Brem et al. 2002; Chesler et al. 2005; Cheung et al. 2005; Jansen and Nap 2001; Monks et al. 2004; Morley et al. 2004; Petretto et al. 2006a, b; Schadt et al. 2003, 2005). The genetic information therefore enables the dissection of the covariance structure for two traits of interest into genetic and nongenetic components, and the genetic component can then be leveraged to support whether an expression and disease trait are related in a causal, reactive, or independent manner (with respect to the expression trait). Elucidating causal relationships is possible in this setting given the unambiguous flow of information from changes in DNA to changes in RNA and protein function (Fig. 1 Leveraging DNA variation information to reconstruct gene networks supposes that we are able to systematically identify genetic loci that at least partially control transcript abundances for genes of interest. This of course is straightforward given that transcript abundance or gene expression traits are quantitative measures that can be analyzed like any other quantitative trait in a genetics context. However, the difficulty in analysis and interpretation comes with the large number of traits examined. Microarrays are capable of monitoring tens of thousands (or hundreds of thousands) of transcripts simultaneously. Therefore, methods to compute eQTLs must consider computational tractability given the need to run the analyses potentially millions of times. In addition, significance thresholds must take into account multiple testing. Multiple testing issues relate not only to the number of transcripts tested but also to the number of markers or proportion of the genome tested. However, the strong correlation structure that exists among many of the expression traits monitored in a segregating population can be leveraged to enhance the power to detect relationships among genes. A number of methods have been developed and applied to gene expression traits in segregating populations to identify eQTLs and to establish relationships among genes and between genes and disease traits, where multiple traits at a time can be considered. Typical approaches to the joint analysis of genetic traits involve mapping each gene expression trait individually and inferring the genetic correlation between pairs or sets of expression traits based on pairwise Pearson correlation, eQTL overlaps, and/or tests for pleiotropy. Using a family-based sample, Monks et al. (2004) estimated the genetic correlation between pairs of traits using a bivariate variance-component-based segregation analysis and showed that the genetic correlation was better able to distinguish clusters of genes in pathways than correlations based on the observed expression traits. This type of method can be extended to perform bivariate and multivariate QTL analyses, which can be more highly powered to detect QTLs when traits are correlated. Clusters of correlated gene expression traits can often contain hundreds or thousands of genes, which would be computationally prohibitive in a joint analysis. Kendziorski et al. (2006) approached this problem in a different way by employing a Bayesian mixture model to exploit the increased information from the joint mapping of correlated gene expression traits, which is computationally tractable for large sets of genes. Instead of doing a linkage scan by computing LOD scores at positions along the genome, Kendziorski et al. (2006) computed the posterior probability that a particular gene expression trait maps to marker m for each marker, as well as the posterior probability that the trait maps nowhere in the genome. Nonlinkage in this setting is declared for a transcript if the posterior probability of nonlinkage exceeds a threshold that bounds the posterior expected false discovery rate (FDR). One benefit to this approach is that it controls false discovery for the number of expression traits being tested, whereas assessing the appropriate significance cutoffs in single-transcript linkage analysis often requires data permutation analyses. The drawback of this method is that it assumes that linkage occurs at either one or none of the markers tested and it lacks a well-defined method for the case when multiple eQTLs control an expression trait. In a study of inbred strain crosses, the only valid way of estimating the extent of genetic control of a given trait is to explicitly model each eQTL, including any epistatic interactions if they exist. Brem and Kruglyak (2005) showed that epistatic interactions were prevalent in the gene expression levels in yeast, and similar suggestions have been made in other species as well (Schadt et al. 2005), but more definitive studies are needed to characterize the extent of epistasis among eQTLs in these other species. In the absence of epistasis, the genetic contribution for each transcript has been estimated by summing contributions for each individual eQTL, assuming that little or no allelic association exists between the eQTLs. In the presence of epistasis, however, this practice cannot yield a valid estimate, and multilocus models are instead required to obtain valid estimates. In addition, multilocus modeling can identify loci contributing to expression traits that would have been missed in single-locus eQTL scans (Brem and Kruglyak 2005; Storey et al. 2005). Integrating eQTLs and clinical trait linkage mapping to infer causality While understanding the mechanisms of RNA expression is in itself important for understanding biological processes, the ultimate use of this information is identifying the relationship between variation in expression levels and disease phenotypes in an organism of interest. Microarray experiments are commonly used to explore differential expression between disease and normal tissue samples or between samples from different disease subtypes. These studies are designed to detect association between gene expression and disease-associated traits, which in turn can lead to the identification of biomarkers of disease or disease subtypes. However, in the absence of supporting experimental data, these data alone are not able to distinguish genes that drive disease from those that respond. As discussed above, eQTL mapping can aid traditional clinical trait QTL (cQTL) mapping by narrowing the set of candidate genes underlying a given cQTL peak and by identifying expression traits that are causally associated with the clinical traits. Expression traits detected as significantly correlated with a clinical phenotype may reflect a causal relationship between the traits, either because the expression trait contributes to, or is causal for, the clinical phenotype, or because the expression trait is reactive to, or a marker of, the clinical phenotype. However, correlation may also exist in cases when the two traits are not causally associated. Two traits may appear correlated due to confounding factors such as tight linkage of causal mutations (Schadt et al. 2005) or may arise independently from a common genetic source. The Ay mouse provides an example of correlations between eumelanin RNA levels and obesity phenotypes induced by an allele that acts independently on these different traits, causing both decreased levels of eumelanin RNA and an obesity phenotype. More generally, a clinical and expression traits for a particular gene may depend on the activity of a second gene in such a manner that conditional on the second gene, the clinical and expression traits are independent. Correlation data alone cannot indicate which of the possible relationships between gene expression traits and a clinical trait are true. For example, given two expression traits and a clinical trait detected as correlated in a population of interest, there are 112 ways to order the traits with respect to one another. If we consider the traits as nodes in a network, then there are five possible ways the traits (or nodes) can be connected: (1) connected by an undirected edge, (2) connected by a directed edge moving left to right, (3) connected by a directed edge moving right to left, (4) connected by a directed edge moving right to left and a directed edge moving left to right, and (5) not connected by an edge. Since there are three pairs of nodes, there are 5 × 5 × 5 = 125 possible graphs. However, because we start with the assumption that the traits are all correlated with one another, we exclude 12 of the 125 possible graphs in which one node is not connected to either of the other two nodes, in addition to excluding the graph in which none of the nodes are connected, leaving us with 112 possible graphs (Fig. 2 It is important to note here that when we use the term causality, it is perhaps meant in a more nonstandard sense than most researchers in the life sciences may be accustomed to. In the molecular biology or biochemistry setting, claiming a causal relationship between, say, two proteins usually means that one protein has been determined experimentally to physically interact with or to induce processes that directly affect another protein and that in turn leads to a phenotypic change of interest. In such instances, an understanding of the causal factors relevant to this activity are known, and careful experimental manipulation of these factors subsequently allows for the identification of genuine causal relationships. However, in the present setting, the term “causal” is used from the standpoint of statistical inference, where statistical associations between changes in DNA, changes in expression (or other molecular phenotypes), and changes in complex phenotypes like disease are examined for patterns of statistical dependency among these variables that allows directionality to be inferred among them, where the directionality then provides the source of causal information (highlighting putative regulatory control as opposed to physical interaction). The graphical models (networks) described here, therefore, are necessarily probabilistic structures that use the available data to infer the correct structure of relationships among genes and between genes and clinical phenotypes (Schadt and Lum 2006). In a single experiment with one time point measurement, these methods cannot easily model more complex regulatory structures that are known to exist, like negative feedback control. However, the methods can be useful in providing a broad picture of correlation and causative relationships, and while the more complex structures may not be explicitly represented in this setting, they are captured nevertheless given that they represent observed states that are reached as a result of more complicated processes like feedback control. Distinguishing proximal (“cis”) eQTL effects from distal (“trans”) All genes expressed in living systems are cis-regulated at some level and so are under the control of various cis-acting elements such as promoters and TATA boxes (Fig. 3
Variations in expression levels induced by DNA variations in or near the gene itself may in turn induce changes in the expression levels of other genes (Fig. 3 In most cases it is not possible to infer the true regulatory effects (i.e., cis vs. trans) of an eQTL without complex bioinformatics study (GuhaThakurta et al. 2006) and experimental validation. As a result, eQTLs have been categorized into proximal and distal types based on the distance between the eQTL and the location of the structural genes. Obviously, if these are on different chromosomes the eQTLs are distal, but if they fall on the same chromosome then they are considered proximal only if the distance between the structural gene and the eQTLs do not exceed some threshold. The exact threshold is a function of the number of meioses and extent of recombination in a given population data set. In a completely outbred population where LD mapping has been used to fine-map the eQTLs, it has been reasonable to require the distance between the proximal eQTL and structural gene to be less than 1 Mb (Cheung et al. 2005). However, in an F2 intercross population constructed from two inbred lines of mice, the extent of LD will be extreme given that all animals are descended from a single F1 founder, with only two meiotic events separating any two mice in the population. In such cases the resolution of linkage peaks is quite low, requiring the threshold of peak-to-physical gene distance to be more relaxed, so that eQTLs that are within 20 or 30 Mb could be potentially cis-acting (Doss et al. 2005; Schadt et al. 2003). While the proximal eQTLs provide an easy path to making causal inference, given that the larger effect sizes commonly associated with proximal eQTLs make them easier to detect (Brem et al. 2002; Cervino et al. 2005; Cheung et al. 2005; Lum et al. 2006; Monks et al. 2004; Schadt et al. 2003), the methods discussed above work for distal as well as proximal eQTLs. In fact, if a given gene sits more centrally in a given gene network that drives disease, it may capture a larger percentage of the genetic variation associated with the disease (Fig. 2 More generally leveraging eQTL data to reconstruct gene networks The classic reductionist view applied to genetics has motivated the identification of single genes associated with disease as one means of getting a foot into disease pathways. However, even in cases where genes are involved in pathways that are well known, it is unclear whether the gene causes disease via the known pathway or whether the gene is involved in other pathways or more complex networks that lead to disease. This was the case with TGFBR2, a recently identified and validated obesity susceptibility gene (Schadt et al. 2005). The classic view of the signaling pathways involving the superfamily of transforming growth factor β (TGF-β) proteins is that TGF-β acts through receptor serine/threonine kinases to phosporylate regulatory proteins of the Smad family, which then move into the nucleus where they bind DNA to activate specific sets of target genes (Alberts 2002) (Fig. 4
RNA levels of the type II TGF-β receptor (TGFBR2) were recently shown to be very significantly correlated with thousands of other gene expression traits in the liver transcriptional network of a cross between two inbred lines of mice (referred to here as the BXD cross) (Schadt et al. 2003, 2005) This set of genes associated with TGFBR2 was enriched for a broad range of biological functions known to be associated with the classic TGF-β signaling pathway and with metabolic disease traits such as obesity. Furthermore, TGFBR2 RNA levels in the BXD cross were also found to be significantly correlated with many obesity-related traits like fat mass, percent body fat, and weight. Taking a view that a complex network of gene interactions underlies obesity phenotypes in the BXD cross, genotypic and gene expression data were systematically integrated to assess whether changes in DNA sequence at a given location in the genome (reflected as genotypes in the cross animals) leading to changes in transcript abundances for a given gene supported an independent, causative, or reactive function of that gene relative to various obesity phenotypes like fat mass (Schadt et al. 2005). In partitioning the thousands of genes associated with obesity in this way, TGFBR2 was one of 40 genes predicted as causal for obesity in the BXD cross. TGFBR2 and two other genes selected for validation were all validated as causal for obesity in this study (Schadt et al. 2005). These data directly demonstrated that TGFBR2 and other genes in this signaling pathway are involved in a more general gene network (Schadt et al. 2005a, b), so that it is possible that perturbations in these other genes or in TGFBR2 itself may drive diseases like obesity by influencing other parts of the network beyond the TGF-β signaling pathway (Fig. 4 Networks provide a convenient framework for exploring the context within which single genes operate. Networks are simply graphical models comprising nodes and edges. For gene networks associated with biological systems, the nodes in the network typically represent genes, and edges (links) between any two nodes indicate a relationship between the two corresponding genes. For example, an edge between two genes may indicate that the corresponding expression traits are correlated in a given population of interest (Zhu et al. 2004), that the corresponding proteins interact (Kim et al. 2005), or that changes in the activity of one gene lead to changes in the activity of the other gene (Schadt et al. 2005). Interaction or association networks have recently gained more widespread use in the biological community, where networks are formed by considering only pairwise relationships between genes, including protein interaction relationships [49], coexpression relationships (Gargalovic et al. 2006; Ghazalpour et al. 2006), and other straightforward measures that may indicate association between two genes. Genetic data can aid in the construction of association networks by helping to reduce artifactual correlations between expression traits. Significant artifactual correlations can arise because of correlated noise structures between array-based experiments networks. One way to leverage the eQTL data in this setting is to simply filter out gene-gene correlations in which the expression traits are not at least partially explained by common genetic effects (Lum et al. 2006). For example, we can connect two genes with an edge in a coexpression network if (1) the p value for the Pearson correlation coefficient between the two genes is less than some prespecified threshold, and (2) the two genes had at least one eQTL in common. This can be taken a step further by formally assessing whether two expression traits driven by a common QTL are related in a causal or reactive fashion, filtering out correlations driven by expression traits that are independently driven by common or closely linked QTLs (Doss et al. 2005; Schadt et al. 2005). As has been discussed, multiple traits driven by common QTLs is a central idea that can be leveraged to construct networks. One intuitive way to establish whether two genes share at least one eQTL is to perform single-trait eQTL mapping for each expression trait and then consider eQTLs for each trait overlapping if the corresponding LOD for the eQTLs are above some threshold and if the eQTLs are in close proximity to one another. The significance of the statistic corresponding to the strength of association between two genes in the coexpression networks is then chosen such that the resulting network exhibits the scale-free property (Gargalovic et al. 2006; Ghazalpour et al. 2006; Lum et al. 2006) and the false discovery rate for the gene-gene pairs represented in the network is constrained. Beyond the simple, albeit intuitively appealing, eQTL overlap method, we can formally test whether two overlapping eQTLs represent a single eQTL or closely linked eQTLs by employing a pleiotropy effects test (PET), such as that originally described by Jiang and Zeng (1995) and Zeng et al. (2000). The formation of gene clusters by simultaneously considering gene-gene and marker-gene correlations also promises to provide a more comprehensive characterization of shared genetic effects (Lee et al. 2006). Identifying modules of highly interconnected genes in coexpression networks Given the scale-free and hierarchical nature of coexpression networks (Barabasi and Oltvai 2004; Ghazalpour et al. 2006; Lum et al. 2006), one of the key problems is to identify the network modules, or functional units, in the network that represent those hub nodes (nodes that are significantly correlated with many other nodes) that are highly interconnected with one another but that are not as highly connected with other hub nodes. Figure 5
Ravasz et al. (2002) used manually selected height cutoff to separate tree branches after hierarchical clustering, in contrast to Lee et al. (2004) who formed maximally coherent gene modules with respect to gene ontology (GO) functional categories. Another strategy is to employ a measure similar to that used by Lee et al. (2004) but without the dependence on the GO functional annotations, given it is of interest to determine independently whether coexpression modules are enriched for GO functional annotations (Lum et al. 2006). The modules identified in this way are informative for identifying the functional components of the network that are associated with disease (Lum et al. 2006). It has been demonstrated that the types of modules depicted in Fig. 5 Using eQTL data to reconstruct probabilistic networks Coexpression networks are useful constructs for characterizing gross topological properties of biological networks, highlighting nodes that are highly connected, and identifying functional modules that aid in the characterization of subnetworks associated with disease. The edges in these networks, however, are undirected so that they do not provide explicit details on the connectivity structure among genes in the network. As suggested above, one way to incorporate causal information into the coexpression networks would be to define direction using the causality procedures described in Fig. 2 Bayesian network methods have been applied previously to reconstruct networks comprising only expression traits, and to networks comprising both expression and disease traits, where the aim has been to identify those portions of the network that are driving a given disease trait. Forming candidate relationships among genes was performed using an extension of standard Bayesian network reconstruction methods (Chiellini et al. 2002). In the first approach to extend this method using genetic data, QTL information for the transcript abundances of each gene considered in the network was incorporated into the reconstruction process. It is well known that searching for the best possible network linking a moderately sized set of genes is an NP-hard problem. Exhaustively searching for the optimal network with hundreds of genes is presently a computationally intractable problem. Therefore, various simplifications are typically applied to reduce the size of the search space and to reduce the number of parameters that need to be estimated from the data. Two simplifying assumptions to achieve such reductions are commonly employed. First, while any gene in a biological system can control many other genes, a given gene can be restricted so that it is allowed to be controlled by a reduced set of genes. Second, the set of genes that can be considered possible causal drivers (parent nodes) for a given gene can be restricted using the type of causality arguments discussed in previous sections, as opposed to allowing for the possibility of any gene in the complete gene set to serve as a parent node. The eQTL data in this case can be leveraged as prior information to restrict the types of relationships that can be established among genes and the QTL information can be more intimately integrated into the network reconstruction process. As indicated in previous sections, correlation measures are symmetric and so can indicate association but not causality. However, QTL mapping information for the gene expression traits can be used to help sort out causal relationships. The different tests described in the section on making causal inferences between pairs of traits provides one way to explicitly sort out such relationships. Zhu et al. (2004) leveraged the eQTL data in a different way to make similar types of inferences in their network reconstruction algorithm. With the various constraints and measures defined above, the goal in reconstructing whole-gene networks is to find a graphical model M (a gene network) that best represents the relationships between genes, given a gene expression data set D of interest. That is, given data D, we seek to find the model M with the highest posterior probability P(M|D). The prior probability P(M) of model M is where the product is taken over all paths in the network (M) under consideration. The algorithm Zhu et al. (2004) employed to search through all possible models to find the network that best fits the data is similar to the local maximum search algorithm implemented by Friedman et al. (2000). Zhu et al. (2007) recently demonstrated via simulation of biologically realistic networks that the integration of genetic and expression data in this fashion to reconstruct gene networks leads to networks that are more predictive than networks reconstructed from expression data alone. The Bayesian network reconstruction algorithm can be used to elucidate the module connectivity structure depicted in Fig. 5 Conclusions The identification of DNA polymorphisms that associate with diseases like obesity and diabetes can be considered only as the beginning in a long series of steps needed to elucidate disease pathways and to establish the specific role individual genes may play in the process. Diseases like obesity are diseases of the system, potentially involving many different pathways operating in many different tissues and ultimately giving rise to not only different disease subtypes but to different comorbidities of the disease as well. The integration of gene expression (and other molecular profiling data more generally) and genotypic data will be critical if we are ever to understand how genetic and environmental perturbations to a given system lead to complex traits like disease. If common forms of these diseases represent states of a network, then focusing on single-gene perturbations will likely never reveal the most effective ways to treat or prevent disease. The integration of the diverse sets of molecular data now being generated in population settings is only in its infancy. Many of the methods employed to date toward this end are more heuristic in nature and so will benefit from a more formal treatment. In addition, little to date has been done to integrate expression data from multiple tissues to dissect how modules in one tissue may communicate with modules in another tissue. The types of interactions considered along with eQTL data so far have been restricted to RNA-RNA association data, despite the availability of large-scale DNA-protein and protein-protein interaction data. The predictive power of the types of networks discussed in this review could be enhanced by more systematically integrating protein-protein interactions, protein-DNA interactions, protein-RNA interactions, RNA-RNA interactions, protein state information, methylation state, and interactions with metabolites as these types of data become available. These developments promise to take us beyond the single-gene view of disease and move us closer to the type of systems level view, depicted in Fig. 1 References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||
Science. 2005 Apr 15; 308(5720):421-4.
[Science. 2005]Science. 2005 Apr 15; 308(5720):419-21.
[Science. 2005]Science. 2005 Apr 15; 308(5720):385-9.
[Science. 2005]Nat Genet. 2006 Mar; 38(3):320-3.
[Nat Genet. 2006]Nature. 2007 Feb 22; 445(7130):881-5.
[Nature. 2007]J Lipid Res. 2006 Dec; 47(12):2601-13.
[J Lipid Res. 2006]Science. 2002 Apr 26; 296(5568):752-5.
[Science. 2002]Genetics. 2006 Feb; 172(2):1155-64.
[Genetics. 2006]Nat Genet. 2005 Mar; 37(3):243-53.
[Nat Genet. 2005]Nat Genet. 2001 Dec; 29(4):389-95.
[Nat Genet. 2001]Nat Genet. 2002 Apr; 30(4):385-93.
[Nat Genet. 2002]Nat Immunol. 2000 Sep; 1(3):221-6.
[Nat Immunol. 2000]Nature. 2003 Mar 20; 422(6929):297-302.
[Nature. 2003]Science. 2003 Dec 19; 302(5653):2141-4.
[Science. 2003]Science. 2002 May 31; 296(5573):1661-71.
[Science. 2002]Genome Biol. 2004; 5(10):R73.
[Genome Biol. 2004]Genetics. 2005 Nov; 171(3):1437-9.
[Genetics. 2005]Science. 2002 Apr 26; 296(5568):752-5.
[Science. 2002]Nat Genet. 2005 Mar; 37(3):233-42.
[Nat Genet. 2005]Nature. 2005 Oct 27; 437(7063):1365-9.
[Nature. 2005]Trends Genet. 2001 Jul; 17(7):388-91.
[Trends Genet. 2001]Am J Hum Genet. 2004 Dec; 75(6):1094-105.
[Am J Hum Genet. 2004]Biometrics. 2006 Mar; 62(1):19-27.
[Biometrics. 2006]Proc Natl Acad Sci U S A. 2005 Feb 1; 102(5):1572-7.
[Proc Natl Acad Sci U S A. 2005]PLoS Biol. 2005 Aug; 3(8):e267.
[PLoS Biol. 2005]J Lipid Res. 2006 Dec; 47(12):2601-13.
[J Lipid Res. 2006]Genome Res. 2005 May; 15(5):681-91.
[Genome Res. 2005]Science. 2002 Apr 26; 296(5568):752-5.
[Science. 2002]Genomics. 2005 Nov; 86(5):505-17.
[Genomics. 2005]Nature. 2005 Oct 27; 437(7063):1365-9.
[Nature. 2005]J Neurochem. 2006 Apr; 97 Suppl 1():50-62.
[J Neurochem. 2006]BMC Genomics. 2006 Sep 15; 7():235.
[BMC Genomics. 2006]Nature. 2005 Oct 27; 437(7063):1365-9.
[Nature. 2005]Genome Res. 2005 May; 15(5):681-91.
[Genome Res. 2005]Nature. 2003 Mar 20; 422(6929):297-302.
[Nature. 2003]Science. 2002 Apr 26; 296(5568):752-5.
[Science. 2002]Nature. 2003 Mar 20; 422(6929):297-302.
[Nature. 2003]Cytogenet Genome Res. 2004; 105(2-4):363-74.
[Cytogenet Genome Res. 2004]Science. 2005 May 20; 308(5725):1164-7.
[Science. 2005]Proc Natl Acad Sci U S A. 2006 Aug 22; 103(34):12741-6.
[Proc Natl Acad Sci U S A. 2006]PLoS Genet. 2006 Aug 18; 2(8):e130.
[PLoS Genet. 2006]J Neurochem. 2006 Apr; 97 Suppl 1():50-62.
[J Neurochem. 2006]Genome Res. 2005 May; 15(5):681-91.
[Genome Res. 2005]Proc Natl Acad Sci U S A. 2006 Aug 22; 103(34):12741-6.
[Proc Natl Acad Sci U S A. 2006]PLoS Genet. 2006 Aug 18; 2(8):e130.
[PLoS Genet. 2006]J Neurochem. 2006 Apr; 97 Suppl 1():50-62.
[J Neurochem. 2006]Genetics. 1995 Jul; 140(3):1111-27.
[Genetics. 1995]Genetics. 2000 Jan; 154(1):299-310.
[Genetics. 2000]Nat Rev Genet. 2004 Feb; 5(2):101-13.
[Nat Rev Genet. 2004]PLoS Genet. 2006 Aug 18; 2(8):e130.
[PLoS Genet. 2006]J Neurochem. 2006 Apr; 97 Suppl 1():50-62.
[J Neurochem. 2006]Science. 2002 Aug 30; 297(5586):1551-5.
[Science. 2002]Science. 2004 Nov 26; 306(5701):1555-8.
[Science. 2004]J Neurochem. 2006 Apr; 97 Suppl 1():50-62.
[J Neurochem. 2006]PLoS Genet. 2006 Aug 18; 2(8):e130.
[PLoS Genet. 2006]Cytogenet Genome Res. 2004; 105(2-4):363-74.
[Cytogenet Genome Res. 2004]J Cell Physiol. 2002 Feb; 190(2):251-8.
[J Cell Physiol. 2002]Cytogenet Genome Res. 2004; 105(2-4):363-74.
[Cytogenet Genome Res. 2004]Cytogenet Genome Res. 2004; 105(2-4):363-74.
[Cytogenet Genome Res. 2004]J Comput Biol. 2000; 7(3-4):601-20.
[J Comput Biol. 2000]PLoS Comput Biol. 2007 Apr 13; 3(4):e69.
[PLoS Comput Biol. 2007]