• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of frontgeneLink to Publisher's site
Front Genet. 2012; 3: 26.
Published online Feb 29, 2012. doi:  10.3389/fgene.2012.00026
PMCID: PMC3289795

Naïve Bayesian Classifier and Genetic Risk Score for Genetic Risk Prediction of a Categorical Trait: Not so Different after all!

Abstract

One of the most popular modeling approaches to genetic risk prediction is to use a summary of risk alleles in the form of an unweighted or a weighted genetic risk score, with weights that relate to the odds for the phenotype in carriers of the individual alleles. Recent contributions have proposed the use of Bayesian classification rules using Naïve Bayes classifiers. We examine the relation between the two approaches for genetic risk prediction and show that the methods are mathematically related. In addition, we study the properties of the two approaches and describe how they can be generalized to include various models of inheritance.

Keywords: genetic risk prediction, genetic score, Naïve Bayes classifier, classification score, classification rule

Introduction

Several statistical methods have been proposed to capture the complex genetic bases of common diseases. These approaches include standard regression models in which the contribution of several genetic variants is summarized by a genetic risk score (GRS; Meigs et al., 2008; Purcell et al., 2009; Paynter et al., 2010), multivariate regression models and “machine learning type” approaches such as support vector machines (Wei et al., 2009; Wu et al., 2011), Naïve Bayes classifiers (NBC; Okser et al., 2010), classification and regression trees, random forests (Bureau et al., 2005; McKinney et al., 2006), rule induction (Sebastiani and Perls, 2010; Stengard et al., 2010), multifactor dimensionality reduction (Moore et al., 2006), and Bayesian networks (Rodin and Boerwinkle, 2005; Sebastiani et al., 2005; Jiang et al., 2011; Kang et al., 2011a). NBCs use a simple but surprisingly effective Bayesian rule that classifies a subject at risk of a trait if the posterior probability of the trait, given the individual’s genetic profile, is maximal (Hand, 2009). The classification rule can be built using a large number of genetic variants, such as single nucleotide polymorphisms (SNPs), by assuming that the SNPs are conditionally independent given the trait (Sebastiani et al., 2012). This hypothesis is often mistaken for “marginal independence” but marginal and conditional independence have no relation (Whittaker, 1990).

In this manuscript we show that there is a mathematical link between NBCs and logistic regression models that use a GRS to summarize the contribution of many SNPs to the susceptibility to a genetic disease. The link between these two approaches also highlights their limitations. We discuss how the directed graphical model underlying a NBC can be extended to include interactions between genes and/or environmental risk factors by maintaining the computations scalable to genome-wide genotype data and even whole genome sequence data.

Methods and Results

We describe two approaches – logistic regression and Bayesian classifier – to define a classification score and a rule to be used for genetic risk prediction of a dichotomous trait denoted as T or “not T.” The classification score for genetic risk prediction is a function that maps a set of SNPs Σ = {S1, …, Sk} into real numbers. The classification rule links the output of the score function to the events T or “not T.” Formally, with S denoting the space of SNPs and R the real numbers:

Classification score: Sc(Σ):SRClassification rule: Sc(Σ> τClassify as T

Logistic regression with a genetic risk score

A logistic regression model that includes the general effects of k biallelic SNPs Σ = {S1, …, Sk} to model the odds for a dichotomous trait T is defined by the logit equation:

logp(T|Σ)1-p(T|Σ)=α0+j=1k(α1jXjAB+α2jXjBB) where XjAB=1 if Sj genotype=AB0 otherwise and XjBB=1 if Sj genotype=BB0 otherwise

We assume that the alleles of the SNPs are ordered in lexicographical order (A < C < G < T), and A represents the first allele and B the second allele regardless of their frequency. The logit equation is the classification score that can be used to define a classification rule based on a threshold τ:

Classification score: Sc(Σ)=logpT|Σ1-pT|Σ=α0+j(α1jXjAB+α2jXjBB)Classification rule: Sc(Σ> τClassify as T

and τ can be determined to optimize sensitivity and specificity by receiver operating characteristic (ROC) curve analysis.

The coefficients of the logistic score are typically estimated by maximum likelihood (McCullagh and Nelder, 1989), or Bayesian methods using large sample approximations or Gibbs sampling (Balding, 2006). By definition, the intercept α0 represents the log-odds for the trait T for the referent group with all SNPs genotypes equal to AA, while each parameter α1j represents the log-odds ratio for the trait T between the AB genotype and the AA genotype of the jth SNP, and each parameter α2j represents the log-odds ratio for T between the BB and AA genotype of the jth SNP, assuming the other SNP genotypes fixed. When α2j = 1j for all j = 1, …, k, then the logistic regression encodes the additive effects of the SNPs, and each parameter α1j represents the log-odds ratio for T for each additional copy of the B allele relative to the referent genotype AA.

It is well known that when the data are from a case–control study design, the intercept does not provide the correct estimate of the odds for T in the populations and several corrections have been proposed to limit this problem (Jewell, 2003). Bias of the intercept term is not a problem when the logistic regression model is meant to be used for classification because different intercepts will simply shift the logistic function and classification scores that differ only by the intercept term lead to equivalent classification rules. We state this property formally because it will be used further.

Property 1: Irrelevance of the intercept term of a logistic regression model for classification

Let Sc1(Σ) and Sc2(Σ) be two classification scores defined as:

Sc1(Σ)=logpT|Σ1-pT|Σ=α0+j(α1jXjAB+α2jXjBB)Sc2(Σ)=logpT|Σ1-pT|Σ=β0+j(α1jXjAB+α2jXjBB)

The two classification scores can be used to define equivalent classification rules by using the relation:

“if Sc1(Σ)>τclassify as T” if and only if“if Sc2(Σ)>τ+β0-α0classify as T”

We note however that the correct estimate of the intercept term is necessary to be able to interpret the prediction from the logistic model in terms of prevalence of the trait in the population.

One of the limitations of multivariate logistic regression is that the number of covariates is bounded above by the sample size. It is expected that many common genetic complex traits may be determined by hundreds of genetic variants (Kraft and Hunter, 2009), so that the sample size needed to build reliable logistic regression models for risk prediction can be prohibitively large.

A naïve but very popular alternative is to collapse the contribution of the k SNPs into a GRS to be used in a univariate logistic model. A GRS is typically defined as the weighted sum of the genotypes:

GRS=GRS(Σ)=i=1kwiXiAA+viXiAB+ziXiBB 

with weights that can be appropriately chosen. The variables XiAB and XiBB are defined as above, and XiAA = 1 if the ith SNP genotype is AA and 0 otherwise. See Table Table11 for a summary of three possible weighting schemes. The GRS is then used as risk factor to define a classification score using a univariate logistic regression:

Table 1
Example of choice of weights for the weighted genetic risk score.
Sc(Σ)=logp(T|GRS)1-p(T|GRS)=γ0+γ1 GRS

Case 1

Although this is often referred to as the “unweighted genetic score,” the heterozygote genotype is always assigned a weight 1, while the homozygous genotype for the risk allele is assigned weight 2 and the other genotype is assigned weight 0. By adopting this weighting scheme, we are simply counting the number of risk alleles each subject carries. The risk allele of each SNP is determined by a “one-SNP-at-a-time” association analysis, typically under an additive genetic model. Using the same notation and lexicographical order of the SNPs that we used earlier, the risk allele of each SNP will be the A allele if the regression coefficient αi of the logistic regression model

logp(T|Si)1-p(T|Si)=α0i+αi(XiAB+2XiBB) 

is negative, and the B allele if αi is positive. In the first case (αi < 0), each copy of the B allele decreases the odds for T, while in the second case (αi  0) each copy of the B allele increases the odds for T. With this definition, the GRS is only a function of the different number of risk alleles regardless of their individual genetic effects, and two identical GRS values can represent genetic profiles that are substantially different. See Figure Figure11 for an example.

Figure 1
Example of GRS (case 1 and case 2 in Table Table1)1) based on three SNPs associated with exceptional longevity. The table on top reports the A/B alleles for the three SNPs, the frequencies of A allele in cases and controls, and the p-value for ...

The slope γ1 in the classification score:

Sc(Σ)=logp(T|GRS)1-p(T|GRS)=γ0+γ1 GRS
(1)

measures the association of the GRS with the trait T in terms of log-odds ratio for T between two GRS that differ by 1, and it is often estimated to test whether the GRS is significantly associated with T. However, the value of γ1 is irrelevant for classification because two classification scores defined as inEq. 1that differ by the slope will produce equivalent classification rules. This is stated in the next property.

Property 2: Irrelevance of the slope of a univariate logistic regression model for classification

Let Sc1(Σ) and Sc2(Σ) be two classification scores defined as:

Sc1(Σ)=logpT|GRS1-pT|GRS=γ0+γ1GRSSc2(Σ)=logpT|GRS1-pT|GRS=β0+β1GRS

The two classification scores can be used to define equivalent classification rules by using the relation:

“Sc1(Σ)>τclassify as T”, if and only if“Sc2(Σ)>β0+β1τ-γ0γ1classify as T”

The GRSs labeled 2 and 3 in Table Table11 weight SNP alleles in different ways to reflect their individual associations with the trait T.

Case 2

The GRS can be written as:

GRS=i=1kvi(XiAB+2XiBB) 

where each weight vi is the maximum likelihood estimate of the regression coefficient in the univariate logistic regression:

logp(T|Xi)1-p(T|Xi)=αi0+viXi; Xi=1 if Si=AB2 if Si=BB 0 otherwise

that measures the association between SNP Si and the trait T with an additive genetic model. Therefore, each weight

vi=logp(T|Xi=1)1p(T|Xi=1)p(T|Xi=0)1p(T|Xi=0)

estimates the log-odds ratio for T for each copy of the B allele in an additive genetic model. Note that this formulation of the GRS does not require the specification of the risk allele of the SNPs, and the weighted genetic score will increase by vi for each copy of the B allele of SNP Si, if this is a risk allele, and decrease by vi for each copy of the B allele if this is the protective allele. See the example in Figure Figure11.

The classification score based on this GRS is computed using the logistic regression inEq. 1, with parameters γ0, γ1 that can be estimated by maximum likelihood or Bayesian methods. The slope represents the odds ratio (OR) for T for a unit change of the GRS. In general, the OR for T between two genetic profiles Σ1 = {S11, …, Sk1} and Σ2 = {S12, …, Sk2} associated with GRS1 and GRS2 is

log(p(T|GRS1)/(1p(T|GRS1)p(T|GRS2)/(1p(T|GRS2))=γ1i=1klog(p(T|Si1)/(1p(T|Si1)p(T|Si2)/(1p(T|Si2))

and this equation shows that the log-odds ratio for T between two weighted GRSs is an average of log-odds ratios of the individual genetic effects rescaled by the coefficient γ1.

The classification rule

if Sc1(Σ)=logp(T|GRS)1-p(T|GRS)>τclassify as T

based on the score

Sc1(Σ)=logpT|GRS1-pT|GRS=γ0+γ1GRS

is equivalent to:

ifi=1klogp(T|Si)/(1p(T|Si)p(T|Si=AA)/(1p(T|Si=AA)>τγ0γ1 classify as T

So the classification rule that uses the weighted GRS in case 2 is essentially based on an average of the individual log-odds ratio for T of each SNP genotype relative to the referent genotypes.

Case 3

The GRS is:

GRS=i=1k(viXiAB+ziXiBB) 

where vi and zi are the MLE estimate of the regression coefficients of the univariate logistic regression

logp(T|Si)1-p(T|Si)=αi0+viXiAB+ziXiBB; XiAB=1if Si=AB0otherwise;XiBB=1if Si=BB0otherwise

that measures the genotypic association between SNP Si and the trait T. Therefore

vi=logp(T|Si=AB)1-p(T|Si=AB)p(T|Si=AA)1-p(T|Si=AA); zi=logp(T|Si=BB)1-p(T|Si=BB)p(T|Si=AA)1-p(T|Si=AA)

are the log-odds ratio for T between the AB and AA genotypes, and BB and AA genotypes. See Figure Figure22 for an example. The classification score and classification rule are derived as in case 2 and can be interpreted as average of the log-odds ratios of individual SNPs genotypes. Compared to case 2, the weights based on genotype associations allow for more general model of associations that are not restricted to linear increase of the log-odds for T. Note also that when the SNPs included in a GRS (case 2 and 3) are independent, the two scores should be approximately equivalent to multivariate logistic regression with additive (case 2) or genotypic association (case 3). In addition, if the SNPs included in the GRS have similar effects, then the GRS in case 1 and 2 should be approximately equivalent.

Figure 2
Example of GRS (case 3 in Table Table11). The table on top reports the A/B alleles for the three SNPs, the frequencies of A allele in cases and controls, and the odds ratio for exceptional longevity in carriers of the AB allele relative to carriers ...

Naïve bayes classifiers

The classification score based on a NBC is the posterior probability of the trait T that is calculated using the formula:

Sc()=p(T|)=p(T)i=1kp(Si|T)p(T)i=1kp(Si|T)+(1p(T))i=1kp(Si|notT)

where p(T) and 1  p(T) are the prior probabilities of having the trait T or not. The conditional probabilities p(Si | T) and p(Si | not T) represent the distribution of the ith SNP genotype in subjects with and without the trait T. They are typically estimated assuming genotypic association (Sebastiani et al., 2012), but they could also be estimated using an additive genetic model. The formula is derived using Bayes’ theorem and assuming that the SNPs are independent, conditionally on T (Hand, 2009). The usual Bayesian classification rule is to classify a subject with the most probable outcome

if Sc(Σ)>0.5classify as T.

This rule is based on a 0–1 loss that assigns the same weight to misclassification errors. A general loss function that weights differently sensitivity and specificity would lead to the classification rule:

if Sc(Σ)>λ1+λclassify as T for λ>0

that can also be written as:

Sc(Σ)>λ1+λlogp(T|Σ)1-p(T|Σ)>log(λ)

and simple algebra shows that this is equivalent to:

logp(T|Σ)1-p(T|Σ)=logp(T)i=1kp(Si|T)(1-p(T))i=1kp(Si|not T)=logi=1kp(T)p(Si|T)i=1k(1-p(T))p(Si|not T)=logi=1kp(T|Si)1-p(T|Si)=i=1klogp(T|Si)1-p(T|Si)>log(λ)

As long as the log-odds ratios are calculated using the same genetic model, this classification rule is equivalent to the classification rule based on the GRS (either case 2 or 3)

if i=1klogp(T|Si)(1-p(T|Si))p(T|Si=AA)(1-p(T|Si=AA))>τ-γ0γ1classify as T

by setting the threshold

τ=γ0-γ1i=1klogp(T|Si=AA)1-p(T|Si=AA)+γ1log(λ)

We state this relation formally.

Property 3: Equivalence of classification rules based on the GRS and the NBC

The classification rules based on a logistic model of a GRS(case 2 or 3) and a NBC are equivalent when the same genetic models are used to link individual SNPs to the trait.

The details of the algebraic manipulations are in Section “Appendix.”

Note that the equivalence between the classification rules based on a NBC and a logistic regression model with a GRS as in case 2 or 3 is a simple consequence of the fact that both models base the prediction on a weighted average of ORs of the individual SNPs. This equivalence is independent of the choice of the prior for T because different prior distributions will lead to equivalent classification rules but with different classification thresholds. Also, the equivalence of classification rules based on GRS and NBC implies that when alternative classifiers are compared by the area under the receiving operator curve they must reach the same value. This is shown in the next example.

Example

To demonstrate the connection between the NBC and the GRS in case 3, we performed a simple simulation. We simulated a dataset with 3000 cases and 3000 controls, and genotype data from 75 causal SNP and 500,000 null SNPs. For the null SNPs, we randomly selected frequencies of the minor allele (p) from a uniform (0.05, 0.5) distribution and genotype frequencies were generated assuming Hardy–Weinberg equilibrium [p2,2p(1  p),(1  p)2]. The causal SNPs were simulated with ORs of 1.2, 1.3, 1.4, 1.5, and 1.6 and minor allele frequencies (MAFs) of 0.1, 0.2, 0.3, 0.4, and 0.5. A causal SNP was simulated for each combination of the above ORs and MAFs (25 combinations) under an additive, recessive and dominant mode of inheritance (25 combinations × 3 modes of inheritance = 75 SNPs). The genotype frequencies in controls were generated to follow Hardy–Weinberg equilibrium [p2,2p(1  p),(1  p)2]. The genotype frequencies in cases for the additive, recessive, and dominant models were [p2,2ORp(1  p),OR2(1  p)2], [p2,2p(1  p),OR(1  p)2] and [p2,2ORp(1  p),OR(1  p)2], respectively. For the cases, the genotype frequencies were divided by the sum of the frequencies so that the frequencies add up to 1. Using the genotype frequencies for each SNP, we simulated a discovery set of 3000 cases and 3000 controls and a replication set with the same sample sizes.

The data in the discovery set were analyzed to generate genetic risk models based on GRS and NBCs in the following way. A Bayesian genome-wide association study was performed on the discovery set and SNPs were ordered according to the posterior probability for the genotypic association to build nested NBCs with increasing number of SNPs as in Sebastiani et al. (2012). To obtain the weights for the three GRSs, we ran two logistic regression models for each SNP, using an additive mode of inheritance and a genotypic mode of inheritance. The results of these analyses were used to detect the risk alleles of SNPs for nested GRS as in case 1; and to estimate the weights of GRS as in cases 2 and 3. Using SNPs ordered by the posterior probability for the genotypic association, we then built three sets of classification models based on logistic regression and the three different GRS, with increasing number of SNPs. The prediction models were tested on the replication set to avoid issues of over-fitting. The simulation described above was repeated five times and the mean AUC across the replicates was used to assess accuracy.

Figure Figure33 (left panel) shows the mean AUC across five replicates for the NBCs and logistic regression models for different GRSs, with increasing number of SNPs. As expected based on our mathematical calculations, the AUCs of the genetic risk models based on the NBCs and the GRSs with a genotypic weights are identical (Figure (Figure3,3, left panel), and the predicted probabilities are almost identical (Figure (Figure3,3, right panel). The weighted and unweighted GRS using an additive mode of inheritance have lower AUCs demonstrating the loss of accuracy with assuming additivity when some of the SNPs do not follow an additive mode of inheritance. Of course if all SNPs do in fact follow an additive model of the inheritance, the genotypic and additive prediction models would perform similarly. The trend of the AUC shows that accuracy keeps increasing as true positive SNPs are included in the model, and then declines when each classification model starts including false positive SNPs. The decline is more evident for the case 1 GRS, while both weighted GRS based on additive or genotypic associations appear to be more robust.

Figure 3
Results of simulation for replication set. The left hand plot graphs the mean area under the ROC (AUC) versus the number of SNPs in the prediction model. The colored lines refer to the AUC of the NBC (black), the unweighted GRS from an additive model ...

Discussion

One of the selling points of genome-wide association studies was to discover genetic variants that are associated with increased susceptibility for disease and could be used for personalized diagnosis and prognosis. Initial results published for example in Meigs et al. (2008) and Paynter et al. (2010) however showed that genetic data added limited predicted values to well established risk factors of Type II diabetes and cardiovascular disease. These initial studies limited the attention to those SNPs that reached genome-wide significance and their effect was summarized into a GRS. Since then, a growing body of literature has shown the increased value of deeper mining of genome-wide association studies but inclusion of large number of SNPs in genetic risk model has continued to resort on GRSs (Cui, 2009; Goddard et al., 2009; Kooperberg et al., 2009; Purcell et al., 2009; Yang et al., 2010; Chen et al., 2011; Chibnik et al., 2011), while machine learning type methods continue to be rare regardless of some successful applications (Wei et al., 2009; Okser et al., 2010; Kang et al., 2011b; Sebastiani et al., 2012).

Our study shows that risk prediction based on a GRS is mathematically equivalent to risk prediction based on a NBC, when the same SNPs with the same mode of inheritance are used in the models. The equivalence is based on the fact that both models essentially base the prediction on a weighted average of ORs of the individual SNPs. While this equivalence establishes the validity of methods based on the NBC for genetic risk prediction and we hope will contribute to make this approach more popular in this field, it also shows that contrary to what stated in Okser et al. (2010) a NBC does not include interactions of SNPs but only additive genetic effects. However, the directed graphical model underlying a NBC can be extended to more general structures to include interactions between genes and/or environmental risk factors by maintaining the computations scalable to genome-wide genotype data and even whole genome sequence data (Sebastiani and Perls, 2008).

Figure Figure44 shows some ways to extend NBCs for risk prediction to include population ancestry, as well as genetic and non-genetic effects that may be missed by test for marginal associations. Figure Figure4A4A describes a directed acyclic graph (DAG) with one parent node (T) and two children nodes (X1 and X2) that may represent SNPs. The DAG describes the conditional independence of X1 and X2 given T. This type of DAG with one root node and multiple conditionally independent children represents a NBC (Sebastiani and Abad-Grau, 2007). The DAG in Figure Figure4B4B extends the NBC in Figure Figure4A4A with an additional node X3 that is marginally independent of T, but conditionally dependent on T given X2. In the context of genetic risk modeling, the node X3 could represent a non-genetic risk factor that is associated with a trait T only in specific genetic backgrounds (the node X2). The DAG in Figure Figure4C4C includes an additional node X4 that is conditionally independent of all other nodes given X1. This additional node may represent a gene × gene interaction that is induced by linkage disequilibrium. Note that both DAGS in Figures Figures4B,C4B,C would give the same classification score for T, because of the independence of T from X4 given X1. So, the DAG in Figure Figure4C4C would be useful for a better explanation of the biology rather than improving genetic risk prediction. Finally, the DAG in Figure Figure4D4D extends the DAG in Figure Figure4B4B by adding a link from T to X3. The inclusion of this link makes the node X3 marginally dependent of T and interaction between X2 and X3 changes the classification score compared to the DAG in Figure Figure44B.

Figure 4
Examples of directed acyclic graph (DAG). All nodes are random variables and the DAG represents Markov properties of marginal and conditional independence (Lauritzen and Sheehan, 2004). In particular, the global Markov property states that a node is independent ...

In addition, and most importantly, the fact that all variables in a DAG are random provides a sound framework for marginal and conditional inference. For example, a genetic risk model based on a DAG can be used for predicting the outcome of a subject by marginalizing out unobserved variables (Solovieff et al., 2011).

Our analysis is limited to binary outcomes, but we expect that similar results hold when the outcome to be predicted is a quantitative trait that follows a normal distribution. Furthermore, our analysis shows that linear transformations of a GRS do not impact predictive accuracy, and similarly, that the predictive accuracy of a NBC cannot be changed by a choice of prior for T. Improving the accuracy can be accomplished by selection of the most predictive SNP and by choosing alternative weights to calculate the GRS. There is no obvious similar choice for a NBC. However, a closely related approach that we used in Sebastiani et al. (2012) to improve the predictive accuracy is to use ensemble of nested NBCs. Finally, the machine learning community has developed many feature selection algorithms for building classifiers (Hastie et al., 2009) that, by the equivalence proved in this paper, may prove to be useful to generate better genetic risk models.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

Research supported with funds from NIH/NHLBI R01 HL089655-03 and R21HL114237 (Paola Sebastiani). We thank Dr. Maria Abad Grau for stimulating discussion during her visit at Boston University in August 2011 that triggered the writing of this manuscript.

Appendix

Derivation of property 3

i=1klogp(T|Si)/(1p(T|Si)p(T|Si=AA)/(1p(T|Si=AA))>τγ0γ1classify as Tif and only ifi=1klog(p(T|Si)1p(T|Si))i=1klog(p(T|Si=AA1p(T|Si=AA)>τγ0γ1classify as Tif and only ifi=1klog(p(T|Si)1p(T|Si))>τγ0γ1+i=0klog(p(T|Si=AA)1p(T|Si=AA))classify as Tif and only ifi=1klog(p(T|Si)1p(T|Si))>log(λ)where log(λ)=τγ0γ1+i=1klog(p(T|Si=AA)1p(T|Si=AA))and τ=γ0γ1i=1klog(p(T|Si=AA)1p(T|Si=AA))+γ1log(γ)

References

  • Balding D. J. (2006). A tutorial on statistical methods for population association studies. Nat. Rev. Genet. 7, 781–79110.1038/nrg1916 [PubMed] [Cross Ref]
  • Bureau A., Dupuis J., Falls K., Lunetta K. L., Hayward B., Keith T. P., Van Eerdewegh P. (2005). Identifying SNPs predictive of phenotype using random forests. Genet. Epidemiol. 28, 171–18210.1002/gepi.20041 [PubMed] [Cross Ref]
  • Chen H., Poon A., Yeung C., Helms C., Pons J., Bowcock A. M., Kwok P. Y., Liao W. (2011). A genetic risk score combining ten psoriasis risk loci improves disease prediction. PLoS ONE 6, e19454.10.1371/journal.pone.0019454 [PMC free article] [PubMed] [Cross Ref]
  • Chibnik L. B., Keenan B. T., Cui J., Liao K. P., Costenbader K. H., Plenge R. M., Karlson E. W. (2011). Genetic risk score predicting risk of rheumatoid arthritis phenotypes and age of symptom onset. PLoS ONE 6, e24380.10.1371/journal.pone.0024380 [PMC free article] [PubMed] [Cross Ref]
  • Cui J. (2009). Overview of risk prediction models in cardiovascular disease research. Ann Epidemiol 19, 711–71710.1016/j.annepidem.2009.05.005 [PubMed] [Cross Ref]
  • Goddard M. E., Wray N. R., Verbyla K., Visscher P. M. (2009). Estimating effects and making predictions from genome-wide marker data. Stat. Sci. 24, 517–52910.1214/09-STS306 [Cross Ref]
  • Hand D. J. (2009). “Naive Bayes,” in The Top Ten Algorithms in Data Mining, eds Wu X., Kumar V., editors. (London: Chapman and Hall; ), 163–178
  • Hastie T., Tibshirani R., Friedman J. H. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction New York: Springer.
  • Jewell R. (2003). Statistics for Epidemiology. Boca Raton: CRC/Chapman and Hall
  • Jiang X., Barmada M. M., Cooper G. F., Becich M. J. (2011). A Bayesian method for evaluating and discovering disease loci associations. PLoS ONE 6, e22075.10.1371/journal.pone.0022075 [PMC free article] [PubMed] [Cross Ref]
  • Kang J., Zheng W., Li L., Lee J., Yan X., Zhao H. (2011a). Use of Bayesian networks to dissect the complexity of genetic disease: application to the Genetic Analysis Workshop 17 simulated data. BMC Proc. 5(Suppl. 9), S37.10.1186/1753-6561-5-S9-S37 [PMC free article] [PubMed] [Cross Ref]
  • Kang J., Kugathasan S., Georges M., Zhao H., Cho J. H. (2011b). Improved risk prediction for Crohn’s disease with a multi-locus approach. Hum. Mol. Genet. 20, 2435–244210.1093/hmg/ddr116 [PMC free article] [PubMed] [Cross Ref]
  • Kooperberg C., LeBlanc M., Obenchain V. (2009). Risk prediction using genome-wide association studies. Genet. Epidemiol. 34, 643–65210.1002/gepi.20509 [PMC free article] [PubMed] [Cross Ref]
  • Kraft P., Hunter D. J. (2009). Genetic risk prediction – are we there yet? N. Engl. J. Med. 360, 1701–170310.1056/NEJMp0810107 [PubMed] [Cross Ref]
  • Lauritzen S. L., Sheehan N. A. (2004). Graphical models for genetic analysis. Stat. Sci. 18, 489–514
  • McCullagh P., Nelder J. (1989). Generalized Linear Models. London: Chapman and Hall
  • McKinney B. A., Reif D. M., Ritchie M. D., Moore J. H. (2006). Machine learning for detecting gene-gene interactions: a review. Appl. Bioinformatics 5, 77–8810.2165/00822942-200605020-00002 [PMC free article] [PubMed] [Cross Ref]
  • Meigs J. B., Shrader P., Sullivan L. M., McAteer J. B., Fox C. S., Dupuis J., Manning A. K., Florez J. C., Wilson P. W., D’Agostino R. B., Sr., Cupples L. A. (2008). Genotype score in addition to common risk factors for prediction of type 2 diabetes. N. Engl. J. Med. 359, 2208–221910.1056/NEJMoa0804742 [PMC free article] [PubMed] [Cross Ref]
  • Moore J. H., Gilbert J. C., Tsai C. T., Chiang F. T., Holden T., Barney N., White B. C. (2006). A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J. Theor. Biol. 241, 252–26110.1016/j.jtbi.2005.11.036 [PubMed] [Cross Ref]
  • Okser S., Lehtimaki T., Elo L. L., Mononen N., Peltonen N., Kahonen M., Juonala M., Fan Y. M., Hernesniemi J. A., Laitinen T., Lyytikainen L. P., Rontu R., Eklund C., Hutri-Kahonen N., Taittonen L., Hurme M., Viikari J. S., Raitakari O. T., Aittokallio T. (2010). Genetic variants and their interactions in the prediction of increased pre-clinical carotid atherosclerosis: the cardiovascular risk in young Finns study. PLoS Genet. 6, e1001146.10.1371/journal.pgen.1001146 [PMC free article] [PubMed] [Cross Ref]
  • Paynter N. P., Chasman D. I., Pare G., Buring J. E., Cook N. R., Miletich J. P., Ridker P. M. (2010). Association between a literature-based genetic risk score and cardiovascular events in women. JAMA 303, 631–63710.1001/jama.2010.119 [PMC free article] [PubMed] [Cross Ref]
  • Purcell S. M., Wray N. R., Stone J. L., Visscher P. M., O’Donovan M. C., Sullivan P. F., Sklar P. (2009). Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752 [PMC free article] [PubMed]
  • Rodin A. S., Boerwinkle E. (2005). Mining genetic epidemiology data with Bayesian networks I: Bayesian networks and example application (plasma apoE levels). Bioinformatics 21, 3273–3278 [PMC free article] [PubMed]
  • Sebastiani P., Abad-Grau M. (2007). “Bayesian networks for genetic analysis,” in Bioinformatics: An Engineering Case-Based Approach, eds Alterovitz G., Ramoni M. F., editors. (Cambridge, MA: Artech House; ), 205–228
  • Sebastiani P., Perls T. T. (2008). “Complex genetic models,” in Bayesian Networks, eds Pourret O., Naïm P., Marcot B., editors. (Chichester: John Wiley & Sons; ), 53–72
  • Sebastiani P., Perls T. T. (2010). Prediction models that include genetic data. Circ. Cardiovasc. Genet. 3, 1–210.1161/CIRCGENETICS.109.933614 [PMC free article] [PubMed] [Cross Ref]
  • Sebastiani P., Ramoni M. F., Nolan V., Baldwin C. T., Steinberg M. H. (2005). Genetic dissection and prognostic modeling of overt stroke in sickle cell anemia. Nat. Genet. 37, 435–44010.1038/ng1533 [PMC free article] [PubMed] [Cross Ref]
  • Sebastiani P., Solovieff N., DeWan A., Walsh K., Puca A., Hartley S. W., Melista E., Andersen S., Dworkis D. A., Wilk J. B., Myers R. H., Steinberg M. H., Montano M., Baldwin C. T., Hoh J., Perls T. T. (2012). Genetic signatures of exceptional longevity in humans. PLoS ONE 7, e29848.10.1371/journal.pone.0029848 [PMC free article] [PubMed] [Cross Ref]
  • Solovieff N., Baldwin C. T., Steinberg M. H., Perls T. T., Sebastiani P. (2011). “Incorporating genetic ancestry into risk prediction models,” in The 12th International Congress of Human Genetics and the American Society of Human Genetics 61st Annual Meeting, Montreal
  • Stengard J. H., Dyson G., Frikke-Schmidt R., Tybjaerg-Hansen A., Nordestgaard B. G., Sing C. F. (2010). Context-dependent associations between variation in risk of ischemic heart disease and variation in the 5′ promoter region of the apolipoprotein E gene in Danish women. Circ. Cardiovasc. Genet. 3, 22–3010.1161/CIRCGENETICS.109.862748 [PMC free article] [PubMed] [Cross Ref]
  • Wei Z., Wang K., Qu H. Q., Zhang H., Bradfield J., Kim C., Frackleton E., Hou C., Glessner J. T., Chiavacci R., Stanley C., Monos D., Grant S. F., Polychronakos C., Hakonarson H. (2009). From disease association to risk assessment: an optimistic view from genome-wide association studies on type 1 diabetes. PLoS Genet. 5, e1000678.10.1371/journal.pgen.1000678 [PMC free article] [PubMed] [Cross Ref]
  • Whittaker J. (1990). Graphical Models in Applied Multivariate Statistics. New York: John Wiley & Sons
  • Wu C., Walsh K., DeWan A., Hoh J., Wang Z. (2011). Disease risk prediction with rare and common variants. BMC Proc. 5(Suppl. 9), S61.10.1186/1753-6561-5-S9-S61 [PMC free article] [PubMed] [Cross Ref]
  • Yang J., Manolio T. A., Pasquale L. R., Boerwinkle E., Caporaso N., Cunningham J. M., de Andrade M., Feenstra B., Feingold E., Hayes M. G., Hill W. G., Landi M. T., Alonso A., Lettre G., Lin P., Ling H., Lowe W., Mathias R. A., Melbye M., Pugh E., Cornelis M. C., Weir B. S., Goddard M. E., Visscher P. M. (2010). Genome partitioning of genetic variation for complex traits using common SNPs. Nat. Genet. 43, 519–52510.1038/ng.823 [PubMed] [Cross Ref]

Articles from Frontiers in Genetics are provided here courtesy of Frontiers Media SA
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...