Logo of bioinfoLink to Publisher's site
Bioinformatics. 2008 Jul 15; 24(14): 1603–1610.
Published online 2008 May 21. doi:  10.1093/bioinformatics/btn239
PMCID: PMC2638262

EM-random forest and new measures of variable importance for multi-locus quantitative trait linkage analysis


Motivation: We developed an EM-random forest (EMRF) for Haseman–Elston quantitative trait linkage analysis that accounts for marker ambiguity and weighs each sib-pair according to the posterior identical by descent (IBD) distribution. The usual random forest (RF) variable importance (VI) index used to rank markers for variable selection is not optimal when applied to linkage data because of correlation between markers. We define new VI indices that borrow information from linked markers using the correlation structure inherent in IBD linkage data.

Results: Using simulations, we find that the new VI indices in EMRF performed better than the original RF VI index and performed similarly or better than EM-Haseman–Elston regression LOD score for various genetic models. Moreover, tree size and markers subset size evaluated at each node are important considerations in RFs.

Availability: The source code for EMRF written in C is available at www.infornomics.utoronto.ca/downloads/EMRF

Contact: ac.no.irhsm@llub

Supplementary information: Supplementary data are available at www.infornomics.utoronto.ca/downloads/EMRF

Supplementary Material

[Supplementary Data]


  • Breiman L. Heuristics of instability and stabilization in model selection. Ann. Stat. 1996a;24:2350–2383.
  • Breiman L. Bagging predictors. Mach. Learn. 1996b;24:123–140.
  • Breiman L. Random forests. Mach. Learn. 2001;45:5–32.
  • Briollais L, et al. Multilevel modeling for the analysis of longitudinal blood pressure data in the Framingham heart study pedigrees. BMC Genet. 2003;4:S19. [PMC free article] [PubMed]
  • Bureau A, et al. Mapping complex traits using random forests. BMC Genet. 2003;4:S64. [PMC free article] [PubMed]
  • Bureau A, et al. Identifying SNPs predictive of phenotype using random forests. Genet. Epidemiol. 2005;28:171–182. [PubMed]
  • Chen WM, et al. Quantitative trait linkage analysis by generalized estimating equations: unification of variance components and Haseman-Elston regression. Genet. Epid. 2004;26:265–272. [PubMed]
  • Churchill GA, Doerge RW. Empirical threshold values for quantitative trait mapping. Genetics. 1994;138:963–971. [PMC free article] [PubMed]
  • Dawber TR, et al. Epidemiological approaches to heart disease: the Framingham study. Am. J. Public Health. 1951;41:279. [PMC free article] [PubMed]
  • Dempster AP, et al. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 1977;39:1–38.
  • Dolan CV, et al. A simulation study of the effects of assignment of prior identity-by-descent probabilities to unselected sib pairs, in covariance-structure modeling of a quantitative-trait locus. Am. J. Hum. Genet. 1999a;64:268–280. [PMC free article] [PubMed]
  • Dolan CV, et al. A note on the power provided by sibships of sizes 2, 3, and 4 in genetic covariance modeling of a codominant QTL. Behav. Genet. 1999b;29:163–170. [PubMed]
  • Elston RC, Stewart J. A general model for the genetic analysis of pedigree data. Hum. Hered. 1971;21:523–542. [PubMed]
  • Falconer DS. Introduction to Quantitative Genetics. 3rd edn. Harlow, Essex, UK/New York: Longmans Green/John Wiley & Sons; 1989.
  • Friedman JH. Greedy function approximation: a gradient boosting machine. Ann. Stat. 2001;29:1189–1232.
  • Gibson G. Epistasis and pleiotropy as natural properties of transcriptional regulation. Theor. Popul. Biol. 1996;49:58–89. [PubMed]
  • Haseman JK, Elston RC. The investigation of linkage between a quantitative trait and a marker locus. Behav. Genet. 1972;2:3–19. [PubMed]
  • Izmirlian G. Application of the random forest classification algorithm to a SELDI-TOF proteomics study in the setting of a cancer prevention trial. Ann. N. Y. Acad. Sci. 2004;1020:154–174. [PubMed]
  • Kruglyak L, Lander ES. Complete multipoint sib-pair analysis of qualitative and quantitative traits. Am. J. Hum. Genet. 1995;57:439–454. [PMC free article] [PubMed]
  • Kruglyak L, et al. Parametric and nonparametric linkage analysis: a unified multipoint approach. Am. J. Hum. Genet. 1996;58:1347–1363. [PMC free article] [PubMed]
  • Lander ES, Green P. Construction of multilocus genetic linkage maps in humans. Proc. Natl Acad. Sci. USA. 1987;84:2363–2367. [PMC free article] [PubMed]
  • Levy D, et al. Evidence for a gene influencing blood pressure on chromosome 17. Genome scan linkage results for longitudinal blood pressure phenotypes in subjects from the Framingham Heart Study. Hypertension. 2000;36:477–483. [PubMed]
  • Liaw A, Wiener M. Classification and regression by randomForest. R News. 2002;2:18–22.
  • Lunetta KL, et al. Screening large-scale association study data: exploiting interactions using random forests. BMC Genet. 2004;10:32. [PMC free article] [PubMed]
  • Moore JH. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum. Hered. 2003;56:73–82. [PubMed]
  • Ott J. Analysis of Human Genetic Linkage. 3rd edn. Baltimore, MD: Johns Hopkins University Press; 1999.
  • R Development Core Team. Vienna, Austria: R Foundation for Statistical Computing; 2008. R: A language and environment for statistical computing. Available at http://www.R-project.org.
  • Schork NJ. Extended multipoint identity-by-descent analysis of human quantitative traits: efficiency, power, and modeling considerations. Am. J. Hum. Genet. 1993;53:1306–1319. [PMC free article] [PubMed]
  • Segal MR, et al. Relating HIV-1 sequence variation to replication capacity via trees and forests. Stat. Appl. Genet. Mol. Biol. 2004;3:2. [PubMed]
  • Shi T, et al. Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma. Mod. Pathol. 2005;18:547–557. [PubMed]
  • Sing T, et al. ROCR: visualizing classifier performance in R. Bioinformatics. 2005;21:3940–3941. [PubMed]
  • Wang T, Elston RC. Two-level Haseman-Elston regression for general pedigree data analysis. Genet. Epidemiol. 2005;29:12–22. [PubMed]
  • Williams JT, Blangero J. Power of variance component linkage analysis to detect quantitative trait loci. Ann. Hum. Genet. 1999;63:545–563. [PubMed]
  • Williams JT, et al. Statistical properties of a variance components method for quantitative trait linkage analysis in nuclear families and extended pedigrees. Genet. Epidemiol. 1997;14:1065–1070. [PubMed]
  • Wu LY, et al. Locus-specific heritability estimation via the bootstrap in linkage scans for quantitative trait loci. Hum. Hered. 2006;62:84–96. [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press
PubReader format: click here to try


Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...