Format

Send to

Choose Destination
Bioinformatics. 2016 May 1;32(9):1366-72. doi: 10.1093/bioinformatics/btv752. Epub 2015 Dec 31.

Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project.

Author information

1
Institute of Genomic Mathematics, University of Bonn, Bonn, Germany.
2
Channing Division of Network Medicine, Brigham and Women's Hospital.
3
Department of Biostatistics, Harvard School of Public Health, Boston, USA.
4
Institute of Human Genetics, University of Bonn, Bonn, Germany.
5
Institut National de la Santé et de la Recherche Médicale (INSERM) Unité Mixte de Recherche (UMR) 1087, l'institut du thorax, Nantes, France, Centre National de la Recherche Scientifique (CNRS) UMR 6291, l'institut du thorax, Nantes, France, Université de Nantes, l'institut du thorax, Nantes, France and Centre Hospitalier Universitaire (CHU) de Nantes, l'institut du thorax, Service de Cardiologie, Nantes, France.
6
Channing Division of Network Medicine, Brigham and Women's Hospital, Department of Biostatistics, Harvard School of Public Health, Boston, USA.
7
Institute of Genomic Mathematics, University of Bonn, Bonn, Germany, Department of Biostatistics, Harvard School of Public Health, Boston, USA.

Abstract

MOTIVATION:

Population stratification is one of the major sources of confounding in genetic association studies, potentially causing false-positive and false-negative results. Here, we present a novel approach for the identification of population substructure in high-density genotyping data/next generation sequencing data. The approach exploits the co-appearances of rare genetic variants in individuals. The method can be applied to all available genetic loci and is computationally fast. Using sequencing data from the 1000 Genomes Project, the features of the approach are illustrated and compared to existing methodology (i.e. EIGENSTRAT). We examine the effects of different cutoffs for the minor allele frequency on the performance of the approach. We find that our approach works particularly well for genetic loci with very small minor allele frequencies. The results suggest that the inclusion of rare-variant data/sequencing data in our approach provides a much higher resolution picture of population substructure than it can be obtained with existing methodology. Furthermore, in simulation studies, we find scenarios where our method was able to control the type 1 error more precisely and showed higher power.

AVAILABILITY AND IMPLEMENTATION:

CONTACT:

dmitry.prokopenko@uni-bonn.de

SUPPLEMENTARY INFORMATION:

Supplementary data are available at Bioinformatics online.

PMID:
26722118
PMCID:
PMC5860507
DOI:
10.1093/bioinformatics/btv752
[Indexed for MEDLINE]
Free PMC Article

Supplemental Content

Full text links

Icon for Silverchair Information Systems Icon for PubMed Central
Loading ...
Support Center