• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of ajhgLink to Publisher's site
Am J Hum Genet. Oct 10, 2008; 83(4): 445–456.
Published online Oct 3, 2008. doi:  10.1016/j.ajhg.2008.08.019
PMCID: PMC2561928

Japanese Population Structure, Based on SNP Genotypes from 7003 Individuals Compared to Other Ethnic Groups: Effects on Population-Based Association Studies


Because population stratification can cause spurious associations in case-control studies, understanding the population structure is important. Here, we examined Japanese population structure by “Eigenanalysis,” using the genotypes for 140,387 SNPs in 7003 Japanese individuals, along with 60 European, 60 African, and 90 East-Asian individuals, in the HapMap project. Most Japanese individuals fell into two main clusters, Hondo and Ryukyu; the Hondo cluster includes most of the individuals from the main islands in Japan, and the Ryukyu cluster includes most of the individuals from Okinawa. The SNPs with the greatest frequency differences between the Hondo and Ryukyu clusters were found in the HLA region in chromosome 6. The nonsynonymous SNPs with the greatest frequency differences between the Hondo and Ryukyu clusters were the Val/Ala polymorphism (rs3827760) in the EDAR gene, associated with hair thickness, and the Gly/Ala polymorphism (rs17822931) in the ABCC11 gene, associated with ear-wax type. Genetic differentiation was observed, even among different regions in Honshu Island, the largest island of Japan. Simulation studies showed that the inclusion of different proportions of individuals from different regions of Japan in case and control groups can lead to an inflated rate of false-positive results when the sample sizes are large.


Genome-wide association studies (GWASs) are a powerful tool for dissecting complex traits by identifying loci linked to particular diseases.1–3 Finding disease loci in a GWAS requires large sample sizes and sophisticated statistical techniques. Inclusion of a large number of subjects in a study increases the power, but it also increases the rate of false-positive results, which may be partly due to population stratification or cryptic relatedness in either the cases or the controls.4 In a case–control GWAS, we detect loci at which some alleles or genotypes are different in frequencies between cases and controls. This approach assumes a homogeneous population in which the relationship between an allele and a trait is random for marker loci unlinked to the trait. In the presence of population stratification, nonrandom associations between an allele and a trait can be found at marker loci that are completely unlinked to a trait locus; such associations are called “spurious associations”5,6. For two subpopulations that were derived from a common ancestral population and that have differentiated to some extent, a spurious association would occur when the case and the control groups are composed of different proportions of the two subpopulations.6 Therefore, it is important to know whether a population is stratified and how and to what extent the stratification affects the results of association studies.

Several methods have been developed for assessing the level of population stratification. A basic approach is a model-based clustering method that uses multilocus genotype data from individuals and detects the presence of population stratification.7,8 Another approach is the genomic-control method, in which Bayesian outlier methods are used.9–11 The genomic-control method is based on the assumption that, in a stratified population, the distribution of the Cochran-Armitage trend test12 statistic would deviate from the expected chi-square distribution for marker loci unlinked to the disease locus. Recently, two methods have been developed for examining population stratification by analyzing relatedness among individuals with SNP genotypes; these methods are applicable to thousands of SNPs.13,14 One is based on a principle-component analysis and also provides a method for correcting for the effects of stratification.13,15 The other uses identity-by-state and identity-by-descent information, and in it, the individuals are clustered by multidimensional scaling (MDS).14

The Japanese population has a rather small genetic diversity, according to data from the SNP discovery project in Japan.16 However, a detailed analysis of the population structure of the Japanese with the use of genome-wide SNPs has not yet been conducted. Previous studies on genetic variations in the Japanese population examined mtDNA-sequence variation,17,18 polymorphic markers on the Y chromosome,19,20 or some polymorphic loci in autosomes.21 Generally, their results are consistent with the hypothesis that the Japanese population has a “dual structure”22 and that immigrants came to Japan in at least two major migration events. If the “dual structure” of the Japanese population is supported by genetic variations in the entire genome, then the correlation between two alleles in an individual would be slightly higher than that in an ideal homogeneous population. In future GWASs with large numbers of subjects, the presence of a population structure or cryptic relatedness in a case–control sample may increase the rates of false-positive results.23 Therefore, it is important to examine the population structure of Japanese individuals with genotypes for genome-wide SNPs. In this study, to examine the population structure of the Japanese population using multilocus SNP genotypes, we analyze the relatedness of 7001 Japanese individuals, along with the African, European and East-Asian individuals in the International HapMap project. We then show how inclusion of different proportions of individuals from different regions of Japan in case and control groups can lead to spurious associations.

Subjects and Methods


Genotype data for 60 European, 60 African, and 90 East-Asian (45 Japanese and 45 Han Chinese) individuals were obtained from the HapMap database (release 22).24 In addition, genotype data were obtained from 7003 self-identified Japanese patients in the BioBank Japan Project.25 These patients, who had 35 of the 47 diseases studied in the BioBank Japan Project, were treated at hospitals in seven geographic regions (Figure 1): Hokkaido (514 individuals), Tohoku (466 individuals), Kanto-Koshinetsu (3978 individuals), Tokai-Hokuriku (358 individuals), Kinki (908 individuals), Kyushu (628 individuals), and Okinawa (151 individuals). Because none of the DNA samples were taken at hospitals in the area of Chugoku-Shikoku, analyses for Chugoku-Shikoku were not performed. The Biobank Japan Project collected human genomic DNA after the patients provided written informed consent to participate in this project. This project was approved by the ethical committees at The Institute of Medical Science, The University of Tokyo, and the Center for Genomics Medicine (formerly, SNP Research Center), Institutes of Physical and Chemical Research (RIKEN).

Figure 1
Geographical Regions of Japan


All of the Japanese DNA samples from the seven areas were grouped by types of diseases and were genotyped for 272,844 SNPs via Perlegen's platform.26,27 SNPs in autosomes (chromosomes 1–22) were selected for further analyses if they satisfied each of the following four criteria: (1) they were polymorphic in the Japanese population, 2) call rates were high enough (≥ 90%), (3) genotype frequencies were in accord with Hardy-Weinberg equilibrium, and (4) they were genotyped in the HapMap project. The Hardy-Weinberg test was used for removal of possibly mistyped SNPs (p < 0.01 by chi-square test) from raw genotyping data. After the selection of SNPs, the genotype data for 140,387 SNPs were used in additional analyses. When European and African samples were included in the analysis, the number of SNPs used was 135,754, because the genotype data for some SNPs were not available in the HapMap database for either European or African data although they were available for the other subpopulations.

Analysis of Relationship between Individuals

SNP autosomal genotypes were used in an examination of the relationship between individuals. The examination was performed via an “Eigenanalysis,” an application of principal-component analysis, in the computer program smartpca, from the EIGENSOFT package.13,15 The number of SNPs analyzed was 140,387 (when African and European individuals were not included) or 135,754 (when African and European individuals were included), and the PCA analysis was run with correction for linkage equilibrium. In an Eigenanalysis of individual SNP genotypes, the first component is the coordinate drawn in the multidimensional space so that the projections of the points (each point represents an individual) to the coordinate have the largest variance. The second component is the coordinate drawn in the multidimensional space so that the projections of the points to the coordinate have the second largest variance, and so forth. Intuitively, one can obtain the best separation of the individuals by use of the first component, the second-best separation by use of the second component, and so forth. The PCA plot with the first and second components showed two main clusters, formed by Japanese individuals, and a third cluster, formed by Han-Chinese individuals. The two main clusters for the Japanese individuals were defined by the K-means method, with the use of the first component. We also used the multidimensional scaling (MDS) method to examine relatedness among individuals, using PLINK.14

Calculation of FST

The FST value, as originally defined by Wright,28 was calculated between two clusters or between two local regions for each SNP site. Confidence intervals of the average FST over loci were calculated by bootstrap resampling, with 1000 replications.

Simulation of GWAS with Individuals from Subpopulations

To examine the effect of the Japanese population structure on a GWAS, we conducted simulations by sampling individuals from the subpopulations in different proportions between cases and controls, then evaluated possible inflation of false-positive rates with the use of the genome-wide χ2 inflation factor for the genomic control.9–11 Imagine that we have n1 case individuals, consisting of m1 and m2 individuals from subpopulations 1 and 2, respectively, and n2 control individuals, consisting of m3 and m4 individuals from subpopulations 1 and 2, respectively. For simulation of an association study, m1 + m3 individuals were randomly chosen without replacement from subpopulation 1 and m2 + m4 individuals were chosen from subpopulation 2 in the same way. With a case sample (m1 + m2 individuals) and a control sample (m3 + m4 individuals), the Cochran-Armitage trend test12 was performed with the genotypes for the 140,387 SNPs for calculation of a genome-wide inflation factor, λ, for the genomic control.9–11 The value of λ was computed as the median χ2 statistic divided by 0.455, the predicted median χ2 if there is no inflation. This procedure was repeated 100 times, and the mean and the standard deviation of observed λ values were calculated.


Japanese Population Structure

To examine the relationship between Japanese individuals, the genotypes of 7003 Japanese individuals and those for 60 European, 60 African, and 90 East-Asian (45 Japanese and 45 Han Chinese) individuals from the International HapMap project were analyzed by Eigenanalysis with the program smartpca.15 The two-dimensional plots with the first and the second components (Figure 2A) showed that African (HapMap population of Yoruba in Ibadan, Nigeria [YRI]), European (HapMap population of Utah, USA residents with ancestry from northern and western Europe [CEU]), and East Asian (HapMap populations of Japanese in Tokyo [JPT] and Han Chinese in Beijing [CHB]) populations were clearly separated from each other, as shown in a previous study of worldwide human relationships based on genome-wide patterns of variation.29 Two Japanese individuals (denoted by “+” in Figure 2A) fell outside the above groups, probably because they had mixed East-Asian and European ancestry. Conversely, the plots with the third and fourth components (Figure 2B) separated the East-Asian subpopulations, suggesting that East-Asian subpopulations have differentiated SNPs.

Figure 2
Relatedness between Japanese, Han-Chinese, European, and African Individuals

Then, the relationship between East-Asian individuals was analyzed, with the use of 7001 Japanese individuals (excluding two outliers) and the 45 Japanese and 45 Han-Chinese individuals from the HapMap project. In the plots with the first and second components (Figure 3A), Han-Chinese individuals formed a distinct cluster (Han-Chinese cluster), and almost all of the Japanese individuals fell into two main clusters. We also examined the relationship between the same individuals with the MDS method14 and obtained a very similar result (Supplemental Data, available online). We classified the Japanese individuals into two main clusters by K-means clustering on Eigenvector 1 values, because most of the differentiation appears to be reflected in Eigenvector 1.

Figure 3
Relatedness between the 7001 Japanese Individuals

After the information of geographical regions of the Japanese individuals was disclosed, it was found that the largest cluster included most of the Japanese individuals whose samples were taken in an area of Japan other than the Okinawa area (Table 1). The second cluster includes most of the individuals whose samples were taken in Okinawa (Table 1). We call the largest cluster (6732 individuals) the Hondo cluster (the Japanese word “Hondo” literally means “the Japanese main islands other than Okinawa”), and we call the second cluster (265 individuals) the Ryukyu cluster (“Ryukyu” is the name of a kingdom that once existed as a chain of islands including Okinawa). The level of genetic differentiation between the clusters was evaluated by FST.28 The average FST between the Ryukyu and the Hondo clusters was 0.00276 (95% CI: 0.00274–0.00278), and that between the Ryukyu and the Han-Chinese clusters was 0.01108 (95% CI: 0.01101–0.01116). Thus, the Ryukyu cluster is more distant from the Han-Chinese clusters than the Hondo cluster is, given that the average FST between the Hondo and the Han-Chinese clusters is 0.00641 (95% CI: 0.00637–0.00647).

Table 1
Classification of Japanese Individuals into Hondo and Ryukyu Clusters

Genetic Differentiation among Geographical Regions

To evaluate genetic differentiation among different regions in Japan, the PCA plots in Figure 3A and the classification of Japanese individuals into two clusters were reexamined according to the geographic regions where samples of the individuals were taken (Figures 4A–4G, Table 1). A measure of genetic differentiation, FST,28 between each pair of subpopulations at each SNP site was also estimated, and average FST values over all the autosomes were calculated (Table 2). There is a remarkable genetic differentiation between Okinawa and other regions in Japan. The FST values between Okinawa and the regions in Hondo were 0.00282–0.00352 (Table 2), whereas those for pairs of subpopulations in Hondo were much smaller (0.00023–0.00077).

Figure 4
PCA Plots of the Japanese Individuals for Each Geographical Region
Table 2
Genetic Differentiation between Subpopulations

Four of the geographical regions (Hokkaido, Kanto-Koshinetsu, Kinki, and Kyushu) in Hondo included small proportions of individuals from the Ryukyu cluster. Kyushu is located in the southeast part of Hondo, and it includes a part of the Ryukyu Islands in the Kagoshima prefecture. Although most of the individuals (565/628, 89.97%) in the Kyushu area belonged to Hondo cluster (Figure 4F), a significant number belonged to the Ryukyu cluster (63/628, 10.03%). Kanto-Koshinetsu and Kinki, both of which have cities with large populations, included a small fraction of individuals from the Ryukyu cluster (0.70% and 2.97%, respectively). Hokkaido (Figure 4A), which shows similarity to Kanto-Koshinetsu (Figure 4C) in both of the PCA plots and in the FST value, included four individuals (0.78%) from the Ryukyu cluster.

Most of the individuals in the Kanto-Koshinetsu area belonged to the Hondo cluster (3945/3977, 99.22%; Figure 4C), and four individuals in this area belonged to the Han-Chinese cluster. All of the HapMap JPT individuals belonged to the Hondo cluster. The PCA plots in Figure 4C are consistent with the fact that HapMap JPT samples were from Tokyo. The PCA plots in Figure 4C also show that genetic diversity in the Kanto-Koshinetsu area is a little greater than that in Tokyo. All of the individuals in the Tohoku and Tokai-Hokuriku areas belonged to the Hondo cluster (Figures 4B and 4D). However, our data show clear genetic differentiation between Tohoku and Tokai-Hokuriku. Interestingly, the FST value between Tohoku and Tokai-Hokuriku (0.00077; Table 2) was the highest among those between all the pairs of Hondo subpopulations. In the PCA plots, the average values of Eigenvector 2 were higher for the individuals from the eastern area, Tohoku, than for individuals from the western areas (Kinki and Kyushu). Tokai-Hokuriku is located in the middle of Honshu Island, and the average value of Eigenvector 2 for the individuals from Tokai-Hokuriku was intermediate between the average values of those from Tohoku and Kinki. The average values of Eigenvector 2 were highly correlated with the longitudes of the seven regions (r2 = 0.82, p = 0.0051; Figure 5), probably because the Han Chinese have much smaller values of Eigenvector 2 than do the Japanese (Figure 3A) and because the individuals from the western areas were a little closer to Han Chinese than those from the Tohoku area were.

Figure 5
Relationship between Average Eigenvector 2 Values and Longitude, for Seven Regions of Japan

Genetic Differentiation between Hondo and Ryukyu Clusters

To clarify the genetic differences between the Hondo and the Ryukyu clusters over the genome and to know at which regions spurious associations are likely to occur, we examined the differences in allele and genotype frequencies between the two clusters. We examined the empirical distribution of FST for all of the SNPs (see Supplemental Data). In spite of the low level of differentiation between the two clusters (average FST = 0.0028), a substantial proportion of SNPs were located in the tails of the distribution; 165 of 140,368 SNPs have FST ≥ 0.03. Then, we searched for genomic regions that showed relatively higher differentiation by the FST values for each SNP (Table 3). The SNP that showed the highest FST (0.0598) was rs2071652 C/T in an intron of the MOG gene (MIM 159465), in the HLA region on chromosome 6 (at Chr6:29743296), for which the frequencies of allele C were 0.74 and 0.95 for the Hondo and Ryukyu clusters, respectively. In addition, another SNP (rs3094187) showing a high FST value (0.0492) was found in the HLA region. SNPs that were highly differentiated between the two clusters were also found in other chromosomes (Table 3).

Table 3
Highly Differentiated SNPs between the Hondo and Ryukyu Clusters

The nonsynonymous SNP showing the greatest difference in genotype frequency between the Hondo and Ryukyu clusters, as determined by the Cochran-Armitage trend test,12 was rs3827760 T/C (370Val/Ala) in the EDAR gene (MIM 604095) (Table 4). The frequencies of the T allele in the Hondo and Ryukyu clusters were 0.222 and 0.398, respectively. This SNP is highly differentiated between Asian and other populations, and its C allele is associated with thick hair.30,31 The nonsynonymous SNP that showed the second greatest difference in genotype frequencies was rs17822931 G/A (180Gly/Arg) in the ABCC11 gene (MIM 607040). The frequency of the G allele, which is associated with wet ear wax (MIM 117800), was higher in the Ryukyu cluster (0.258) than in the Hondo cluster (0.121). The A allele, which is associated with dry ear wax and whose frequencies were highest in Chinese and Koreans,32 was predominant in both the Hondo and Ryukyu clusters.

Table 4
Nonsynonymous SNPs Showing Significant Differences in Genotype Frequencies between the Hondo and Ryukyu Clusters

Effects of the Population Structure on a Case–Control Study

To examine how the Japanese population structure affects a case–control study, we conducted simulations by sampling individuals as cases and controls from subpopulations in different compositions. Then, we calculated the genome-wide χ2 inflation factor, λ, for genomic control,9–11 an indicator of the inflation of false-positive rates due to the effects of population structure. First, we examined how much the difference in proportions between case individuals from the Hondo and the Ryukyu clusters and control individuals from the Hondo and the Ryukyu clusters would affect the genome-wide χ2 inflation factor, λ. We conducted simulations in which the control group consisted of individuals from the Hondo cluster and the case group was a mixture of individuals from the Hondo and the Ryukyu clusters. Under these conditions, with 200 individuals for both cases and controls, the λ value reached 1.1 when the proportion of the individuals from the Ryukyu cluster was 23% (Figure 6A). Then, we examined how the sample size affects λ when the proportion of the individuals from the Ryukyu cluster in the case group was 10% or 20% (Figure 6B). As expected, we observed a linear increase of λ as the sample size increased. When 10% of the cases were from the Ryukyu cluster, the λ value was close to 1.1 when the sample size was 1000. This suggests that the inflation of false-positive rates would be within an acceptable level (λ ≤ 1.1) for a study design when the proportion of the individuals from the Ryukyu cluster is less than 10% and the sample size is 1000. If the sample size is larger than 1000, inclusion of individuals from the Ryukyu cluster could affect the results of the association study even when the proportion is small. Conversely, inclusion of a higher proportion of individuals from the Ryukyu cluster may not seriously affect the results of the association study when the sample size is smaller than 1000. Because λ is expected to increase linearly when the sample size increases, the acceptable proportion of the subjects from the Ryukyu cluster can be estimated for different sample sizes. When 20% of cases were from the Ryukyu cluster, the average value of λ exceeded 1.1 when the sample size was 300. This suggests that including a substantial proportion of individuals from the Ryukyu cluster would increase the rate of false-positive results even if the sample sizes were much smaller than 1000.

Figure 6
Increase of the Genome-Wide Inflation Factor by Different Compositions of Individuals from the Two Main Clusters

We also examined how different proportions of case and control individuals from subpopulations in Hondo affect λ, although the genetic differences within the Hondo cluster are much smaller than the genetic difference between the Hondo and the Ryukyu clusters. As a combination of two subpopulations in Hondo, individuals from Tohoku and Kinki (FST = 0.00064 between the two subpopulations) were used. In this simulation, all of the controls were from Kinki and the cases were a mixture of individuals from Kinki and Tohoku in different proportions. In simulations with 400 cases and 400 controls, we observed that λ reached 1.1 when the proportion of Tohoku case individuals was 53% (Figure 7).

Figure 7
Increase of the Genome-Wide Inflation Factor by Different Compositions of Individuals from the Tohoku and Kinki Areas

To examine the effects of genetic differences between different regions within Hondo in a GWAS, we then conducted simulations by using pairs of the Hondo regions; one subpopulation was used for the cases and the other was used for the controls. For each condition of simulations with different numbers of individuals (200–350), the average value of λ was calculated (Figure 8). For the pairs of two subpopulations excluding Tohoku and Kyushu, the λ values were close to 1.0 and never reached 1.1 even when the sample size was 350. On the other hand, the pairs including Kyushu or Tohoku showed higher values of λ. The two pairs, Tohoku versus Kinki and Tohoku versus Kyushu, showed the highest λ values. For sample sizes larger than 350, the λ values can be approximated as λ increases linearly with the sample size.

Figure 8
Increase of the Genome-Wide Inflation Factor with the Use of Two Different Subpopulations as Cases and Controls


Our present study has clearly shown, on the basis of analysis of genome-wide SNP genotypes that most Japanese individuals fall into two main clusters: the Hondo cluster and the Ryukyu cluster. Our results also show that local regions in Honshu Island (the largest island of Japan) are still genetically differentiated, even though human migration within Japan has become rather frequent in the past 100 years or so. Our finding that the individuals from Tohoku were less related to Han-Chinese individuals than were the individuals from Kinki and Kyushu suggests that the individuals in Tohoku were less affected by immigrants from the Asian continent than were the individuals in Kinki. The immigrants who came to Japan from the Asian continent through the Korean Peninsula may have entered Japan from northern Kyushu, the Japan Sea side of Kinki or Chugoku. Our finding that the individuals from the western areas in the Hondo cluster had smaller values of Eigenvector 2 than did those in the eastern areas may be because the northeast areas of Japan, such as Tohoku, are distant from the main contact point to the Asian continent. On the other hand, the individuals from Kanto-Koshinetsu and Hokkaido were broadly distributed in the PCA plots, which is not consistent with the east-west trend of genetic differentiation. The broad distribution of the individuals from Kanto-Koshinetsu may be due to recent migrations from various areas of Japan into the Kanto area. The Kanto area includes large cities, such as Tokyo and Yokohama, and recent migrations from various areas of Japan into the Kanto area may have obscured ancient genetic differentiation in the Kanto area. The individuals in Hokkaido are similar to those in the Kanto-Koshinetsu area, even though Hokkaido is located at the north end of Japan. This is probably because most of the people living in Hokkaido are descendents of people who moved from Honshu. The current population of Ainu (an ethnic group indigenous to Hokkaido) was estimated to be about 25,000, and this is ~0.5% of the whole population in Hokkaido.

Previous studies showed genetic affinities between the Ainu and Ryukyu peoples,21,33 who live in the north and south ends of Japan, respectively, and who are thought to be descendents of the Jomon people. These observations are consistent with the “dual-origin hypothesis”,22 which states that the ancestral Japanese populations were brought by two major migration events.17,19–21 Archeological studies have suggested that the Jomon period (the Japanese Neolithic age) started about 16,000 years ago and ended about 3000 years ago, when the Yayoi period, a rice-farming and metal-using age, started. In the Yayoi period, immigrants from the Asian continent had moved to western Japan via Korea or China and expelled or mixed with the Jomon people. Our observations of the two main clusters and genetic differentiation among geographic regions are not discordant with the dual-origin hypothesis, although most of the Hokkaido individuals in this study are probably different from the indigenous Ainu people. Most of the people living in Okinawa Island are probably derived from the Jomon people, whereas most of the people living in Hondo are probably derived from the Yayoi people or are a mixture of the Yayoi and Jomon peoples. Individuals in Tohoku showed two interesting features that are difficult to attribute to only local genetic differentiation. First, within the Hondo cluster, the individuals from Tohoku were closest to the individuals from Okinawa with respect to Eigenvector 1 (Supplemental Data). Second, the FST value between Tohoku and Okinawa was smaller than the FST value between Tokai-Hokuriku and Okinawa, even though the geographical distance between Okinawa and Tohoku is greater than that between Okinawa and Tokai-Hokuriku. These observations might reflect ancient population affinities between Tohoku and Okinawa, which have been obscured by the gene flow between their geographic neighbors in Honshu Island. The presence of two main clusters may also be explained by the long-term isolation of populations in the Ryukyu Islands.34 However, the finding that the FST value between Okinawa and Tohoku was smaller than that between Okinawa and Tokai-Hokuriku cannot be explained by only local genetic differentiation. The distinct difference between the Hondo and the Ryukyu clusters is probably due to two factors: there were two major migrations to Japan, and populations in the Ryukyu Islands became genetically differentiated by isolation.

Although we classified the 7001 Japanese individuals into the two main clusters, most of the individuals in the Hondo cluster were located in a limited area in the PCA plot (between −0.02 and 0.01 for the first component and between −0.02 and 0.02 for the second component). If we define a “core Hondo-cluster area” as this area including most of the individuals, we notice a small fraction of individuals who were located between the Han-Chinese cluster and the core Hondo-cluster area (Figure 3A). Some of those individuals might be genetically non-Japanese East Asians, and others may have mixed Japanese and non-Japanese East-Asian ancestries. Further analyses including individuals from other areas of Asia would be desirable for understanding the Japanese population structure in detail, considering recent migrations from neighboring countries.

It is interesting that the genotype frequencies of two nonsynonymous SNPs, one in EDAR and the other in ABCC11, were significantly different between the Hondo and Ryukyu clusters. This is because these SNPs were associated with phenotypic variations,30,31 and it was suggested that the increase in the frequencies of the specific alleles were driven by positive selection. These observations suggest that a search for differentiated nonsynonymous SNPs between closely related subpopulations, like the Hondo and the Ryukyu clusters, would be an efficient approach to finding SNPs that are involved in phenotypic variations and have been under natural selection.

We should be careful when inferring from allele–trait associations that are detected in the genomic regions where relatively higher differentiations were observed (e.g., particular regions in chromosome 6).3 As a result of the considerable heterogeneity in the level of genetic differentiation over the human genome,35,36 spurious associations are more likely to occur in differentiated regions than in other regions, even if the value of the genome-wide inflation factor is within an acceptable range. To avoid possible false-positive results at differentiated SNPs, a method for correcting the effect of population stratification (implemented in the EIGENSTRAT program in EIGENSOFT13) would be effective.

Because of the genetic differentiation among geographical regions in Japan, the design of a GWAS needs to take into account the structure of the Japanese population, especially if there are differences in disease prevalence among geographical regions of Japan. In the present study, we used individual genotype data to conduct simulations in order to examine to what extent the population stratification causes an increase of false-positive rates in association studies. On the basis of the genome-wide χ2 inflation factor, λ, we found the conditions under which an increase of false-positive rates would be acceptable or negligible. More generally, we propose the following approaches to avoidance of an inflation of false positive rates in a GWAS for the Japanese population: (1) If either cases or controls include individuals from the Ryukyu cluster in different but small proportions, simply exclude them in the studies. (2) If both case and control groups include significant proportions of individuals from the Ryukyu cluster, examine the heterogeneity of the odds ratios among the clusters and the entire sample (e.g., by using the Mantel-Haenzel's test37). (3) Select controls so that the proportions of individuals from the Ryukyu cluster in case and control groups are as equal as possible. (4) If one examines the relatedness between case and control groups by any method (e.g., the smartpca program in EIGENSOFT,13,15 PLINK14) and obtains a result in a two-dimensional graph, then select the controls so that the graph areas including cases and controls are equivalent.

Web Resources

The URLs for data presented herein are as follows:

Supplemental Data

Supplemental Data include two figures and one table and can be found with this paper online at http://www.ajhg.org/.

Supplemental Data

Document S1. Two Figures and One Table:


We thank Kazuharu Misawa, Takahiro Nakamura, Tatsuhiko Tsunoda, and Ryo Yamada for helpful discussions. We thank Toshihiro Tanaka for his effort in the SNP-discovery project in Japan. We thank all of the members in the Laboratory for Genotyping. We also thank all of the members in the BioBank Japan Project for their efforts in organizing the project. This study was supported by the Ministry of Education, Culture, Sports, Science and Technology; of Japan.


1. Ozaki K., Ohnishi Y., Iida A., Sekine A., Yamada R., Tsunoda T., Sato H., Sato H., Hori M., Nakamura Y., Tanaka T. Functional SNPs in the lymphotoxin-alpha gene that are associated with susceptibility to myocardial infarction. Nat. Genet. 2002;32:650–654. [PubMed]
2. Klein R.J., Zeiss C., Chew E.Y., Tsai J.Y., Sackler R.S., Haynes C., Henning A.K., SanGiovanni J.P., Mane S.M., Mayne S.T. Complement factor H polymorphism in age-related macular degeneration. Science. 2005;308:385–389. [PMC free article] [PubMed]
3. The Wellcome Trust Case Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. [PMC free article] [PubMed]
4. Voight B.F., Pritchard J.K. Confounding from cryptic relatedness in case-control association studies. PLoS Genetics. 2005;1:e32. [PMC free article] [PubMed]
5. Reich D.E., Goldstein D.B. Detecting association in a case-control study while correcting for population stratification. Genet. Epidemiol. 2001;20:4–16. [PubMed]
6. Pritchard J.K., Rosenberg N.A. Use of unlinked genetic markers to detect population stratification in association studies. Am. J. Hum. Genet. 1999;65:220–228. [PMC free article] [PubMed]
7. Pritchard J.K., Stephens M., Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–959. [PMC free article] [PubMed]
8. Hoggart C.J., Shriver M.D., Kittles R.A., Clayton D.G., McKeigue P.M. Design and analysis of admixture mapping studies. Am. J. Hum. Genet. 2004;74:965–978. [PMC free article] [PubMed]
9. Devlin B., Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004. [PubMed]
10. Devlin B., Roeder K., Wasserman L. Genomic control for association studies: a semiparametric test to detect excess-haplotype sharing. Biostatistics. 2000;1:369–387. [PubMed]
11. Devlin B., Roeder K., Wasserman L. Genomic control, a new approach to genetic-based association studies. Theor. Popul. Biol. 2001;60:155–166. [PubMed]
12. Armitage P. Tests for linear trends in proportions and frequencies. Biometrics. 1955;11:375–386.
13. Price A.L., Patterson N.J., Plenge R.M., Weinblatt M.E., Shadick N.A., Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 2006;38:904–909. [PubMed]
14. Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A., Bender D., Maller J., Sklar P., de Bakker P.I., Daly M.J., Sham P.C. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. [PMC free article] [PubMed]
15. Patterson N., Price A.L., Reich D. Population structure and eigenanalysis. PLoS Genetics. 2006;2:e190. [PMC free article] [PubMed]
16. Haga H., Yamada R., Ohnishi Y., Nakamura Y., Tanaka T. Gene-based SNP discovery as part of the Japanese Millennium Genome Project: identification of 190,562 genetic variations in the human genome. Single-nucleotide polymorphism. J. Hum. Genet. 2002;47:605–610. [PubMed]
17. Horai S., Murayama K., Hayasaka K., Matsubayashi S., Hattori Y., Fucharoen G., Harihara S., Park K.S., Omoto K., Pan I.H. mtDNA polymorphism in East Asian Populations, with special reference to the peopling of Japan. Am. J. Hum. Genet. 1996;59:579–590. [PMC free article] [PubMed]
18. Tanaka M., Cabrera V.M., Gonzalez A.M., Larruga J.M., Takeyasu T., Fuku N., Guo L.J., Hirose R., Fujita Y., Kurata M. Mitochondrial genome variation in eastern Asia and the peopling of Japan. Genome Res. 2004;14:1832–1850. [PMC free article] [PubMed]
19. Hammer M.F., Horai S. Y chromosomal DNA variation and the peopling of Japan. Am. J. Hum. Genet. 1995;56:951–962. [PMC free article] [PubMed]
20. Hammer M.F., Karafet T.M., Park H., Omoto K., Harihara S., Stoneking M., Horai S. Dual origins of the Japanese: common ground for hunter-gatherer and farmer Y chromosomes. J. Hum. Genet. 2006;51:47–58. [PubMed]
21. Omoto K., Saitou N. Genetic origins of the Japanese: a partial support for the dual structure hypothesis. Am. J. Phys. Anthropol. 1997;102:437–446. [PubMed]
22. Hanihara K. Dual structure model for the population history of the Japanese. Japan Review. 1991;2:1–33.
23. Nakamura T., Shoji A., Fujisawa H., Kamatani N. Cluster analysis and association study of structured multilocus genotype data. J. Hum. Genet. 2005;50:53–61. [PubMed]
24. The International HapMap Consortium A haplotype map of the human genome. Nature. 2005;437:1299–1320. [PMC free article] [PubMed]
25. Nakamura Y. The BioBank Japan Project. Clin. Adv. Hematol. Oncol. 2007;5:696–697. [PubMed]
26. Hinds D.A., Stuve L.L., Nilsen G.B., Halperin E., Eskin E., Ballinger D.G., Frazer K.A., Cox D.R. Whole-genome patterns of common DNA variation in three human populations. Science. 2005;307:1072–1079. [PubMed]
27. Peacock E., Whiteley P. Perlegen sciences, inc. Pharmacogenomics. 2005;6:439–442. [PubMed]
28. Wright S. The genetical structure of populations. Ann. Eugen. 1951;15:323–354. [PubMed]
29. Li J.Z., Absher D.M., Tang H., Southwick A.M., Casto A.M., Ramachandran S., Cann H.M., Barsh G.S., Feldman M., Cavalli-Sforza L.L., Myers R.M. Worldwide human relationships inferred from genome-wide patterns of variation. Science. 2008;319:1100–1104. [PubMed]
30. Fujimoto A., Kimura R., Ohashi J., Omi K., Yuliwulandari R., Batubara L., Mustofa M.S., Samakkarn U., Settheetham-Ishida W., Ishida T. A scan for genetic determinants of human hair morphology: EDAR is associated with Asian hair thickness. Hum. Mol. Genet. 2008;17:835–843. [PubMed]
31. Sabeti P.C., Varilly P., Fry B., Lohmueller J., Hostetter E., Cotsapas C., Xie X., Byrne E.H., McCarroll S.A., Gaudet R. Genome-wide detection and characterization of positive selection in human populations. Nature. 2007;449:913–918. [PMC free article] [PubMed]
32. Yoshiura K., Kinoshita A., Ishida T., Ninokata A., Ishikawa T., Kaname T., Bannai M., Tokunaga K., Sonoda S., Komaki R. A SNP in the ABCC11 gene is the determinant of human earwax type. Nat. Genet. 2006;38:324–330. [PubMed]
33. Bannai M., Ohashi J., Harihara S., Takahashi Y., Juji T., Omoto K., Tokunaga K. Analysis of HLA genes and haplotypes in Ainu (from Hokkaido, northern Japan) supports the premise that they descent from Upper Paleolithic populations of East Asia. Tissue Antigens. 2000;55:128–139. [PubMed]
34. Novembre J., Stephens M. Interpreting principal component analyses of spatial population genetic variation. Nat. Genet. 2008;40:646–649. [PMC free article] [PubMed]
35. Weir B.S., Cardon L.R., Anderson A.D., Nielsen D.M., Hill W.G. Measures of human population structure show heterogeneity among genomic regions. Genome Res. 2005;15:1468–1476. [PMC free article] [PubMed]
36. Akey J.M., Zhang G., Zhang K., Jin L., Shriver M.D. Interrogating a high-density SNP map for signatures of natural selection. Genome Res. 2002;12:1805–1814. [PMC free article] [PubMed]
37. Mantel N., Haenszel W. Statistical aspects of the analysis of data from retrospective studies of disease. J. Natl. Cancer Inst. 1959;22:719–748. [PubMed]

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • MedGen
    Related information in MedGen
  • PubMed
    PubMed citations for these articles
  • SNP
    PMC to SNP links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...