Logo of ajhgLink to Publisher's site
Am J Hum Genet. 2002 Mar; 70(3): 635–651.
Published online 2002 Feb 8. doi:  10.1086/338999
PMCID: PMC384943

Phylogeographic Differentiation of Mitochondrial DNA in Han Chinese


To characterize the mitochondrial DNA (mtDNA) variation in Han Chinese from several provinces of China, we have sequenced the two hypervariable segments of the control region and the segment spanning nucleotide positions 10171–10659 of the coding region, and we have identified a number of specific coding-region mutations by direct sequencing or restriction-fragment–length–polymorphism tests. This allows us to define new haplogroups (clades of the mtDNA phylogeny) and to dissect the Han mtDNA pool on a phylogenetic basis, which is a prerequisite for any fine-grained phylogeographic analysis, the interpretation of ancient mtDNA, or future complete mtDNA sequencing efforts. Some of the haplogroups under study differ considerably in frequencies across different provinces. The southernmost provinces show more pronounced contrasts in their regional Han mtDNA pools than the central and northern provinces. These and other features of the geographical distribution of the mtDNA haplogroups observed in the Han Chinese make an initial Paleolithic colonization from south to north plausible but would suggest subsequent migration events in China that mainly proceeded from north to south and east to west. Lumping together all regional Han mtDNA pools into one fictive general mtDNA pool or choosing one or two regional Han populations to represent all Han Chinese is inappropriate for prehistoric considerations as well as for forensic purposes or medical disease studies.


The Han people constitute China’s and the world’s largest ethnic group, making up ~93% of the country’s population and nearly 20% of all humankind. The formation of the Han people was a process of continuous expansion by integration of numerous tribes or ethnic groups; it began with the ancient Huaxia tribe, which was formed during the 21st–8th centuries b.c. Although the Han people are now spread all over the country, the highest population concentrations are in the basins of the Yellow River, the Yangtze River, and the Zhujiang River and on the Songhuajiang-Liaohe plain in northeast China, as well as on the islands of Taiwan and Hainan (Du and Yip 1993; Ge et al. 1997). The migration of Han people to provinces such as Xinjiang and Yunnan occurred relatively recently, having started mainly ~100–600 years ago, and was caused by war, plague, and other reasons (Ge et al. 1997). Do these populations bear some genetic differences from those from the historical Han regions, such as Wuhan and Qingdao? To what extent can the genetic data reflect those recent migration events? A prerequisite for answering these and more-specific questions with genetic data is a thorough screening of mtDNA and Y-chromosome variation across China.

Hitherto, mtDNA from Han Chinese has been poorly sampled and understood in its variation, with only limited data available from Guangdong (T. Kivisild, H.-V. Tolk, J. Parik, Y. Wang, S. S. Papiha, H.-J. Bandelt, and R. Villems, unpublished data), Hong Kong (Betty et al. 1996), Shanghai (Nishimaki et al. 1999), Shandong (Wang et al. 2000), and Taiwan (Horai et al. 1996; Tsai et al. 2001). Moreover, previous genetic studies of the Chinese populations either grouped the various regional Han populations into “Southern Han” and “Northern Han” (Su et al. 1999, 2000) or simply used Han samples from only one or two regions to stand for all Han Chinese (Horai et al. 1996; Hou et al. 2001; Karafet et al. 2001), thereby neglecting potential geographic differences between different Han populations, as well as migrations between north and south. Although genetic contrast between southern and northern populations has been claimed in classical genetic markers (e.g., Zhao and Lee 1989; Chen et al. 1993; Du et al. 1998), dermatoglyphic data (Zhang et al. 1998), archaeological assemblages (Wu et al. 1989), as well as in nuclear microsatellites (Chu et al. 1998) and Y-chromosome single-nucleotide polymorphism (SNP) data (Su et al. 1999; Karafet et al. 2001), no detailed mtDNA study has been performed to substantiate this claim. Chu et al. (1998) and Su et al. (1999) also argued for a southern origin of northern populations, whereas Ding et al. (2000) emphasized that the regional genetic difference observed in the principal-component (PC) maps of mtDNA, nuclear short tandem repeats (STRs), and Y-chromosome SNPs might be more properly explained by a simple model of isolation by distance (IBD). Given the large census size of the Han people, the complexity of the migration events, and these hotly debated issues, it is necessary to gather detailed information about the regional Han populations.

To take full advantage of a uniparental marker system, such as mtDNA, one needs a sufficiently resolved phylogeny that is not overly blurred by recurrent mutations. Because the two hypervariable segments (HVS-I and HVS-II) alone—although useful for forensic purposes—cannot support a very reliable estimate of the mtDNA phylogeny (Bandelt et al. 2000), we opted for sequencing one stretch of the coding region (10171–10659) as well, which turned out to be highly informative for East Asian mtDNAs. Another segment (14055–14590) was sequenced in a few samples, helping to define four haplogroups. In addition, a number of further sites relevant for Eurasian mtDNAs (Macaulay et al. 1999; Schurr et al. 1999; T. Kivisild, H.-V. Tolk, J. Parik, Y. Wang, S. S. Papiha, H.-J. Bandelt, and R. Villems, unpublished data) were checked either by direct sequencing or through RFLP testing in specific mtDNAs.

Material and Methods


From six provinces in China, 263 unrelated Han individuals were analyzed: 43 from Kunming, Yunnan; 42 from Wuhan, Hubei; 50 from Qingdao, Shandong; 47 from Yili, Xinjiang; 51 from Fengcheng, Liaoning; and 30 from Zhanjiang, Guangdong (see fig. 1 for sample locations). The maternal pedigrees (unrelated through at least three generations) of all individuals were ascertained before sampling. Except for 17 samples from Xinjiang, all subjects were able to confirm that the birthplace of their maternal grandmothers was in the same province.

Figure  1
Geographic locations of the Han samples under study

Previously published Han mtDNA data used here for comparison include 69 mtDNAs from Guangzhou, Guangdong (with HVS-I, HVS-II, and additional coding-region information; T. Kivisild, H.-V. Tolk, J. Parik, Y. Wang, S. S. Papiha, H.-J. Bandelt, and R. Villems, unpublished data), 20 mtDNAs from Hong Kong (HVS-I; Betty et al. 1996), 120 mtDNAs from Shanghai (HVS-I; Nishimaki et al. 1999—however, these data are not fully reliable; see Bandelt et al. 2001), 155 Taiwanese mtDNAs (HVS-I and HVS-II; Tsai et al. 2001), and another 66 Taiwanese mtDNAs (HVS-I; Horai et al. 1996). Further, mtDNAs (HVS-I) from 78 patients with type 2 diabetes mellitus (Y.-G. Yao, P.-L. Geng, Q.-P. Kong, and Y.-P. Zhang, unpublished data) from Xining, Qinghai, who do not bear the 3243 A→G transition (a well-known pathogenic mutation), were included here. Fifty mtDNAs from Zibo, Shandong, represented by a 185-bp fragment of HVS-I (16194–16378; Wang et al. 2000), were tentatively taken into consideration.

Amplification and Sequencing of HVS-I, HVS-II, and Region 10171–10659

Genomic DNA was extracted from whole blood by standard phenol/chloroform methods. The sequences of HVS-I from position 16001 to 16497 (relative to the revised Cambridge reference sequence [CRS]; Andrews et al. 1999) were amplified and sequenced as described elsewhere (Yao et al. 2000a). For HVS-II, the primer pair L29 and H408 was used in amplification and sequencing. For the segment 10148–10659, which covers the tRNAArg gene (10405–10469) and parts of the ND3 (10059–10406) and ND4L (10470–10766) genes, we used primers L10170 and H10660 for amplification and sequencing (table 1). Since several segments of the same mtDNA had to be screened, care was taken to avoid artificial recombination caused by potential sample crossover; therefore, doubtful segments were resequenced.

Table 1
Primers for Amplification, Sequencing, and RFLP Analyses[Note]

Typing of Other Polymorphisms

First, those Han individuals who had not yet been screened for the mtDNA 9-bp deletion in the COII/tRNALys intergenic region (Yao et al. 2000b) were analyzed as described in that study. Then, as for the typing of further coding-region polymorphisms in specific lineages, we took advantage of the phylogenetic analyses of Eurasian mtDNAs provided by Macaulay et al. (1999) and Kivisild and colleagues (T. Kivisild, H.-V. Tolk, J. Parik, Y. Wang, S. S. Papiha, H.-J. Bandelt, and R. Villems, unpublished data), which employed coding-region information (mainly derived from Ozawa et al. 1991, 1995; Ikebe et al. 1995; Ingman et al. 2000). In each run, a few (random) controls were tested. Some (unexpected) mutations observed in the controls were then systematically screened in related mtDNAs, which eventually led to the identification of novel characteristic markers for some haplogroups. In total, 13 pairs of primers were designed for RFLP typing and coding-region sequencing, as listed (along with the PCR conditions) in table 1.

Data Analyses

The sequences were edited and aligned by the DNASTAR software (DNASTAR, Inc.) and were compared with the revised CRS (Andrews et al. 1999). The length polymorphisms of the A and C stretches in 16180–16188 (triggered by the 16189 T→C substitution) were disregarded in the analyses. We adopted the classification tree proposed by Kivisild and colleagues (T. Kivisild, H.-V. Tolk, J. Parik, Y. Wang, S. S. Papiha, H.-J. Bandelt, and R. Villems, unpublished data), but without highlighting haplogroup E (which is still poorly described and apparently very rare in China) and subhaplogroups of A and Y. We then assigned the mtDNAs to the (nested) haplogroups according to HVS-I, HVS-II, and coding-region information, in such a way that each mtDNA was allocated to the most-derived (i.e., smallest) named haplogroup it belongs to. If the haplogroup has further named subhaplogroups, then (following Richards et al. 1998) a star is attached to the haplogroup name that refers to the mtDNA under consideration, to emphasize that the haplogroup status of the mtDNA cannot be specified further (relative to the classification tree). Coalescence times, along with standard deviations, were estimated according to the methods of Forster et al. (1996) and Saillard et al. (2000) for the major haplogroups detected in the 332 mtDNAs (263 from this study and 69 from T. Kivisild, H.-V. Tolk, J. Parik, Y. Wang, S. S. Papiha, H.-J. Bandelt, and R. Villems [unpublished data]).

Haplogroup frequencies were then computed for the regional Han mtDNA samples. To compare these haplogroup profiles with those from the previously published Han HVS-I data sets (lacking coding-region information), we classified the published mtDNAs in another, coarser scheme guided by HVS-I and HVS-II motifs and (near-)matching with the 332 Han mtDNAs. This necessarily precluded the finer subdivision of haplogroup D4, the recognition of F2, and the distinction between M* and N*. The frequency vectors of the basal mtDNA profiles (which only record the frequencies of the 10 basal haplogroups M7, M8, M9, M10, G2, D, A, N9, B, and R9 and the R* and M*/N* haplotypes in 13 Han samples) and the coarse mtDNA profiles were then subjected to PC analysis by the POPSTR program.


Classification Tree

The sequence variation in HVS-I, HVS-II, region 10171–10659, and at further polymorphic sites detected in the 263 Han individuals is shown in table 2. The present data suggest two new subhaplogroups of M, which we name “M9” and “M10,” as well as subhaplogroups of D4 (D4a and D4b), D5 (D5a), and F1 (F1c). Except for M10 and F1c, these new haplogroups each have at least one representative in the complete sequence database (T. Kivisild, H.-V. Tolk, J. Parik, Y. Wang, S. S. Papiha, H.-J. Bandelt, and R. Villems, unpublished data). Altogether we distinguish 44 named nested haplogroups in the Han mtDNA classification tree. Figure 2 displays these haplogroups, along with the defining sites considered in this study. Almost all samples can be affiliated with proper haplogroups of macrohaplogroups M and N, with the exception of a few M* haplotypes and one N* haplotype that could not be specified further. Evidently, some of the M* haplotypes belong to specific clades (one with motif 16234-16290-125-127 and another with 318-326), the mutual relationships of which are not yet clear. Among the three R* haplotypes that could not be classified as B or R9, two bear a mutation motif of 185-189-10398-16189-16311, similar to the motif of B5, but were found to lack the 9-bp deletion.

Figure  2
Classification tree of the mtDNA haplogroups observed in Han Chinese. The diagnostic mutations considered here (relative to the revised CRS; Andrews et al. 1999) are indicated on the branches. Nucleotide changes are specified for transversions by suffixes, ...
Table 2
Sequence Variation in the 263 Chinese Han Individuals Analyzed in the Present Study[Note]

Two mtDNAs, one sampled in Yunnan and the other in Liaoning, are regarded as resulting from admixture from western Eurasia (via central Asia), as they belong to the west Eurasian haplogroups HV and T1 (Macaulay et al. 1999). Note that the sample from Guangzhou contains one W haplotype (T. Kivisild, H.-V. Tolk, J. Parik, Y. Wang, S. S. Papiha, H.-J. Bandelt, and R. Villems, unpublished data).

The region 10171–10659 harbors numerous sites that support basal branches in the Asian mtDNA phylogeny. To begin with, site 10400 is one of the defining sites for macrohaplogroup M, whereas 10398 is one of the characteristic sites for macrohaplogroup N (Quintana-Murci et al. 1999). Back mutations at 10398—which occur occasionally (Macaulay et al. 1999)—are then characteristic of haplogroups Y and B5. The transition at 10397, which defines haplogroup D5, leads to the simultaneous loss of two prominent RFLP sites (10394 DdeI and 10397 AluI; Bandelt et al. 1999). Site 10181 defines haplogroup D4b, and site 10410 defines a subclade of D4a that seems to be frequent in Japan (T. Kivisild, H.-V. Tolk, J. Parik, Y. Wang, S. S. Papiha, H.-J. Bandelt, and R. Villems, unpublished data) but occurs only once in our Han data (from Liaoning). Subhaplogroup M7b2 of M7b can also be recognized by 10345. We define the new haplogroup M10 by sites 10646 (+10646 RsaI) and 16311, although one should bear in mind that both sites are prone to recurrent mutations. Haplogroup R9, as defined by Kivisild and colleagues (T. Kivisild, H.-V. Tolk, J. Parik, Y. Wang, S. S. Papiha, H.-J. Bandelt, and R. Villems, unpublished data), is identified by 10310. A branch of R9, haplogroup R9a, is further characterized by 10320 in addition to its HVS-I motif. Haplogroup F1 (F sensu stricto, as originally introduced by Torroni et al. [1994]) may be characterized by 10609 as well, whereas its sister group F2 (T. Kivisild, H.-V. Tolk, J. Parik, Y. Wang, S. S. Papiha, H.-J. Bandelt, and R. Villems, unpublished data) likely has the defining sites 10535 and 10586. The complete mtDNA sequence (XLIND) from China, reported by Ingman et al. (2000), is thus identified as an F type that does not belong to F1 or F2 (fig. 2). The newly defined subhaplogroup F1c of F1 has the characteristic site at 10454.

The region 14055–14590 is also quite informative for the Asian mtDNA phylogeny. It harbors one marker each for haplogroups C (14318), Y (14178), and M8a (14470, also recognizable by +14465 AccI). Haplogroup M9, introduced here, has the two characteristic sites 14308 and 3394 (identifiable by +3391 HaeIII).

In the recently published complete sequence data (Ingman et al. 2000; Finnilä et al. 2001), haplogroups C and Z were found to share the transition at 4715 and the A→T transversion at 15487 (among other mutations). Our typing of an M8a mtDNA confirms that the former two mutations are also shared by haplogroup M8a, thus supporting the phylogenetic position of M8, with CZ and M8a forming sister clades (T. Kivisild, H.-V. Tolk, J. Parik, Y. Wang, S. S. Papiha, H.-J. Bandelt, and R. Villems, unpublished data). The 9-bp deletion in the COII/tRNALys intergenic region, which is a diagnostic marker for haplogroup B, was found sporadically in lineages from A, D, and M*, thus confirming our previous results about the multiple origin of the deletion in these individuals (Yao et al. 2000b).

As to the dating of the nodes in the classification tree, table 3 lists the age estimates of the major haplogroups. Haplogroups M7, CZ, M8, G2, N9, B, B4, B5, F, F1, and R9 are all rather ancient, with ages >50,000 years. The ages of the other haplogroups seem to fall into the range 30,000–50,000 years, except for that of M8a, which may be <20,000 years.

Table 3
Estimated Haplogroup Coalescent Times

From Coding Region to Control Region

The present Han mtDNA data (including those of T. Kivisild, H.-V. Tolk, J. Parik, Y. Wang, S. S. Papiha, H.-J. Bandelt, and R. Villems, unpublished data) with coding-region information can serve as a starting point for provisional haplogroup assignment of those east Asian mtDNAs for which only a segment of the control region is available (see GenBank). Potential haplogroup status can then be inferred through a motif search and (near-)matching with the 332 Han mtDNAs. For illustration, we take ancient mtDNA data, which usually offer only short fragments of HVS-I (and HVS-II). The mtDNAs from the 2,000-year-old Yixi site from Shandong Province (Oota et al. 1999), with polymorphic sites reported from 16203 to 16362 and from 146 to 263, can all be assigned to specific haplogroups, albeit at different levels of certainty. For example, sequence 01 (16203-16291-16304, 249d-263) does not match any of the 332 Han mtDNAs but has three one-step neighbors (XJ8414, XJ8407, and GD7809), all in F2; since it bears the full motif 16291-16304-249d for F2a, we can quite safely conclude that the sequence belongs to F2a. In contrast, sequence 19 (16223, 146-263) has no close companion (at distance two or fewer mutational steps) in the Han data and lacks any salient motif of the haplogroups considered here; therefore, if it can be assigned at all, we could at best assign it to M*.

An interesting case is constituted by the 29 mtDNAs from the 4,500-year-old Nakazuma Jomon site that were sequenced for the region 16209–16402 (Shinoda and Kanai 1999). The haplogroup affiliations of the resulting nine haplotypes, except for type 9 (16256-16278-16295), can be recognized by following our classification strategy. Type 1 (16223-16311-16357) matches haplotypes from M10 (one sampled in Liaoning and another one in Yunnan), and type 7 (16284) matches a B4b haplotype from Liaoning. The other six types have one-step neighbors in the Han mtDNA database: type 2 (16223-16234-16290-16319) is thus related to A haplotypes from Wuhan and Yunnan; type 3 (16223-16298-16319-16355) to M8a haplotypes from Qingdao and Wuhan; type 4 (16223-16266-16274-16362) to a D4 haplotype from Liaoning and to D5a haplotypes from Liaoning, Wuhan, Xinjiang, and Qingdao; type 6 (16223-16278-16362) to two G2 haplotypes and type 8 (16223-16245-16362-16368) to one D4 haplotype, all from Liaoning; finally, type 5 (16223-16357) is a one-step descendant of the matched M10 type 1 (but, alternatively, it would also be a one-step neighbor of an M* haplotype from Qingdao). It is conspicuous that the Jomon mtDNAs find their near-matches within the Han mtDNA database mainly in the northern and central pools, especially in the Liaoning sample.

Haplogroup Profiles

Haplogroup frequencies varied among the regional Han populations (table 4). Five main features can be discerned. (1) Haplogroups A, Z, and Y are absent in the two Guangdong samples. These two samples differ significantly in the number of M* mtDNAs. Haplogroup M7b (including M7b1, M7b2, and M7b*) is absent in the Zhanjiang sample but is present, with a frequency of 8.7%, in the Guangzhou sample. The frequency of F1a in the Guangzhou sample (17.4%) is higher than that in the Zhanjiang sample (6.7%). (2) Haplogroup M7b1 has by far the highest frequency (14.0%) in the Yunnan sample, whereas, in central and northeast China, it only occurs at low frequencies (<5.0%). (3) The Wuhan sample shows a relatively high frequency of haplogroup A (16.7%), followed by the Shanghai (11.7%) and Xinjiang (10.6%) samples. These three samples and the Zibo sample have relatively high frequencies (> 7.5%) of CZ. (4) Most of the mtDNAs that belong to haplogroups M9, M8a, Y, and G2 are restricted to the northern and northwestern populations of Liaoning, Qingdao, Xinjiang, and Qinghai, although the Taiwanese samples also include a good number of M9, Y, and G2 mtDNAs. The newly defined haplogroup, M10, has the highest frequency in the Liaoning sample (5.9%). (5) Generally, the frequencies of haplogroups F1 and B tend to decrease from south to north, whereas the D4 frequency increases.

Table 4
Estimated Frequencies (%) of mtDNA Haplogroups in Regional Han Populations[Note]

PC Maps for mtDNA Data

The basal mtDNA haplogroup profiles of the 13 Han samples were treated as input vectors for the PC analysis. Figure 3 displays the PC map for the first two principal components, which together account for 63% of the total variation. A geographic patterning of the samples is evident in the map, as mainly expressed by the first PC. The second PC, however, also contributes to the south-to-north cline (leaving aside the outlier—the Zhanjiang sample from southernmost mainland China). The two populations from Guangdong, Guangzhou and Zhanjiang, are distant from each other in the PC map, although they are geographically proximate. In contrast, the four northern populations (Qinghai, Liaoning, Qingdao, and Zibo) are close together. Although the Zibo data were extremely meager (185-bp fragments of HVS-I), the haplogroup classification, by and large, seems to be correct, since Zibo comes next to Qingdao (from the same province, Shandong) in the map. The populations with recent migration history, Taiwanese and Xinjiang Han, take intermediate positions in the PC map, in the vicinity of the populations from central and east China.

Figure  3
PC map of the mtDNA data (with respect to the basal haplogroup profiles) of 13 regional Han samples.

In the PC map, with respect to the coarse profiles (with 33 entries; see table 4), the south-to-north cline of the populations observed in the basal PC map does not change considerably (map not shown). Since the basal haplogroups are probably as old as [gt-or-equal, slanted]50,000 years, one could expect that the ancient imprints of the earliest settlement processes on regional mtDNA pools are slightly more pronounced in the basal PC map.


The phylogenetic analysis of the Han HVS-I and HVS-II sequences is greatly enhanced by the information provided by the region 10171–10659 and other specific polymorphisms, which enables us to distinguish between the two macrohaplogroups M and N and to identify several new haplogroups. The region 10171–10659, which had not been studied before (unless complete sequencing was carried out), overlaps with the ND3 gene that was sequenced in a small worldwide sample by Nachman et al. (1996); with respect to our classification scheme, we can immediately infer that their types, 11 and 13, belong to haplogroup D5, type 6 to B4a, and type 3 to R9. The now-emerging tree of East Asian mtDNAs (present study; T. Kivisild, H.-V. Tolk, J. Parik, Y. Wang, S. S. Papiha, H.-J. Bandelt, and R. Villems, unpublished data) can help to direct complete sequencing efforts in that lineages would be selected from those deep branches that are not yet represented by complete sequences, thus filling the lacunae. Another benefit is the tracing of pathogenic mtDNA lineages: if a certain new mutation was found in the coding region of the patient’s mtDNA, one could speed up the diagnosis by first typing this mutation in normal individuals from the same haplogroup, to see whether it is haplogroup-specific or pathogenic. The type 2 diabetes mellitus sample from Qinghai Province included here can serve as a good example in this respect. Although no normal controls from the same province have yet been analyzed for mtDNA, it is reasonable to expect that slight fluctuations in haplogroup frequencies, compared with neighboring regions (as shown in table 4 and fig. 3) reflect regional differences, rather than association with type 2 diabetes mellitus.

Coding-region information is indispensable for phylogenetic analysis of mtDNA. In cases where direct information from the coding region is not available, one can at least link combinations of HVS-I mutations with certain mutations in the coding region. Specifically, we can anticipate the haplogroup status of most East Asian HVS-I sequences via the Han database through (near-)matching and motif recognition. This classification strategy can be very useful for ancient DNA analysis, as demonstrated above. Attempts at estimating a phylogeny solely from HVS-I without any reference to coding-region sites would go astray, in particular, if neighbor joining (NJ) with midpoint rooting comes into action (see the appendix of the article by Richards et al. [1996]). For instance, this approach applied to the large Thai HVS-I data set (see fig. 3 of Fucharoen et al. [2001]) resulted in highly polyphyletic clusters: haplogroup B was distributed over two clusters, 1 and 3b; cluster 3a includes haplogroups D5, M7c, N9a, and M*; cluster 4 groups C and Z together with R9a; and cluster 8 harbors D4, D5, and A lineages. Most of the apparent clades of this NJ tree intermingle lineages from macrohaplogroups M and N and therefore would not pass the test with complete sequence data. The same kind of problem is also manifest in the NJ analysis of the HVS-I data performed by Qian et al. (2001). Even a mass screening of East Asian mtDNA data based on HVS-I alone, assisted by a network method, cannot provide a much more favorable picture. Among the six “radiation groups” I–VI, erected by Oota et al. (1999), three groups (I–III) each comprise both M and N lineages, one group (IV) comprises Y and R lineages, and only two groups (V and VI) could potentially serve as proxies for monophyletic groups (B4 and F, respectively).

The comparison of the regional Han mtDNA samples revealed an obvious geographic differentiation in the Han Chinese, as shown by the haplogroup-frequency profiles and the PC maps. The south-to-north cline observed in the frequencies of haplogroups F1, B, and D4 is quite similar to the distributions of immunoglobulin Gm allotypes Gm1,3;5 and Gm1;21 in Chinese populations (Zhao and Lee 1989). Hence, the grouping of different Han populations into just “Southern Han” and “Northern Han” (Su et al. 1999, 2000) or the use of one or two Han regional populations to stand for all Han Chinese (Horai et al. 1996; Hou et al. 2001; Karafet et al. 2001) constitutes a procrustean bed and does not appropriately reflect the genetic structure of the Han. Intriguingly, despite the numerous historically recorded migrations and substantial gene flow across China from the Bronze Age to the present time (Ge et al. 1997), differences between geographic regions have been maintained. The regional difference is more pronounced in south and southwest China: in the PC map, the southern and southwestern populations show a more diverse pattern than the populations from central, east, and northeast China. The Zhanjiang and Guangzhou samples, though from the same province (Guangdong), differ considerably in their mtDNA haplogroup distribution. It thus seems that the Neolithic expansions from the Yellow River basin and later from the Yangtze River basin to other parts of China, as well as Bronze Age movements, did not erase local populations. The subsequent conquest of the Han in historical time, starting from central China, constituted mainly a political expansion process that led to the cultural assimilation of numerous ethnic groups under the dominant Han culture (Ge et al. 1997).

The spread of Han people to Yunnan, Xinjiang, and Taiwan happened relatively recently—within the past several hundred years. For the Yunnan Han, according to historical records, many movements were caused by an expansion policy, especially during the Ming dynasty (1368–1644 a.d.) (Ge et al. 1997). Since at that time the local population density was very high, the relative contribution of the Han to the local gene pools was overall rather minor, although eventually Han culture was generally accepted. Therefore, the genetic makeup of the Yunnan Han should show more influence from the autochthonous people than that of Han people from their early historical homelands in the basins of the Yellow River and the Yangtze River (see Du et al. 1998). The Taiwanese and Xinjiang Han have similar demographic histories: after World War II, both populations received a heavy influx of Han people from across almost all of China. However, before the withdrawal of the Guomingtang, Han people from the proximal Fujiang and Guangdong provinces and other parts of China continually migrated to Taiwan, with two main waves arriving in the 18th and 19th centuries (Ge et al. 1997). The high frequencies of haplogroups F1a and M7b in the Taiwanese Han, if not an autochthonous signal, might well reflect this connection with south mainland China, whereas other haplogroups—such as G2 and Y, mainly present in the north—hint at recent migrations from north and northeast China. The presence of two R9a types in Xinjiang (incidentally matching the two R9a haplotypes from Hong Kong; Betty et al. 1996), as well as the M7b haplotypes, point to connections with south and southwest China, where R9a and M7b are prevalent. On the other hand, the relatively high percentage of haplogroups A, C, and Z in this population may stem from recent migrations of Han people from central and east China to Xinjiang Province during the 1950s and 1960s. Evidence for recent migration is also reflected by the fact that no west Eurasian mtDNA types were found in the Xinjiang Han, whereas, among the Uygurs and Kazakhs from the same geographic areas (Yao et al. 2000a), >30% of individuals belong to west Eurasian haplogroups (Macaulay et al. 1999).

In summary, our phylogenetic analysis of 263 Han mtDNAs shows that ~94% of the lineages can be allocated to specific subhaplogroups of the Eurasian founder haplogroups M, N, and R (which is itself a subhaplogroup of N shared between Europe and East Asia). Most of the nested haplogroups that are not infrequent have ages >30,000 years. It is conspicuous that the potentially most ancient of these haplogroups, R9 and B, may have their earliest diversification in southern China and/or Southeast Asia. A few possibly basal branches of M, present in Guangdong but absent or rare in northern China, still await a full description with more data from Southeast Asia. Only a restricted number of major subhaplogroups of M and N—namely, G, M8, M9, A, and N9—may be of central or northern Chinese provenance. All this makes an initial pioneer colonization of China ~60,000 years ago from Southeast Asia conceivable (as proposed by Su et al. 1999; Jin and Su 2000) but still leaves much room for speculation about the population dynamics during the long period between then and the Last Glacial Maximum. The contrast between the northern and southern genetic pools might have its roots in this period. Subsequent migration events may have somewhat blurred this early distinction, with the genetic pools of central China possessing mtDNA features of both the northern and the southern pools.


We thank Dr. Vincent Macaulay for helpful comments on an earlier version of this paper and Professor Henry C. Harpending for providing the program POPSTR. We are also grateful to Professor Pai-Li Geng and Qing-Wei Li for sample collection and Gou Shi-Kang and Wu Shi-Fang for technical assistance. This research was supported by grants from the Natural Sciences Foundation of China, the Chinese Academy of Sciences, and the Natural Sciences Foundation of Yunnan Province, as well as by a short-term research scholarship from the German Deutchser Akademischer Austauschdienst.

Electronic-Database Information

Accession numbers and the URL for data in this article are as follows:

GenBank Overview, http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html (for mtDNA control region data; accession numbers AY052834–AY053358)


Andrews RM, Kubacka I, Chinnery PF, Lightowlers RN, Turnbull DM, Howell N (1999) Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nat Genet 23:147 [PubMed]
Bandelt H-J, Forster P, Röhl A (1999) Median-joining networks for inferring intraspecific phylogenies. Mol Biol Evol 16:37–48 [PubMed]
Bandelt H-J, Macaulay V, Richards M (2000) Median networks: speedy construction and greedy reduction, one simulation, and two case studies from human mtDNA. Mol Phylogenet Evol 16:8–28 [PubMed]
Bandelt H-J, Lahermo P, Richards M, Macaulay V (2001) Detecting errors in mtDNA data by phylogenetic analysis. Int J Legal Med 115:64–69 [PubMed]
Betty DJ, Chin-Atkins AN, Croft L, Sraml M, Easteal S (1996) Multiple independent origins of the COII/tRNALys intergenic 9-bp mtDNA deletion in aboriginal Australians. Am J Hum Genet 58:428–433 [PMC free article] [PubMed]
Chen R, Ye G, Geng Z, Wang Z, Kong F, Tian D, Bao P, Liu R, Liu J, Song F, Fan L, Zhang G, Guo S, Xu L, Xu X, Cheng D, Zhao X (1993) Revelations of the origin of Chinese nation from clustering analysis and frequency distribution of HLA polymorphism in major minority nationalities in mainland China. Acta Genetica Sinica 20:389–398 (in Chinese) [PubMed]
Chu JY, Huang W, Kuang SQ, Wang JM, Xu JJ, Chu ZT, Yang ZQ, Lin KQ, Li P, Wu M, Geng ZC, Tan CC, Du RF, Jin L (1998) Genetic relationship of populations in China. Proc Natl Acad Sci USA 95:11763–11768 [PMC free article] [PubMed]
Ding Y-C, Wooding S, Harpending H, Chi H-C, Li H-P, Fu Y-X, Pang J-F, Yao Y-G, Xiang YJG, Moyzis R, Zhang Y-P (2000) Population structure and history in East Asia. Proc Natl Acad Sci USA 97:14003–14006 [PMC free article] [PubMed]
Du R, Xiao CJ, Cavalli-Sforza LL (1998) Genetic distances between Chinese populations calculated on gene frequencies of 38 loci. Sci China C 28:83–89 [PubMed]
Du R, Yip VF (1993) Ethnic groups in China. Science Press, Beijing
Finnilä S, Lehtonen MS, Majamaa K (2001) Phylogenetic network for European mtDNA. Am J Hum Genet 68:1475–1484 [PMC free article] [PubMed]
Forster P, Harding R, Torroni A, Bandelt H-J (1996) Origin and evolution of native American mtDNA variation: a reappraisal. Am J Hum Genet 59:935–945 [PMC free article] [PubMed]
Fucharoen G, Fucharoen S, Horai S (2001) Mitochondrial DNA polymorphisms in Thailand. J Hum Genet 46:115–125 [PubMed]
Ge JX, Wu SD, Chao SJ (1997) Zhongguo yimin shi (The migration history of China). Fujian People Press, Fuzhou, China (in Chinese)
Horai S, Murayama K, Hayasaka K, Matsubayashi S, Hattori Y, Fucharoen G, Harihara S, Park KS, Omoto K, Pan IH (1996) mtDNA polymorphism in east Asian populations, with special reference to the peopling of Japan. Am J Hum Genet 59:579–590 [PMC free article] [PubMed]
Hou YP, Zhang J, Li YB, Wu J, Zhang SZ, Prinz M (2001) Allele sequences of six new Y-STR loci and haplotypes in the Chinese Han population. Forensic Sci Int 118:147–152 [PubMed]
Ikebe S, Tanaka M, Ozawa T (1995) Point mutations of mitochondrial genome in Parkinson's disease. Brain Res Mol Brain Res 28:281–295 [PubMed]
Ingman M, Kaessmann H, Pääbo S, Gyllensten U (2000) Mitochondrial genome variation and the origin of modern humans. Nature 408:708–713 [PubMed]
Jin L, Su B (2000) Natives or immigrants: modern human origin in East Asia. Nat Rev Genet 1:126–133 [PubMed]
Karafet T, Xu L, Du R, Wang W, Feng S, Wells RS, Redd AJ, Zegura SL, Hammer MF (2001) Paternal population history of east Asia: sources, patterns, and microevolutionary process. Am J Hum Genet 69:615–628 [PMC free article] [PubMed]
Macaulay V, Richards M, Hickey E, Vega E, Cruciani F, Guida V, Scozzari R, Bonné-Tamir B, Sykes B, Torroni A (1999) The emerging tree of west Eurasian mtDNAs: a synthesis of control-region sequences and RFLPs. Am J Hum Genet 64:232–249 [PMC free article] [PubMed]
Nachman MW, Brown WM, Stoneking M, Aquadro CF (1996) Nonneutral mitochondrial DNA variation in humans and chimpanzees. Genetics 142:953–963 [PMC free article] [PubMed]
Nishimaki Y, Sato K, Fang L, Ma M, Hasekura H, Boettcher B (1999) Sequence polymorphism in the mtDNA HV1 region in Japanese and Chinese. Legal Med 1:238–249 [PubMed]
Oota H, Saitou N, Matsushita T, Ueda S (1999) Molecular genetic analysis of remains of a 2,000-year-old human population in China—and its relevance for the origin of the modern Japanese population. Am J Hum Genet 64:250–258 [PMC free article] [PubMed]
Ozawa T (1995) Mechanism of somatic mitochondrial DNA mutations associated with age and diseases. Biochim Biophys Acta 1271:177–189 [PubMed]
Ozawa T, Tanaka M, Ino H, Ohno K, Sano T, Wada Y, Yoneda M, Tanno Y, Miyatake T, Tanaka T, Itoyama S, Ikebe S, Hattori N, Mizuno Y (1991) Distinct clustering of point mutations in mitochondrial DNA among patients with mitochondrial encephalomyopathies and with Parkinson's disease. Biochem Biophys Res Commun 176:938–946 [PubMed]
Qian YP, Chu Z-T, Dai Q, Wei C-D, Chu JY, Tajima A, Horai S (2001) Mitochondrial DNA polymorphism in Yunnan nationalities in China. J Hum Genet 46:211–220 [PubMed]
Quintana-Murci L, Semino O, Bandelt H-J, Passarino G, McElreavey K, Santachiara-Benerecetti AS (1999) Genetic evidence of an early exit of Homo sapiens sapiens from Africa through eastern Africa. Nat Genet 23:437–441 [PubMed]
Richards M, Côrte-Real H, Forster P, Macaulay V, Wilkinson-Herbots H, Demaine A, Papiha S, Hedges R, Bandelt H-J, Sykes B (1996) Paleolithic and Neolithic lineages in the European mitochondrial gene pool. Am J Hum Genet 59:185–203 [PMC free article] [PubMed]
Richards M, Macaulay V, Bandelt H-J, Sykes B (1998) Phylogeography of mitochondrial DNA in western Europe. Ann Hum Genet 62:241–260 [PubMed]
Saillard J, Forster P, Lynnerup N, Bandelt H-J, Nørby S (2000) mtDNA variation among Greenland Eskimos: the edge of the Beringian expansion. Am J Hum Genet 67:718–726 [PMC free article] [PubMed]
Schurr TG, Sukernik RI, Starikovskaya YB, Wallace DC (1999) Mitochondrial DNA variation in Koryaks and Itel'men: population replacement in the Okhotsk Sea–Bering Sea region during the Neolithic. Am J Phys Anthropol 108:1–39 [PubMed]
Shinoda K, Kanai S (1999) Intracemetery genetic analysis at the Nakazuma Jomon site in Japan by mitochondrial DNA sequencing. Anthropol Sci 107:129–140
Su B, Xiao C, Deka R, Seielstad MT, Kangwanpong D, Xiao J, Lu D, Underhill P, Cavalli-Sforza L, Chakraborty R, Jin L (2000) Y chromosome haplotypes reveal prehistorical migrations to the Himalayas. Hum Genet 107:582–590 [PubMed]
Su B, Xiao J, Underhill P, Deka R, Zhang W, Akey J, Huang W, Shen D, Lu D, Luo J, Chu J, Tan J, Shen P, Davis R, Cavalli-Sforza L, Chakraborty R, Xiong M, Du R, Oefner P, Chen Z, Jin L (1999) Y-chromosome evidence for a northward migration of modern humans into eastern Asia during the last ice age. Am J Hum Genet 65:1718–1724 [PMC free article] [PubMed]
Torroni A, Miller JA, Moore LG, Zamudio S, Zhuang J, Droma T, Wallace DC (1994) Mitochondrial DNA analysis in Tibet: implications for the origin of the Tibetan population and its adaptation to high altitude. Am J Phys Anthropol 93:189–199 [PubMed]
Tsai LC, Lin CY, Lee JC, Chang JG, Linacre A, Goodwin W (2001) Sequence polymorphism of mitochondrial D-loop DNA in the Taiwanese Han population. Forensic Sci Int 119:239–247 [PubMed]
Wang L, Oota H, Saitou N, Jin F, Matsushita T, Ueda S (2000) Genetic structure of a 2,500-year-old human population in China and its spatiotemporal changes. Mol Biol Evol 17:1396–1400 [PubMed]
Wu R, Wu X, Zhang S (1989) Early humankind in China. Science Press, Beijing (in Chinese)
Yao Y-G, Lü X-M, Luo H-R, Li W-H, Zhang Y-P (2000a) Gene admixture in the silk road of China: evidence from mtDNA and melanocortin 1 receptor polymorphism. Genes Genet Syst 75:173–178 [PubMed]
Yao Y-G, Watkins WS, Zhang Y-P (2000b) Evolutionary history of the mtDNA 9-bp deletion in Chinese populations and its relevance to the peopling of East and Southeast Asia. Hum Genet 107:504–512 [PubMed]
Zhang H, Ding M, Jiao Y, Wang X, Yan Z, Jin G, Meng X, Bai C, Lu Z, Chen R (1998) A dermatoglyphic study of the Chinese population III. Dermatoglyphics cluster of fifty-two nationalities in China. Acta Genetica Sinica 25:381–391 (in Chinese)
Zhao TM, Lee TD (1989) Gm and Km allotypes in 74 Chinese populations: a hypothesis of the origin of the Chinese nation. Hum Genet 83:101–110 [PubMed]

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • EST
    Published EST sequences
  • MedGen
    Related information in MedGen
  • Nucleotide
    Published Nucleotide sequences
  • PopSet
    Published population set
  • PubMed
    PubMed citations for these articles
  • Taxonomy
    Related taxonomy entry
  • Taxonomy Tree
    Taxonomy Tree

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...