• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Aug 19, 1997; 94(17): 9197–9201.
PMCID: PMC23111
Genetics

Detecting immigration by using multilocus genotypes

Abstract

Immigration is an important force shaping the social structure, evolution, and genetics of populations. A statistical method is presented that uses multilocus genotypes to identify individuals who are immigrants, or have recent immigrant ancestry. The method is appropriate for use with allozymes, microsatellites, or restriction fragment length polymorphisms (RFLPs) and assumes linkage equilibrium among loci. Potential applications include studies of dispersal among natural populations of animals and plants, human evolutionary studies, and typing zoo animals of unknown origin (for use in captive breeding programs). The method is illustrated by analyzing RFLP genotypes in samples of humans from Australian, Japanese, New Guinean, and Senegalese populations. The test has power to detect immigrant ancestors, for these data, up to two generations in the past even though the overall differentiation of allele frequencies among populations is low.

Classical theory in population genetics has focused on the long term effects of immigration on allele frequency distributions in semi-isolated populations, concentrating on the stationary distribution resulting from a balance between forces of immigration, genetic drift, and mutation (14). Less theory exists addressing the effect of recent immigration among populations with low levels of genetic differentiation. A theory describing the effects of immigration on the genetic composition of individuals in populations that are not at genetic equilibrium is needed to interpret much of the data being generated using current genetic techniques.

In this paper we consider the multilocus genotypes that result when individuals are immigrants, or have recent immigrant ancestry. We propose a test that allows recent immigrants to be identified on the basis of their multilocus genotypes; the test has considerable power for detecting immigrant individuals even when the overall level of genetic differentiation among populations is low. Molecular genetic techniques that allow multilocus genotypes to be described from single individuals are relatively new, and much of the information contained in these types of data is not fully exploited by estimators of long term gene flow that are currently available (57). We provide an example of an application of the method to restriction fragment length polymorphism (RFLP) genotypes from human populations; the method may also be applied to analyze multilocus allozyme and microsatellite data.

Theory

A collection of I discrete populations of a diploid species exchange immigrants, with random mating among individuals within populations. Consider a set of l loci in linkage equilibrium, and let kj be the number of alleles at the jth locus. Let x = {xhji} be a matrix of the allele frequencies in each population, where xhji is the frequency of the hth allele (h = 1, 2, … , kj) at the jth locus (j = 1, 2, … , l) in the ith population (i = 1, 2, … , I). A set of n = {n1, n2, … , nI} chromosomes are sampled from the I populations, where ni is the number sampled from the ith population. Let X = {Xijm} be the matrix of genotypes among the sampled individuals, where Xijm is the genotype at the jth locus of the mth individual sampled from the ith population.

The population allele frequencies are generally unknown, and we therefore derive the probability density of allele frequencies in each population by using a Bayesian approach. The tests presented in this paper make use of the allele frequency distributions to calculate genotype probabilities. It is assumed that the total number of alleles at the jth locus in each population is identically kj. The set of alleles observed in the collection of populations as a whole is used as an estimate of kj. Without additional information, we initially assign an equal probability density to the frequencies of the alleles at the jth locus in the ith population. The prior probability density of allele frequencies (i.e., before sampling) is then (8)

equation M1
1

The posterior probability density of the allele frequencies at the jth locus, conditioned on the alleles observed in a sample from population i, is now determined. Let the vector nji = {n1ji, … , nkjji}, where nhji is the observed number of copies of the hth allele at the jth locus in a sample from the ith population. The posterior probability density of allele frequencies is then

equation M2
2

where

equation M3
3

and we define nji = Σh=1kj nhji. The marginal distribution of nji (conditional on nji) is

equation M4
4

Eq. 2 then simplifies to

equation M5
5

where θ = nji + 1 and

equation M6
6

Eq. 5 is a Dirichlet probability density function (8) with parameters θ and ah, where h = 1, 2, … , kj.

Genotype Probabilities

If individual m is born to nonimmigrant parents in population i, the probability that the individual has genotype Xijm at the jth locus, assuming random mating, is

equation M7
7

for all h = 1, 2, … , kj and g = 1, 2, … , kj where gh. The actual allele frequencies in population i are unknown and we therefore consider the probability of the genotype for individual m, conditional on the sample of alleles from the ith population, denoted as Pr(Xijm|nji). The genotype of individual m is created by sampling two alleles at random from population i. Since the allele frequencies are not known we use the Dirichlet density of Eq. 5 to describe the probability density of allele frequencies and integrate over all possible allele frequencies

equation M8
8

The marginal probability of the observed genotype, obtained in this way, is equal to the probability of sampling two alleles from a compound multinomial-Dirichlet distribution (7)

equation M9
9

This is the posterior probability that the genotype Xijm is observed for the jth locus when the individual is a nonimmigrant from population i. If the allele frequencies are independent among loci (i.e., there is no linkage disequilibrium), the genotype probabilities at other loci are calculated similarly. The probability of the multilocus genotype Xim = {Xi1m, … , Xilm} of individual m is then a product over the probabilities of the observed allelic configurations for each locus

equation M10
10

We now consider situations in which one parent is a resident of population i, and the other is an immigrant born to nonimmigrant parents in population i′. In this case, one allele copy is of immigrant origin. There is generally no prior information regarding the source of an individual’s alleles (chromosomes), and each copy at a locus is therefore equally likely to have been derived from the immigrant source. If we consider the genotype of individual m, born in population i, and denote an immigrant allele from population i′ using a prime symbol, the probability of the mixed genotype X(i,i′)jm at the jth locus is

equation M11
11

where parentheses in the subscripts indicate that the alleles making up the genotype are averaged with respect to possible source populations, and brackets indicate that the alleles making up the genotype are labeled according to their source population. If an individual has alleles h and g at a particular locus, for example, these possibilities are X[i,i′]jm = hg′ and X[i′,i]jm = hg, where h′ indicates that allele h is derived from population i′ and h indicates that it is derived from population i. Conditional on the allele frequencies, the probabilities of the genotypes, labeled according to source population, are

equation M12

equation M13
12

Sampling a single allele for each population from a multinomial-Dirichlet density with the appropriate population parameters, we obtain

equation M14
13

equation M15

for all h = 1, 2, … , kj and g = 1, 2, … , kj where gh. This is the posterior probability that the genotype X(i′,i)jm is observed for the jth locus when the individual is of mixed ancestry from populations i and i′. The probability of the multilocus genotype X(i′,i)m = {X(i′,i)1m, … , X(i′,i)lm} is then

equation M16
14

Identifying Immigrant Genotypes

In this section, we describe a test for detecting individuals born in a population other than the one from which they are sampled; these individuals are first-generation immigrants. Consider an individual m randomly sampled from population i. The probability of the observed multilocus genotype for the individual, given that the individual was born in population i and has no recent immigrant ancestry, is calculated using Eq. 10 above as Pr(Xim|ni). If individual m was instead born in population i′ to parents with no recent immigrant ancestry and subsequently immigrated to population i, then the probability of observing the multilocus genotype of the individual is calculated using Eq. 10 above as Pr(Xim|ni). The relative probability that the individual was born to parents with no recent immigrant ancestry in population i, rather than population i′, is therefore given by the ratio of the probabilities

equation M17
15

In practice, we take logarithms and use the equivalent form

equation M18
16

Positive values of ln Λ indicate that the null hypothesis (that the individual is not an immigrant) is favored, while negative values indicate that the alternative hypothesis (that the individual is an immigrant) is favored. A value of ln Λ = ln(10) = 2.30, for example, indicates that individual m is 10 times more likely to have arisen in population i, while a value of ln Λ = −2.30 indicates the individual is 10 times less likely to have arisen in i than i′. The distribution of the statistic Λ under the null hypothesis (that the individual is not an immigrant) was examined using Monte Carlo simulation (see below).

We now describe a test for detecting an individual with a single parent that is an immigrant, or is descended from an immigrant. In this case, one allele copy at each locus is of possible immigrant origin and the other is of local origin. For l independent loci, the probability of observing the genotype of individual m, given that the individual was born in population i and has an immigrant parent from population i′, is calculated using Eq. 14 as Pr(X(i′,i)m|ni, ni).

The individual might instead have an ancestor d generations in the past that was an immigrant. The probability of the observed genotype X(i′,i)jm at the jth locus under this hypothesis is

equation M19
17

For l independent loci, the posterior probability of observing the genotype of individual m, given that this individual has an immigrant ancestor d generations removed from population i′, is

equation M20
18

The relative probability that individual m, born in population i, did not have an immigrant ancestor from population i′ at generation d in the past is

equation M21
19

We again use logarithms to calculate this statistic as ln Λd. The analysis can be extended to consider individuals of mixed parentage over several generations, but the number of possible outcomes makes an exhaustive analysis difficult. In certain cases, when fewer ancestral immigration patterns are possible, based on prior information, the method outlined above might be extended to decide among the possible alternatives.

Critical Region and Power of Tests

The critical (rejection) region for the test statistic calculated using the methods described in the preceding sections contains all values of the statistic such that Λ < C, where C is chosen to satisfy Pr(Λ < C) = α under the null hypothesis. For a specified value of C, the value of α is given by

equation M22
20

where

equation M23
21

and the sum is over the total number of possible genotype configurations G = [product]j=1l(kj + 1)kj/2. A Monte Carlo estimator of α is

equation M24
22

where Xi(r) is the rth simulated genotype with R genotypes simulated in total from the posterior probability distribution Pr(Xih|ni). The random variables Xi(r) can be generated using the following procedure: for the jth locus, generate the first allele by assigning to allele type h the probability

equation M25
23

If the first allele is of type h, generate the second allele by assigning to allele type h the probability

equation M26
24

or to allele type gh the probability

equation M27
25

It is also possible to determine C by generating a set of genotypes as outlined above and considering the value of the test statistic that falls below 1 − α percent of the values for the simulated genotypes (see Fig. Fig.1).1). The power of the test to reject the null hypothesis when it is false, for a specified critical region α, is

equation M28
26

where

equation M29
27

where C(α) is the value of C that specifies the critical region with probability α determined using Eq. 20. A Monte Carlo estimator of the power β is

equation M30
28

where Xi(r) is the rth simulated genotype, with R genotypes simulated in total from the posterior probability distribution Pr(Xih|ni). The power of the test is illustrated graphically as the overlap between the distributions of the statistic generated by simulating genotypes under the null and alternative hypotheses (see Fig. Fig.2).2).

Figure 1
Illustration of Monte Carlo method for examining significance of test statistic ln Λ for comparison of Australian (sample) and New Guinean (potential source) populations. Histogram of 1,000 values of the ln-probability difference generated by ...
Figure 2
Histograms indicating the power of the immigration tests for two cases. (a) The hypothesis that an Australian individual is an immigrant (d = 0) from New Guinea is considered. The shaded columns represent the distribution of ln Λ generated given ...

Application

We have applied our method to a set of 12 individuals from each of four human populations. We chose to compare two population samples with quite low genetic differentiation (9) and two population samples with quite high genetic differentiation from a set of population samples studied previously (10). The samples with low differentiation are from an Australian population and a New Guinean population (FST distance = 0.056). The samples with high differentiation are from a Japanese population and a Senegalese population (FST distance = 0.232). The Australian sample was collected from a coastal region of Australia, and the New Guinea sample from the highland region of New Guinea. The Japanese sample consists of individuals born in Japan and was collected in the San Francisco Bay Area (11). The Senegalese sample consists of Niokolonke individuals of the Mandenka population collected in southeastern Senegal (10). These 48 individuals have been typed at approximately 50 loci (12) by using RFLPs. The physical locations of the loci suggest that most are unlinked. Multiple restriction enzymes were used to type several of the loci so that the total number of genetic markers was approximately 75. The procedures employed in the sampling and the genetic analysis are described in detail elsewhere (11).

The power of the test to detect immigrants depends on the extent of differentiation between the populations compared (Table (Table1)1) as well as the number of loci examined and the number of individuals sampled (unpublished observations). A test of the hypothesis that an individual is an immigrant has high power in all the population comparisons. A test of the hypothesis that an individual has an immigrant parent has lower power for a comparison of individuals from the Australian and New Guinean samples than for a comparison of individuals from the Japanese and Senegalese samples. The test has power to detect an immigrant ancestor through the grandparent generation for a comparison of individuals from Japan and Senegal.

Table 1
Power of posterior probability ratio tests for recent immigration, with α = 0.05

The distribution of the statistic under Monte Carlo simulation (Fig. (Fig.2)2) illustrates the power of the tests. In Fig. Fig.22a, individuals sampled in Australia are postulated to have immigrated from New Guinea. There is little overlap between the distribution of the test statistic generated by Monte Carlo simulation under the null hypothesis that an individual was born in the Australian population (at right of Fig. Fig.22a) and that under the alternative hypothesis that an individual is an immigrant from the New Guinea population (at left of Fig. Fig.22a). In Fig. Fig.22b, individuals in the Australian sample have a single parent that is an immigrant from New Guinea under the alternative hypothesis. In this case there is more overlap between the distributions generated under the null and alternative hypotheses, indicating that the test has reduced power by comparison with the test for detecting first-generation immigrants (i.e., Fig. Fig.22a).

We applied the test to predict whether individuals sampled in Australia have New Guinean ancestry, and vice versa, and whether individuals sampled in Japan have African ancestry, and vice versa. A total of four individuals from the complete set of 48 comparisons produced significant test statistics at some level of ancestry (Table (Table2).2). Three of the four individuals (Australia 1, Australia 2, and Australia 3) who appeared to be immigrants, or descended from immigrants, were drawn from the Australian population, which appears likely to have experienced recent exchanges of immigrants (11). In the case of three individuals (Australia 1, Australia 3, and Japanese 1) it appears possible that an ancestor two or more generations removed was an immigrant, whereas in the case of one individual (Australia 2) it appears most probable that the individual is a first-generation immigrant. Given these results, one might consider excluding individual Australia 1, for example, from the Australian population sample for evolutionary studies, as it is quite probable that this individual has recent immigrant ancestry.

Table 2
Power of the posterior probability ratio test to detect immigrant ancestry: Four individuals with posterior probability ratios indicating possible immigration (α < 0.05)

Discussion

The test for detecting recent immigration developed in this paper provides information relevant to a wide range of problems in population biology and human genetics. In the area of human genetics, for example, the method may be used to identify individuals whose genomes are not typical of the populations in which they currently live, or of their ethnic group. This may be helpful in genetic counselling. In the area of evolutionary biology, it is often important to identify immigrant individuals to study their behavior and interactions with resident individuals. It may also be important to quantify the amount of recent immigration in populations that are not at genetic equilibrium. In the field of conservation genetics, this test may be useful for identifying the population of origin for zoo animals whose history is poorly known to implement successful captive breeding programs.

At least three potentially misleading results may arise when applying the method considered here. First, the failure to reject the hypothesis that an individual was an immigrant, or descended from immigrants, may simply reflect the fact that the appropriate populations for comparison were not included in the analysis. Second, an individual might incorrectly appear to have originated in a particular population other than the one from which it was sampled. This might be due to similarities in allele frequencies, due to long-term gene flow, between that population and a third population from which the individual actually originated, but which was not included in the sample of populations. Third, the fact that many pairwise comparisons between populations are performed for each of a large number of individuals means that some individuals will appear to be immigrants purely by chance. This can be corrected for by using smaller values for α.

The analyses of human populations presented in this paper show that, even with a sample of only 60 independent loci, the method we have proposed has power to detect immigrant ancestry up to two generations in the past. This is despite our conservative correction for uncertainties of allele frequencies. A larger number of loci will increase the power and could allow a single immigrant great-grandparent (out of 8 total), or a single immigrant great-great-grandparent (out of 16 total), to be identified. The precise number of loci needed to obtain a given level of power depends on the degree of genetic differentiation between populations; with greater differentiation, fewer loci are needed to obtain the same level of power. Computer simulations should prove useful in exploring the statistical performance of the method more generally.

Program availability.

A program written in the C computer language for performing the calculations described in this paper is available by anonymous ftp from mw511.biol.berkeley.edu in directory /pub, or on the World-Wide Web at site http://mw511.biol.berkeley.edu/homepage.html.

Acknowledgments

This research was supported, in part, by a National Institutes of Health Grant (GM40282) to Montgomery Slatkin and by a postdoctoral fellowship from the Natural Sciences and Engineering Research Council of Canada to B.R.

ABBREVIATION

RFLP
restriction fragment length polymorphism

References

1. Wright S. Genetics. 1931;16:97–159. [PMC free article] [PubMed]
2. Kimura M. Annu Rep Natl Inst Genet. 1953;3:63.
3. Maruyama T. Theor Pop Biol. 1970;1:273–306. [PubMed]
4. Slatkin M. Annu Rev Ecol System. 1985;16:393–430.
5. Slatkin M, Barton N H. Evolution. 1989;43:1349–1368.
6. Weir B S, Cockerham C C. Evolution. 1984;38:1358–1370.
7. Rannala B, Hartigan J A. Genet Res. 1996;67:147–158. [PubMed]
8. Johnson N L, Kotz S. Distributions in Statistics: Continuous Multivariate Distributions. New York: Wiley; 1970.
9. Reynolds J, Weir B S, Cockerham C C. Genetics. 1983;105:767–779. [PMC free article] [PubMed]
10. Poloni E S, Excoffier L, Mountain J L, Langaney A, Cavalli-Sforza L L. Ann Hum Genet. 1995;59:43–61. [PubMed]
11. Lin A A, Hebert J M, Mountain J L, Cavalli-Sforza L L. Gene Geography. 1994;8:191–214. [PubMed]
12. Mountain J L. Ph.D. thesis. Stanford, CA: Stanford University; 1994.

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...