NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Griffiths AJF, Gelbart WM, Miller JH, et al. Modern Genetic Analysis. New York: W. H. Freeman; 1999.

  • By agreement with the publisher, this book is accessible by the search feature, but cannot be browsed.
Cover of Modern Genetic Analysis

Modern Genetic Analysis.

Show details

Covariance and correlation

Another statistical notion that is of use in the study of quantitative genetics is the association, or correlation, between variables. As a result of complex paths of causation, many variables in nature vary together but in an imperfect or approximate way. Figure 18-15a provides an example, showing the lengths of two particular teeth in several individual specimens of a fossil mammal, Phenacodus primaevis. The longer an individual’s first lower molar is, the longer its second molar is, but the relation between the two teeth is imprecise. Figure 18-15b shows that the total length and tail length in individual snakes (Lampropeltis polyzona) are quite closely related to each other.

Figure 18-15. Scatter diagrams of relations between pairs of variables.

Figure 18-15

Scatter diagrams of relations between pairs of variables. (a) Relation between the lengths of the first and second lower molars (M1 and M2) in the extinct mammal Phenacodus primaevis. Each point gives the M1 and M2 measurements for one individual. (b) Tail (more...)

The usual measure of the precision of a relation between two variables x and y is the correlation coefficient (rxy). It is calculated in part from the product of the deviation of each observation of x from the mean of the x values and the deviation of each observation of y from the mean of the y values—a quantity called the covariance of x and y (cov xy):

Image ch18fb14.jpg

A formula that is exactly algebraically equivalent but that makes computation easier is

Image ch18fb15.jpg

Using this formula, we can calculate the covariance between the right (x) and the left (y) leg counts in Table 18-2.

Image ch18fb16.jpg
The correlation, rxy, is defined as
Image ch18fb17.jpg

In the formula for correlation, the products of the deviations are divided by the product of the standard deviations of x and y (sx and sy). This normalization by the standard deviations has the effect of making rxy a dimensionless number that is independent of the units in which x and y are measured. So defined, rxy will vary from −1, which signifies a perfectly linear negative relation between x and y, to +1, which indicates a perfectly linear positive relation between x and y. If rxy= 0 there is no linear relation between the variables. It is important to notice, however, that sometimes when there is no linear relation between two variables but there is a regular nonlinear relation between them, one variable may be perfectly predicted from the other. Consider, for example, the parabola shown in Figure 18-16. The values of y are perfectly predictable from the values of x; yet rxy = 0 because, on average over the whole range of x values, larger x values are not associated with either larger or smaller y values. The data in Figure 18-15a and b have rxy values of 0.82 and 0.99, respectively. In the example of the sex comb teeth of Table 18-2, the correlation between left and right legs is

Image ch18fb18.jpg

Figure 18-16. A parabola.

Figure 18-16

A parabola. Each value of y is perfectly pre-dictable from the value of x, but there is no linear correlation.

a very small value.

Correlation and equality

It is important to notice that correlation between two sets of numbers is not the same as numerical identity. For example, two sets of values can be perfectly correlated, even though the values in one set are very much larger than the values in the other set. Consider the following pairs of values:

Image ch18fb19.jpg

The variables x and y in the pairs are perfectly correlated (r = + 1.0) although each value of y is about 20 units greater than the corresponding value of x. Two variables are perfectly correlated if, for a unit increase in one, there is a constant increase in the other (or a constant decrease if r is negative). The importance of the difference between correlation and identity arises when we consider the effect of environment on heritable characters. Parents and offspring can be perfectly correlated in some trait such as height, yet, because of an environmental difference between generations, every child can be taller than the parents. This phenomenon appears in adoption studies, in which children may be correlated with their biological parents but, on the average, may be quite different from the parents as a result of a change in social situation.


The measurement of correlation provides us with only an estimate of the precision of relation between two variables. A related problem is predicting the value of one variable given the value of the other. If x increases by two units, by how much will y increase? If the two variables are linearly related, then that relation can be expressed as

Image ch18fb20.jpg

where b is the slope of the line relating y to x and a is the y intercept of that line.

Figure 18-17 shows a scatter diagram of points for two variables, y and x, together with a straight line expressing the general linear trend of y with increasing x. This line, called the regression line of y on x , has been positioned so that the deviations of the points from the line are as small as possible. Specifically, if Δy is the distance of any point from the line in the y direction, then the line has been chosen so that

Image ch18fb21.jpg

Figure 18-17. A scatter diagram showing the relation between two variables, x and y, with the regression line of y on x.

Figure 18-17

A scatter diagram showing the relation between two variables, x and y, with the regression line of y on x. This line, with a slope of 2/4, minimizes the squares of the deviations (Δy).

Any other straight line passed through the points on the scatter diagram will have a larger total squared deviation of the points from it.

Obviously, we cannot find this least-squares line by trial and error. It turns out, however, that, if slope b of the line is calculated by

Image ch18fb22.jpg

and if a is then calculated from

Image ch18fb23.jpg

so that the line passes through the point Image xbar.jpg, Image ybar.jpg, then these values of b and a will yield the least-squares prediction equation.

Note that the prediction equation cannot predict y exactly for a given x, because there is scatter around the line. The equation predicts the average y for a given x, if large samples are taken.

Samples and populations

The preceding sections have described the distributions and some statistics of particular assemblages of individuals that have been collected in some experiments or sets of observations. For some purposes, however, we are not really interested in the particular 100 undergraduates or 27 snakes that have been measured. Instead, we are interested in a wider world of phenomena, of which those particular individuals are representative. Thus, we might want to know the average seed weight in general of plants of the species Crinum longifolium. That is, we are interested in the characteristics of a universe, of which our small collection of observations is only a sample. The characteristics of any particular sample are not identical with those of the universe but vary from sample to sample.

We can use the sample mean to estimate the true mean of the universe, but the sample variance and covariance are on the average a little smaller than the true value in the universe. This is because the deviations from the sample mean are not all independent of one another, because the data used to calculate the sample mean are the same as those used to calculate the deviations from that mean. It is simple to correct for this bias. Whenever we are interested in the variance of a set of measurements—not as a characteristic of the particular collection but as an estimate of a universe that the sample represents—then the appropriate quantity to use, rather than s2 itself, is [N/(N −1]s2. Note that this new quantity is equivalent to dividing the sum of squared deviations by N − 1 instead of N in the first place, so

Image ch18fb24.jpg

All these considerations about bias also apply to the sample covariance. In the formula for the correlation coefficient (page 599), however, the factor N/(N − 1) would appear in both the numerator and the denominator and therefore cancel out, so we can ignore it for the purposes of computation.

By agreement with the publisher, this book is accessible by the search feature, but cannot be browsed.

Copyright © 1999, W. H. Freeman and Company.
Bookshelf ID: NBK21288


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...