Figure 18-15
.
Scatter diagrams of relations between pairs of variables. (a) Relation between the lengths
of the first and second lower molars (M1 and
M2) in the extinct mammal Phenacodus primaevis.
Each point gives the M1 and M2 measurements for one individual. (b)
Tail length and body length of 18 individuals of the snake Lampropeltis
polyzona. (Image: Negative #2430, Phenacodus, painting by Charles
Knight; courtesy Dept. of Library Services, American Museum of Natural History. Photo:
Animals Animals/© Zig Leszczynski)
Another statistical notion that is of use in the study of quantitative
genetics is the
association, or correlation, between
variables. As a result of complex paths of causation, many
variables in nature vary together but in an imperfect or approximate way. provides an example, showing the lengths of two particular
teeth in several individual specimens of a fossil mammal, Phenacodus primaevis. The longer an
individual’s first lower molar is, the longer its second molar is, but the relation between the
two teeth is imprecise. shows that the
total length and tail length in individual snakes (Lampropeltis polyzona) are quite closely
related to each other.
The usual measure of the precision of a relation between two variables x and
y is the correlation coefficient
(rxy). It is calculated in part from the product of the deviation
of each observation of x from the mean of the x values and the
deviation of each observation of y from the mean of the y
values—a quantity called the covariance of
x and y (cov xy):
A formula that is exactly algebraically equivalent but that makes computation easier
is
Using this formula, we can calculate the covariance between the right (x) and
the left (y) leg counts in Table
18-2.
The correlation,
rxy, is defined as
Figure 18-16
.
A parabola. Each value of y is perfectly pre-dictable from the value of
x, but there is no linear correlation.
In the formula for correlation, the products of the deviations are divided by the product of
the
standard deviations of
x and
y
(
sx and
sy). This normalization by the
standard deviations has the effect of making
rxy a dimensionless
number that is independent of the units in which
x and
y are
measured. So defined,
rxy will vary from −1, which signifies a
perfectly linear negative relation between
x and
y, to +1,
which indicates a perfectly linear positive relation between
x and
y. If
rxy= 0 there is no linear relation between
the
variables. It is important to notice, however, that sometimes when there is no
linear relation between two
variables but there is a regular
nonlinear relation between them, one
variable may be perfectly predicted from
the other. Consider, for example, the parabola shown in . The values of
y are perfectly predictable from the values of
x; yet
rxy = 0 because, on average over the whole
range of
x values, larger
x values are not associated with
either larger or smaller
y values. The data in and b have
rxy values of 0.82 and 0.99,
respectively. In the example of the sex comb teeth of
Table
18-2, the correlation between left and right legs is
a very small value.
Correlation and equality
It is important to notice that correlation between two sets of numbers is not the same as
numerical identity. For example, two sets of values can be perfectly
correlated, even though the values in one set are very much larger than the
values in the other set. Consider the following pairs of values:

The variables x and y in the pairs are perfectly correlated
(r = + 1.0) although each value of y is about 20 units
greater than the corresponding value of x. Two variables are perfectly
correlated if, for a unit increase in one, there is a constant increase in the other (or a
constant decrease if r is negative). The importance of the difference between
correlation and identity arises when we consider the effect of environment on heritable
characters. Parents and offspring can be perfectly correlated in some trait such as height,
yet, because of an environmental difference between generations, every child can be taller than
the parents. This phenomenon appears in adoption studies, in which children may be correlated
with their biological parents but, on the average, may be quite different from the parents as a
result of a change in social situation.
Regression
The measurement of correlation provides us with only an estimate of the
precision of relation between two variables. A related problem is predicting
the value of one variable given the value of the other. If x increases by two
units, by how much will y increase? If the two variables are linearly related,
then that relation can be expressed as
where b is the slope of the line relating y to
x and a is the y intercept of that
line.
Figure 18-17
.
A scatter diagram showing the relation between two variables, x and
y, with the regression line of y on x.
This line, with a slope of 2/4, minimizes the squares of the deviations
(Δy).
shows a scatter diagram of points for
two
variables,
y and
x, together with a straight
line
expressing the general linear trend of
y with increasing
x.
This
line, called the
regression line of
y
on
x, has been positioned so that the deviations of the points from the
line
are as small as possible. Specifically, if Δ
y is the distance of any point
from the
line in the
y direction, then the
line has been chosen so
that
Any other straight line passed through the points on the scatter diagram will have a larger
total squared deviation of the points from it.
Obviously, we cannot find this least-squares line by trial and error. It turns
out, however, that, if slope b of the line is calculated by
and if a is then calculated from
so that the line passes through the point
,
, then these values of b and a will
yield the least-squares prediction equation.
Note that the prediction equation cannot predict y exactly for a given
x, because there is scatter around the line. The equation predicts the
average y for a given x, if large samples are taken.
Samples and populations
The preceding sections have described the distributions and some statistics of particular
assemblages of individuals that have been collected in some experiments or sets of
observations. For some purposes, however, we are not really interested in the particular 100
undergraduates or 27 snakes that have been measured. Instead, we are interested in a wider
world of phenomena, of which those particular individuals are representative. Thus, we might
want to know the average seed weight in general of plants of the species
Crinum longifolium. That is, we are interested in the characteristics of a
universe, of which our small collection of observations is only a
sample. The characteristics of any particular sample are not identical with those
of the universe but vary from sample to sample.
We can use the sample mean to estimate the true mean of the universe, but the sample variance
and covariance are on the average a little smaller than the true value in the universe. This is
because the deviations from the sample mean are not all independent of one another, because the
data used to calculate the sample mean are the same as those used to calculate the deviations
from that mean. It is simple to correct for this bias. Whenever we are interested in the
variance of a set of measurements—not as a characteristic of the particular collection but as
an estimate of a universe that the sample represents—then the appropriate quantity to use,
rather than s2 itself, is [N/(N −1]s2.
Note that this new quantity is equivalent to dividing the sum of squared deviations by N − 1
instead of N in the first place, so

All these considerations about bias also apply to the sample covariance. In the formula for
the correlation coefficient (page 599), however, the factor N/(N − 1) would
appear in both the numerator and the denominator and therefore cancel out, so we can ignore it
for the purposes of computation.
ǀ