By agreement with the publisher, this book is accessible by the search feature, but cannot be browsed.

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Griffiths AJF, Gelbart WM, Miller JH, et al. Modern Genetic Analysis. New York: W. H. Freeman; 1999.

## Modern Genetic Analysis.

Show detailsAnother statistical notion that is of use in the study of quantitative genetics is the association, or correlation, between variables. As a result of complex paths of causation, many variables in nature vary together but in an imperfect or approximate way. Figure 18-15a provides an example, showing the lengths of two particular teeth in several individual specimens of a fossil mammal, Phenacodus primaevis. The longer an individual’s first lower molar is, the longer its second molar is, but the relation between the two teeth is imprecise. Figure 18-15b shows that the total length and tail length in individual snakes (Lampropeltis polyzona) are quite closely related to each other.

The usual measure of the precision of a relation between two variables *x* and
*y* is the correlation coefficient
(*r _{xy}*). It is calculated in part from the product of the deviation
of each observation of

*x*from the mean of the

*x*values and the deviation of each observation of

*y*from the mean of the

*y*values—a quantity called the covariance of

*x*and

*y*(cov

*xy*):

A formula that is exactly algebraically equivalent but that makes computation easier is

Using this formula, we can calculate the covariance between the right (*x*) and
the left (*y*) leg counts in Table
18-2.

*r*, is defined as

_{xy}In the formula for correlation, the products of the deviations are divided by the product of
the standard deviations of *x* and *y*
(*s _{x}* and

*s*). This normalization by the standard deviations has the effect of making

_{y}*r*a dimensionless number that is independent of the units in which

_{xy}*x*and

*y*are measured. So defined,

*r*will vary from −1, which signifies a perfectly linear negative relation between

_{xy}*x*and

*y*, to +1, which indicates a perfectly linear positive relation between

*x*and

*y*. If

*r*there is no linear relation between the variables. It is important to notice, however, that sometimes when there is no

_{xy}= 0*linear*relation between two variables but there is a regular

*nonlinear*relation between them, one variable may be perfectly predicted from the other. Consider, for example, the parabola shown in Figure 18-16. The values of

*y*are perfectly predictable from the values of

*x*; yet

*r*= 0 because, on average over the whole range of

_{xy}*x*values, larger

*x*values are not associated with either larger or smaller

*y*values. The data in Figure 18-15a and b have

*r*values of 0.82 and 0.99, respectively. In the example of the sex comb teeth of Table 18-2, the correlation between left and right legs is

_{xy}a very small value.

## Correlation and equality

It is important to notice that correlation between two sets of numbers is not the same as
numerical identity. For example, two sets of values can be perfectly
*correlated,* even though the values in one set are very much larger than the
values in the other set. Consider the following pairs of values:

The variables *x* and *y* in the pairs are perfectly correlated
(*r* = + 1.0) although each value of *y* is about 20 units
greater than the corresponding value of *x*. Two variables are perfectly
correlated if, for a unit increase in one, there is a constant increase in the other (or a
constant decrease if *r* is negative). The importance of the difference between
correlation and identity arises when we consider the effect of environment on heritable
characters. Parents and offspring can be perfectly correlated in some trait such as height,
yet, because of an environmental difference between generations, every child can be taller than
the parents. This phenomenon appears in adoption studies, in which children may be correlated
with their biological parents but, on the average, may be quite different from the parents as a
result of a change in social situation.

## Regression

The measurement of correlation provides us with only an estimate of the
*precision* of relation between two variables. A related problem is predicting
the value of one variable given the value of the other. If *x* increases by two
units, by how much will *y* increase? If the two variables are linearly related,
then that relation can be expressed as

where *b* is the slope of the line relating *y* to
*x* and *a* is the *y* intercept of that
line.

Figure 18-17 shows a scatter diagram of points for
two variables, *y* and *x*, together with a straight line
expressing the general linear trend of *y* with increasing *x*.
This line, called the **regression line of**
**
y
**

**on**

*x***,**has been positioned so that the deviations of the points from the line are as small as possible. Specifically, if Δ

*y*is the distance of any point from the line in the

*y*direction, then the line has been chosen so that

Any other straight line passed through the points on the scatter diagram will have a larger total squared deviation of the points from it.

Obviously, we cannot find this **least-squares line** by trial and error. It turns
out, however, that, if slope *b* of the line is calculated by

and if *a* is then calculated from

so that the line passes through the point ,
, then these values of *b* and *a* will
yield the least-squares prediction equation.

Note that the prediction equation cannot predict *y* exactly for a given
*x*, because there is scatter around the line. The equation predicts the
*average y* for a given *x*, if large samples are taken.

## Samples and populations

The preceding sections have described the distributions and some statistics of particular
assemblages of individuals that have been collected in some experiments or sets of
observations. For some purposes, however, we are not really interested in the particular 100
undergraduates or 27 snakes that have been measured. Instead, we are interested in a wider
world of phenomena, of which those particular individuals are representative. Thus, we might
want to know the average seed weight *in general* of plants of the species
*Crinum longifolium.* That is, we are interested in the characteristics of a
**universe,** of which our small collection of observations is only a
**sample.** The characteristics of any particular sample are not identical with those
of the universe but vary from sample to sample.

We can use the sample mean to estimate the true mean of the universe, but the sample variance
and covariance are on the average a little smaller than the true value in the universe. This is
because the deviations from the sample mean are not all independent of one another, because the
data used to calculate the sample mean are the same as those used to calculate the deviations
from that mean. It is simple to correct for this bias. Whenever we are interested in the
variance of a set of measurements—not as a characteristic of the particular collection but as
an estimate of a universe that the sample represents—then the appropriate quantity to use,
rather than *s*^{2} itself, is *[N/(N −1]s ^{2}*.
Note that this new quantity is equivalent to dividing the sum of squared deviations by N − 1
instead of

*N*in the first place, so

All these considerations about bias also apply to the sample covariance. In the formula for
the correlation coefficient (page 599), however, the factor *N/(N − 1)* would
appear in both the numerator and the denominator and therefore cancel out, so we can ignore it
for the purposes of computation.

- Covariance and correlation - Modern Genetic AnalysisCovariance and correlation - Modern Genetic Analysis

Your browsing activity is empty.

Activity recording is turned off.

See more...