# Nonparametric regression to the mean

^{*}Department of Statistics, University of California, 1 Shields Avenue, Davis, CA 95616; and

^{‡}Department of Mathematics, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA 92093

^{†}To whom correspondence should be addressed. E-mail: ude.sivadcu.dlaw@relleum.

## Abstract

Available data may reflect a true but unknown random variable of interest plus an additive error, which is a nuisance. The problem in predicting the unknown random variable arises in many applied situations where measurements are contaminated with errors; it is known as the regression-to-the-mean problem. There exists a well known solution when both the distributions of the true underlying random variable and the contaminating errors are normal. This solution is given by the classical regression-to-the-mean formula, which has a data-shrinkage interpretation. We discuss the extension of this solution to cases where one or both of these distributions are unknown and demonstrate that the fully nonparametric case can be solved for the case of small contaminating errors. The resulting nonparametric regression-to-the-mean paradigm can be implemented by a straightforward data-sharpening algorithm that is based on local sample means. Asymptotic justifications and practical illustrations are provided.

The regression-to-the-mean phenomenon was named by Galton (1), who noticed that the height of sons tends to be closer to the population mean than the height of the father. The phenomenon is observed in uncontrolled clinical trials, where subjects with a pathological measurement tend to yield closer-to-normal subsequent measurements (2, 3) and motivates controlled clinical trials for the evaluation of therapeutic interventions (4, 5). Classical regression to the mean has been studied mainly in the context of multivariate normal distributions (6).

In the typical regression-to-the-mean situation, one has observations that are contaminated by random errors. The well known basic result for the situation of a multivariate normal distribution corresponds to shrinkage to the mean and provides the best prediction for a new observation based on past observations and also a method for denoising contaminated observations.

Extensions of the normality-based regression-to-the-mean strategies have been studied by various authors. Although the contaminating errors are still assumed to be normal, Das and Mulder (7) derived a regression-to-the-mean formula allowing for an arbitrary distribution of the underlying observations. This result was combined with an Edgeworth approximation of this unknown distribution in ref. 8, and it forms the starting point of our investigation as well, see Eq. 2 below. Regression to the mean for more complex treatment effects has been studied in refs. 9 and 10.

We propose a procedure for the case where both the distribution of the true underlying uncontaminated observations (which are to be predicted) as well as the distribution of the contaminating errors are unknown. As we demonstrate, if repeated observations are available, it is possible to obtain consistent predictors under minimal assumptions on the distributions if either the error variance declines or the number of repeated measurements increases asymptotically. We establish asymptotic normality and propose an intuitively appealing and simple implementation based on local sample moments that is illustrated with a data set consisting of a bivariate sample of repeated blood-sugar measurements for pregnant women.

## The Regression-to-the-Mean Problem

The general problem can be stated as follows: Given unknown independently
and identically distributed (i.i.d.) random variables
*X*_{i}, we observe a sample
of data contaminated with errors δ_{i},

Here, *X*_{i} and δ_{i} are
independent, and the contaminating errors δ_{i} are
i.i.d. with zero means. The goal is to predict the uncontaminated values
*X*_{i} from the observed contaminated data
_{i}. The best
linear unbiased predictor for *X*_{i} is given by the
Bayes estimator
.
Assuming the existence of probability density functions (PDFs)
*f*_{} for
*, f*_{X} for
*X*, and *f*_{δ} for δ, we find by
elementary calculations

and

where we denote the joint PDF of (*,
X*) by
*f** _{}*,

*X*. This leads to the following general form for the regression-to-the-mean function:

We show that the difficulty that is caused by the fact that both
*f*_{δ} and *f*_{X} are unknown
can be addressed with a nonparametric method. The proposed method produces
consistent predictors of the uncontaminated *X*, whenever the errors
δ can be assumed to be shrinking asymptotically, as in situations where
an increasing number of repeated measurements become available. In classical
regression to the mean, a critical assumption is that the contaminating PDF
*f*_{δ} is Gaussian; even then its variance is typically
unknown and must be estimated, requiring the availability of repeated
measurements for at least some subjects.

The key argument for the Gaussian case can be found in ref.
7 (see also refs.
11 and
12). We reproduce the argument
here for the one-dimensional case. Assume
for a given *x*_{0} and denote the standard Gaussian density
function by ϕ. Then, substituting (1/σ)ϕ(·/σ) for
*f*_{δ} in Eq. **1**, and using the fact that
*x* = –ϕ^{(1)}(*x*)/ϕ(*x*),

Under the additional assumption , we have . Substituting

for *f** _{}* in Eq.

**2**then produces the classical regression-to-the-mean formula

Both Eqs. **1** and **2** reveal that regression to the mean corresponds
to shrinkage toward the mean; in Eq. **2**, this becomes shrinkage to the
mode, rather, as
at a mode of the density
*f** _{}*.

Extending Eq. **2** to the *p*-dimensional case, one finds
analogously

Here *V* = cov(δ) is the *p* × *p* covariance
matrix of the contaminating errors δ, which are assumed
*p*-variate normal, , and
is the gradient of the *p*-dimensional PDF
*f** _{}*.

## The Nonparametric Case

The general regression-to-the-mean formula (Eq. **1**) is not applicable
in practice when neither *f*_{δ} nor
*f*_{X} are contained in a parametric class; indeed it
is easily seen that these components are then unidentifiable. The derivation
of Eqs. **2**–**4** is tied to the feature that the Gaussian PDF
is the unique solution of the differential equation
*g*^{(1)}(*x*)/*g*(*x*) =
–*x*.

The following basic assumptions are made.

**Assumption A1.** *The p-dimensional* (*p* ≥ 1)
*measurements that are observed for n subjects are generated as
follows*:

*where the uncontaminated unobservable data X*_{i}
*are i.i.d. with PDF f*_{X}, *and the measurement
errors* δ_{i} *are i.i.d. with PDF*

*where* ψ *is an unknown PDF and V*_{n}
*is a sequence of covariance matrices V* =
*V*_{n} =
(*v*_{kl})_{1≤k, l≤p} of full rank,
with*V*_{n} → 0, *where*
*and* |*V*| *denotes the determinant of V.
Moreover, X*_{i} *and*
δ_{i} *are independent for all i. For the case
p* = 1, *we set V*_{n} =
(σ_{n}) = σ. *The
*_{i} *are i.i.d.
with PDF f** _{}*.

**Assumption A2.** *At a given point x*_{0} *in the
interior of the support of f*_{X} *such that
f*_{X}(*x*_{0}) > 0, *the PDFs*
ψ *and f*_{X} *are twice continuously
differentiable, and* ψ *satisfies the moment conditions*
(*p* = 1)

*and for p* > 1, ψ *satisfies*

*and all third-order moments are bounded.*

We note that in the case of repeated measurements per subject,

assuming all δ_{ij} and
(*X*_{i}, δ_{ij}) are
independent, one may work with averages

where
,
and analogously for
_{i},
*X*_{i}. Then, for *p* = 1, Eq. **5** is
replaced by

for fixed *m* (and analogously for *p* > 1). If the number of
repeated measurements is large, we may consider the case *m* =
*m*(*n*) → ∞ as *n* → ∞, where

for σ_{m}_{(}_{n}_{)}
= σ/*m*(*n*)^{1/2}, with ψ replaced by
ψ_{n}, satisfying the moment properties as in
*Assumption A2*; this case is covered as long as
ψ_{n} and its first-order derivatives are uniformly
bounded for all *n*.

For simplicity, we develop the following argument for the case *p* =
1; the extension to *p* > 1 is straightforward. The central
observation under *Assumptions A1* and *A2* is the following
argument: From Eq. **1**,

and for the denominator

Let μ_{j} = ∫ψ(*x*)*x ^{j}dx*
for

*j*≥ 1. Combining a Taylor expansion with the moment conditions (

*Assumption A2*) and observing that, because ψ is a PDF, ∫ψ

^{(1)}(

*z*)

*dz*= 0, ∫ψ

^{(1)}(

*z*)

*zdz*= –∫ψ(

*z*)

*dz*= –1, ∫ψ

^{(1)}(

*z*)

*z*

^{2}

*dz*= –2∫ψ(

*z*)

*zdz*= 0, and ∫ψ

^{(1)}(

*z*)

*z*

^{3}

*dz*= –3μ

_{2}, we find

We note that in the Gaussian case, where ψ = ϕ, the term on the
left-hand side of Eq. **11** vanishes, because then
ψ^{(1)}(*z*) = –*z*ψ(*z*). In case
the contaminating errors have a symmetric PDF or, more generally, whenever
μ_{3} = 0, and the PDFs are three times continuously
differentiable, the Taylor expansion can be carried one step further to yield

Likewise, the difference in Eqs. **11** and **12** can be made of even
smaller order by requiring additional moments to be equal to those of a
Gaussian distribution. Finally,

Combining Eqs. **10**, **11**, and **13**,

and if μ_{3} = 0, the leading remainder term is
.
Finally, for the multivariate case the same arguments lead to the following
extension of Eq. **14**:

## Local Sample Means for Nonparametric Regression to the Mean

The concept of local moments and local sample moments is related to the data-sharpening ideas proposed in ref. 13 and was formulated in ref. 14. The special case of a local sample mean is used implicitly in “mean update” mode-finding algorithms (15, 16) and provides an attractive device for implementing nonparametric regression to the mean.

The starting point is a random variable *Z* with twice continuously
differentiable density *f*_{Z}. Given an arbitrary
point , *x*_{0}
= (*x*_{01},...,
*x*_{0}_{p})′, and choosing a sequence
of window widths γ = γ_{n} > 0, define a
sequence of local neighborhoods

The local mean at *x*_{0} is defined as
μ_{z} = (μ_{z}_{1},...,
μ_{z}_{p})′, with

where in *e*_{j} = (0,..., 1,..., 0)′ the 1
occurs in the *j*th position. According to ref.
14,

The empirical counterpart to these local means are the local sample means.
Given an i.i.d. sample (*Z*_{1},...,
*Z*_{n}) of
random variables with PDF
*f*_{Z}, where *Z*_{i} =
(*Z*_{i}_{1},...,
*Z*_{ip})′, the local sample mean is
μ_{Z} = (μ_{Z}_{1},...,
μ_{Z}_{p})′, where

and γ = γ_{n} > 0 is a sequence with γ
→ 0as *n* → ∞. This is the sample mean found from the
data falling into the local neighborhood *S*(*x*_{0}),
standardized by γ^{–}^{2}. By equations 3.4 and
3.8 in ref. 14,

motivating the connection to nonparametric regression to the mean as in Eq.
**15**.

Usually the covariance matrix *V* of the contaminating errors
δ is unknown and can be estimated via the sample covariance matrix

given a contaminated sample with repeated measurements,
(_{ik}_{1},...,
_{ikp})′, 1
≤ *i* ≤ *n*, 1 ≤ *k* ≤
*m*_{i}, and
,
where *m*_{i} ≥ 2, 1 ≤ *r* ≤
*p*.

We note that consistency =
*V*(1 + *o*_{p}(1)) holds asa long as
, *n*
→ ∞. Then the estimate

satisfies

as long as γ → 0, σ → 0 and
*n*γ^{2}^{+}^{p} →
∞.

The following additional regularity conditions are needed for asymptotic results.

**Assumption A3.** *As n* → ∞, γ → 0,
*n*γ^{2}^{+}^{p} → ∞,
*and for a* λ ≥ 0,
*n*γ^{2}^{+}^{p}^{+}^{4}
→ λ^{2}.

**Assumption A4.** *It holds that V* =
*for a
fixed covariance matrix V*_{0} *with
trace*(*V*_{0}) = *p and a sequence*
*as
n* → ∞. *Here, V*_{0} *is the covariance
matrix associated with the error PDF* ψ *defined in* Assumption
A2.

**Assumption A5.** *As n* → ∞,
(*n*γ^{2}^{+}^{p})^{1/2}σ
→ 0, σ/γ → 0.

We then obtain, using local sample means of Eq. **18** and estimates
of Eq. **20**, the following main
result on asymptotic normality and consistency of the shrinkage estimates in
Eq. **21**.

**Theorem 1.** *Under* Assumptions A1–A5, *as n*
→ ∞,

*in distribution, where B* = (β_{1},...,
β_{p})′,

*and*

*In the one-dimensional case* (*p* = 1), *this simplifies
to*

## Simulation Results

To illustrate the advantage of nonparametric regression to the mean in Eq.
**21**, we compare it with the Gaussian analog. If *X* ~
*N*(μ_{X}, ∑), δ ~ *N*(0,
*V*), = *X* +
δ, with *X*, δ independent, the extension of Eq. **3**
to the multivariate case is

A total of 300 observations were generated from the (½,
½)-mixture of two bivariate normal distributions with means (–1,
–1) and (1, 1) and common covariance matrix ⅛*I*, where
*I* stands for the identity matrix. Samples then were contaminated by
adding Gaussian noise with zero mean and covariance matrix *V* =
¼*I*.

Parametric and nonparametric regression-to-the-mean estimates, assuming
that *V* is known while μ_{X} is estimated through
the sample mean of the observed
_{i}, are presented
in Fig. 1 for a typical
simulation run. Circles represent the generated uncontaminated data, and
arrows point from contaminated data to predicted data, which correspond to the
tips of the arrows. The graphical results clearly indicate that the
nonparametric procedure tracks the original uncontaminated data well, whereas
the parametric procedure shrinks the data toward the origin, which is the
wrong strategy for these nonnormal data.

*Top Left*), contaminated sample (

*Top Right*), nonparametic regression to the mean using Eq.

**21**(

*Middle Left*), arrows pointing from contaminated to predicted observations (

*Middle Right*

**...**

As a measure of accuracy in recovering the original uncontaminated data, we
computed the average sum of squared differences between original
uncontaminated data and regression-to-the-mean estimates for the Gaussian
method of Eq. **26** and the nonparametric method of Eq. **21** over 500
Monte Carlo samples under the specifications described above. The resulting
average squared error measures for the Gaussian and nonparametric procedures
were 414.44 and 60.19, respectively, indicating an almost 7-fold improvement
for nonparametric relative to Gaussian regression to the mean in this
example.

## Application to Repeated Blood-Sugar Measurements

Blood-sugar measurements are a common tool in diabetes testing. In a glucose-tolerance test, the glucose level in blood is measured after a period of fasting (fasting-glucose measurement) and again 1 h after giving the subject a defined dose of glucose (postprandial glucose measurement). Pregnant women are prone to develop subclinical or manifest diabetes, and establishing the distribution of blood-glucose levels after a period of fasting and after a dose of glucose is therefore of interest.

O'Sullivan and Mahan (17)
collected data on 52 pregnant women whose blood-glucose levels (fasting and
postprandial) were measured during three subsequent pregnancies, thus
establishing a series of repeated bivariate measurements with three
repetitions (*m* = 3, *p* = 2) (see also ref.
18, p. 211). In a
preprocessing step, the data were standardized by subtracting the mean and
dividing by the SD for each of the two variables fasting glucose (mean 72.9
mg/100 ml, SD 6.05) and postprandial glucose (mean 107.8 mg/100 ml, SD 18.65)
separately. Subsequently, 52 bivariate sample means
_{i}_{.}
were obtained by averaging over the three repeated measurements for each
subject. These data are shown as open circles in
Fig. 2.

**21**) for glucose measurements for 52 women, with repeated measurements over three pregnancies. Circles are observed sample means obtained from the three repetitions of the standardized values of (fasting

**...**

Applying Eqs. **19**–**21** with window width γ = 1.4 and
sample covariance matrix
,
and
,
we obtain the predictions
*Ê*(*X*_{i}|_{i}_{.}).
The arrows in Fig. 2 show the
displacement from observed to predicted values, the latter corresponding to
the tips of the arrows.

Moving from the original observations to the predictions has a
data-sharpening effect. This can be seen quite clearly from
Parzen–Rosenblatt nonparametric kernel density estimates of the
bivariate density, comparing the density of the original observations
(*Upper*) with that of the predicted observations (*Lower*) in
Fig. 3.

## Concluding Remarks

We have generalized the regression-to-the-mean paradigm to a nonparametric situation, where both the nature of the target distribution of given observations as well as that of the contaminating errors are unknown. It is shown that in this fairly general situation regression to the mean corresponds to shrinkage towards the mode of the distribution. We propose a straightforward estimation scheme for the shrinkage factor based on local sample means. Thus a connection emerges between nonparametric regression to the mean with data-shrinkage ideas and the mean update algorithm that has been used previously for mode finding and cluster analysis.

Open questions concern choice of smoothing parameters. A plug-in approach
could be based on estimating the unknown quantities in the asymptotic
distribution provided in Eqs. **23**–**25**, and bootstrap methods
based on residuals are another option. Procedures for more elaborate designs
where nonparametric regression to the mean would be incorporated into
more-complex models involving comparison of means, analysis of variance, or
regression components are also of interest, as is the estimation of the
contaminating errors and their distribution from the “residuals”
*Ê*(*X*|)
– .

## Acknowledgments

We are grateful for the helpful and detailed comments of two reviewers. This research was supported in part by National Science Foundation Grants DMS-9971602, DMS-0204869, and 0079430.

## Appendix: Proof of Theorem 1

We first establish the following result on multivariate asymptotic
normality of local sample means, computed from random samples
(*X*_{1},..., *X*_{n}) with PDF
*f*_{X}.

**Theorem A.1.** *For vectors of local sample means*
*of Eq.* *18**and* μ = (μ_{1},...,
μ_{p})′, μ_{j} =
*D ^{e}^{j}f*

_{X}(

*x*

_{0})/3

*f*

_{X}(

*x*

_{0})

*of Eq.*

**,**

*17**it holds under*Assumptions A1–A3

*that*

*in distribution, where * =
*B*/3(*see Eq.* ** 24**)

*and*(

*see Eq.*

**).**

*25**Proof:* Extending an argument of ref.
14 (p. 105), consider random
variables

By third-order Taylor expansion of
and *EU*_{j}*U*_{k} =
*O*(γ^{4}|*S*|)for *j* ≠
*k*. Defining random variables

and using fixed constants α_{1},...,
α_{p}, we find that

and

Applying the Cramér–Wold device and Slutsky's theorem completes the proof.

*Proof of* Theorem 1: Observing Eqs. **15**, **19**, and
**21**, *Assumptions A4* and *A5*, and the consistency of
,

is seen to have the same limiting distribution as
.
Therefore, *Theorem 1* is a direct consequence of *Theorem A.1*
once we establish the following two results:

and

The moment conditions for ψ (see *Assumption A2*) in the
multivariate case are, with constants β_{α} and
ζ_{α},

and this leads to (see chapter 6 in ref. 19 and ref. 20)

By using these moment conditions in second-order Taylor expansions,

whence Eq. **A.2** follows by *Assumption A5.*

We next discuss the denominators of
and
.
Abbreviating ρ_{n} = (*n*γ
^{p}^{+}^{2})^{1/2}, we find, based
on the kernel density estimator with uniform kernel and window *S*,
denoting the indicator function by *I*(·),

and because by Eq. **A.4**,
*EI*(_{i}
*S*) – *EI*(*X*_{i}
*S*) = *O*(|*S*|σ^{2}), we
arrive at

Note that due to *Assumption A5*,

which implies

again using Eq. **A.4**. We conclude that var
,
whence, with Eq. **A.6**,

Regarding the numerator, the terms to consider are

and the terms that include *x*_{0}_{j} are
handled in the same way as the denominator by using Eq. **A.7**. Because
_{ij} =
*X*_{ij} + δ_{ij}, it therefore
remains to consider

The same argument as for the denominator and additional Cauchy–Schwarz
bounds lead to *EI* → 0, *EI*^{2} → 0, and
therefore . For II, note that

because *X*_{ij} and δ_{ij} are
independent. Furthermore,
*E*(δ_{ij}*I*(*X*_{ij}
*S*))^{2} =
*O*(σ^{2}|*S*|) leads to var(II) =
*O*(γ^{2}σ^{2}) = *o*(1), according
to *Assumption A5.* Therefore,
, and Eq. **A.3**
follows, concluding the proof.

## Notes

Abbreviations: i.i.d., independently and identically distributed; PDF, probability density function.

## References

**,**246–263.

**,**121–130. [PubMed]

**,**214–218. [PubMed]

**,**780. [PMC free article] [PubMed]

**,**241–243. [PubMed]

**,**1163–1190.

**,**493–497.

**,**431–435.

**,**593–598. [PubMed]

**,**939–947. [PubMed]

**,**1073–1077.

**,**1163–1190.

**,**941–947.

**,**90–109.

**,**32–40.

**,**345–351. [PubMed]

**,**439–458.

**National Academy of Sciences**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (396K) |
- Citation

- Estimation of an errors-in-variables regression model when the variances of the measurement errors vary between the observations.[Stat Med. 2002]
*Kulathinal SB, Kuulasmaa K, Gasbarra D.**Stat Med. 2002 Apr 30; 21(8):1089-101.* - Assessing the adequacy of variance function in heteroscedastic regression models.[Biometrics. 2007]
*Wang L, Zhou XH.**Biometrics. 2007 Dec; 63(4):1218-25. Epub 2007 May 2.* - Regression estimator in ranked set sampling.[Biometrics. 1997]
*Yu PL, Lam K.**Biometrics. 1997 Sep; 53(3):1070-80.* - The use of correlation and regression methods in optometry.[Clin Exp Optom. 2005]
*Armstrong RA, Eperjesi F, Gilmartin B.**Clin Exp Optom. 2005 Mar; 88(2):81-8.* - A statistics primer. Correlation and regression analysis.[Am J Sports Med. 1998]
*Greenfield ML, Kuhn JE, Wojtys EM.**Am J Sports Med. 1998 Mar-Apr; 26(2):338-43.*

- From concepts, theory, and evidence of heterogeneity of treatment effects to methodological approaches: a primer[BMC Medical Research Methodology. ]
*Willke RJ, Zheng Z, Subedi P, Althin R, Mullins CD.**BMC Medical Research Methodology. 12185* - Assessing the regression to the mean for non-normal populations via kernel estimators[North American Journal of Medical Sciences....]
*John M, Jawad AF.**North American Journal of Medical Sciences. 2010 Jul; 2(7)288-292* - Combined bias suppression in single-arm therapy studies[Journal of Evaluation in Clinical Practice....]
*Hamre HJ, Glockmann A, Kienle GS, Kiene H.**Journal of Evaluation in Clinical Practice. 2008 Oct; 14(5)923-929* - Regression toward the mean – a detection method for unknown population mean based on Mee and Chua's algorithm[BMC Medical Research Methodology. ]
*Ostermann T, Willich SN, Lüdtke R.**BMC Medical Research Methodology. 852*

- Nonparametric regression to the meanNonparametric regression to the meanProceedings of the National Academy of Sciences of the United States of America. Aug 19, 2003; 100(17)9715

Your browsing activity is empty.

Activity recording is turned off.

See more...