Skip to main content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comput Stat Data Anal. Author manuscript; available in PMC 2011 Jul 5.
Published in final edited form as:
PMCID: PMC3129714
NIHMSID: NIHMS90438
PMID: 21738282

Sharp Quadratic Majorization in One Dimension

Abstract

Majorization methods solve minimization problems by replacing a complicated problem by a sequence of simpler problems. Solving the sequence of simple optimization problems guarantees convergence to a solution of the complicated original problem. Convergence is guaranteed by requiring that the approximating functions majorize the original function at the current solution. The leading examples of majorization are the EM algorithm and the SMACOF algorithm used in Multidimensional Scaling. The simplest possible majorizing subproblems are quadratic, because minimizing a quadratic is easy to do. In this paper quadratic majorizations for real-valued functions of a real variable are analyzed, and the concept of sharp majorization is introduced and studied. Applications to logit, probit, and robust loss functions are discussed.

Keywords: Successive approximation, iterative majorization, convexity

1 Introduction

Majorization algorithms, including the EM algorithm, are used for more and more computational tasks in statistics [De Leeuw, 1994; Heiser, 1995; Hunter and Lange, 2004; Lange et al., 2000]. The basic idea is simple. A function g majorizes a function f at a point y if gf and g(y) = f(y). If we are minimizing a complicated objective function f iteratively, then we construct a majorizing function at the current best solution x(k). We then find a new solution x(k+1) by minimizing the majorization function. Then we construct a new majorizing function at x(k+1), and so on.

Majorization algorithms are worth considering if the majorizing functions can be chosen to be much easier to minimize than the original objective function, for instance linear or quadratic. In this paper we will look in more detail at majorization with quadratic functions. We restrict ourselves to functions of a single real variable. This is not as restrictive as it seems, because many functions F(x1, ⋯ , xn) in optimization and statistics are separable in the sense that

F(x1,,xn)=i=1nfi(xi),

and majorization of the univariate functions fi automatically gives a majorization of F.

Many of our results generalize without much trouble to real-valued functions on Rn and to constrained minimization over subsets of Rn. The univariate context suffices to explain most of the basic ideas.

2 Majorization

2.1 Definitions

We formalize the definition of majorization at a point.

Definition 2.1

Suppose f and g are real-valued functions on Rn. We say that g majorizes f at y if

  • g(x) ≥ f(x) for all x,
  • g(y) = f(y).

If the first condition can be replaced by

  • g(x) > f(x) for all xy,

we say that majorization is strict.

Thus g majorizes f at y if d = gf has a minimum, equal to zero, at y. And majorization is strict if this minimizer is unique. If g majorizes f at y, then f minorizes g at y. Alternatively we also say that f supports g at y.

It is also useful to have a global definition, which says that f can be majorized at all y.

Definition 2.2

Suppose f is a real-valued function on Rn and g is a real-valued function on Rn⨂Rn. We say that g majorizes f if

  • g(x,y) ≥ f(x) for all x and all y,
  • g(x,x) = f(x) for all x.

Majorization is strict if the first condition is

  • g(x,y) > f(x) for all xy.

2.2 Majorization Algorithms

The basic idea of majorization algorithms is simple. Suppose our current best approximation to the minimum of f is x(k), and we have a g that majorizes f in x(k). If x(k) already minimizes g we stop, otherwise we update x(k) to x(k+1) by minimizing g. If we do not stop, we have the sandwich inequality

f(x(k+1)) ≤ g(x(k+1)) < g(x(k)) = f(x(k)), 

and in the case of strict majorization

f(x(k+1)) < g(x(k+1)) < g(x(k)) = f(x(k)).

Repeating these steps produces a decreasing sequence of function values, and appropriate additional compactness and continuity conditions guarantee convergence of both sequences x(k) and f(x(k)) [De Leeuw, 1994]. In fact, it is not necessary to actually minimize the majorization function; it is sufficient to have a continuous update function h such that g[h(y)] < g(y) for all y. In that case the sandwich inequality still applies with x(k+1) = h(x(k)).

2.3 Majorizing Differentiable Functions

We first show that majorization functions must have certain properties at the point where they touch the target.

Theorem 2.1

Suppose f and g are differentiable at y. If g majorizes f at y, then

  • g(y) = f(y),
  • g(y) = f(y).

If f and g are twice differentiable at y, then in addition

  • g″(y) ≥ f″(y).

Proof

If g majorizes f at y then d = gf has a minimum at y. Now use the familiar necessary conditions for the minimum of a differentiable function, which say the derivative at the minimum is zero and the second derivative is non-negative.

Theorem 2.1 can be generalized in many directions if differentiability fails. If f has a left and right derivatives in y, for instance, and g is differentiable, then

fR(y)g(y)fL(y).

If f is convex, then fL(y) ≤ fR(y), and f(y) must exist in order for a differentiable g to majorize f at y. In this case g(y) = f(y). For nonconvex f more general differential inclusions are possible using the four Dini derivatives of f at y [see, for example, McShane, 1944, Chapter V].

3 Quadratic Majorizers

As we said, it is desirable that the subproblems in which we minimize the majorization function are simple. One way to guarantee this is to try to find a convex quadratic majorizer. We limit ourselves mostly to convex quadratic majorizers because concave ones have no minima and are of limited use for algorithmic purposes.

The first result, which has been widely applied, applies to functions with a continuous and uniformly bounded second derivative [Böhning and Lind-say, 1988].

Theorem 3.1

If f is twice differentiable and there is an B > 0 such that f″(x) ≤ B for all x, then for each y the convex quadratic function

g(x) = f(y) + f(y)(x − y) + ½B(xy)2.

Majorizes f at y.

Proof

Use Taylor’s theorem in the form

f(x) = f(y) + f(y)(x − y) + ½f″(ξ)(xy)2

with ξ on the line connecting x and y. Because f″(ξ) ≤ B, this implies f(x) ≤ g(x), where g is defined above.

This result is very useful, but it has some limitations. In the first place we would like a similar result for functions that are not everywhere twice differentiable, or even those that are not everywhere differentiable. Second, the bound does take into account that we only need to bound the second derivative on the interval between x and y, and not on the whole line. This may result in a bound which is not sharp. In particular we shall see below that substantial improvements can result from a non-uniform bound B(y) that depends on the support point y.

Why do we want the bounds on the second derivative to be sharp? The majorization algorithm corresponding to this result is

x(k+1)=x(k)1Bf(x(k)),

Which converges linearly, say to x, by ostrowski’s Theorem [De Leeuw, 1994]. More precisely

limk|x(k+1)x||x(k)x|=11Bf(y),

Thus the smaller we choose B, the faster our convergence. We mention some simple properties of quadratic majorizers.

Property 1

If a quadratic g majorizes a twice-differentiable convex function f at y, then g is convex. This follows from g″(y) ≥ f″(y) ≥ 0.

Property 2

Quadratic majorizers are not necessarily convex. In fact, they can even be concave. Take f(x) = −x2 and g(x) = −x2 + ½(xy)2.

Property 3

If a concave quadratic g majorizes a twice-differentiable function f at y, then f is concave at y. This follows from 0 ≥ g″(y) ≥ f″(y).

Property 4

For some functions quadratic majorizers may not exist. Suppose, for example, that f is a cubic. If g is quadratic and majorizes f, then we must have d = gf ≥ 0. But d = gf is a cubic, and thus d < 0 for at least one value of x.

Property 5

Quadratic majorizers may exist almost everywhere, but not everywhere. Suppose, for example, that f(x) = |x|. Then f has a quadratic majorizer at each y except y = 0. If y ≠ 0 we can use, following Heiser [1986], the arithmetic mean-geometric mean inequality in the form

x2y212(x2+y2),

and find

|x|12|y|x2+12|y|.

That a quadratic majorizer does not exist at y = 0 follows from the discussion at the end of Section 2: f is convex and fL(0) = −1 < fR(0) = +1.

Example 1 For a nice regular example we use the celebrated functions

ϕ(x)=12πez22,Φ(x)=xϕ(z)dz.

Then

Φ(x)=ϕ(x),Φ(x)=ϕ(x)=xϕ(x),Φ(x)=ϕ(x)=(1x2)ϕ(x),Φ(x)=ϕ(x)=x(x23)ϕ(x).

To obtain quadratic majorizers we must bound the second derivatives. We can bound Φ″(x) by setting its derivative equal to zero. We have Φ‴(x) = 0 for x = ±1. Moreover Φ″″(1) < 0 and thus Φ″(x) ≤ ϕ(1). In the same way ϕ‴(x) = 0 for x = 0 and x=±3. At x = 0 the function ϕ″(x) has a minimum, at x=±3 it has two maxima. Thus ϕ(x)2ϕ(3). More precisely, it follows that

0Φ(x)=ϕ(x)ϕ(0),ϕ(1)Φ(x)=ϕ(x)ϕ(1),ϕ(0)Φ(x)=ϕ(x)2ϕ(3).

Thus we have the quadratic majorizers

Φ(x) ≤ Φ(y) + ϕ(y)(x − y) + ½ϕ(1)(xy)2

and

ϕ(x)ϕ(y)yϕ(y)(xy)+ϕ(3)(xy)2.

The majorizers is illustrated for both Φ and ϕ at the points y = 0 and y = −3 in Figures 1 and and2.2. The inequalities in this section may be useful in majorizing multivariate functions involving ϕ and Φ. They are mainly intended, however, to illustrate construction of quadratic majorizers in the smooth case.

An external file that holds a picture, illustration, etc.
Object name is nihms90438f1.jpg

Quadratic majorization of cumulative normal

An external file that holds a picture, illustration, etc.
Object name is nihms90438f2.jpg

Quadratic majorization of normal density

4 Sharp Quadratic Majorization

We now drop the assumption that the objective function is twice differentiable, even locally, and we try to improve our bound estimates at the same time.

4.1 Differentiable Case

Let us first deal with the case in which f is differentiable in y. Consider all a > 0 for which

f(x) ≤ f(y) + f(y)(x − y) + ½a(xy)2

for a fixed y and for all x. Equivalently, we must have, for all x,

af(x)f(y)f(y)(xy)12(xy)2.
(1)

Define the function

δ(x,y)=f(x)f(y)f(y)(xy)12(xy)2

for all xy. The system of inequalities (1) has a solution if and only if

A(y)=supxδ(x,y)<.

If this is the case, then any aA(y) will satisfy (1). Because we want a to be as small as possible, we will usually prefer to choose a = A(y). This is what we mean by the sharp quadratic majorization. If the second derivative is uniformly bounded by B, we have A(y) ≤ B, and thus our bound improves on the uniform bound considered before.

The function δ has some interesting properties. For differentiable convex f we have f(x) ≥ f(y) + f(y)(xy) and thus δ(x,y) ≥ 0. In the same way for concave f we have δ(x,y) ≤ 0. For strictly convex and concave f these inequalities are strict. If δ(x,y) ≤ 0 for all x and y, then f must be concave. Consequently A(y) ≤ 0 only if f is concave, and without loss of generality we can exclude this case from consideration.

The function δ(x,y) is closely related to the second derivative at or near y. If f is twice differentiable at y, then, by the definition of the second derivative,

limxyδ(x,y)=f(y).
(2)

If f is three times differentiable, we can use the Taylor Expansion to sharpen this to

limxyδ(x,y)f(y)xy=16f(y).

Moreover, in the twice differentiable case, the Mean Value Theorem implies there is a ξ in the interval extending from x to y with δ(x,y) = f″(ξ). We can also derive an integral representation of δ(x,y) and its first derivative with respect to x [Tom Ferguson, Personal Communication, 03/12/04].

Lemma 4.1

δ(x,y) can written as the expectation

δ(xy) = E{f″[Vy + (1 − V)x]}, 

where the random variable V follows a β (2,1) distribution. Likewise

δ(xy) = ⅓E{f‴[Wy + (1 − W)x]}, 

where the random variable W follows a β (2,2) distribution. Thus δ(x,y) and δ(x,y) can be interpreted as smoothed versions of fand f‴.

Proof

The first representation follows from the second-order Taylor’s expansion

f(x)=f(y)+f(y)(xy)+(xy)201f[vy+(1v)x]vdv

with integral remainder [Lange, 2004]. This can be rewritten as

δ(x,y)=201f[vy+(1v)x]vdv.
(3)

Since the density of β(2,1) at v is 2v this gives the first result in the lemma. Differentiation under the integral sign of 3 yields the second representation.

In view of Lemma 4.1, δ(x,y) is jointly continuous in x and y when f″(x) is continuous. Furthermore, if f″(x) tends to ∞ as x tends to −∞ or +∞, then δ(x,y) is unbounded in x for each fixed y. Thus, quadratic majorizations do not exist for any y if the second derivative grows unboundedly. It also follows from Lemma 4.1 that the best quadratic majorization does not exist if the third derivative f‴ is always positive (or always negative). This happens, for instance, if the first derivative f is strictly convex or strictly concave. Thus as mentioned earlier, cubics do not have quadratic majorizations.

Property 6

Majorization may be possible at all points y without the function A(y) being bounded. Suppose the graph of f″(x) is 0 except for an isosceles triangle centered at each integer n ≥ 2. If we let the base of the triangle be 2n−3 and the height of the triangle be n, then the area under the triangle is n−2. The formulas

f(x)=0xf(y)dy,f(x)=0xf(y)dy

define a nonnegative convex function f(x) satisfying

f(x)n=21n2<.

To prove the A(y) is finite for every y, recall the limit (2) and observe that

δ(x,y)=f(w)(xy)f(y)(xy)12(xy)2=f(w)f(y)12(xy)

for some w between x and y. It follows that δ(x,y) tends to 0 as |x| tends to ∞. Because A(n) ≥ f″(n) = n, it is clear that A(y) is unbounded.

4.2 Computing the Sharp Quadratic Majorization

Let us study the case in which the supremum of δ(x,y) over xy is attained at, say, zy. In our earlier notation A(y) = δ(z,y). Differentiating δ(x,y) with respect to x gives

δ(x,y)=12(xy)2[f(x)+f(y)](xy)[f(x)f(y)]14(xy)4,

and

f(z)f(y)zy=12[f(z)+f(y)]
(4)

is a necessary and sufficient condition for δ(z,y) to vanish. At the optimal z we have

A(y)=δ(z,y)=f(z)f(y)zy.
(5)

It is interesting that the fundamental theorem of calculus allows us to recast equations (4) and (5) as

12[f(z)+f(y)]=01f[z+t(yz)]dtA(y)=01f[z+t(yz)]dt.

When f is convex, A(y) ≥ 0. For the second derivative at z, we have

δ(z,y)=(zy)2f(z)[f(z)f(y)](zy)12(zy)4.

At a maximum we must have δ″(z,y) ≤ 0, which is equivalent to

f(z)f(z)f(y)zy=A(y).
(6)

We can achieve more clarity by viewing these questions from a different angle. If the quadratic g majorizes f at y, then it satisfies

g(x) = f(y) + f(y)(x − y) + ½a(xy)2

for some a. If z is a second support point, then g not only intersects f at z, but it also majorizes f at z. The condition g(z) = f(z) yields

a=f(z)f(y)zy.

If we match this value with the requirement δ(z,y) = a, then we recover the second equality in (5). Conversely, if a point z satisfies the second equality in (5), then it is a second support point. In this case, one can easily check condition (4) guaranteeing that z is a stationary point of δ(x,y).

4.3 Optimality with Two Support Points

Building on earlier work by Groenen et al. [2003], Van Ruitenburg [2005] proves that a quadratic function g majorizing a differentiable function f at two points must be a sharp majorizer. The idea of looking for quadratic majorizers with two support points has been used earlier by Heiser [1986] and others. Van Ruitenburg, however, is the first to present the result in full generality. Our approach is more analytical and computational, and designed to be applied eventually to multivariate quadratic majorizations. For completeness, we now summarize in our language Van Ruitenburg’s [2005] lovely proof of the two-point property.

Lemma 4.2

Suppose two quadratic functions g1g2 both majorize the differentiable function f at y. Then either g1 strictly majorizes g2 at y or g1 strictly majorizes g2 at y.

Proof

We have

g1(x) = f(y) + f(y)(x − y) + ½a1(xy)2
(7)
g2(x) = f(y) + f(y)(x − y) + ½a2(xy)2
(8)

with a1a2. Subtracting (7) and(8) proves the lemma.

Lemma 4.3

Suppose the quadratic function g1 majorizes a differentiable function f at y and z1y and that the quadratic function g2 majorizes f at y and z2y. Then g1 = g2.

Proof

Suppose g1g2. Since both g1 and g2 majorize f at y, Lemma 4.2 applies. If g2 strictly majorizes g1 at y, then g1(z2) < g2(z2) = f(z2), and g1 does not majorize f. If g1 strictly majorizes g2 at y, then similarly g2(z1) < g1(z1) = f(z1), and g2 does not majorize f. Unless g1 = g2, we reach a contradiction.

We now come to Van Ruitenburg’s main result.

Theorem 4.4

Suppose a quadratic function g1 majorizes a differentiable function f at y and at zy, and suppose g2g1 is a quadratic majorizing f at y. Then g2 strictly majorizes g1 at y.

Proof

Suppose g1 strictly majorizes g2. Then g2(z) < g1(z) = f(z) and thus g2 does not majorize f. The result now follows from Lemma 4.2.

Property 7

It is not true, by the way, that a quadratic majorizer can have at most two support points. There can even be an infinite number of them. Consider the function h(x) = c sin2(x) for some c > 0. Clearly h(x) ≥ 0 and h(x) = 0 for all integer multiples of π. Now define f(x) = x2h(x) and g(x) = x2. Then g is a quadratic majorizer of f at all integer multiples of π. This is plotted in Figure 3 for c = 10.

Property 8

There is no guarantee that a second support point zy exists. Consider the continuously differentiable convex function

f(x)={x2x12x1x>1,

and fix y > 1. For x > 1

δ(x,y)=2x12y+12(xy)12(xy)2=0.

For x ≤ 1

δ(x,y)=x22y+12(xy)12(xy)2=(x1)212(xy)2.

It follows that limx→−∞ δ(x,y) = 2. On the other hand, one can easily demonstrate that δ(x,y) < 2 whenever x ≤ 1. Hence, A(y) = 2, but δ(x,y) < 2 for all xy.

4.4 Even Functions

Assuming that f(x) is even, i.e. f(x) = f(−x) for all x, simplifies the construction of quadratic majorizers. If an even quadratic g satisfies g(y) = f(y) and g(y) = f(y), then it also satisfies g(−y) = f(−y) and g(−y) = f(−y). If in addition g majorizes f at either y or −y, then it majorizes f at both y and −y, and Theorem 4.4 implies that it is the best possible quadratic majorization at both points. This means we only need an extra condition to guarantee that g majorizes f. The next theorem, essentially proved in the references [Groenen et al., 2003; Jaakkola and Jordan, 2000; Hunter and Li, 2005] by other techniques, highlights an important sufficient condition.

Theorem 4.5

Suppose f(x) is an even, differentiable function on R such that the ratio f(x)/x is decreasing on (0,∞). Then the even quadratic

g(x)=f(y)2y(x2y2)+f(y)

is the best quadratic fimajorizer of f (x) at the point y.

Proof

It is obvious that g(x) is even and satisfies the tangency conditions g(y) = f(y) and g(y) = f(y). For the case 0 ≤ xy, we have

f(y)f(x)=xyf(z)dz=xyf(z)zzdzf(y)yxyzdz=f(y)y12(y2x2)=f(y)g(x),

where the inequality comes from the assumption that f(x)/x is decreasing. It follows that g(x) ≥ f(x). The case 0 ≤ yx is proved in similar fashion, and all other cases reduce to these two cases given that f(x) and g(x) are even.

There is an condition equivalent to the sufficient condition of Theorem 4.5 that is sometimes easier to check.

Theorem 4.6

The ratio f(x)/x is decreasing on (0,∞) if and only f(x) is concave. The set of functions satisfying this condition is a closed under the formation of (a) positive multiples, (b) convex combinations, (c) limits, and (d) composition with a concave increasing function g(x).

Proof

Suppose f(x) is concave in x and x > y. Then the two inequalities

f(x)f(y)+f(y)2y(xy)f(y)f(x)+f(x)2x(yx)

are valid. Adding these, subtracting the common sum f(x)+f(y) from both sides, and rearranging give

f(x)2x(xy)f(y)2y(xy).

Dividing by (xy)/2 yields the desired result

f(x)xf(y)y.

Conversely, suppose the ratio is decreasing and x > y. Then the mean value expansion

f(x)=f(y)+f(z)2z(xy)

for z ∈ (y,x) leads to the concavity inequality.

f(x)f(y)+f(y)2y(xy).

The asserted closure properties are all easy to check.

As examples of property (d) of Theorem 4.6, note that the functions g(x) = ln x and g(x)=x are concave and increasing. Hence, if f(x) is concave, then lnf(x) and f(x)1/2 are concave as well.

The above discussion suggests that we look at more general transformations of the argument of f. If we define (x) = f(α + βx) for an arbitrary function f(x), then a brief calculation shows that

A(y)=β2A(α+βy)z(y)=z(α+βy)αβ

using the identity δ͂(x,y) = β2δ(α+βx,α+βy). An even function f(x) satisfies (x) = f(x) for α = 0 and β = −1.

4.5 Non-Differentiable Functions

If f is not differentiable at y, then we must find a and b such that

f(x) ≤ f(y) + b(x − y) + ½a(xy)2.

for all x. This is an infinite system of linear inequalities in a and b, which means that the solution set is a closed convex subset of the plane.

Analogous to the differentiable case we define

δ(x,y,b)=f(x)f(y)b(xy)12(xy)2,

as well as

A(y,b)=supxδ(x,y,b).

If A(y,b) < +∞, we have the sharpest quadratic majorization for given y and b. The sharpest quadratic majorization at y is given by

A(y)=infbA(y,b).

5 Examples

As we explained in the introduction, majorizing univariate functions is usually not useful in itself. The results become relevant for statistics if they are used in the context of separable multivariate problems. In this section we first illustrate how to compute sharp quadratic majorizers for some common univariate function ocurring in maximum likelihood problems, and then we apply these majorizers to the likelihood problems themselves.

5.1 Logistic

Our first example is the negative logarithm of the logistic cdf

Ψ(x)=11+ex.

Thus

f(x) = log(1 + ex).

Clearly

f(x)=ex1+ex=Ψ(x)1,

and

f(x)=ex(1+ex)2=Ψ(x)[1Ψ(x)].

Thus f″(x) > 0 and f(x) is strictly convex. Since f″(x) ≤ 1/4, a uniform bound is readily available.

The symmetry relations

f(x)=x+f(x),f(x)=[1+f(x)]=Ψ(x),f(x)=f(x).

demonstrate that z = −y satisfies equation (4) and hence maximizes δ(x,y). The optimum value is determined by (5) as

A(y)=δ(z,y)=2Ψ(y)12y.

The same result was derived, using quite different methods, by Jordan [2000] and Groenen et al. [2003].

We plot the function δ(x,y) for y = 1 and y = 8 in Figure 4. Observe that the uniform bound 1/4 is not improved much for y close to 0, but for large values of y the improvement is huge. This is because A(y) ≈ (2|y|)−1 for large |y|. Thus for large values of y we will see close to superlinear convergence if we use A(y).

An external file that holds a picture, illustration, etc.
Object name is nihms90438f4a.jpg
An external file that holds a picture, illustration, etc.
Object name is nihms90438f4b.jpg

δ for logistic at y = 1 (left) and y = 8 (right).

Alternatively, we can majorize f(x) = log(1+ex) by writing

log⁡(1 + ex) = −½x + log⁡(ex⁄2ex⁄2)

and majorizing the even function h(x) = log(ex/2 + ex/2). Straightforward but tedious differentiation shows that

[h(x)x]=1e2x+2xex2x2(1+ex)2=12x2(1+ex)2k=2[2xxkk!(2x)k+1(k+1)!]=22x2(1+ex)2k=2xk+1k![12kk+1]0.

Hence, h(x)/x is decreasing on (0,∞), and Theorem 4.5 applies.

5.2 The Absolute Value Function

Because |x| is even, Theorem 4.5 yields the majorization

g(x)=12|y|(x2y2)+|y|=12|y|x2+12|y|,

which is just the result given by the arithmetic/geometric mean inequality in Property 5. When y = 0, recall that no quadratic majorization exists.

If we approach majorization of |x| directly, we need to find a > 0 and b such that

a(xy)2b(x − y) + |y| ≥ |x|

for all x. Let us compute A(y,b). If y < 0 then b = −1, and thus

A(y,1)=supxy|x|+x12(xy)2=1|y|.

If y > 0 then b = +1, and again

A(y,+1)=supxy|x|x12(xy)2=1|y|.

In both cases, the best quadratic majorizer can be expressed as

g(x)=121|y|(xy)2+sign(y)(xy)+|y|=12|y|x2+12|y|.

5.3 The Huber Function

Majorization for the Huber function, specifically quadratic majorization, has been studied earlier by Heiser [1987] and Verboon and Heiser [1994]. In those papers quadratic majorization functions appear more or less out of the blue, and it is then verified that they are indeed majorization functions. This is not completely satisfactory. Here we attack the problem by applying Theorem 4.5. This leads to the sharpest quadratic majorization.

The Huber function is defined by

f(x)={12x2if|x|<c,c|x|12c2otherwise.

Thus we really deal with a family of even functions, one for each c > 0. The Huber functions are differentiable with derivative

f(x)={xif|x|<c,cifxc,cifxc.

Since it is obvious that f(x)/x is decreasing (0,∞), Theorem 4.5 immediately gives the sharpest majorizer

g(x)={12c|y|(xy)2cx12c2ifyc,12x2if|y|<c,12c|y|(xy)2+cx12c2ify+c.

5.4 General Logit and Probit Problems

Suppose we observe independent counts from n binomials counts ni, with parameters Ni and πi(x), where

πi(x)=11+exp(hi(x))

for given functions hi(x) of p unknown parameters x. This covers both linear and nonlinear logistic regression problems. The deviance, i.e. twice the negative log-likelihood, is

D(x)=2i=1Nnilogπi+(Nini)log(1πi(x))==2i=1nNi{f(hi(x))(pi1)hi(x)},

where as before, f(x) = log(1+exp(−x)). Using f(−hi(x)) = −Ψ(hi(x)) = −πi(x) we see that a quadratic majorization of f

f(hi(x)) ≤ f((hi(y)) + f(hi(y))(hi(x) − hi(y)) + ½Ai(y)(hi(x)−hi(y))2

leads to the quadratic majorization of the deviance by a weighted least squares function of the form

σ(x)=i=1nNiAi(y)(hi(x)zi(y))2,

where

zi(y)=hi(y)+πi(y)piAi(y).

This means we can solve the logistic problem by solving a sequence of weighted least squares problems. If the hi are linear, then these are just linear regression problems. If the hi are bilinear, the subproblems are weighted singular value decompositions, and if the hi are distances the subproblems are least squares multidimensional scaling problems. If we have algorithms to solve the weighted least squares problems, then we automatically have an algorithm to solve the corresponding logistic maximum likelihood problem.

In fact, it is shown by De Leeuw [2006] that essentially the same results apply if we replace the logit Ψ by the probit function Φ. The difference is that for the probit we have A(y) ≡ 1 and thus uniform quadratic majorization is sharp.

As one of the reviewers correctly points out, sharp univariate quadratic majorization of f does not imply sharp multivariate quadratic majorization of D. Although, it is of course true that sharp univariate majorization gives better results that unsharp univariate majorization. The problem of sharp multivariate majorization is basically unexplored, although we do have some tentative results.

5.5 Application to Discriminant Analysis

Discriminant analysis is another attractive application. In discriminant analysis with two categories, each case i is characterized by a feature vector zi and a category membership indicator yi taking the values −1 or 1. In the machine learning approach to discriminant analysis (Vapnik, 1995), the hinge loss function [1 −yi(α+zitβ)]+ plays a prominent role. Here (u)+ is short-hand for the convex function max{u,0}. Just as in ordinary regression, we can penalize the overall separable loss

g(θ)=i=1m[1yi(α+zitβ)]+,

where θ = (α, β), by imposing a lasso or ridge penalty +λθθ.

Most strategies for estimating θ pass to the dual of the original minimization problem. A simpler strategy, proposed by Groenen et al. (2007) is to majorize each contribution to the loss by a quadratic and minimize the surrogate loss plus penalty. In Groenen et al. (2008) this approach is extended to quadratic and Huber hinges, still maintaining the idea of using quadratric majorizers. A little calculus shows that the absolute value hinge (u)+ is majorized at un ≠ 0 by the quadratic

q(u|un)=14|un|(u+|un|)2.
(9)

In fact, by the same reasoning as for the absolute value function, this is the best quadratic majorizer. To avoid the singularity at 0, we recommend replacing q(u | un) by

r(u|un)=14|un|+ε(u+|un|)2.

In double precision, a good choice of ε is 10−5. Of course, the dummy variable u is identified in case i with 1 −yi(α + zitβ). If we impose a ridge penalty, then the majorization (9) leads to a majorization algorithm exploiting weighted least squares.

6 Iterative Computation of A(y)

In general, one must find A(y) numerically. We do not suggest here that the combination of finding A(y) by an iterative algorithm, and then using this A(y) in the iterative quadratic majorization algorithm, is necessarily an efficient way to construct overall algorithms. It requires an infinite number of iterations within each of an infinite number of iterations. The results in this section can be used, however, for computing A(y) for specific functions to find out how it behaves as a function of y. This has been helpful to us in finding A(y) for the logit and probit, where the computations suggested the final analytic result.

For a convex function f, two similar iterative algorithms are available. They both depend on minorizing f by the linear function f(x(k))+f(x(k))(xx(k)) at the current point x(k) in the search for the maximum z of δ(x,y). This minorization propels the further minorization

δ(x,y)f(x(k))+f(x(k))(xx(k))f(y)f(y)(xy)12(xy)2=[f(x(k))f(y)](xy)+f(x(k))+f(x(k))(yx(k))f(y)12(xy)2.

Maximizing the displayed minorizer drives δ(x,y) uphill. Fortunately, the minorizer is a function of the form

h(w)=cw+dw2=cw+dw2

with w = xy. The stationary point w = −2d/c furnishes the maximum of h(w) provided

h(2dd)=2cw3|w=2dc+6dw4|w=2dc=c48d3

is negative. If f(x) is strictly convex, then

d = 2[f(x(k)) + f(x(k))(y − x(k)) − f(y)], 

is negative, and the test for a maximum succeeds. The update can be phrased as

x(k+1)=y2f(x(k))+f(x(k))(yx(k))f(y)f(x(k))f(y).

A brief calculation based on equations (4) and (5) shows that the iteration map x(k+1) = g(x(k)) has derivative

g(z)=f(z)(zy)f(z)f(y)=f(z)A(y)

at the optimal point z.

On the other hand, the Dinkelbach [1967] maneuver for increasing h(w) considers the function e(w) = cw+dh(w(k))w2 with value e(w(k)) = 0. If we choose

w(k+1)=c2h(w(k))

to maximize e(w), then it is obvious that h(w(k+1) ≥ h(w(k)). This gives the iteration map

xn+1=y+12[f(x(k))f(y)](x(k)y)2f(x(k))f(y)f(y)(x(k)y)=y+f(x(k))f(y)δ(x(k),y)

with derivative at z equal to f″(z)/A(y) by virtue of equations (4) and (5). Hence, the two algorithms have the same local rate of convergence. We recommend starting both algorithms near y. In the case of the Dinkelbach algorithm, this entails

h(w) ≈ δ(xy) ≈ f″(y) > 0

for f(x) strictly convex. Positivity of h(w(0)) is required for proper functioning of the algorithm.

In view of the convexity of f(x), it is clear that f″(z)/A(y) ≥ 0. The inequality f″(z) ≤ A(y) follows from the condition A(y) = A(z) determined by Theorem 4.4 and inequality (6). Ordinarily, strict inequality f″(z) < A(y) prevails, and the two iteration maps just defined are locally contractive. Globally, the standard convergence theory for iterative majorization (MM algorithms) suggests that limn→∞|x(k+1)x(k)| = 0 and that the limit of every convergent subsequence must be a stationary point of δ(x,y) [Lange, 2004].

7 Discussion

In separable problems in which quadratic majorizers exist we have shown that it is often possible to use univariate sharp quadratic majorizers to increase convergence speed of iterative majorization (or MM) algorithms. This is true even for multivariate problems with a potentially very large number of parameters, such as the logit and probit problems in Section 5.4.

There is, however, still plenty of room for improvement. We do not have, at the moment, a satisfactory theory of multivariate sharp quadratic majorization, and such a theory would obviously help to boost convergence rates even more. Such a theory should be based on the fact that a multivariate quadratic majorizer must satisfy

f(x) − f(y) − f(y)(x − y) − ½(xy)A(x − y) ≤ 0

for all x. This is an infinite system of linear inqualities in A, and thus the solution set is either empty or convex. Sharp quadratic majorization will concentrate on the extreme points of this convex set, although it is unrealistic to expect that we can find and A(y) which is sharp in all directions. Clearly both the theoretical and the implementation aspects of sharp multivariate majorization are a useful topic for further research.

Acknowledgments

This research was supported in part by NIH grants GM53275 and MH59490 to KL. Comments by two anonymous reviewers have been very helpful in preparing this final version.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

1. Böhning D, Lindsay BG. Monotonicity of quadratic approximation algorithms. Annals Institute Stat Math. 1988;40:641–663. [Google Scholar]
2. De Leeuw J. Block relaxation algorithms in statistics. In: Bock HH, Lenski W, Richter MM, editors. Information Systems and Data Analysis. Berlin: Springer-Verlag; 1994. pp. 308–325. [Google Scholar]
3. De Leeuw J. Principal component analysis of binary data by iterated singular value decomposition. Comput Stat Data Analysis. 2006;50:21–39. [Google Scholar]
4. Dinkelbach W. On nonlinear fractional programming. Management Science. 1967;13:492–Ű498. [Google Scholar]
5. Groenen PJF, Giaquinto P, Kiers HAL. Technical Report EI 2003-09. Econometric Institute, Erasmus University; Rotterdam, Netherlands: 2003. Weighted majorization algorithms for weighted least squares decomposition models. [Google Scholar]
6. Groenen PJF, Nalbantov G, Bioch JC. Nonlinear support vector machines through iterative majorization and I-splines. In: Lenz HJ, Decker R, editors. Studies in Classification, Data Analysis, and Knowledge Organization. Springer-Verlag; Heidelberg-Berlin: 2007. pp. 149–161. [Google Scholar]
7. Groenen PJF, Nalbantov G, Bioch JC. SVM-Maj: a majorization approach to linear support vector machines with different hinge errors. Advances in Data Analysis and Classification. 2008;2:17–43. [Google Scholar]
8. Heiser WJ. Technical Report RR-86-12. Department of Data Theory, University of Leiden; Leiden, Netherlands: 1986. A majorization algorithm for the reciprocal location problem. [Google Scholar]
9. Heiser WJ. Correspondence analysis with least absolute residuals. Comput Stat Data Analysis. 1987;5:337–356. [Google Scholar]
10. Heiser WJ. Convergent computation by iterative majorization: Theory and applications in multidimensional data analysis. In: Krzanowski WJ, editor. Recent Advances in Descriptive Multivariate Analysis. Clarendon Press; Oxford: 1995. pp. 157–189. [Google Scholar]
11. Hunter DR, Lange K. A tutorial on MM algorithms. Amer Statistician. 2004;58:30–37. [Google Scholar]
12. Hunter DR, Li R. Variable selection using MM algorithms. Ann Stat. 2005;33:1617–1642. [PMC free article] [PubMed] [Google Scholar]
13. Jaakkola TSW, Jordan MIW. Bayesian parameter estimation via variational methods. Stat Computing. 2000;10:25–Ű37. [Google Scholar]
14. Lange K. Optimization. Springer-Verlag; New York: 2004. [Google Scholar]
14. Lange K, Hunter DR, Yang I. Optimization transfer using surrogate objective functions. J Comput Graph Stat. 2000;9:1–59. with discussion. [Google Scholar]
15. McShane EJ. Integration. Princeton University Press; Princeton: 1944. [Google Scholar]
16. Van Ruitenburg J. Measurement and Research Department Reports 2005-04. Arnhem; Netherlands: 2005. Algorithms for parameter estimation in the Rasch model. [Google Scholar]
17. Vapnik V. The Nature of Statistical Learning Theory. Springer-Verlag; New York: 1995. [Google Scholar]
18. Verboon P, Heiser WJ. Resistant lower rank approximation of matrices by iterative majorization. Comp Stat Data Anal. 1994;18:457–467. [Google Scholar]