This LaTeX document is available as postscript or asAdobe PDF.

Restricted Maximum Likelihood

Restricted maximum likelihood (REML), was suggested by Thompson (1962), and was described formally by Patterson and Thompson (1971). The term 'restricted' is perhaps not the most correct to use to describe the method. The procedure requires that have a multivariate normal distribution. The method is translation invariant. The maximum likelihood approach automatically keeps the estimator within the allowable parameter space(i.e. zero to plus infinity), and therefore, REML is a biased procedure.

The Likelihood Function

The likelihood function of is assumed to have a multivariate normal distribution which is

In REML, however, the likelihood function of a set of error contrasts is used, denoted by , where , and has rank equal to . That is,

The next step is to take the natural log of the likelihood function to give

Notice that is a constant that does not depend on the unknown variance components or factors in the model, and therefore, can be ignored. Searle (1979) showed that

and that

for any as long as and . Hence, L1 can be re-written as

Various alternative forms of L2 can be derived. Some basic results on determinants are necessary to follow the derivations.

First, if and are square matrices of the same order, then

This can be used to show that

Then it follows that

Secondly, for a general matrix below with and being square and non-singular, then

which can be used to show that if and , then

Also, if is a scalar constant, then

where m is the order of A. These relationships are used to show that if

then

and letting

then

From the last equality, then

Combining the above results gives

Now note that

Then

and finally, the log-likelihood function becomes

where

Derivatives of the Log Likelihood

Recall that the variance-covariance matrix of y is

and a projection matrix is

From Searle (1979), the following results about derivatives hold:
(1)

(2)

(3)

(4)

and
(5)

To derive formulas for estimating the variance components take the derivative of L2with respect to the unknown components.

Combine the two terms involving the traces and note that

then

for or

for i=0 for the residual component. Using and the fact that

then

and

The other terms, and , were simplified by Henderson (1973) to show that they could be calculated from solutions to his Mixed Model Equations. To simplify these quadratic forms notice that

Therefore,

Henderson noticed that

where is the assumed variance covariance matrix of (in this case , and is BLUP of and the solution of from his MME, then

Similarly, Henderson showed that

where

Now equate the derivatives to zero to obtain

and

Use of the above formulas is known as the EM algorithm, (Expectation Maximization). Also note that the EM algorithm is iterative. One must take the new estimates, reform the matrix , recalculate new solutions to Henderson's MME, and obtain new estimates of the variances. The process continues until the change in new versus old estimates is smaller than a pre-specified amount, like .000001%. The main problems with the EM algorithm are that convergence is very, very slow, if it occurs and may not converge, but instead diverge. The more components there are to estimate, the longer will be the number of iterates to reach convergence, depending upon the starting values in the iterative process.

Another major problem is the requirement for which is a submatrix of the inverse to the MME. In practice, the inverse of the MME is not explicitly derived and solutions to MME are obtained by Gauss-Seidel iteration. Thus, there have been many attempts to approximate the value of , and most have not been suitable.

The above problems have led to attempts to discover other computational algorithms for REML. One of those has been Derivative Free REML which is described in the next section.

Derivative Free Method

Derivative Free REML was proposed by Smith and Graser(1986) and Graser, Smith and Tier(1987) and has been expanded upon by Meyer (1987,91) who has developed a set of programs for computing estimates of variance components for a whole range of univariate and multivariate models. The description given below is a very simplified version of the method for basic understanding of the technique.

In essence, one can imagine an s dimensional array containing the values of the likelihood function for every possible set of values of the ratios of the components to the residual variance. The technique is to search this array and find the set of ratios for which the likelihood function is maximized. There is more than one way to conduct this search. Care must be taken to find the 'global' maximum rather than one of possibly many 'local' maxima. At the same time the number of likelihood evaluations to be computed must also be minimized.

One begins by looking at alternative forms of the log likelihood function given by L2, that is,

Note that

so that

The computations are achieved by constructing the following matrix,

then by Gaussian elimination of one row at a time, the sum of the log of the non-zero pivots (using the same ordering for each evaluation of the likelihood) gives

Gaussian elimination, using sparse matrix techniques, requires less computing time than inverting the coefficient matrix of the mixed model equations. The ordering of factors within the equations could be critical to the computational process and some experimentation may be necessary to determine the best ordering. The likelihood function can be evaluated without the calculation of solutions to the mixed model equations, without inverting the coefficient matrix of the mixed model equations, and without computing any of the . The formulations for more general models and multiple trait models are more complex, but follow the same ideas.

The general technique is best described with a small example. Below are ordinary least squares equations for a model with three factors, two of which (A and B) are random. The total sum of squares is .

EM Algorithm

To demonstrate the EM algorithm, let and be the starting values of the ratios for factors A and B, respectively, which must be added to the diagonals of the above equations. The solution vector is

Then

and from the inverse of the coefficient matrix,

which give rise to the following estimates,

New ratios are formed as

and

and these are used to form the mixed model equations again, followed by new solutions and traces, and so on, until the estimated ratios and the prior values of the ratios are equal. The estimates appear to converge to

DF REML

Begin by fixing the value of at 10 and let the value of take on values of . Using L2 to evaluate the likelihood, then the results were as follows:

 L2 5 -207.4442 10 -207.1504 20 -206.9822 30 -206.9274
To find the value of that maximizes L2 for , let

then

From this a prediction equation for L2 is

This equation can be differentiated with respect to and then equated to zero to find the value of the ratio that maximizes the likelihood. This gives

The search for the global maximum continues now by keeping at 25.4612, and looking at four new values of which give the following results.

 L2 2 -206.2722 3 -206.2055 4 -206.2473 5 -206.3426
Applying the quadratic regression to these points gives

The search continues by alternating between fixing either or and narrowing the field of search. Keeping at 3.19, now the possible values for might be 24, 25, 26, and 27.

To insure that a global maximum has been found, the entire process could be started with vastly different starting values for the ratios, such as and let values for be 40, 50, 60, and 70.

The more components there are to estimate, the more evaluations of the likelihood that are going to be needed, and the more probable it is that convergence might be to a local maximum rather than to the global maximum. Another problem might be that one component might converge towards zero. There has been much work on searching algorithms for DF REML.

Please refer to the literature for specification of the log likelihood function for particular models and situations. Also, refer to work by Boldman and Van Vleck (1993) which found a simplification of Meyer's algorithms which reduced computational time by several orders of magnitude. Even so, DFREML has been applied to fairly small data sets and can take considerable time to find estimates for these. The available software may also not be able to handle particular models, and so the user should be aware of these possible problems.

Algorithms to Maximize a Nonlinear Function

Hofer (1998, Variance component estimation in animal breeding: a review, in the Journal of Animal Breeding and Genetics, 115:247-266) gives a general description of the algorithms used with REML. A gradient (change in some quantity as a result of a change in the parameters), called determines the direction of the search in the next step. The gradient is modified by a matrix and by a scalar, c. The new estimate is given by

where

evaluated at .

For the Newton-Raphson algorithm, evaluated at ,

For the Method of Scoring algorithm, evaluated at ,

The traces in the Newton-Raphson algorithm and in the Method of Scoring algorithm both involve the inverse elements, in complex forms, of the mixed model equations, which makes them difficult to attack computationally. Johnson and Thompson (1995) suggested averaging the observed and expected information matrices and coined the term Average Information REML, or AI REML. They defined as the inverse of the following matrix, which does not require elements of the inverse of the mixed model equations.

These last quantities are calculated by augmenting some previous equations as follows:

where

There would be one vector for each variance component to be estimated. All of them may be added to the above matrix at one time, before Gaussian elimination is applied. The calculation of each involves solutions to the mixed model equations.

Hofer (1998) compared the various algorithms for various models. The AI REML algorithm seemed to be advantageous over the other methods. DF REML was advantageous over EM REML when the number of parameters to be estimated was small, but EM REML became more efficient than DF REML when the number of parameters was large as in multiple trait models. Animal breeders, in general, need to be able to access and utilize REML programs written by experts in order to save time from writing their own programs, and to avoid the duplication of effort that would be needed to get new programs to work correctly and efficiently. However, for some models there may not be programs available to handle them, and then writing your own programs would be necessary.

Bayesian Estimation

These notes are based on Sorensen (1998) from a course on "Gibbs Sampling in Quantitative Genetics" given at AGBU, Armidale, NSW, Australia. The Gibbs sampling tool to derive estimates of variance components from the joint posterior distribution seems to be a very powerful tool to use with animal models. Computationally, one needs only a program that calculates solutions to Henderson's mixed model equations, very good random number generators, and computer time to process very large numbers of samples.

Begin with the simple single trait animal model. That is,

The Bayesian approach is to derive the joint posterior distribution by application of Bayes theorem. If is a vector of random variables and is the data vector, then

Re-arranging gives

In terms of the simple animal model, includes , , , and . The conditional distribution of given is

Prior distributions need to be assigned to the components in , and these need to be multiplied together and times the conditional distribution of given . For the fixed effects vector, , there is little prior knowledge about the values that elements in that vector might have. This is represented by assuming

For , the vector of additive genetic values, quantitative genetics theory suggests that they follow a normal distribution, i.e.

where q is the length of . A natural estimator of is , call it S2a, which has

Multiplying both sides by q and dividing by gives

which is a scaled, inverted Chi-square distribution, written as

where va and S2a are hyperparameters with S2a being a prior guess about the value of and va being the degrees of belief in that prior value. Usually q is much larger than va and therefore, the data provide nearly all of the information about . Similarly, for the residual variance,

Now form the joint posterior distribution as

which can be written as

In order to implement Gibbs sampling, all of the fully conditional posterior distributions (one for each component of ) need to be derived from the above joint posterior distribution. Let

Now we adopt a new notation, let

where is a scalar representing just one element of the vector , and is a vector representing all of the other elements except . Similarly, and can be partitioned in the same manner as

In general terms, the conditional posterior distribution of is

where

Then

for

Also,

where

for . The conditional posterior distributions for the variances are

for , and and

for , and , and .

Computational Scheme

Gibbs sampling is much like Gauss-Seidel iteration, except that

1.
when a new solution is calculated a random amount is added to it based upon its conditional posterior distribution variance before proceeding to the next equation, and
2.
after all equations have been processed, new values of the variances are calculated and a new variance ratio is determined prior to beginning the next round.
Suppose we have the following MME for five animals:

where , and

Let the starting values for , and let va=ve=10, and and , so that k=14. Let RND represent a random normal deviate from a random normal deviate generator, and let CHI(idf) represent a random Chi-square variate from a random Chi-Square variate generator. To begin, let and . Below are descriptions of calculations in the first two rounds.

Round 1

Now calculate the residuals and their sum of squares in order to obtain a new residual variance.

The new residual variance is

The additive genetic variance requires calculation of using the a-values obtained above, which gives

Then

The new variance ratio becomes

k = 95.6382 / 8.0605 = 11.8650.

Round 2

Round 2 begins by re-forming the MME using the new variance ratio. The equations have changed to

Now the process is repeated using the last values of and and .

The residuals and their sum of squares are

The new residual variance is

The new variance ratio becomes

k = 76.1377 / 6.2343 = 12.2127.

Continue taking samples for thousands of rounds.

Burn In Period

Generally, the Gibbs chain of samples does not immediately converge to give samples from the joint posterior distribution. A burn in period is usually allowed during which the sampling process moves from the initial values of the parameters to those from the joint posterior distribution. The length of the burn-in period is usually judged by visually inspecting a plot of sample values across rounds. In the example figure below, the burn-in period should likely be 2000 rounds. This figure represents an animal model analysis of over 323,000 records on slightly over 1 million animals. The trait was one of the type traits in the Holstein breed. The samples generated during the burn-in period are typically discarded from further use.

Estimates

Each round of Gibbs sampling is dependent on the results of the previous round. Depending on the total number of observations and parameters, one round may be positively correlated with the next twenty to three hundred rounds. Even though samples are correlated, the user can determine the effective number of samples in the total by calculating lag correlations. Suppose a total of 12,000 samples (after removing the burn-in rounds) gave an effective number of samples equal to 500. This implies that every 24th sample should be uncorrelated.

An overall estimate of a parameter can be obtained by averaging all of the 12,000 samples (after the burn-in). However, to derive a confidence interval or to plot the distribution of the samples or to calculate the standard deviation of the estimate, the variance of the 500 independent samples should be used.

Influence of the Priors

In the small example, va=ve=10 whereas N was only 5. Thus, the prior values of the variances received more weight than information coming from the data. This is probably appropriate for this small example, but if N were 5,000,000, then the influence of the priors would be next to none. The amount of influence of the priors is not directly determined by the ratio of vi to N. In the small example, even though , the influence of S2a could be greater than . Schenkel (1998) applied some methods recommended to him by Gianola in his thesis work.

Long Chain or Many Chains?

Early papers on MCMC (Monte Carlo Markov Chain) methods recommended running many chains of samples and then averaging the final values from each chain. This was to insure independence of the samples. That philosophy seems to have changed to a recommendation of one single long chain. For animal breeding applications this could mean 100,000 samples or more. If a month is needed to run 50,000 samples, then maybe three chains of 50,000 would be preferable. If only an hour is needed for 50,000 samples, then 1,000,000 samples would not be difficult to run. The main conclusion is that it is very difficult to know if or when a Gibbs chain has converged to the joint posterior distribution. This is currently an active area of statistical research.

This LaTeX document is available as postscript or asAdobe PDF.

Larry Schaeffer
2000-03-17