This LaTeX document is available as postscript or asAdobe PDF.

L. R. Schaeffer, March 1999

**Introduction**

The quality of any statistical analysis is best judged by the model that describes the data. A model represents the sampling nature of the data and reflects the biology of the problem. There are three conceptual levels of models.

- A
*true*model describes the data perfectly, leaving no unexplained variation. The*true*model is never known exactly and may not necessarily be a linear model. - An
*ideal*model is as close to the true model as possible. The*ideal*model should be used for the analysis, but often can not be utilized for various reasons. - An
*operational*model is a simplified version of the ideal model, and is used in the analysis.

A good operational model is derived from the ideal model. Given the data and resource limitations under which the researcher must function, the operational model is a simplification of the ideal model. If too many simplifications have to be made, then perhaps an analysis may not be worthwhile. The assumptions to go from the ideal model to the operational model should be known, and thus the quality of the operational model can be judged.

**Observations**

The observation vector contains elements resulting from measurements, either subjective or objective, on the experimental units (usually animals) under study. The elements in the observation vector are random variables that have a multivariate distribution, and if the form of the distribution is known, then advantage should be taken of that knowledge.

**Factors**

*Factors* refer to variables, either discrete or continuous, which
may influence or are related to the elements in the observation vector.
For example, the milk yield of a dairy cow is known to be influenced by
her age at calving, the season in which she calved, her genetic potential,
and the number of days nonpregnant, to list a few. Model building
requires that all useful factors be identified.

Discrete factors usually have *classes* or *levels* such as
age at calving could have four levels (e.g. 20 to 24 months,
25 to 28 months, 29 to 32 months, and 33 months or greater). Hence
an analysis of data would provide estimates of differences in milk yields
of cows in the various age levels. Alternatively, the effect of age
might be considered as a covariable with a linear and quadratic effect
upon milk yield, and the regression coefficients relating age and
age squared to milk yields would be estimated in the analysis.

Some factors may have a special interest to the researcher while other
factors need to be included in the model in order to reduce the
residual variation. For example, a researcher could be interested in
the effects of various levels of application of growth hormones to beef
cattle, but the model must also include the effects of age of the
animal, the sex, the location, diet, and breed. The latter group of
effects are often refered to as *nuisance factors*. Nuisance
factors cannot be ignored or omitted from the model because this
could drastically alter the interpretation of results for the factors
of main interest.

**Fixed and Random Factors**

In the traditional "frequentist" approach, fixed and random factors need to be distinguished. In a Bayesian approach there is no such distinction between factors. Both approaches will be used in this course.

Fixed factors are factors in which the classes comprise all of the possible classes of interest that could be observed. For example, the sex of an animal is either male, female, sterilized male, or sterilized female. If the number of classes in a factor is small and confined to this number even if conceptual resampling were performed an infinite number of times, then the factor is likely fixed. Other examples are age classes, lactation number, management system, cage number, and breed class. Usually if the sampling were to be repeated a second time, those factors which maintain the same classes between the two samplings would be fixed factors. For example, a growth trial on pigs using two diets would probably need to use the same housing facilities, the same age groups of pigs, and the same diets, but the individual pigs would necessarily have to be new animals because an animal could not go through the same growth phase a second time in its life. Pig effects would be considered a random factor while the other effects would be fixed.

Random factors are factors whose levels are considered to be drawn randomly from an infinitely large population of levels. As in the previous pig experiment, pigs were considered random because the pig population of the world is large enough to be considered infinitely large, and the group that were involved in that experiment were a random sample from that population. In actual fact, however, the pigs on that experiment were likely sampled from those relatively few pigs that were available at the time the trial started, but still they are considered to be a random factor because if the experiment were to be repeated again, there would likely be a completely different group of pigs involved.

Another way to determine if a factor is fixed or random is to know how the results will be used. In a nutrition trial the results infer something about the diets in the trial. The diets are specific and no inferences should be made about other diets not tested in the experiment. Hence diet effects would be a fixed factor. In contrast, if animal effects were in the model, inferences about how any animal might respond to a specific diet may need to be made. There should not be anything peculiar about the animal on the trial that would nullify that inference. Animal effects would be a random factor.

In general, a few questions need to be answered to make the correct choice of fixed or random factor designation. Some of the questions are

- 1.
- How many levels of the factor are in the model? If small, then perhaps this is a fixed factor. If large, then perhaps this is a random factor.
- 2.
- Is the number of levels in the population large enough to be considered infinite? If yes, then perhaps this factor is random.
- 3.
- Would the same levels be used again if the experiment were to be repeated a second time? If yes, then perhaps this factor is fixed.
- 4.
- Are inferences to be made about levels not included in the experiment? If yes, then perhaps this factor should be random.
- 5.
- Were the levels of a factor determined in a nonrandom manner? If yes, then perhaps this factor should be treated as fixed.

In a Bayesian context, a prior distribution needs to be assumed about each of the factors. For random factors, typically these might be assumed to have a Normal distribution with a particular mean and variance. For fixed factors, an uniform distribution may be assumed or a prior distribution in which the factors are proportional to a constant. In a Bayesian context, even the variances need to have an assumed prior distribution. The prior distributions are combined to derive the distribution of the observations, and then are used with the distribution of the data to arrive at a posterior distribution from which inferences may be made.

**Models**

Only linear models are discussed. A linear model contains
a set of factors which additively affect the observations, but a variable
within a factor may represent, for example, a squared term. Linear
models are adequate in most biological circumstances. This does not
imply that nonlinear models are unimportant. Nonlinear relationships
may often be approximated by a linear model. However, if a nonlinear
model gives a better *ideal* model than a linear model, then the
nonlinear model should be utilized. Texts that deal with nonlinear
model methods should be consulted.

A linear model, in the traditional sense, is composed of three parts:

- 1.
- The equation.
- 2.
- Expectations and Variance-Covariance matrices of random variables.
- 3.
- Assumptions, restrictions, and limitations.

**The Equation**

The equation of the model defines the factors that may have an effect on the
observed trait. A matrix formulation of a general model equation is
as follows:

where

- is the vector of observed values of the trait,
- is a vector of factors, collectively known as fixed effects,
- is a vector of factors known as random effects,
- is a vector of residual terms, also random,
- are known matrices, commonly known as design matrices, that describe the precise relationship between the elements of and with those of .

**Expectations and VCV Matrices**

In general terms, the expectations are

and the variance-covariance matrices are

where and are general square matrices assumed to be nonsingular and positive definite, and the elements of which are assumed known. Also,

**Assumptions and Limitations**

The third part of a model includes items that are not apparent in parts 1 and 2; for example, information about the data or the manner in which the data were collected. Were the animals randomly selected or did they have to meet some minimum standards? Did the data arise from many environments, at random, or were the environments specially chosen?

In this part of the model the differences between the *operational*
model and the *ideal* model should be listed, and the possible
effects of those differences on the analysis should be explained.
Such a comparison is frequently overlooked or ignored, but part 3 of
the model contains the most important information for assessing the
quality of the analysis.

A linear model is not complete unless all three parts of the model are present. Statistical procedures and strategies for data analysis are determined only after a complete model is in place.

**Examples of Models**

**Beef Calf Weights**

Suppose we have weights on beef calves taken at 200 days of age as shown in the table below.

Males | Females |

198 | 187 |

211 | 194 |

220 | 202 |

185 |

where

The **expectations** and **variance-covariance matrices** of the
random factors are

Additionally,

The **assumptions and limitations** of the model could be listed
as follows:

- 1.
- All calves are assumed to be of the same breed.
- 2.
- All calves were reared in the same environment and time period.
- 3.
- All calves were from dams of the same age (e.g. 3 yr olds).
- 4.
- Maternal effects do not influence 200 day weights.
- 5.
- Calf effects contain all genetic effects.
- 6.
- All weights were accurately recorded (i.e. not guessed).

Ordering the observations by males, then females, the matrix
representation of the model would be

and of order 7. Also,

**Sire Model**

Suppose we have progeny data of three sires on temperament scores (on a scale of 1 to 40) taken at milking time as shown below:

CG | Age | Sire | Score |

1 | 1 | 1 | 17 |

1 | 2 | 2 | 29 |

1 | 1 | 2 | 34 |

1 | 2 | 3 | 16 |

2 | 2 | 3 | 20 |

2 | 1 | 3 | 24 |

2 | 2 | 1 | 13 |

2 | 1 | 1 | 18 |

2 | 2 | 2 | 25 |

2 | 1 | 2 | 31 |

The **equation** of the model might be

where

The **expectations and variance-covariance matrices** are

Thus, the residual variance differs between contemporary groups. The sire variance represents one quarter of the additive genetic variance because all progeny are assumed to be half-sibs (i.e. from different dams). The sires are assumed to be unrelated.

The **assumptions and limitations** of the model could be

- 1.
- Daughters were approximately in the same stage of lactation when temperament scores were taken.
- 2.
- The same person assigned temperament scores for all daughters.
- 3.
- The age groupings were appropriate.
- 4.
- Sires were unrelated to each other.
- 5.
- Sires were mated randomly to dams (with respect to milking temperament or any correlated traits).
- 6.
- Only one offspring per dam.
- 7.
- Only one score per daughter.
- 8.
- No preferential treatment towards particular daughters.

The reader is asked to set out the matrix formulation of the design matrices and and .

**Feed Intake**

Feed intake can be measured on individual pigs which might be
modeled as

where

All pigs were purebred Landrace. Two males and two females were taken randomly from each litter for feed intake measurements. The model assumes that there are no sex differences in feed intake nor that maternal effects have any influence. All measurements were taken at approximately the same age of the pigs within a controlled environment at one location. Feed and handling of pigs was therefore, uniform for all pigs within a herd-year-month subclass. Litters were related through the use of boars from artificial insemination. Feed intake was the average of 3 daily intakes during the week, and weekly averages were available for 5 consecutive weeks. Growth was assumed to be linear during the test period.

The reader should try to identify weaknesses in the above model and recommend changes or further assumptions that are missing. Note that all three parts of the model have been given without labelling each part.

This LaTeX document is available as postscript or asAdobe PDF.