This LaTeX document is available as postscript or asAdobe PDF.
A categorical trait is one which is expressed phenotypically as a member of one of m categories. When m=2, then the trait is an "all-or-none" trait. Most disease traits are binomial in nature, where the animal is either diseased or not diseased. Traits with more than one category are calving ease or conformation traits. The categories are arranged in a delineated sequence from one extreme expression to the opposite extreme expression. Calving ease, for example, the categories are (1) unassisted calving, (2) slight assistance, (3) hard pull, or (4) caesarian section required. Rump angle, for example, has nine categories where the pins range from too low to too high with the desirable category being somewhere in the middle.
Categorical traits may be inherited in a polygenic manner with a very large number of genes involved. The underlying susceptibility to a disease trait, calving ease, or type trait may actually be continuous and may follow a normal distribution. The underlying continuous scale is known as the liability scale. On this liability scale is a threshold point (or points) where above this threshold the animal expresses the disease phenotype, and below the threshold point the animal does not express the disease. The liability scale is only conceptual and cannot be observed.
Two papers were published in 1983 by Gianola and Foulley, and 1984 by Harville and Mee both describing the same procedure for the analysis of categorical data. The reader is referred to these papers for the theoretical details. The derivation is Bayesian in nature. The model employed is known as a "threshold" model. The details of this procedure are described in these notes.
The general linear model is
There are various quantities which need to be computed repeatedly in the analysis, and these are based on normal distribution functions.
Data should be arranged by smallest subclasses. Consider the calving ease data from the paper by Gianola and Foulley (1983).
Herd- | Age of | Sex of | Sire of | Category | Total | ||
Year | Dam(yr) | Calf | Calf | 1 | 2 | 3 | |
1 | 2 | M | 1 | 1 | 0 | 0 | 1 |
1 | 2 | F | 1 | 1 | 0 | 0 | 1 |
1 | 3 | M | 1 | 1 | 0 | 0 | 1 |
1 | 2 | F | 2 | 0 | 1 | 0 | 1 |
1 | 3 | M | 2 | 1 | 0 | 1 | 2 |
1 | 3 | F | 2 | 3 | 0 | 0 | 3 |
1 | 2 | M | 3 | 1 | 1 | 0 | 2 |
1 | 3 | F | 3 | 0 | 1 | 0 | 1 |
1 | 3 | M | 3 | 1 | 0 | 0 | 1 |
2 | 2 | F | 1 | 2 | 0 | 0 | 2 |
2 | 2 | M | 1 | 1 | 0 | 0 | 1 |
2 | 3 | M | 1 | 0 | 0 | 1 | 1 |
2 | 2 | F | 2 | 1 | 0 | 1 | 2 |
2 | 3 | M | 2 | 1 | 0 | 0 | 1 |
2 | 2 | F | 3 | 0 | 1 | 0 | 1 |
2 | 3 | M | 3 | 0 | 0 | 1 | 1 |
2 | 2 | M | 4 | 0 | 1 | 0 | 1 |
2 | 2 | F | 4 | 1 | 0 | 0 | 1 |
2 | 3 | F | 4 | 2 | 0 | 0 | 2 |
2 | 3 | M | 4 | 2 | 0 | 0 | 2 |
Let n_{jk} represent the number of observations in the k^{th}category for the j^{th} subclass, and n_{j.} represent the marginal total for the j^{th} subclass. There are s=20 subclasses in the above example data, and m=3 categories for the trait.
The system of equations to be solved can be written as follows:
The process is begun by choosing starting values for , , and . Let and , then starting values for can be obtained from the data by knowing the fraction of animals in the first category (i.e. 0.6786), and in the first two categories (i.e. 0.8572). The threshold values that give those percentages are t_{1} = 0.3904, and t_{2} = 0.9563.
Suppose that several iterations have been performed and the
latest solutions are
The following calculations are performed for each subclass from j=1 to s, those for j=1 are shown below:
.5167 | .6082 | .5692 | 1.3058 | 1.5721 | 1.4002 | |
.5447 | .6687 | 1.2380 | .7196 | .6923 | 1.3247 | .7234 |
.6776 | .7334 | .7146 | .6110 | 1.1373 | 1.3723 | ) |
-.3475 | -.4671 | .9848 | 1.0414 | -1.0678 | -.0928 | |
1.0521 | -.5731 | -.9676 | -.6993 | 1.4632 | .9961 | -.7115 |
.6416 | 1.3042 | .4859 | -.4713 | -.8220 | -1.2216 | ) |
Assume that
for the example data.
There are dependencies in
and with the threshold points,
so that 4 restrictions on the solutions are needed. Let the
solutions for ,
H_{1}, A_{1}, and S_{1} be restricted
to zero by removing those columns from .
The
remaining calculations give the following submatrices of the
mixed model equations.
The solutions (change in estimates) to these equations are added to the estimates used to start the iteration to give the new estimates, as shown in the table below.
t_{1} | .000229 | .37529 | .375519 |
t_{2} | .000158 | 1.01135 | 1.011508 |
H_{1} | 0.0 | 0.0 | 0.0 |
H_{2} | -.000046 | .29752 | .297473 |
A_{1} | 0.0 | 0.0 | 0.0 |
A_{2} | -.000013 | -.12687 | -.126883 |
S_{1} | 0.0 | 0.0 | 0.0 |
S_{2} | .000064 | -.39066 | -.390596 |
u_{1} | .000011 | -.08154 | -.081529 |
u_{2} | -.000013 | .06550 | .065487 |
u_{3} | -.000014 | .12280 | .122786 |
u_{4} | .000017 | -.10676 | -.106743 |
As further iterations are completed, then the right hand sides to the equations approach a null vector, so that the solutions, , also approach a null vector.
There could be problems in obtaining convergence, or a valid set of estimates. In theory the threshold points should be strictly increasing. That is, t_{m} should be greater than t_{(m-1)} should be greater than t_{(m-2)} and so on. The solutions may not come out in this order. This could happen when two thresholds are close to each other due to the definition of categories. An option would be to combine two or more of the categories involved in the problem, and to re-analyze with fewer categories.
Another problem could occur when all observations fall into either extreme category. This is known as an "extreme category" problem. Suggestions have been to delete subclasses with this problem, but throwing away data is not a good idea. Other types of analyses may be needed.
Harville and Mee (1984) recommend a REML-like procedure for
estimating the variance ratio of residual to random effects
variances. One of the assumptions in a threshold model is
that the residual variance is fixed to a value of 1. Hence
only that variances of the random effects need to be
estimated. Let a generalized inverse of the coefficient
matrix in the equations be represented as
Using the example data, and the recent estimates (as final values),
The final estimate of the variances are in terms of the
liability or underlying normal distribution scale. If the
number of categories was only two, then heritability can
be converted to the "observed" scale using the formula
Using the estimates from the non-linear system of equations,
the probability of a sire's offspring to fall into each
category can be calculated. Take sire 3 as an example.
First the threshold point for male offspring of 2-yr-old dams
in herd-year 1 needs to be determined as
Notice that the probabilities are defined in terms of one of the smallest fixed effects subclasses. If 3-yr-old dams and female calves were considered, then the new threshold values would be lower, thereby giving fewer offspring in category 1 and maybe more in categories 2 or 3, because the solutions for 3-yr-old dams is negative and for female calves is negative. If an average offspring is envisioned, then solutions for ages of dam can be averaged and solutions for sex of calf can be averaged. A practical approach would be to choose the subclass where there seems to be the most economic losses and rank sires on that basis.
Real*8 Height,Q Real*4 x,p print *,'Enter a number greater than 99 to quit' 5 print *,'Enter an x-axis value' read(5,*)x if(x.gt.99.0)go to 99 p = pnor(x) Q = -0.5*x*x Height = dexp(Q)/2.506628 print 104,x,p,Height 104 format(' x = ',f15.5/ ' p = ',f15.5/ ' Height = ',f10.5) go to 5 99 print 105 105 format(' End of Program') stop end
This LaTeX document is available as postscript or asAdobe PDF.
Larry Schaeffer