# A Generalization of the Transmission/Disequilibrium Test for Uncertain-Haplotype Transmission

## Summary

A new transmission/disequilibrium-test statistic is proposed for situations in which transmission is uncertain. Such situations arise when transmission of a multilocus marker haplotype is considered, since haplotype phase is often unknown in a substantial number of instances. Even for single-locus markers, transmission is uncertain if one or both parents are missing. In both these situations, uncertainty may be reduced by the typing of further siblings, whose disease status may be unaffected or unknown. The proposed test is a score test based on a partial score function that omits the terms most influenced by hidden population stratification.

## Introduction

Until recently, the literature on transmission/disequilibrium testing assumed, for the most part, that marker genotypes can be directly measured in affected cases and in both their parents and, therefore, that transmissions of haplotypes from parents to affected offspring can be counted. In two important situations this is not the case:

- 1.both parents may not always be available, particularly for diseases with late onset, and
- 2.since, if considered alone, binary markers such as single-nucleotide polymorphism (SNPs) may carry insufficient information, it may be necessary to consider haplotypes constructed from several such markers; however, haplotype phase is often uncertain.

In the former case, missing parental genotypes can be inferred from offspring genotypes, but Curtis and Sham (1995) have shown that restriction of the analysis to families in which such inference is certain leads to bias. Several alternative proposals have been published recently, mostly relying on the typing of further siblings, either unaffected or with unknown disease phenotype (Curtis 1997; Martin et al. 1997; Boehnke and Langefeld 1998; Horvath and Laird 1998; Schaid and Rowland 1998; Spielman and Ewens 1998). These approaches have been reviewed and compared by Monks et al. (1998).

Despite this work, no solution has yet been proposed for the problem of phase uncertainty for marker haplotypes, although the problems clearly have much in common. Again, our ability to infer parental haplotypes is enhanced by the typing of additional siblings, but it is to be expected that, here too, restriction of the analysis to families with known haplotype phase may lead to bias.

This report proposes a new and unified approach to these problems. In the next section, two different likelihood approaches to the analysis of transmission/disequilibrium studies are compared for the case in which all genotypes and transmissions are completely observed. The likelihood approach to uncertain transmission is then described, and its disadvantages are discussed. A new approach is described, and some extensions are outlined. These sections of the report emphasize the theoretical principles behind the method; detailed algebraic expressions are given in the Appendix.

## Likelihoods, Score Tests, and the Transmission/Disequilibrium Test (TDT)

The idea that underlies TDTs is that, in the presence of association between a genetic marker and disease susceptibility (DS) (when such association is due to coincidence of linkage and gametic-phase disequilibrium between marker and DS gene), the probability of transmission of a marker gene from parents to an *affected* offspring is increased from the .5 value predicted by Mendelian inheritance. Several such tests have been proposed; the relationship between these is clarified by consideration of the likelihood for a disease-association model that is parametrized in terms of probabilities of disease, conditional on marker genotype, and of population-allele (or, more generally, haplotype) frequencies (Schaid 1996).

At the marker locus, the genotype *g* consists of a pair of haplotypes (*i,j*), and there is association between marker and disease when, at the population level, the probability of disease depends on the genotype. If this probability is represented by π_{g}, the genotype relative risk (GRR) is defined as _{g}=π_{g}/π_{(1,1)} (here, the genotype (1,1) has been taken as the reference genotype, so that _{(1,1)}=1). If there are *H* distinct haplotypes, then there are *G*=*H*(*H*+1)/2 distinct genotypes and, even for moderate *H*, many GRR parameters are required in order to model association. Thus, to obtain powerful tests, it may be necessary to consider a more restrictive model.

Falk and Rubinstein (1987) implicitly assumed a “genotype relative risk” model in which a single copy of the high-risk allele at a biallelic locus suffices to confer maximum risk on the genotype. This model has only a single association parameter and leads to a 1-df test statistic. However, the model does not easily generalize to multiallelic markers. Instead, most authors, following Terwilliger and Ott (1992), concentrate on a “haplotype-based” approach. This implicitly assumes the model _{(i,j)}=θ_{i}θ_{j}, in which the two haplotypes act multiplicatively. The parameters θ_{i} may be regarded as haplotype relative risk (HRR) parameters (again, to ensure identifiability, it will be necessary to impose a constraint such as the “corner” constraint θ_{1}=0). With this model, the (*i,j*) heterozygote genotype carries a relative risk equal to the geometric mean of the relative risks for the fully homozygote genotypes, . The model represents marker-disease association with *H*-1 free parameters and therefore leads to association tests with *H*-1 df.

A full-probability model for the data must also describe the probability distribution of parental genotypes in the population. Again, unless simplifying assumptions are made, such a model can involve very many parameters. Accordingly, it is usual to assume no population stratification and Hardy-Weinberg equilibrium, so that the probability that a parent drawn from the population at random has genotype *g*=(*i*,*j*) is

and the two parents' genotypes are independent. The parameters ψ_{i} are the population haplotype relative frequencies and obey the constraint Σ_{i}ψ_{i}=1. If the multiplicative model for HRRs also holds, it is easily shown that disease cases are also in Hardy-Weinberg equilibrium, with modified haplotype frequencies

It is generally more convenient to work with unbounded parameters, and, accordingly, henceforth the HRRs will be replaced by their logarithms, β_{i}, and the haplotype frequencies will be replaced by their multinomial logit transformations, γ_{i}; and θ_{i}=*exp*β_{i} and ψ_{i}=*exp*γ_{i}/Σ_{i}*exp*γ_{i}. With these transformations, the score functions are simply differences between observed and expected counts of haplotypes (see the Appendix).

The likelihood contribution of a parent-offspring trio ascertained via the affected offspring is the joint probability of parental and offspring genotypes, say *PG* and *OG,* conditional on disease in the offspring, say *OD*=1; this, in turn, factorizes into two parts

The full-likelihood contribution and its two factors for the *i*th such trio will be denoted by *L*^{(F)}_{(i)}, *L*^{(P)}_{(i)}, and *L*^{(C)}_{(i)}, respectively, so that, corresponding to equation(2),

Detailed expressions for these likelihood contributions are given in the Appendix, but here it need only be noted that, whereas *L*^{(F)}_{(i)} and *L*^{(P)}_{(i)} depend on both sets of parameters, β and γ, the “conditional” likelihood contribution *L*^{(C)}_{(i)} depends only on the HRR parameters β. If the corresponding log-likelihood contributions are denoted by , the log likelihood decomposes additively: ^{(F)}_{(i)}=^{(P)}_{(i)}+^{(C)}_{(i)}. After summation over families, the total log likelihood decomposes in the same way (total log likelihoods will be indicated by omission of the subscript *i*).

This decomposition is central to what follows. Tests for no association (*H*_{0}:β=0) can be constructed by use of either the full log likelihood, ^{(F)}, or the conditional log likelihood, ^{(C)}. The former extracts more information, but at a price; the extra term included, ^{(P)}, is the one that depends on the population model for parental genotypes, and the integrity of the added information depends strongly on correct specification of this model. Thus, ^{(P)} depends on β, because the conditioning on presence of disease in offspring may induce deviation from Hardy-Weinberg equilibrium and from independence of the two parental genotypes. Since the model assumes that neither of these exists in the population, evidence of such deviations for the parents within the study constitutes evidence for disease-marker association. In contrast, the conditional-likelihood term ^{(C)} does not depend on assumptions of the model for parental genotypes. This informal argument would suggest that methods based on the full log likelihood may be more powerful than those based on the conditional likelihood (Terwilliger and Ott 1992) but that they may give incorrect answers if the assumptions of the population model are not met (Spielman et al. 1993). For the latter reason, conditional-likelihood methods are usually preferred.

Tests of association can be constructed in two ways, from each of these likelihoods:

- 1.the log likelihood–ratio test, comparing twice the log likelihood at the global maximum-likelihood estimate, , with the maximized likelihood under the null hypothesis,:;
- 2.the score test based on the first-derivative vector of evaluated at.

Both tests simplify if the conditional likelihood is used, since γ is not involved. The two testing strategies are asymptotically equivalent, leading to χ^{2} tests on *H*-1 df. Table 1 sets out some tests previously proposed, classified by the testing strategy and likelihood on which they are based.

The approach proposed here for situations in which the transmission pattern may be uncertain is based on the score test, and this section concludes with some further notation for this approach. The score vector, denoted by **u**, is the vector of first derivatives of the log likelihood with respect to the parameters. This can be partitioned into two parts corresponding to the two sets of parameters:

The score vector is of length 2*H,* with the first *H* elements, **u _{β}**, concerning the HRR parameters β and with the next

*H*elements,

**u**

_{β}, concerning the logit-transformed gene frequencies γ. The information matrix,

**J**, is of size 2

*H*×2

*H*and can be partitioned similarly:

Standard likelihood theory shows that, when evaluated at the correct parameter values, the variance-covariance matrix of **u** is given by the expected value of **J** (since, in the cases considered here, **J** does not depend on random variables, a distinction between the observed and expected values of **J** will not be made hereafter). Each set of parameters will require a linear constraint for identifiability (e.g., one β and one γ might be set to zero), leaving 2(*H*-1) free parameters, so that the rank of **J** will also be 2(*H*-1).

In the same way as the log likelihood, , the values of **u** and **J** can be decomposed into contributions of parents and of transmission from parents to offspring:

Similarly, these arrays are sums of contributions for each parent-offspring trio, which will be denoted by “**u**_{(i)}” and “**J**_{(i)}.” The full expressions for these contributions are given in the Appendix.

After the score vector has been calculated at the null hypothesis, β=0, the score test is given by

where *V*^{}_{ββ} is a generalized inverse of the variance-covariance matrix of **u**_{β}. Asymptotically, this is distributed as χ^{2} with df equal to the rank of *V*_{ββ} (usually *H*-1). Single-df tests for excess transmission of specific haplotypes may be constructed by dividing the square of the appropriate element of **u**_{β} by its variance. Clayton and Jones (1999) have pointed out that both of these testing strategies lack power for large *H* and have suggested an alternative test based on a quadratic form in the score vector, using a haplotype “similarity” matrix. This approach could be extended to the case in which haplotype assignments are uncertain, along the lines suggested below; but there are a number of technical problems, and this will not be attempted here.

If the test is based on the conditional argument—that is, if **u**^{(C)} is used—then **V**_{ββ} is given by the corresponding information matrix, *J*^{(C)}_{ββ}. However, if the full likelihood is used, a correction must be made for the fact that γ must be estimated. The estimated value of γ for fixed β, say, is obtained by solving the estimating equation *u*_{γ}=0. A linear approximation leads to

where the score and information component on the right-hand side are evaluated at the true values of β and γ. This, in turn, leads to the approximation

for the variance of. To perform the score test in practice, **J** is evaluated at.

## Incomplete Data

Likelihood methods generalize naturally to situations in which data are incomplete. Well-known results (Dempster et al. 1977; Little and Rubin 1987) give the score and information for incomplete data in terms of moments of the corresponding functions for the complete data, taken over the “posterior” distribution of the complete data, given all the available data. Thus, the data for the *i*th trio may be consistent with a set of possible genotypes and transmissions, each giving a different likelihood contribution. If these are denoted by *L*_{(j)},*j*, the log-likelihood contribution of such a trio is _{}=*log*Σ_{j}*L*_{(j)}. If the score and information contribution corresponding to the complete data likelihoods *L*_{(j)} are denoted by **u**_{(j)} and **J**_{(j)}, the score contribution for a trio with incomplete data is a weighted mean of the possible complete-data contributions consistent with the observed data:

The variance-covariance matrix of *u*_{} is obtained from the corresponding information contribution:

These expressions could be used to compute the score vector and information matrix for incomplete data, so that, with use of expression (4) and equation (6), a score test for association could be calculated.

This approach provides a solution to the problem of uncertain-haplotype transmission, but it has a major disadvantage. Equations (7) and (8) require that *L*_{(j)}, **u**_{(j)}, and **J**_{(j)} refer to the *full* likelihood, so that they should more correctly be written as *L*^{(F)}_{(j)}, *u*^{(F)}_{(j)}, and *J*^{(F)}_{(j)}*,* respectively. Because of the summation over possible genotypes consistent with the available data, the score contributions from each trio no longer decompose into a parental contribution *u*^{(P)}_{} and a conditional-likelihood contribution *u*^{(C)}_{}. Thus, this approach would not seem to lead to a test that is robust against deviation from the assumptions of the model for the distribution of parental genotypes.

## A Partial-Likelihood Argument

In this section, a compromise between the full-likelihood approach and the conditional-likelihood approach is proposed. For complete data in which both parental genotypes and haplotype transmission to offspring are observed with certainty, this uses the parental term *L*^{(P)}, defined by equations (2) and (3), for inference concerning γ but reverts to the conditional part of the likelihood, *L*^{(C)}, for inference concerning β. Thus, the contribution of the *i*th such trio to the partial score function is

Because they are based on factorization (3), the parental and conditional score contributions *u*^{(P)}_{(i)} and *u*^{(C)}_{(i)} are independent, and the variance-covariance matrix of *u*^{(*)}_{(i)} is

Note that, because *u*^{(*)}_{(i)} are no longer true score functions, *V*^{(*)}_{(i)} is no longer equal to the matrix of derivatives of *u*^{(*)}_{(i)}, which is

When the data are incomplete, consistent with a set of complete data scenarios *j*, the proposed partial-score function is a weighted mean of the complete-data partial-score functions over , using the full-likelihood contributions *L*^{(F)}_{(j)} as weights; that is,

This approach does not entirely do away with the need for a model for the distribution of parental genotypes, but this model contributes only to the weights given to the complete-data-score contributions for β in the mean score and not to these contributions themselves. Intuitively, this strategy would be expected to be much more robust against departures from the population model than would a full-likelihood approach. The variance of *u*^{(*)}_{} may be estimated by an expression similar to that given in equation (8):

and the matrix of derivatives of *u*^{(*)}_{} is given by

The total partial-score vector, **u**^{(*)}*,* is obtained by summation of such contributions over families, as is its variance **V**^{(*)} and the matrix of derivatives *J*^{(*)}*.*

Although the complete-data-score contributions to *u*^{(*)}_{β} and *u*^{(*)}_{γ} are independent, as demonstrated by the block-diagonal structure of the contributions **V**^{(*)} seen in equation (9), the process of averaging over *j* to obtain the incomplete-data-score contributions leads to the variance contributions given by equation (12), which are *not* block diagonal. It is then necessary to take account of the need to estimate γ when tests for β are constructed. As outlined in the earlier discussion of the standard-likelihood theory, this may be done by adopting the linear approximation to the estimating equations, leading toequation (5). Because the information and variance matrices no longer coincide, the variance estimate for is now rather more complicated than the expression given in equation (6):

The theory of the proposed partial-score function has been set out above in some generality and would allow consistent estimation of β as well as the testing of hypothetical values. However, some simplification is possible when the null hypothesis β=0 is tested. In this special case, we can embed the null hypothesis within the model for parental genotypes, so that these are assumed to be sampled at random from the population. The dependence of *L*^{(P)}_{(j)} and *u*^{(P)}_{γ(j)} on β may then be ignored, leading to a number of simplifications:

- 1.
- 2.similarly, the derivative matrix in the uncertain-transmission case,
*J*^{(*)}_{}*,*is given by equation (12); and - 3.equation (14) reverts to the simpler form of equation (6).

This simplification becomes essential when, as below, more-extended nuclear-family structures are considered. In such cases, the dependence of *L*^{(P)} on β becomes more complicated and, particularly when multiple affected offspring are considered, may involve ascertainment corrections. In such circumstances, consistent estimation of β under uncertain-haplotype transmission may prove difficult, whereas testing the null hypothesis remains straightforward.

To perform a score test of *H*_{0}:β=0, the score equations *u*^{(*)}_{γ}=0 are solved for the maximum-likelihood estimate, with β=0. The vector *u*^{(*)}_{β} is then used in expression (4), to obtain an asymptotic χ^{2} statistic.

## Additional Offspring

Uncertainty of parental genotype and transmission to the affected offspring, whether due to missing parental genotype or phase uncertainty, may be reduced by typing further offspring in the family. For the moment, assume that these are either unaffected or of unknown disease status. Although, strictly speaking, transmission to unaffected offspring provides some information about the association parameters β, since high-risk haplotypes will be *less* likely to be transmitted to unaffected offspring, the amount of information provided is negligible when the risks of disease conditional on genotype, π_{g}, are small. In these circumstances, unaffected offspring and offspring with unknown disease status can be treated in exactly the same way; such offspring are ignored in the conditional-likelihood contributions *L*^{(C)}_{(i)} and, consequently, in the *u*^{(*)}_{β}. However, they are used when the set, , of genotypes consistent with the available data is calculated and when the likelihood weights, when score contributions over are averaged, are calculated.

Additional *affected* offspring are more difficult to deal with, since these potentially contribute to *L*^{(C)} and, therefore, to *u*^{(*)}_{β}. However, if the null hypothesis allows for the presence of *linkage* between marker locus and disease-susceptibility locus, only specifying the gametic-phase equilibrium, then the contributions of multiple affected offspring to the family contribution to *L*^{(C)} are not independent. In this situation, some authors have advocated use of only the first affected offspring, but this may lead to considerable loss of information. More-satisfactory approaches have been suggested by Martin et al. (1997) and by Horvath and Laird (1998). The approach proposed here can be extended to deal with this difficulty, by application of the general results, reported by Huber (1967), concerning “robust” variance estimation when a likelihood is misspecified. When the null hypothesis allows for linkage and more than one affected offspring may be included for each family, the expected value of **u**^{(*)} is still zero, but its variance is incorrectly estimated by the results given above. When the method of Huber is followed, an alternative estimate is provided by the empirical variance-covariance matrix of the contributions of each family to **u**^{(*)}*.* Thus, if families 1,…,*N* are consistent with genotype configurations _{1},…,_{N}, then the overall score is *u*^{(*)}=Σ^{N}_{i=1}*u*^{(*)}_{i}, and its variance can be estimated by

After estimation of γ, a robust variance estimate for *u*^{(*)}_{β} can be obtained from equations (14) and (15).

## Discussion

This report has set out the statistical theory behind the proposed new approach to transmission/disequilibrium testing. Numerical results concerning its performance in the situation in which one parent is unavailable have been provided by A. Cervino, A. Hill, and P. Donnelly (personal communication), who demonstrate that the method is indeed highly robust against violation of the assumption of no population stratification—at least in the situations that they considered. The method has a number of advantages over other approaches that have been proposed. First, it is efficient, making full use of whatever parental data are available. Second, by use of the robust variance estimator (15), more than one affected offspring per family may be used, even in the presence of linkage. Third, this would seem to be the only approach so far proposed that will deal with the problem of phase uncertainty for multilocus haplotypes. This will be an important problem as attention turns to SNP markers within candidate genes. (A computer program, “TRANSMIT,” which implements the methods described in this report, is available from the author [Medical Research Council Biostatistics Unit].)

## Appendix A : Likelihood, Score, and Information Contributions

Consider a marker with *m* alleles or possible haplotypes and a family (parent-offspring trio), *f,* in which the parental genotypes are (*p,q*) and (*r,s*) and the affected offspring is (*p,r*). In terms of the original gene frequency and HRR parameters, the full-likelihood contribution of the family is

where all the summations are over the range 1,…,*m* and ψ^{*}_{i}, the relative frequency of haplotype *i* in cases, is given by equation (1). The final simplified form makes it clear that this likelihood is equivalent to the suggestion, by Terwilliger and Ott (1992), that haplotypes *p* and *r* may be regarded as “case” haplotypes and that the untransmitted haplotypes *q* and *s* may be regarded as “controls.” The two factors in the factorization (3) of this contribution are

Thus the conditional-likelihood contribution involves only the HRR parameters, whereas the parental contribution involves both sets of parameters.

All the score functions may be regarded as observed counts minus expected counts. When *N*_{i} is used to denote the number of times that haplotype *i* occurs in the two parents, the score contributions with respect to γ_{i} are

When *T*_{i} is used to denote the number of times that haplotype *i* is transmitted to the affected offspring, the score contribution for β_{i}, based on the full likelihood, is ^{(F)}_{(f)}/β_{i}=*T*_{i}-2ψ^{*}_{i}. The corresponding score contribution based on the conditional likelihood has a similar observed-minus-expected form, but the expected frequencies are now based on transmission probabilities:

where Δ_{ij} is the Kronecker delta, taking the value 1 if *i*=*j* and 0 otherwise. It follows that the score contribution for β_{i} based on the parental likelihood is ^{(P)}_{(f)}/β_{i}=*E*_{i}-2ψ^{*}_{i}. It is clear from this expression why this term is so dependent on the assumptions of no population stratification and Hardy-Weinberg equilibrium.

The important score functions are equations (A1) and (A2), and, under the null hypothesis, these take the values

The corresponding information contributions are obtained by differentiation of equations (A1) from equations (A2). At the null hypothesis, after it is noted that Σ_{k}ψ_{k}=1,

Only a small number of the elements of the matrix [^{2}^{(C)}/ββ^{T}] (again evaluated at the null hypothesis) need to be updated. If *p*≠*q**,* the diagonal elements (*p,p*) and (*q,q*) must be incremented by + 1/4, and the off-diagonal elements (*p,q*) and (*q,p*) must be incremented by − 1/4. A similar procedure is used for *r,s.*

## Electronic-Database Information

The URL for data in this article is as follows:

## References

**American Society of Human Genetics**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (193K) |
- Citation

- Linkage analysis in the presence of errors IV: joint pseudomarker analysis of linkage and/or linkage disequilibrium on a mixture of pedigrees and singletons when the mode of inheritance cannot be accurately specified.[Am J Hum Genet. 2000]
*Göring HH, Terwilliger JD.**Am J Hum Genet. 2000 Apr; 66(4):1310-27. Epub 2000 Mar 23.* - Detection of disease genes by use of family data. I. Likelihood-based theory.[Am J Hum Genet. 2000]
*Whittemore AS, Tu IP.**Am J Hum Genet. 2000 Apr; 66(4):1328-40. Epub 2000 Mar 29.* - Unbiased application of the transmission/disequilibrium test to multilocus haplotypes.[Am J Hum Genet. 2000]
*Dudbridge F, Koeleman BP, Todd JA, Clayton DG.**Am J Hum Genet. 2000 Jun; 66(6):2009-12. Epub 2000 Apr 13.* - Adaptation of the extended transmission/disequilibrium test to distinguish disease associations of multiple loci: the Conditional Extended Transmission/Disequilibrium Test.[Ann Hum Genet. 2000]
*Koeleman BP, Dudbridge F, Cordell HJ, Todd JA.**Ann Hum Genet. 2000 May; 64(Pt 3):207-13.* - Genotype-environment interaction in transmission disequilibrium tests.[Adv Genet. 2001]
*Eaves LJ, Sullivan P.**Adv Genet. 2001; 42:223-40.*

- Mega2: validated data-reformatting for linkage and association analyses[Source Code for Biology and Medicine. ]
*Baron RV, Kollar C, Mukhopadhyay N, Weeks DE.**Source Code for Biology and Medicine. 9(1)26* - Imputation Without Doing Imputation: A New Method for the Detection of Non-Genotyped Causal Variants[Genetic Epidemiology. 2014]
*Howey R, Cordell HJ.**Genetic Epidemiology. 2014 Apr; 38(3)173-190* - Association of Single-Nucleotide Polymorphisms of the Tau Gene With Late-Onset Parkinson Disease[JAMA : the journal of the American Medical ...]
*Martin ER, Scott WK, Nance MA, Watts RL, Hubble JP, Koller WC, Lyons K, Pahwa R, Stern MB, Colcher A, Hiner BC, Jankovic J, Ondo WG, Allen FH Jr, Goetz CG, Small GW, Masterman D, Mastaglia F, Laing NG, Stajich JM, Ribble RC, Booze MW, Rogala A, Hauser MA, Zhang F, Gibson RA, Middleton LT, Roses AD, Haines JL, Scott BL, Pericak-Vance MA, Vance JM.**JAMA : the journal of the American Medical Association. 2001 Nov 14; 286(18)2245-2250* - Pseudo-Sibship Methods in the Case-Parents Design[Statistics in medicine. 2011]
*Yu Z, Deng L.**Statistics in medicine. 2011 Nov 30; 30(27)10.1002/sim.4397* - Increasing the power of association studies with affected families, unrelated cases and controls[Frontiers in Genetics. ]
*Stewart WC, Cerise J.**Frontiers in Genetics. 4200*

- A Generalization of the Transmission/Disequilibrium Test for Uncertain-Haplotype...A Generalization of the Transmission/Disequilibrium Test for Uncertain-Haplotype TransmissionAmerican Journal of Human Genetics. Oct 1999; 65(4)1170

Your browsing activity is empty.

Activity recording is turned off.

See more...