# Modeling genetic inheritance of copy number variations

^{1,}

^{2,}

^{*}Zhen Chen,

^{3}Mahlet G. Tadesse,

^{4}Joseph Glessner,

^{2}Struan F. A. Grant,

^{2}Hakon Hakonarson,

^{2}Maja Bucan,

^{1}and Mingyao Li

^{}

^{3}

^{1}Department of Genetics, University of Pennsylvania,

^{2}Center for Applied Genomics and Division of Human Genetics, The Children's Hospital of Philadelphia,

^{3}Department of Biostatistics and Epidemiology, University of Pennsylvania, Philadelphia, PA 19104 and

^{4}Department of Mathematics, Georgetown University, Washington, DC 20057, USA

^{}Corresponding author.

## Abstract

Copy number variations (CNVs) are being used as genetic markers or functional candidates in gene-mapping studies. However, unlike single nucleotide polymorphism or microsatellite genotyping techniques, most CNV detection methods are limited to detecting total copy numbers, rather than copy number in each of the two homologous chromosomes. To address this issue, we developed a statistical framework for intensity-based CNV detection platforms using family data. Our algorithm identifies CNVs for a family simultaneously, thus avoiding the generation of calls with Mendelian inconsistency while maintaining the ability to detect *de novo* CNVs. Applications to simulated data and real data indicate that our method significantly improves both call rates and accuracy of boundary inference, compared to existing approaches. We further illustrate the use of Mendelian inheritance to infer SNP allele compositions in each of the two homologous chromosomes in CNV regions using real data. Finally, we applied our method to a set of families genotyped using both the Illumina HumanHap550 and Affymetrix genome-wide 5.0 arrays to demonstrate its performance on both inherited and *de novo* CNVs. In conclusion, our method produces accurate CNV calls, gives probabilistic estimates of CNV transmission and builds a solid foundation for the development of linkage and association tests utilizing CNVs.

## INTRODUCTION

A central strategy in the genetic study of human diseases is to identify genomic DNA variations related to clinical phenotypes. Human genomic variation exists in many forms, including single nucleotide polymorphisms (SNPs), simple repeat elements, microsatellites and structural variations such as copy number variations (CNVs) (1). A CNV is defined as a chromosomal segment, at least 1 kb in length, whose copy number varies in comparison with a reference genome (2). A significant fraction of CNVs are likely to have functional consequences, due to gene dosage alteration, disruption of genes, positional effects or the uncovering of deleterious alleles (3,4). Thus, comprehensive identification and cataloging of CNVs will greatly benefit the genetic and functional analysis of human genome variation.

Multiple techniques have been developed to detect deletions or duplications in the human genome and other mammalian genomes (5), and many of them depend on analyzing patterns of signal intensities across the genome. Traditionally, large chromosome rearrangements have been detected by array-comparative genomic hybridization (CGH) techniques that analyze the fluorescence signal intensities of clones (6–9). Another comparable platform for CNV detection is whole genome oligonucleotide arrays. Since design of the arrays does not depend on SNPs, such technology can achieve complete genome coverage with higher precision for boundary inference of CNVs. Due to recent increased popularity of genome-wide association studies, high-density SNP genotyping arrays have been commonly used for CNV detection and analysis. With such arrays, signal intensity is measured for each allele of a given SNP, and analysis of signal intensities across all SNPs in the genome is used to infer CNVs (10,11). More recently, to improve the coverage of SNP arrays for CNV analysis, manufacturers of SNP genotyping arrays, such as Affymetrix and Illumina, have incorporated nonpolymorphic (NP) markers into their SNP genotyping arrays, especially in known CNV regions.

Although traditionally ‘losses’ and ‘gains’ have been used to describe the major classes of CNVs, CNVs in a diploid genome are indeed chromosome-specific events. That is, CNVs can exist in any of the two homologous chromosomes, such as being deleted on one chromosome but duplicated on the other. Knowing chromosome-specific copy number is important to the development of linkage and association tests for CNVs. However, those commonly used CNV detection techniques mentioned above all depend on signal intensity measures, and are therefore unable to infer copy number in each homologous chromosome. The efficient utilization of family information can potentially help circumvent this issue. Furthermore, since most CNVs follow Mendelian inheritance (8), the use of family information can improve the sensitivity and specificity of CNV detection (12). In fact, family-based designs are now commonly used in genome-wide association studies, making it highly desirable to develop methods to infer chromosome-specific copy numbers. For example, in a recent CNV study on autism spectrum disorders, 751 families have been genotyped by the Affymetrix genome-wide 5.0 Human SNP arrays (13); in our ongoing study, 943 autism families were genotyped using the Illumina HumanHap550 SNP arrays (14). Other family-based genome-wide association studies include the Framingham heart study (15), a multiple sclerosis study (16) and type I diabetes studies (17,18).

To use family information in analysis of CNVs, Kosta *et al.* (19) developed an approach to infer chromosome-specific copy numbers for nuclear families after the total copy numbers are obtained from quantitative PCR. In our previous CNV analysis (12), we incorporated family information in a two-step procedure in which family members were first used independently to generate CNV calls, and then combined together to post-validate calls obtained in the first step by incorporating family relationships. Although this approach has been shown to significantly increase the sensitivity and specificity of CNV detection, the family information is not optimally used. Moreover, if the CNV boundary is inferred incorrectly in the first step, it cannot be corrected in the second step. More recently, Marioni *et al.* (20) discussed similar issues extensively for array CGH data, and proposed that copy numbers can be inferred on each chromosome, using HapMap family data as examples.

Efficient utilization of family information in CNV detection requires incorporation of the family relationships when modeling the joint probability distribution of signal intensities for family members. Similar to traditional multipoint linkage analysis with families, such a modeling procedure requires consideration of two levels of dependency—the dependency of signal intensities both between adjacent markers for each family member and at the same marker between family members. The first level of dependency can be modeled by a hidden Markov chain, in which the degree of dependency is determined by transition probabilities of the hidden copy number states, whereas the second level of dependency is determined by Mendelian inheritance. However, unlike the analysis of SNPs or microsatellites, family-based CNV studies are limited by the technical platforms, which can only give intensity estimates of the total copy number of a diploid genome. The analysis of CNVs in families is further complicated by the occurrence of *de novo* events, which occur as germline, somatic or cell line-induced chromosome aberrations in offspring that were not inherited from either parent.

To address these complications, we describe a unified statistical framework developed to jointly model the signal intensities for a parents–offspring trio. We demonstrate that our model is computationally feasible and can be used to analyze trios in a more efficient manner than existing methods, which do not consider family relationships or use family relationships separately (12). By computer simulations and analysis of experimentally validated CNVs on real data, we demonstrate its superior performance in increasing call rates and in identifying the exact boundaries of CNVs. In addition, by analyzing a set of families genotyped using both the Illumina and Affymetrix SNP arrays, we further show the applicability of our method on different technical platforms and in detecting both inherited and *de novo* CNVs. Although CNV detection only concerns the total copy number, our model gives probabilistic estimates of chromosome-specific copy numbers, which can be used for the future development of linkage and association tests that require chromosome-specific copy number information.

## METHODS

### Overview of the hidden Markov model framework

The hidden Markov model (HMM) is a statistical technique that models data generated from an underlying Markov process. The HMM assumes that the distribution of an observed data point depends on an unobserved (hidden) state, where the elements of the hidden states follow a Markov process. Since CNV detection typically involves aggregating information from multiple consecutive SNPs, HMM provides a natural framework for modeling dependence structures between copy numbers at nearby markers. Figure 1 shows a schematic representation of our proposed model for joint CNV distribution in a parents–offspring trio. The model consists of a chain for the copy number states of the father, a chain for the copy number states of the mother, a chain for the *de novo* event status of the offspring and these three chains are independent of each other. Although the offspring copy number at each marker is dependent on the copy number at the previous marker, it is also determined by six other elements: the copy number states of parents and the *de novo* status at both the current marker and the previous marker (dashed lines in Figure 1). Below we will describe how to explicitly model the joint CNV distribution of a parents–offspring trio through likelihood calculation of the signal intensities.

### Signal intensities for Illumina SNP arrays

To illustrate our method, we focus on data generated from the Illumina SNP arrays and the Affymetrix arrays with both SNP and NP markers. Illumina SNP arrays produce two measures on signal intensities at each SNP—log *R* ratio (LRR) and B allele frequency (BAF), and these two measures were originally proposed by Illumina for copy number inference (10). To obtain LRR and BAF, for each SNP, the raw signal intensities are subject to a normalization procedure, which produces the *X*- and *Y*-values, representing normalized signal intensity for the A and B alleles, respectively. Two measures, *R* = *X* + *Y*, and *θ* = arctan(*Y*/*X*)/(*π*/2), are then calculated for each SNP. As a normalized measure of total signal intensity, LRR is then calculated as log_{2}(*R*_{observed}/*R*_{expected}), where *R*_{expected} is computed from linear interpolation of the canonical genotype clusters. The BAF is a normalized measure of the relative signal intensity ratio of the B and A alleles. Let *θ _{g}*,

*g*∈ {AA, AB, BB} denote the mean

*θ*value for genotype cluster

*g*obtained from a set of reference samples. The corresponding BAFs are defined as 0.0, 0.5 and 1.0, respectively. Then, for a subject with

*θ*

_{subject}, the BAF is defined through linear interpolation among the three clusters:

### Signal intensities for Affymetrix SNP arrays

The Affymetrix genome-wide 5.0 and 6.0 SNP arrays contain approximately equal numbers of SNP markers and NP markers to improve the genome coverage. We followed a similar procedure as used by the Illumina platform to derive the LRR and BAF values for SNP markers, and the LRR values for NP markers. We used the Affymetrix Power Tools (http://www.affymetrix.com/support/developer/powertools/changelog/index.html) to perform data normalization, signal extraction and genotype calling from raw CEL files generated in genotyping experiments. For each SNP marker, we then relied on the allele-specific signal intensities for the AA, AB and BB genotypes on all genotyped samples to construct three canonical genotype clusters. Since each NP marker has only one reference cluster, we set the center value of the cluster as the median of the signal intensities for all genotyped samples. Once the canonical genotype clusters are constructed, we can then transform the signal intensity values for each SNP into R, LRR, θ and BAF values. The method described below uses both LRR and BAF values, but for NP markers, the BAF information is ignored in the likelihood calculation.

### Likelihood of signal intensities for a parents–offspring trio

Assume a parents–offspring trio is genotyped at *T* consecutive SNPs. For SNP *j* (1 ≤ *j* ≤ *T*), let *r _{j}* = (

*r*,

_{j,f}*r*,

_{j,m}*r*) denote the triplet of LRRs of the father, the mother and the offspring,

_{j,o}*b*= (

_{j}*b*,

_{j,f}*b*,

_{j,m}*b*) denote the corresponding BAFs,

_{j,o}*z*= (

_{j}*z*,

_{j,f}*z*,

_{j,m}*z*) denote the underlying hidden copy number states, and DN

_{j,o}*(1:*

_{j}*de novo*event; 0: inherited from parents) denote the

*de novo*event status of the offspring. The observed signal intensities for the trio can be represented by

*r*= (

*r*

_{1},…,

*r*),

_{T}*b*= (

*b*

_{1},…,

*b*), and the hidden copy number states can be represented by

_{T}*z*= (

*z*

_{1},…,

*z*). Let

_{T}*λ*denote all parameters in the HMM (including means and standard deviations in the emission probabilities of signal intensities, initial probabilities of copy number states and transition probabilities). The likelihood of the signal intensities for the trio is

Figure 1 provides a schematic representation of the dependence structure specified in Equation (2). This equation requires a few simplifying but reasonable assumptions, including the conditional independence of LRR and BAF values at each marker (supported by empirical data), the conditional independence of LRR/BAF values between adjacent markers, as well as the conditional independence of BAF values and the *de novo* event status at each marker. For the starting SNP, its contribution to the likelihood is the product of the emission probability of the signal intensities, the initial probability of copy number states, and the initial probability of the *de novo* event status. Based on empirical data from HapMap, we estimate that *ε* = *P*(DN_{1} = 1|λ) = 1.5e – 6. For the remaining SNPs (2 ≤ *j* ≤ *T*), the contribution of each SNP to the likelihood is the product of the emission probability, the transition probability of copy number states and the transition probability of the *de novo* event status. The challenge of the HMM lies in the inference of the hidden copy number states of each marker and the *de novo* event status of the offspring, given the observed signal intensities. Below, we describe elements needed in the HMM calculation. We note that the calculation in Equation (2) can be easily extended to nuclear families with multiple offspring, in which each additional offspring requires a variable specifying copy number state and a variable indicating *de novo* status for each marker.

#### Hidden copy number states

We adopt a five-state definition of hidden copy number states (Table 1) to reflect possible copy number changes, including double-copy deletion (zero copies), single-copy deletion (one copy), normal state (two copies), single-copy duplication (three copies) and double or more copy duplication (four or more copies). A copy number of more than four is usually indistinguishable from four copies in patterns of signal intensity, so we combine this rare scenario with four copies.

#### Emission probabilities of signal intensity

Given the copy number states of the father, the mother and the offspring, their signal intensities are independent, thus for marker . We propose to model the emission probability of the LRRs as a normal distribution based on empirical observations, , where is the normal density function with unknown mean μ_{zj,k} and SD σ_{zj,k}.

The emission probability of BAF is slightly more complicated than the LRR. For the zero-copy state, we used a mixture of normal with mean 0.5 and unknown SD, and a point mass at 0 or 1 to model the distribution of BAF. For each state other than the zero-copy state, there are multiple possible genotypes with distinct patterns of BAF. Let *C*(*z _{j,k}*) denote the total number of genotypes (Table 1) for state

*z*of individual

_{j,k}*k*at SNP

*j*. For each genotype that is consistent with the copy number state, let

*g*denote the number of copies of allele B. Let

*p*be the population frequency of allele B at marker

_{j,B}*j*, which can be estimated from a set of reference samples such as the HapMap. Then the emission probability of BAF can be modeled as a mixture of distributions, where

is the probability of genotype *g*, and

The use of truncated normal distribution is due to the truncation in BAF calculation. The point mass probabilities are set as *M*_{0} = *M*_{1} = 0.5.

#### Initial probability of copy number states

For the first marker, the initial probability of the copy number states for the trio is

where the first two terms are the initial probabilities of copy number states for the father and mother, respectively, and the third term is the conditional probability of copy number state of the offspring given the parental copy number states and *de novo* event status of the offspring. If DN_{1} = 1, then a *de novo* event occurs. Assuming the offspring is equally likely to take one of the five copy number states, then . We note that this is a simplified assumption for computational convenience, since in reality the probability of some *de novo* events (such as when duplicating two additional copies) requires more dramatic changes in genomic contents than others (such as when duplicating one copy). If DN_{1} = 0, then the offspring's copy number is determined by the parental copy numbers through Mendelian inheritance.

To model Mendelian inheritance of CNVs, it is necessary to specify models for chromosome-specific copy numbers. Given the total copy number, there might be multiple compatible chromosome-specific copy number configurations. A detailed illustration is given in Suppplementary Figure 1. Due to combinatorial complexities, the likelihood function should explicitly incorporate and appropriately weigh different configurations of chromosome-specific copy numbers. To model the probability distribution of chromosome-specific copy number at a single marker, here we propose a single-parameter model with the parameter *a*, which specifies the probability of the less likely chromosome-specific copy number configuration (Table 2). Once the parental chromosome-specific copy numbers are known, the probability distribution of the offspring's chromosome-specific copy numbers can then be obtained following Mendel's first law (Supplementary Tables 1–3).

#### Transition probabilities of copy number states

For a parents–offspring trio, the transition probability of their copy number states from SNP *j* – 1 to SNP *j* is

The transition probability describes the probability of having a copy number state change between two adjacent SNPs. Intuitively, the copy number state is unlikely to change for SNPs that are nearby but is more likely to change for SNPs that are far apart. To appropriately model such spatial dependency, we use the following model to characterize the transition probability for the parents (*k* = *f* or *m*),

where *d _{j}* is the physical distance between SNPs

*j*– 1 and

*j*, and

*D*is a constant that is set as 100 kb. The values of

*γ*'s are treated as unknown parameters.

For the offspring, we need to calculate . We note that

Thus, we need to calculate . This probability can be classified into four categories: *de novo* at both SNPs, *de novo* at only one SNP and inherited at both SNPs.

When both SNPs are *de novo*, the parental copy numbers become irrelevant, implying that we can assume the offspring's copy number states follow a hidden Markov chain that is independent of the parents. Under this assumption,

where , and can be calculated based on the transition probability as described earlier for the parents.

When SNP *j* – 1 is *de novo* and SNP *j* is inherited, then a CNV breakpoint occurs between markers *j* – 1 and *j*, thus it is reasonable to assume that the copy number states of the offspring at these two markers are independent. Under this assumption,

where can be calculated based on Mendelian inheritance as specified in Supplementary Tables 1–3. The conditional probability when SNP *j* – 1 is inherited and SNP *j* is *de novo* can be calculated in a similar fashion.

When both SNPs *j* – 1 and *j* are inherited, the probability can be calculated based on Mendelian inheritance. Given the high density of SNPs on Illumina's whole-genome SNP genotyping arrays, it is reasonable to assume that there is no recombination between SNPs *j* – 1 and *j*, suggesting that we can treat these two SNPs as a single unit when calculating the Mendelian inheritance probabilities. To model Mendelian inheritance, we need to specify models for chromosome-specific copy numbers given the total copy numbers at two adjacent SNPs. Following a similar derivation of the chromosome-specific copy number model for a single SNP, here we propose a single-parameter model with parameter, *b*, which specifies the probability of the less likely chromosome-specific copy number configuration in which copy number changes occur at both chromosomes (Table 3). Such an assumption is reasonable since it is unlikely that copy number changes occur on both chromosomes unless the CNV is common. Due to the high-dimensionality (25 × 25 × 25) of the table for two-marker copy number inheritance, we do not provide it in the manuscript, but it is available in the source code of our software.

#### Transition probabilities of de novo event statuses for the offspring

The majority of the CNVs in the offspring are inherited from the parents, but a small fraction of the offspring's CNVs may occur due to meiotic recombination, mitotic recombination, or cell line-induced chromosome rearrangements. The transition probability of *de novo* event status describes the probability of the offspring's CNV changing from inherited to *de novo* or vice versa. Clearly, the transition probability is dependent on distance between two adjacent markers since markers that are close to each other are more likely to be located in the same inherited or *de novo* region. To model such spatial dependency, we adopt the same transition probability model that was previously described for copy number states but with different transition parameters,

where the values of *δ*s are treated as unknown parameters.

#### Parameter estimation and CNV calling

Inference on the hidden copy number states requires estimation of all unknown parameters, including *μ*s and *σ*s for the signal intensity, initial probabilities for copy number states *π*, the transition probability matrix Γ = (*γ _{h,l}*) for copy number states, the transition probability matrix for the

*de novo*event status Δ = (

*δ*) and

_{h,l}*a*and

*b*, the parameters in the single- and two-marker chromosome-specific copy number models. It is computationally challenging to estimate these parameters given the high dimension of the data. Moreover, a single sample may not carry sufficient information for estimating all model parameters. However, assuming the samples are homogeneous, then we can select a set of training samples with large CNV regions through visually examining patterns of LRRs and BAFs to estimate the corresponding

*μ*s and

*σ*s for regions with different numbers of copies. In our analysis, we fixed the values of

*a*and

*b*at 0.0009. Evaluations with different values of

*a*and

*b*suggest that the results are robust to misspecification of their values (data not shown). The initial probabilities

*π*, the

*de novo*rate

*ε*and the Δ matrix are estimated from previously published HapMap CNV results (12). To estimate the parameters in the transition matrix Γ, we used the Baum–Welch algorithm (21) to maximize the likelihood in Equation (2). Given a set of HMM parameters and the signal intensity data from a trio, we then used the Viterbi algorithm (22) to infer the most likely path (state sequences for all SNPs along each chromosome) for each of the individuals in the trio simultaneously. A CNV is called from the most likely state sequence, whenever a stretch of states that is different from the normal state is observed.

### Availability

All the CNV calling algorithms have been implemented in the latest version of PennCNV, which is freely and publicly available at http://www.openbioinformatics.org/penncnv/. The Affymetrix LRR/BAF data transformation programs, which were used in this study for the Affymetrix genome-wide 5.0 arrays, were also made available as a beta version. A set of standard HMM models are provided for commonly used arrays; however, like commercial software such as Partek and GoldenHelix, users have the freedom to tweak all HMM parameters, the CNV inheritance models, as well as the population frequency of B allele parameters, which are suitable for custom-made arrays.

### RESULTS

We have developed a joint-calling algorithm for CNV detection in parent–offspring trios, using a hidden Markov framework that simultaneously models family relationship and signal intensities. This CNV calling algorithm differs substantially from the previously described family-based CNV calling algorithm (12) in that, first, copy number estimates are given with respect to a parent–offspring trio simultaneously in one step instead of two steps (Figure 1), and second, it gives probabilistic estimates of chromosome-specific copy numbers. Therefore, we compared the performance of the proposed joint-calling algorithm with existing algorithms that either do not incorporate family relationship or use them separately (12). We first performed simulation studies and then analyzed experimentally validated CNVs from multiple families in a real dataset genotyped using the Illumina SNP arrays. Furthermore, we used several concrete examples from real data to demonstrate how chromosome-specific copy numbers and SNP allele composition within CNVs can be inferred from family data. Finally, to demonstrate the versatility of the proposed method, we tested it on inherited and *de novo* CNVs from a set of families genotyped using both the Illumina HumanHap550 and the Affymetrix genome-wide 5.0 SNP arrays.

### Comparative analysis of CNV detection on simulated data

To evaluate the performance of the proposed joint-calling algorithm under various scenarios of CNV inheritance, we performed computer simulations. We generated signal intensity data, as represented by LRR and BAF values, for 27,742 SNPs on chromosome 11 for the HumanHap550 SNP array, based on allele frequency and CNV size distribution from the empirical data obtained from the HapMap CEU samples (12). We tested a total of eight different inheritance scenarios of parent–offspring CNV combinations (Figure 2); for each scenario, we called CNVs using (i) the individual-calling algorithm that treats family members as if they were unrelated, (ii) the posterior-calling algorithm as described before (12) and (iii) the joint-calling algorithm as proposed in this paper. A total of 1000 data sets are simulated for each scenario, and there are either 1000 or 2000 true CNVs for each scenario depending on whether the CNVs are transmitted to the offspring. Given that *de novo* CNVs are rare (23–25), we did not consider them in the simulations; instead, we evaluated *de novo* CNVs in real data analysis as shown in a later section.

**...**

For each simulation scenario, the number of ‘exactly correct’ CNV calls (CNV calls with identical copy number and identical boundaries as the true CNVs) is shown in Figure 2. We can see that the three calling algorithms have similar performance for scenarios 3, 4 and 6, but for the other scenarios, the joint-calling algorithm yielded a substantially larger number of ‘exactly correct’ calls. Another important criterion of comparing different CNV calling algorithms is the number of false positive and false negative calls. Here, we refer to a CNV call as false positive if the call does not overlap with the true CNV, and we refer to a CNV as false negative if it is not detected by the CNV calling algorithm. Supplementary Figure 2 shows the numbers of false positive and false negative calls for the simulated data. We observed that when the offspring's CNV is inherited, the performance of the joint-calling algorithm far exceeds the other two algorithms, especially for duplication CNVs. For example, for scenario 7, where the father has duplication on both chromosomes and the offspring has duplication on only one chromosome, the number of false negative calls for the individual-calling algorithm is 646; it drops down to 404 and 162, respectively, for the posterior-calling algorithm and the joint-calling algorithm. Our results suggest that efficient utilization of family information can significantly improve the call rates as well as the accuracy of CNV boundary inference.

### Comparative analysis of CNV detection on real data

To test the performance of the joint-calling algorithm in real data, we examined 10 families (34 subjects) from the Autism Genetic Resource Exchange (AGRE) Consortium. All study samples were genotyped using the Illumina HumanHap550 SNP array (14). To compare the performance of the three calling algorithms, we focused on the 4p16.1 deletions between *WDR1* and *ZNF518B*, which spans only four SNPs, making the CNV detection especially difficult. We designed PCR-walking experiments to validate the CNVs and mapped the approximate breakpoints. We then selected a pair of primers to infer the true copy numbers of all subjects by PCR amplification of the genomic segment encompassing CNV breakpoints. Finally, we re-sequenced the short PCR product to confirm that the breakpoints are identical among unrelated families. Since the true copy number for all subjects are known experimentally (Supplementary Figure 3), we compared the CNV calls for three algorithms with the true copy numbers. Collapsing all families together, there are a total of 30 true CNVs. For the individual-based calling algorithm, only 15 CNVs were detected, implicating a relatively high false negative rate. In contrast, both the posterior calling algorithm and the joint-calling algorithm are capable of detecting all 30 CNVs in all families. However, for one family, the posterior calling algorithm identified a CNV call with only three SNPs, resulting in a slight discordance in boundary inference. The joint-calling algorithm, on the other hand, completely recovered all true CNVs, and all the CNV calls have the correct boundaries with four SNPs. This comparative analysis on real data corroborate our analysis on simulated data, and confirms that the joint-calling algorithm improves accuracy in boundary inference and leads to decreased false negative rate.

### Inference of chromosome-specific copy numbers from family data

Family information can be used to infer CNV genotypes, that is, chromosome-specific copy numbers on each of the two homologous chromosomes. To illustrate this point by a concrete example, we show in Figure 3 an AGRE family in which all family members carry a ∼130 kb duplication on 22q11.21, which encompasses the *PRODH* and *DGCR6* gene. The results from the individual-calling, the posterior-calling and the joint-calling algorithms are concordant for this family, revealing that the first child has four copies of this CNV region, yet the father, the mother and the sibling in this family carry three copies. As shown in Table 2, when the total copy number is 3, the corresponding chromosome-specific copy numbers can be either 1/2 or 0/3, and when the total copy number is 4, the corresponding chromosome-specific copy numbers can be either 1/3 or 2/2. Despite such uncertainty, when the family relationship is considered, we can infer confidently that the chromosome-specific copy numbers for the father, the mother, the first child and the second child must be 1/2, 1/2, 2/2 and 1/2, respectively, and that the first child inherits the duplicated chromosome from both parents. If only one child is available in this family, we can still infer the most likely chromosome-specific copy number combinations in the parents–offspring trio, albeit with less confidence. This example is merely an illustration of how additional family information can be used to increase confidence of chromosome-specific copy number estimates, compared to the ‘prior distribution’ in Table 2.

### Inference of chromosome-specific SNP genotypes in CNVs from family data

For CNV calls generated on SNP genotyping arrays, we can also use the SNP genotypes within the CNV to infer the SNP allele composition for each of the two homologous chromosomes, that is, chromosome-specific SNP genotypes. Unlike the ‘called SNP genotypes’ given by a genotype calling software, which comprises three types of allele compositions (AA, AB and BB), the ‘real SNP genotypes’ within a CNV can be jointly inferred from the BAF values and the total copy numbers (for example, A, BB, ABB and AABB, in Table 1). Knowing the SNP allele composition within the inherited CNV is important for the development of linkage and association tests on CNVs for disease phenotypes. To further illustrate this, we used a large segregating pedigree in AGRE as an example for such analysis: there are six offspring in this family, and five of them are affected by autism spectrum disorders (including four with strict autism diagnosis and one with broad spectrum diagnosis). Figure 4 displays the chromosome-specific SNP genotypes on the first 10 SNPs in the 10q11.22 duplication CNV region for each of the individuals. By examining SNP genotypes, we can disambiguate the four parental CNV haplotypes with clear SNP allele composition (Figure 4, Supplementary Table 4).

**...**

Furthermore, Supplementary Figure 4 and Supplementary Table 5 show another example of chromosome-specific SNP genotypes in this pedigree at a duplication CNV on 6q27, which co-segregates with autism in this family. However, unlike the 10q11.22 duplication, since the mother is homozygous without copy number change in the 6q27 region, the transmission patterns of the two maternal chromosomes cannot be discriminated. In addition, we also analyzed the 22q11.21 duplication in the family presented in Figure 3: the use of family relationship allows the identification of the SNP allele composition and parental origin for each of the two duplicated homologous chromosomes in the first child, who carries four copies of this region (Supplementary Figure 5 and Supplementary Table 6). All these examples suggest the importance of examining family relationship and incorporating SNP genotypes into the analysis of CNVs. Efficient utilization of such information can provide valuable insights into studying the biological aspects of CNVs, including their evolutionary history as well as their genetic transmission patterns.

### Detection of inherited and *de novo* CNVs from Illumina and Affymetrix SNP arrays

Although our algorithm was originally developed for Illumina data, the algorithm is general enough and can be readily applied to data generated from other technical platforms. To demonstrate such utility, we analyzed a set of AGRE families genotyped with both the Illumina HumanHap550 SNP arrays by us and the Affymetrix genome-wide 5.0 SNP arrays by others (13). All these families contain at least one family member with experimentally validated 16p11.2 deletion or duplication, including three inherited CNVs from the father in family AU0029 and five *de novo* CNVs in the offspring in four other families. This CNV region is flanked by two ∼146 kb segmental duplications which share 99.6% sequence identity to each other and are 593 kb apart (Figure 5). For the Illumina HumanHap550 array, the CNV is covered by 47 SNPs with 530 kb in length. The Affymetrix genome-wide 5.0 Human SNP array contains 82 markers between segmental duplications; however, it also contains three additional markers within segmental duplications without unique genomic location, therefore we removed the three markers from our analysis. The exactly correct CNV calls from the Affymetrix array should contain 82 markers (28 SNP markers, 54 NP markers) with 569 kb in length. These families provide an ideal basis for comparison of different CNV calling algorithms and different technical platforms.

**...**

We compared the performance of the individual-calling, the posterior-calling and the joint-calling algorithms (Figure 5). All three algorithms gave correct CNV calls in all individuals carrying the CNV, indicating the high sensitivity and specificity of these algorithms in detecting large-sized inherited or *de novo* CNVs. However, the algorithms differ in their ability to detect the exact CNV boundaries, which is especially obvious for the Affymetrix array due to its higher levels of background noise in signal intensity data. This example illustrates the ability of the joint-calling algorithm in detecting both inherited and *de novo* CNVs with accurate boundary prediction, and its broad applicability to arrays that incorporate NP markers.

## DISCUSSION

We have developed a formal statistical framework to model the genetic inheritance of CNVs, via a HMM that simultaneously considers family relationship and signal intensities for parent–offspring trios. Our method considers the trio as a unit and calls their CNVs simultaneously, thus avoiding the generation of calls that are Mendelian inconsistent while maintaining the ability to detect *de novo* events. Moreover, our method allows the probabilistic estimation of chromosome-specific copy numbers, which can be used in subsequent CNV analysis. By extensive simulations and analysis of real family data, we showed that when the offspring's CNVs are inherited from the parents, the proposed method improves the call rates and the accuracy of boundary inference over existing methods. Although we present the algorithm for parent–offspring trios only, our method can be extended to analysis of nuclear families with multiple offspring, and we demonstrated the utility of using information from additional family members in Figure 3. Altogether, we hope that our method and software (http://www.openbioinformatics.org/penncnv) will be of great value to genome-wide CNV studies using family data.

Although we described our CNV calling algorithm for data generated from Illumina HumanHap550 and Affymetrix genome-wide 5.0 SNP arrays, we note that data derived from Illumina Human1M and Affymetrix genome-wide 6.0 SNP arrays are similar in nature and can therefore be analyzed directly with the proposed algorithm. Moreover, our algorithm can be extended to other platforms, such as array-CGH experiments, or oligonucleotide tiling arrays. For these non-SNP arrays, since no allele frequency information can be inferred, only the signal intensities at each marker contribute to likelihood calculation. We note that due to the lower precision of array-CGH experiments, one might consider using ‘loss’ and ‘gain’, rather than the exact copy number, in the model. In such cases, the number of hidden copy number states reduces to three, and the various CNV inheritance tables need to be adjusted accordingly by combining copy numbers zero and one into a single ‘loss’ state, and copy numbers three and four into a single ‘gain’ state.

Compared to a previously published posterior-calling algorithm (12), there are several distinct advantages of the proposed joint-calling algorithm. First, instead of using family information separately, the joint-calling algorithm jointly models the family information with signal intensities and thus uses data in the most efficient manner. Second, if the CNV boundary is inferred incorrectly in the first step in the posterior-calling algorithm, then it cannot be corrected in the second step; however, as evidenced by our simulation results and analysis of the AGRE families, for inherited CNVs, the joint-calling algorithm is more likely to infer the correct boundary. Another unique feature of the joint-calling algorithm is the ability to give probabilistic estimate of chromosome-specific copy numbers, which is only feasible when family information is simultaneously modeled with signal intensities.

Despite the distinct advantages of the joint-calling algorithm, we recognize that it is computationally intensive and requires more assumptions than the posterior-calling algorithm. First, in the joint-calling algorithm, the family relationship needs to be jointly modeled with the signal intensities, thus requires 5 × 5 × 5 × 2 (five states for each individual and two *de novo* states for the offspring) states in the HMM for each marker in the genome; however, the original formula for the posterior-calling algorithm needs only six HMM states multiplied by three individuals in a trio. Second, due to the increased number of hidden states, the joint-calling algorithm requires more memory than the posterior-calling algorithm, which may be a problem for future ultra high-density oligonucleotide arrays with dozens of millions of markers. However, these problems can be solved by analyzing chromosome segments sequentially and then combining results together. Third, we note that the joint-calling algorithm makes more assumptions (such as the parameters used in Tables 2 and and3,3, as well as the *de novo* indicator transition rate) than the posterior-calling algorithm. The inherent complexity of the model dictates that some parameters must be estimated directly from empirical data rather than inferred from maximum likelihood. However, we note that the accuracy of these parameters only affects rare scenarios, and has little effects on the overall CNV calls. For example, increasing the transition rate of DN indicator from ‘inherited’ to ‘*de novo*’ 10-fold only makes the detection of *de novo* event in the child less sensitive, but has virtually no effect on the detection of inherited CNVs in the trio or non-transmitted CNVs in the parents, which comprise the majority of CNVs in a family (data not shown).

For CNVs identified from high-density SNP genotyping data, another important piece of information is the corresponding SNP genotypes for markers within the CNVs; for example, SNP genotypes can be used to characterize parental origin of *de novo* events (12). In addition, SNP genotypes can help interpret inherited CNVs and extract more biological information, including probabilistic models for chromosome-specific copy numbers. We demonstrated the utilization of SNP genotype information on an AGRE family in which we can infer with certainty on the chromosome-specific copy numbers and the corresponding chromosome-specific SNP genotypes. Results from such analysis can be used to evaluate the preferential transmission pattern in a transmission disequilibrium test framework. In addition, information on the parental origin of a CNV will be particularly important for analysis of allelic imbalance and can help interpret gene expression differences from two homologous chromosomes.

In summary, we have developed a statistical framework to model the genetic inheritance of CNVs for parents–offspring trios. The likelihood calculation makes it easily extendable to nuclear families with multiple offspring. We believe that the application, adaptation and extension of our model in future studies will greatly facilitate the development of CNV detection algorithms for data generated from various technical platforms, and will foster the development of powerful and efficient linkage and association tests utilizing CNVs.

## SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

## FUNDING

NARSAD Distinguished Investigator Award (to M.B.); University Research Foundation grant, McCabe Pilot Award from the University of Pennsylvania (to M.L.); National Institute of Health (grant R01HG004517), an Institute Development Award to the Center for Applied Genomics from the Children's Hospital of Philadelphia (to H.H., S.F.A.G., J.G.). Funding for open access charge: National Institute of Health (grant R01HG004517 to M.L.).

*Conflict of interest statement*. None declared.

## ACKNOWLEDGEMENTS

We thank Edmund Weisberg in the Center for Clinical Epidemiology and Biostatistics at the University of Pennsylvania for editing assistance. We thank two anonymous reviewers for their insightful suggestions on realistic data simulations and presentation of concrete real data examples. We gratefully acknowledge the resources provided by the Autism Genetic Resource Exchange (AGRE) Consortium (members of the consortium listed in Appendix 1) and the participating AGRE families. We are most grateful to the Children's Hospital of Philadelphia and the Broad Institute for providing us with access to the genotype data from the Illumina and Affymetrix platforms, respectively. National Institute of Mental Health (grant 1U24MH081810 to Clara M. Lajonchere PI partially); National Institutes of Health (grant R01-MH604687).

## Appendix 1: The AGRE Consortium

Daniel Geschwind, MD, PhD, UCLA, Los Angeles, CA; Maja Bucan, PhD, University of Pennsylvania, Philadelphia, PA; W. Ted Brown, MD, PhD, FACMG, N.Y.S. Institute for Basic Research in Developmental Disabilities, Staten Island, NY; Rita M. Cantor, PhD, UCLA School of Medicine, Los Angeles, CA; John N. Constantino, MD, Washington University School of Medicine, St Louis, MO; T.Conrad Gilliam, PhD, University of Chicago, Chicago, IL; Martha Herbert, MD, PhD, Harvard Medical School, Boston, MA; Clara Lajonchere, PhD, Autism Speaks, Los Angeles, CA; David H. Ledbetter, PhD, Emory University, Atlanta, GA; Christa Lese-Martin, PhD, Emory University, Atlanta, GA; Janet Miller, J.D., PhD, Autism Speaks, Los Angeles, CA; Stanley F. Nelson, MD, UCLA School of Medicine, Los Angeles, CA; Gerard D. Schellenberg, PhD, University of Washington, Seattle, WA; Carol A. Samango-Sprouse, Ed.D, George Washington University, Washington, DC; Sarah Spence, MD, PhD, UCLA, Los Angeles, CA; Matthew State, MD, PhD, Yale University, New Haven, CT; Rudolph E. Tanzi, PhD, Massachusetts General Hospital, Boston, MA.

## REFERENCES

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (8.9M) |
- Citation

- PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data.[Genome Res. 2007]
*Wang K, Li M, Hadley D, Liu R, Glessner J, Grant SF, Hakonarson H, Bucan M.**Genome Res. 2007 Nov; 17(11):1665-74. Epub 2007 Oct 5.* - Improved detection of global copy number variation using high density, non-polymorphic oligonucleotide probes.[BMC Genet. 2008]
*Shen F, Huang J, Fitch KR, Truong VB, Kirby A, Chen W, Zhang J, Liu G, McCarroll SA, Jones KW, et al.**BMC Genet. 2008 Mar 28; 9:27. Epub 2008 Mar 28.* - Bayesian EM algorithm for scoring polymorphic deletions from SNP data and application to a common CNV on 8q24.[Genet Epidemiol. 2009]
*Zöllner S, Su G, Stewart WC, Chen Y, McInnis MG, Burmeister M.**Genet Epidemiol. 2009 May; 33(4):357-68.* - Methods to detect and analyze copy number variations at the genome-wide and locus-specific levels.[Cytogenet Genome Res. 2008]
*Lee JH, Jeon JT.**Cytogenet Genome Res. 2008; 123(1-4):333-42. Epub 2009 Mar 11.* - Extending genome-wide association studies to copy-number variation.[Hum Mol Genet. 2008]
*McCarroll SA.**Hum Mol Genet. 2008 Oct 15; 17(R2):R135-42.*

- Molecular Characterization of an Intact p53 Pathway Subtype in High-Grade Serous Ovarian Cancer[PLoS ONE. ]
*Hayano T, Yokota Y, Hosomichi K, Nakaoka H, Yoshihara K, Adachi S, Kashima K, Tsuda H, Moriya T, Tanaka K, Enomoto T, Inoue I.**PLoS ONE. 9(12)e114491* - A Bayesian Integrative Model for Genetical Genomics with Spatially Informed Variable Selection[Cancer Informatics. ]
*Cassese A, Guindani M, Vannucci M.**Cancer Informatics. 13(Suppl 2)29-37* - Analysis of Genome-Wide Copy Number Variations in Chinese Indigenous and Western Pig Breeds by 60 K SNP Genotyping Arrays[PLoS ONE. ]
*Wang Y, Tang Z, Sun Y, Wang H, Wang C, Yu S, Liu J, Zhang Y, Fan B, Li K, Liu B.**PLoS ONE. 9(9)e106780* - A HIERARCHICAL BAYESIAN MODEL FOR INFERENCE OF COPY NUMBER VARIANTS AND THEIR ASSOCIATION TO GENE EXPRESSION[The annals of applied statistics. 2014]
*Cassese A, Guindani M, Tadesse MG, Falciani F, Vannucci M.**The annals of applied statistics. 2014 Mar 1; 8(1)148-175* - A genome-wide study of de novo deletions identifies a candidate locus for non-syndromic isolated cleft lip/palate risk[BMC Genetics. ]
*Younkin SG, Scharpf RB, Schwender H, Parker MM, Scott AF, Marazita ML, Beaty TH, Ruczinski I.**BMC Genetics. 1524*

- Modeling genetic inheritance of copy number variationsModeling genetic inheritance of copy number variationsNucleic Acids Research. 2008 Dec; 36(21)e138

Your browsing activity is empty.

Activity recording is turned off.

See more...