• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Hum Genet. Author manuscript; available in PMC Apr 3, 2009.
Published in final edited form as:
PMCID: PMC2665168
NIHMSID: NIHMS101226

Identification of common genetic variants that account for transcript isoform variation between human populations

Wei Zhang, Shiwei Duan, Wasim K. Bleibel, Steven A. Wisel, R. Stephanie Huang, Xiaolin Wu, and Lijun He
Section of Hematology/Oncology, Department of Medicine, The University of Chicago, Box MC6091, 5841 S. Maryland Ave., Chicago, IL 60637, USA
Tyson A. Clark, Tina X. Chen, Anthony C. Schweitzer, and John E. Blume
Expression Research Laboratory, Affymetrix Inc., Santa Clara, CA 95051, USA
M. Eileen Dolan
Section of Hematology/Oncology, Department of Medicine, The University of Chicago, Box MC6091, 5841 S. Maryland Ave., Chicago, IL 60637, USA

Abstract

In addition to the differences between populations in transcriptional and translational regulation of genes, alternative pre-mRNA splicing (AS) is also likely to play an important role in regulating gene expression and generating variation in mRNA and protein isoforms. Recently, the genetic contribution to transcript isoform variation has been reported in individuals of recent European descent. We report here results of an investigation of the differences in AS patterns between human populations. AS patterns in 176 HapMap lymphoblastoid cell lines derived from individuals of European and African ancestry were evaluated using the Affymetrix GeneChip® Human Exon 1.0 ST Array. A variety of biological processes such as response to stimulus and transcription were found to be enriched among the differentially spliced genes. The differentially spliced genes also include some involved in human diseases that have different prevalence or susceptibility between populations. The genetic contribution to the population differences in transcript isoform variation was then evaluated by a genome-wide association using the HapMap genotypic data on single nucleotide polymorphisms (SNPs). The results suggest that local and distant genetic variants account for a substantial fraction of the observed transcript isoform variation between human populations. Our findings provide new insights into the complexity of the human genome as well as the health disparities between the two populations.

Introduction

The existence of health disparities between human populations, for example, the differential response to therapeutic treatments (Huang et al. 2007) and higher risks of certain common diseases has been reported by clinical scientists. However, the genetic basis for population differences in clinical outcomes and risk of common disease is not fully understood (Huang et al. 2007; Ioannidis et al. 2004; Kurian and Cardarelli 2007). In the past few years, gene expression has been studied as a quantitative complex phenotype (Morley et al. 2004; Stranger et al. 2005), which sits between genetic/non-genetic variations and other more complicated cellular or whole-body phenotypes. Therefore, studying variation in gene expression between populations may help explain these health disparities. In addition to several differences between populations in transcriptional and translational regulation of genes, alternative premRNA splicing (AS) is also likely to play an important role in regulating gene expression and generating variation in mRNA and protein isoforms. The initial sequencing and analysis of the human genome suggested an unexpectedly low gene number of 30,000-35,000 (Lander et al. 2001; Venter et al. 2001), which raises the question of the source of the complexity of the human genome. Numerous studies such as those using expressed sequence tags (ESTs) and cDNAs aligned to the genomic sequences have shown that AS is prevalent in mammalian genomes (Sorek et al. 2004). It has been estimated that between one-third and two-thirds of all human genes undergo alternative splicing (Sorek et al. 2004) and the disruption of specific AS events has been implicated in several human genetic diseases including cancer (Brinkman 2004; Faustino and Cooper 2003; Novoyatleva et al. 2006).

Studies using the International HapMap Project (http://www.hapmap.org) resources (Frazer et al. 2007; International HapMap Consortium 2003, 2005; Zhang et al. 2008b) have shown that common genetic variants in the form of single nucleotide polymorphisms (SNPs) contribute to gene expression variation within the same HapMap population (Duan et al. 2008a) as well as between the different HapMap populations (Spielman et al. 2007; Storey et al. 2007; Stranger et al. 2007; Zhang et al. 2008a). The Phase I/II HapMap samples are comprised of a panel of lymphoblastoid cell lines (LCLs) derived from individuals of northern and western European ancestry collected by Centre d'Etude du Polymorphisme Humain (CEPH) (CEU: CEPH individuals from Utah, USA, 30 parents-child trios), individuals of African ancestry (YRI: Yoruba people from Ibadan, Nigeria, 30 parents-child trios) and individuals of eastern Asian ancestry (CHB: Han Chinese from Beijing, China, 45 unrelated samples; JPT: Japanese from Tokyo, Japan, 45 unrelated samples). Recently, studies have begun to demonstrate the genetic contribution to the transcript isoform variation in the unrelated CEU samples (Hull et al. 2007; Kwan et al. 2007, 2008). However, the systematic comparison of the transcript isoform variation including AS events between human populations and their regulation by common genetic variants have not been comprehensively investigated. We therefore utilized the Affymetrix GeneChip® Human Exon 1.0 ST Array (exon array), which contains probes for ~20,000 well-annotated human genes (~1.4 million annotated and predicted exons corresponding to 17,745 transcript clusters using the core set of exon-level probesets supported by RefSeq (Pruitt et al. 2007)), to study 176 HapMap samples (87 CEU and 89 YRI) from parents-offspring trios.

One potential problem with the use of oligonucleotide expression arrays is the possibility that SNPs located within probes could affect hybridization efficiency (Gilad et al. 2005) and lead to false expression quantitative loci (eQTLs) (Alberts et al. 2007). This effect was also observed in our exon array expression data. We described this effect in a previous publication using HLA-DPB1 as an example (Zhang et al. 2008a). To reduce the potential variability associated with this effect, we filtered out probesets (exon-level) containing all known SNPs in the current dbSNP database (version 129) (Duan et al. 2008b) maintained by the National Center for Biotechnology Information (NCBI) before summarizing transcript cluster (gene-level) expression signals. In addition, a recent publication suggested that the effect of unannotated or undiscovered SNPs is quite small for the exon array using the unrelated CEU samples (Kwan et al. 2008). Our goals were then to identify probesets that showed transcript isoform variation between these two populations, to determine what biological processes or pathways were enriched in the genes containing differentially spliced probesets and to evaluate the contribution of local and distant genetic variants (SNPs) to the observed population differences in transcript isoform variation (see Supplemental Fig. 1 for the workflow). Specifically, we focused on the differences between the CEU and YRI samples in simple cassette exon skipping events. Splicing index (SI), defined as the relative contribution of a probeset (exon-level) to transcript cluster (gene-level) expression (Affymetrix Inc. 2006; Gardina et al. 2006) was used to evaluate any transcript isoform variation between the two populations.

Materials and methods

Cell lines, RNA isolation and chip hybridization

Details for this part including our approach to avoid systematic bias were described in a previous publication (Zhang et al. 2008a). Briefly, HapMap cell lines (International HapMap Consortium 2003, 2005) (30 CEU trios and 30 YRI trios) were purchased from Coriell Institute for Medical Research (Camden, NJ). Two CEU samples (GM10855 and GM12236) were not available from Coriell at the time of the study. The viability of two lines (GM12716, GM18871) was below 85% at the sample collection time. Therefore, a total of 176 cell lines (87 CEU samples and 89 YRI samples) were included in this study. Total RNA was extracted using Qiagen Qiashredder and RNeasy plus kits (Qiagen, Germantown, MD) according to manufacturer's protocol. All 176 RNA samples had high quality and showed no signs of DNA contamination or RNA degradation. RNA samples were immediately frozen and stored at -80°C. For each cell line, ribosomal RNA was depleted and cDNA was generated, which was fragmented and end labeled. Approximately 5.5 μg of labeled DNA target was hybridized to the Affymetrix GeneChip® Human Exon 1.0 ST Array at 45°C for 16 h per manufacturer's recommendation (http://www.affymetrix.com/products/arrays/exon_application.affx). Hybridized arrays were then washed and scanned on a GCS3000 Scanner (Affymetrix, Santa Clara, CA).

Data Filtering for SNPs in probes, signal normalization and summarization

Expression arrays were analyzed using the Affymetrix PowerTools v1.8.6 (http://www.affymetrix.com/support/developer/powertools/index.affx). The start and end coordinates of all probes represented on the exon array were queried and determined against the human genome (hg18). The coordinates for all SNPs were then queried in the dbSNP database (version 129) (http://www.ncbi.nlm.nih.gov/projects/SNP) and used to identify probes harboring known SNPs. Of the ~1.4 million probesets on the exon array, 350,382 probesets contained at least one probe with a SNP (~600,000 probes). The probeset signal intensity files were filtered by removing those ~600,000 probes from the probesets harboring these known SNPs (Duan et al. 2008b). Probe intensities were then background corrected and quantile normalized over all 176 samples. The data were then log2 transformed with a median polish. Gene-level expression of 17,745 transcript clusters was summarized using the RMA (robust multi-array average) (Irizarry et al. 2003) method with signals generated on a core set [i.e., with RefSeq-supported (Pruitt et al. 2007) annotation] of exons (~110,000 probesets). A transcript cluster or probeset was defined to be reliably expressed in LCLs if the log2 transformed expression signal was greater than 6 in at least 80% of the 176 samples. A total of 8,565 of the 17,745 core transcript clusters met these criteria. To avoid annotation ambiguity, the final analysis dataset is comprised of 7,701 expressed transcript clusters (corresponding to 102,729 probesets, a minimum of 3 probesets for each transcript cluster) with unique gene annotations (based on NCBI Human Genome Build 34) as retrieved from the Affymetrix NetAffx Analysis Center (http://www.affymetrix.com/analysis/index.affx).

Detecting differentially spliced probesets between populations

Candidate probesets differentially spliced between the CEU and YRI samples were detected by calculating the splicing index (SI) (Affymetrix Inc. 2006; Gardina et al. 2006). The SI represents the log-transformed normalized exon-level probeset intensities by the gene-level transcript cluster intensities in each sample. SIi,j=log(ei,jgi), where ei,j is the intensity of the jth probeset of the ith transcript cluster, gi is the intensity of the ith transcript cluster and SIi,j is the splicing index of the jth probeset of the ith transcript cluster. The permutation-based free step-down approach of Westfall-Young (W-Y approach) (Westfall and Young 1993) was used to detect probesets with differential SI values between the CEU and YRI samples. The basic test was the standard pooled variance t statistic. Because of the relatedness among family members, trios were permuted between the two populations. The W-Y approach (n = 10,000 permutations) was then used to compute simultaneous P values that control the overall or family-wise error rate. In addition, the W-Y approach (n = 10,000 permutations) was applied on the unrelated CEU or YRI samples to detect potential differential probesets between males and females. The probesets with a significant permutation-adjusted P value (Pc < 0.01) were chosen for further analyses. The permutation-adjusted one-sided P values were calculated using the software Permax 2.2, http://biowww.dfci.harvard.edu/~gray/permax.html, which has an implementation of the W-Y approach and is provided as a contributory library by Robert Gray in the R statistical package (R Development Core Team 2005). The annotations for the differentially spliced probesets including gene symbol, cytoband and whether the probeset overlaps coding regions were retrieved from the Affymetrix NetAffx Analysis Center.

Biological process and pathway analyses

We used the DAVID (Database for Annotation, Visualization and Integrated Discovery) (Dennis et al. 2003; Huang da et al. 2007) (http://david.abcc.ncifcrf.gov) to identify enriched Gene Ontology (GO) (Ashburner et al. 2000) (http://www.geneontology.org) or PANTHER (Thomas et al. 2003) (http://www.pantherdb.org/pathway) biological processes as well as known pathways such as those in the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al. 2004) (http://www.genome.jp/kegg), Biocarta (http://www.biocarta.com) and PANTHER (Thomas et al. 2003) among the genes that showed differential transcript isoform variation between the CEU and YRI samples. The analysis set of 7,701 uniquely-annotated transcript clusters were used as the background list. Biological processes that were overrepresented relative to the background were selected (5 hits or more, Fisher's exact test Pc<0.50 after Benjamini-Hochberg, BH correction) (Benjamini and Hochberg 1995; Huang da et al. 2007). The same criteria were applied to identify enriched pathways. In addition, DAVID was also used to check if there were any genes with known AS events among our identified genes with differentially spliced probesets between the two populations. We further examined if these identified genes were involved in any Mendelian diseases as annotated in the Online Mendelian Inheritance in Man (OMIM) database (McKusick 1998) (http://www.ncbi.nlm. nih.gov/Omim).

Genotypic data for the HapMap samples

SNP genotypes were downloaded from the International HapMap Project website (http://www.hapmap.org) (Thorisson et al. 2005) (release 22 March 2008). To reduce the effect of possible genotyping errors, we excluded the SNPs with Mendelian allele transmission errors on 22 autosomes in the CEU and YRI samples, respectively. Thus, our final genotypic dataset was comprised of about 1.57 million SNPs for the two populations.

Fst values

Fst, a metric representation of the effect of population subdivision, was estimated according to Wright's approximate formula Fst = (HT - HS)/HT, where HT represents expected heterozygosity per locus of the total population and HS represents expected heterozygosity of a subpopulation (Wright 1950). An Fst value was calculated for each SNP of interest using allele frequencies estimated from the unrelated individuals for each population.

Cluster analysis

For the differentially spliced probesets, the Pearson correlation coefficients of the SI values were computed for the 176 samples to represent pairwise similarity. The probesets were then grouped by a hierarchical clustering algorithm (Eisen et al. 1998) using the average linkage method, which was implemented in the MeV:MultiExperiment Viewer (Saeed et al. 2003) (http://www.tm4.org).

Identifying common genetic variants correlated with AS patterns

The SI values of the differential probesets were evaluated for association with SNP genotype using the QTDT software (Abecasis et al. 2000a, b). The association study was carried out in the combined CEU and YRI data with gender and population as covariates (QTDT P<3.18 × 10-8, Pc < 0.05 after Bonferroni correction by ~1.57 million common SNPs in both the HapMap CEU and YRI populations). The incomplete trios were also used in the QTDT analysis. We defined a probeset as locally-regulated if the SI was associated with a SNP(s) within 2.5 Mb on the same chromosome, while a probeset was distantly-regulated if the SI was associated with SNP(s) on different chromosome(s) or more than 2.5 Mb away on the same chromosome.

Validation of transcript isoform variation between populations

Total RNA from 53 unrelated CEU and 48 unrelated YRI cell lines was extracted using the RNeasy Mini kit (Qiagen Inc., Valencia, CA) following the manufacturer's protocol. RNA quality assessment and quantification were conducted using the optical spectrometry 260/280 nm ratio. Subsequently, mRNA was reversely transcribed to cDNA using Applied Biosystems High Capacity Reverse Transcription kit (Applied Biosystems, Foster City, CA). Reverse transcription reactions were prepared to yield a final cDNA concentration of 50 ng/uL. Primers used for quantitative RT-PCR (Supplemental Table 3) were designed using Primer3 software (Rozen and Skaletsky 2000). Expression measurements were performed on the Applied Biosystems 7500 Real-Time PCR system. Total reaction was carried out in 25-μL volume which consisted of 12.5 μL ABI SYBR Universal mix (Applied Biosystems, Foster, CA), 1.5-μL primers along with 10-μL diluted cDNA. The thermocyler parameters were: 50°C for 2 min, 95°C for 10 min, and 40 cycles of 95°C for 15 s/60°C for 1 min. Each cycle threshold (Ct) value obtained for each probeset of interest was quantified into relative expression levels using the relative standard curve method (Applied Biosystems 2004). Each standard curve was created using a mixture of cDNA of known concentration from all samples being tested. Each experiment was conducted in duplicate for samples from both populations. A ratio comparing the relative quantity of the probeset of interest relative to the quantity of the constitutive exon was compared with the splicing index values from the expression arrays to determine replication of findings.

Results

Detecting differentially spliced probesets between populations

We compared the SI values of 102,729 probesets (belonging to 7,701 uniquely-annotated transcript clusters with reliable expression in LCLs). Using the W-Y approach (Westfall and Young 1993) that adjusts for the trio structure in these samples, 782 probesets within 570 transcript clusters had significantly different SI values (Pc < 0.01, permutation-adjusted), indicating variations in AS events between the CEU and YRI samples. Among the 782 probsets, we found that 397 probesets had significantly lower SI values in the CEU samples, while 385 probesets had significantly lower SI values in the YRI samples. Figure 1 shows the genomic distribution of these differentially spliced probesets. No chromosomes were overrepresented or underrepresented in terms of the number of differentially spliced probesets (Pc < 0.05 after BH correction). In addition, 514 out of the 782 differential probesets were in coding regions and the remaining 268 probesets were in untranslated regions (UTRs). The details of these 782 probesets are presented in Supplemental Table 1. The 782 probesets could be grouped into two distinguishable clusters representing the populations based on the splicing index values (Fig. 2).

Fig. 1
Genomic distribution of the differentially spliced probesets between the CEU and YRI samples. 397 probesets had significantly lower SI values in the CEU samples (top ticks along chromosomes), while 385 probesets had significantly lower SI values in the ...
Fig. 2
Cluster analysis of the differentially spliced probesets. The 782 differentially spliced probesets were grouped into two clusters representing the two populations based on their splicing index values. The columns are cell lines and the rows are probesets ...

Biological process and pathway analyses

Three GO biological processes (“response to stimulus”, “regulation of cellular process” and “transcription”) and four PANTHER biological processes (“nucleoside, nucleotide and nucleic acid metabolism”, “asymmetric protein localization”, “cell proliferation and differentiation” and “cell structure and motility”) were found to be enriched in the 570 uniquely-annotated transcript clusters (Pc < 0.50 after BH correction) (Table 1). At the same significance level, one KEGG pathway (“antigen processing and presentation”) was found to be enriched among these transcript clusters (Pc < 0.50 after BH correction) (Table 1). Among the 570 differentially spliced genes, 80 are linked to certain diseases as maintained in the OMIM database (Supplemental Table 1), though no individual disease was enriched (Pc < 0.50 after BH correction). These diseases include, for example, type I diabetes and certain types of cancer. In contrast, the term “immune (disease)” (41 genes, P = 0.0075, Pc = 0.13 after BH correction) was enriched among these genes by searching the “GENETIC_ASSOCIATION_DB_DISEASE_ CLASS” database, which compiles ~9,000 associations from the literatures by DAVID (Dennis et al. 2003; Huang da et al. 2007). Notably, among the 570 differentially spliced genes we identified, 171 genes are known to have alternative products (Supplemental Table 1) by searching the Protein Information Resource (PIR) (McGarvey et al. 2000) through DAVID (Dennis et al. 2003; Huang da et al. 2007). The category of “alternative products” was enriched relative to the analysis set of 7,701 genes (P = 0.027, Pc = 0.093 after BH correction).

Table 1
Enriched biological processes and pathways among the genes with differentially spliced probesets

Identifying common genetic variants that associate with differentially spliced probesets

Association with ~2 million common HapMap (International HapMap Consortium 2003, 2005) SNPs (minor allele frequency ≥5% in the unrelated parents of each population) using the QTDT software (Abecasis et al. 2000a, b) was evaluated in both the CEU and YRI samples with population and gender as covariates. We identified 2,393 local SNPs that were correlated with the SI values of 97 differentially spliced probesets in 85 transcript clusters. In addition, 419 distant SNPs were found to be correlated with the SI values of 152 probesets in 124 transcript clusters. Details for these associated SNPs are listed in Supplemental Table 2. Among them, both local and distant SNPs were identified for 36 differentially spliced probesets in 34 transcript clusters. Table 2, Fig. 3 and and44 show some representative local SNP/SI relationships with relatively higher Fst values (Fst > 0.15). Supplemental Table 2 lists the details for all significant SNP/SI relationships (Pc < 0.05 after Bonferroni correction).

Fig. 3
Common genetic variants account for transcript isoform variation of MRPL43 between populations. Probesets PS3303658, PS3303664 and PS3303666 of MRPL43 were differentially spliced between the CEU and YRI samples. PS3303658 had a lower splicing index in ...
Fig. 4
Common genetic variants account for transcript isoform variation of OAS1 between populations. Probesets PS3432457, PS3432458, PS3432462 and PS3432463 and PS3432463 of OAS1 were differentially spliced between the CEU and YRI samples. PS3432462 had a lower ...
Table 2
Representative local SNPs associated with differential SI values (Pc < 0.05 after Bonferroni correction)

Validation of transcript isoform variation between populations

From the probesets that were differentially spliced (Supplemental Table 1), we randomly chose 3 internal exons: PS3764493 (MTMR4), PS3303658 (MRPL43) and PS3476020 (MPHOSPH9) to experimentally validate. In addition, we included PS3527423 (PARP2) as the positive control, which was previously shown to be differentially spliced in the unrelated CEU samples (Kwan et al. 2008). Using the unrelated CEU cell lines, we confirmed the within-population variation of probeset PS3527423 (PARP2) demonstrated by Kwan et al. (2008) (Supplemental Fig. 2). Quantitative Real-Time PCR showed a difference in the ratio of isoforms between the two populations for probesets PS3764493 (MTMR4) and PS3303658 (MRPL43) (Supplemental Fig. 3). The quantitative Real-Time PCR results for PS3764493 (MTMR4) and PS3303658 (MRPL43) were consistent with the trend of SI values calculated from the exon array data (Supplemental Table 1).

Discussion

The Affymetrix GeneChip® Human Exon 1.0 ST Array was utilized to measure probeset (exon-level) expression in EBV (Epstein-Barr Virus)-transformed LCLs derived from 176 apparently healthy individuals (CEU, 87 cell lines; YRI, 89 cell lines) (Zhang et al. 2008a). Transcript cluster (gene-level) expressions were computed by summarizing signals from RefSeq-supported (Pruitt et al. 2007) exons (core set) within each transcript cluster. Our first goal was to identify probesets in transcript clusters with evidence for between-population transcript isoform variation. We compared the splicing index values (Affymetrix Inc. 2006; Gardina et al. 2006) of 7,701 uniquely-annotated transcript clusters (containing 102,729 probesets). Because non-expressed exons are known to introduce false positive results in the SI calculation, particularly in the presence of gene expression level changes (Affymetrix Inc. 2006), we limited our analyses to transcript clusters and probesets with reliable expression in the two populations as a whole. To identify meaningful AS events, we only focused on transcript clusters with a minimum of three expressed probesets. A significantly lower SI value in a population indicates that the particular probeset (exon-level) may be skipped in an AS isoform or that the respective transcription isoform has a lower relative ratio among all isoforms. The proportion of expressed genes (~50%) we defined is comparable to previous observations in LCLs (Cheung et al. 2003; Spielman et al. 2007), though a precise profiling of expressed genes in these samples has not been investigated experimentally.

Using the permutation-based W-Y approach (Westfall and Young 1993), we identified 782 probesets within 570 transcript clusters that showed differential SI values between the two populations (Fig. 1). The advantages of the W-Y approach include that (1) it considers dependence between genes when testing expression; (2) it allows the cluster-level permutation, thus taking into account the parents-child trio structure of the CEU and YRI samples. Although differential gene expression between males and females has been detected in a panel of CEPH LCLs (Zhang et al. 2007), no probesets (at Pc < 0.05, permutation-adjusted) were found to show gender-specific differences in either CEU or YRI samples, suggesting transcript isoform variation may not commonly contribute to gender-specific gene expression. Using RT-PCR, two of the three randomly-chosen exons (67%) from the 782 probesets could be validated for population differences in abundance of respective transcript isoforms (Supplemental Fig. 3), though a more comprehensive validation would be necessary to provide a more accurate estimation of the current findings. In addition, among the 570 transcript clusters containing differentially spliced probesets, approximately a third (171 genes) are known to have AS events or alternative products (literature-based evidence) as maintained in the PIR database (McGarvey et al. 2000) (Supplemental Table 1). Our list of differentially spliced genes between the two populations was found to overrep-resent the category of “alternative products” relative to the analysis set of 7,701 genes (P = 0.027, Pc = 0.093 after BH correction), indicating that many of the identified genes have known alternatively spliced transcript isoforms. Another interesting question would be whether the population differences in transcript isoform variation are mainly regulatory in nature at the level of RNA expression or due to changes at the protein level. We classified the 782 differentially spliced probesets based on their locations in the gene structure. More were located in coding regions (514 probesets) than UTRs (268 probesets) (P < 2.2 × 10-16, binomial test), suggesting that the majority of these population differences are potentially at the protein level.

Since the disruption of specific AS events has been implicated in several human genetic diseases (Faustino and Cooper 2003), we searched the OMIM database to see if any of the differentially spliced genes are involved in human diseases. Among the diseases found (Supplemental Table 1), FSGS (glomerulosclerosis, focal segmental, 1) is known to be more common in African Americans than Europeans (Sorof et al. 1998). We found that one probeset (PS3832645) of the causal gene ACTN4 (actin, alpha 4) showed significantly lower SI values in CEU, indicating possible skipping in these samples (Supplemental Table 1). Another interesting example is TIDM (type I diabetes mellitus). It has been known that fewer African American children develop type 1 diabetes (also known as juvenile onset diabetes) than white children (Diabetes Epidemiology Research International Study Group 1988). We found that two probesets of OAS1 (2′,5′-oligoadenylate synthetase 1), which has been implicated in TIDM showed significantly lower SI values in the CEU (PS3432462, PS3432463) and YRI (PS3432451,PS3432457, PS3432458), separately, suggesting different transcript isoforms could play a role in the racial disparity of this disease (Supplemental Table 1). Interestingly, Tessier et al. recently confirmed the association of TIDM with a splicing alteration in OAS1 (Tessier et al. 2006).

Furthermore, using the DAVID (Dennis et al. 2003) web application, three GO biological processes, four PANTHER biological processes, and one KEGG pathway were found to be enriched among the 570 differentially spliced genes relative to the background (Table 1). Notably, both the enriched GO term “response to stimulus” and the enriched KEGG pathway “antigen processing and presentation” are related to immune response. We previously found that transcript clusters (gene-level) differentially expressed between the CEU and YRI samples were enriched in immune response genes (Zhang and Dolan 2008a; Zhang et al. 2008a). It has been reported that African Americans may be more susceptible to infection by certain bacteria than Caucasians (Noble and Miller 1980) and some genetic polymorphisms that may lead to different antimicrobial response (Jordan et al. 2005). Our finding that the immune response-related genes were enriched among the differentially spliced genes suggests that AS or transcript isoform variation could be a critical mechanism in defining the racial differences in the infectious diseases. Another enriched GO term is “transcription”, which includes lower level processes required for the maturation of mRNA such as “mRNA splicing via spliceosome”. In contrast, the PANTHER biological process “nucleoside, nucleotide and nucleic acid metabolism” was also enriched, suggesting that the splicing of these transcriptionrelated genes including those spliceosome-related genes (e.g., splicing factors SFPQ and SFRS5, Supplemental Table 1) could potentially be involved in the regulation of transcript isoform variation between human populations. However, since a large proportion of genes have no pathway annotation and the validation of pathways in the databases is often not rigorously performed, interpretation of these results warrants some caution.

Previous studies have shown that common genetic variants account for the population differences in gene expression (Spielman et al. 2007; Storey et al. 2007; Stranger et al. 2007; Zhang et al. 2008a, b) and transcript isoform variation within the unrelated CEU samples (Hull et al. 2007; Kwan et al. 2007, 2008). We tried to investigate if the differences in allele frequency of common genetic variants contribute to the observed differences in transcript isoform variation between the CEU and YRI samples. To identify genetic variants that account for this variation, we carried out a genome-wide eQTL analysis by associating the HapMap genotypic data (International HapMap Consortium 2003, 2005) on ~1.57 million SNP markers with the SI values of the 782 differentially spliced probesets using the QTDT software, which has the advantages of conducting the powerful total association analyses using the entire panel of samples while correcting for internal correlations among all the members (Abecasis et al. 2000a, b). A probeset associated with SNP(s) within 2.5 Mb on the same chromosome was defined as locally-regulated, while a probeset associated with SNP(s) on different chromosome(s) or more than 2.5 Mb away on the same chromosome was defined as distantly-regulated. By combining the CEU and YRI data and using population identity as a covariate, the QTDT analysis after Bonferroni correction provided us a list of SNPs whose associations with differential SI values of probesets were the most striking, suggesting that the allele frequency differences of these associated common genetic variants account for a substantial fraction of the differences in transcript isoform variation between the two populations. Notably, many of the locally associated SNPs and some distantly associated SNPs are in linkage disequilibrium (LD) (Supplemental Table 2). For example, two local SNPs, rs2791650 and rs2791648 associated with a probeset of FRAP1 are in complete LD (Supplemental Table 2). The allele-frequency-driven transcript isoform variation difference between the CEU and YRI samples is further illustrated in Figs. 3 and and4,4, which show some examples of the contribution of local genetic variants to the observed differences in transcript isoform variation. Because of the existence of both local and distant SNPs, our findings also suggest that a complete network of regulation of differential AS patterns could potentially be the result of interactions among various local and distant genetic elements. On one hand, our findings suggest that approximately 30% of the differentially spliced genes could be accounted for by the allele frequency differences of either local or distant single SNPs, an observation similar to what we observed for gene-level expression differences between these two populations (Zhang et al. 2008a). On the other hand, our findings suggest that the remainder could be due to other mechanisms such as DNA methylation or controlled by multiple SNPs.

In this study, we present the first comprehensive view of the transcript isoform variation and its regulation by genetic variants between individuals of European and African ancestry. Our results suggest that although between one-third and two-thirds of all human genes could undergo alternative splicing (Sorek et al. 2004), the proportion of genes with differential AS between human populations could be much lower (~8% based on our estimate at Pc<0.01). A number of biological processes such as those involving immune response and mRNA synthesis were found to be enriched in the differentially spliced genes between the CEU and YRI samples. Our results suggest that genetic variation of DNA sequence contributes to a substantial fraction of the population-level transcript isoform variation, though some other non-genetic factors could also potentially influence the observed differences between populations. Technically, although the reproducibility of the exon arrays is generally high (Affymetrix Inc. 2007; Kwan et al. 2007), one limitation of this work is that technical replicates were not available for these samples (Zhang et al. 2008a), thus limiting our focus to only sets of genes that are differentially spliced between populations. For a more comprehensive view of the AS patterns, one would need to consider inter-individual and inter-population variation together. Finally, in addition to the intrinsic limitations of using the HapMap samples (e.g., one tissue type), there are other challenges and confounding factors (such as capturing unknown SNPs, YRI samples collected decades after CEU) that might be considered in future studies (Zhang and Dolan 2008b, c) to help us better utilize this tremendous resource to yield new insights into the alternative splicing process in humans.

Data availability

Gene expression data deposited in Gene Expression Omnibus (GEO): GSE9703.

Supplementary Material

Supplemental Fig 1

Supplemental Fig 2

Supplemental Fig 3

Supplemental Table 1

Supplemental Table 2

Supplemental Table 3

Acknowledgments

This Pharmacogenetics of Anticancer Agents Research (PAAR) Group (http://www.pharmacogenetics.org) study was supported by NIH/NIGMS grants U01 GM61393 and U01 GM61374. We are grateful to Dr. Jeong-Ah Kang for maintaining cell lines, Cheryl A. Roe for reviewing the manuscript and Drs. James Fackenthal and Emily Kistner for helpful discussion. T.A.C., T.X.C., A.C.S., and J.E.B. are employees of Affymetrix, Inc.

Footnotes

Electronic supplementary material The online version of this article (doi:10.1007/s00439-008-0601-x) contains supplementary material, which is available to authorized users.

References

  • Abecasis GR, Cardon LR, Cookson WO. A general test of association for quantitative traits in nuclear families. Am J Hum Genet. 2000a;66:279–292. [PMC free article] [PubMed]
  • Abecasis GR, Cookson WO, Cardon LR. Pedigree tests of transmission disequilibrium. Eur J Hum Genet. 2000b;8:545–551. [PubMed]
  • Affymetrix Inc. Affymetrix Technical Note. 2006. Identifying and validating alternative splicing events.
  • Affymetrix Inc. Affymetrix GeneChip Gene and Exon Array Whitepaper Collection. 2007. Human Gene 1.0 ST Array Performance.
  • Alberts R, Terpstra P, Li Y, Breitling R, Nap JP, Jansen RC. Sequence polymorphisms cause many false cis eQTLs. PLoS ONE. 2007;2:e622. [PMC free article] [PubMed]
  • Applied Biosystems . Technical Note. 2004. Guide to performing relative qualification of gene expression using Real-Time quantitative PCR.
  • Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25:25–29. [PMC free article] [PubMed]
  • Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B. 1995;57:289–300.
  • Brinkman BM. Splice variants as cancer biomarkers. Clin Biochem. 2004;37:584–594. [PubMed]
  • Cheung VG, Conlin LK, Weber TM, Arcaro M, Jen KY, Morley M, Spielman RS. Natural variation in human gene expression assessed in lymphoblastoid cells. Nat Genet. 2003;33:422–425. [PubMed]
  • Dennis G, Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003;4:P3. [PMC free article] [PubMed]
  • Diabetes Epidemiology Research International Study Group Geographic patterns of childhood insulin-dependent diabetes mellitus. Diabetes Epidemiology Research International Group. Diabetes. 1988;37:1113–1119. [PubMed]
  • Duan S, Huang RS, Zhang W, Bleibel WK, Roe CA, Clark TA, Chen TX, Schweitzer AC, Blume JE, Cox NJ, Dolan ME. Genetic architecture of transcript-level variation in humans. Am J Hum Genet. 2008a;82:1101–13. [PMC free article] [PubMed]
  • Duan S, Zhang W, Bleibel WK, Cox NJ, Dolan ME. SNPinProbe_1.0: a database for filtering out probes in the Affymetrix GeneChip®g.0 ST array potentially affected by SNPs. Bioinformation. 2008b;2:469–470. [PMC free article] [PubMed]
  • Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998;95:14863–14868. [PMC free article] [PubMed]
  • Faustino NA, Cooper TA. Pre-mRNA splicing and human disease. Genes Dev. 2003;17:419–437. [PubMed]
  • Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, Pasternak S, Wheeler DA, Willis TD, Yu F, Yang H, Zeng C, Gao Y, Hu H, Hu W, Li C, Lin W, Liu S, Pan H, Tang X, Wang J, Wang W, Yu J, Zhang B, Zhang Q, Zhao H, Zhao H, Zhou J, Gabriel SB, Barry R, Blumenstiel B, Camargo A, Defelice M, Faggart M, Goyette M, Gupta S, Moore J, Nguyen H, Onofrio RC, Parkin M, Roy J, Stahl E, Winchester E, Ziaugra L, Altshuler D, Shen Y, Yao Z, Huang W, Chu X, He Y, Jin L, Liu Y, Shen Y, Sun W, Wang H, Wang Y, Wang Y, Xiong X, Xu L, Waye MM, Tsui SK, Xue H, Wong JT, Galver LM, Fan JB, Gunderson K, Murray SS, Oliphant AR, Chee MS, Montpetit A, Chagnon F, Ferretti V, Leboeuf M, Olivier JF, Phillips MS, Roumy S, Sallee C, Verner A, Hudson TJ, Kwok PY, Cai D, Koboldt DC, Miller RD, Pawlikowska L, Taillon-Miller P, Xiao M, Tsui LC, Mak W, Song YQ, Tam PK, Nakamura Y, Kawaguchi T, Kitamoto T, Morizono T, Nagashima A, Ohnishi Y, et al. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007;449:851–861. [PMC free article] [PubMed]
  • Gardina PJ, Clark TA, Shimada B, Staples MK, Yang Q, Veitch J, Schweitzer A, Awad T, Sugnet C, Dee S, Davies C, Williams A, Turpaz Y. Alternative splicing and differential gene expression in colon cancer detected by a whole genome exon array. BMC Genomics. 2006;7:325. [PMC free article] [PubMed]
  • Gilad Y, Rifkin SA, Bertone P, Gerstein M, White KP. Multispecies microarrays reveal the effect of sequence divergence on gene expression profiles. Genome Res. 2005;15:674–680. [PMC free article] [PubMed]
  • Huang da W, Sherman BT, Tan Q, Collins JR, Alvord WG, Roayaei J, Stephens R, Baseler MW, Lane HC, Lempicki RA. The DAVID gene functional classification tool: a novel biological module-centric algorithm to functionally analyze large gene lists. Genome Biol. 2007;8:R183. [PMC free article] [PubMed]
  • Huang RS, Kistner EO, Bleibel WK, Shukla SJ, Dolan ME. Effect of population and gender on chemotherapeutic agentinduced cytotoxicity. Mol Cancer Ther. 2007;6:31–36. [PMC free article] [PubMed]
  • Hull J, Campino S, Rowlands K, Chan MS, Copley RR, Taylor MS, Rockett K, Elvidge G, Keating B, Knight J, Kwiatkowski D. Identification of common genetic variation that modulates alternative splicing. PLoS Genet. 2007;3:e99. [PMC free article] [PubMed]
  • International HapMap Consortium The International HapMap Project. Nature. 2003;426:789–796. [PubMed]
  • International HapMap Consortium A haplotype map of the human genome. Nature. 2005;437:1299–1320. [PMC free article] [PubMed]
  • Ioannidis JP, Ntzani EE, Trikalinos TA. `Racial' differences in genetic effects for complex diseases. Nat Genet. 2004;36:1312–1318. [PubMed]
  • Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4:249–264. [PubMed]
  • Jordan WJ, Eskdale J, Lennon GP, Pestoff R, Wu L, Fine DH, Gallagher G. A non-conservative, coding single-nucleotide polymorphism in the N-terminal region of lactoferrin is associated with aggressive periodontitis in an African-American, but not a Caucasian population. Genes Immun. 2005;6:632–635. [PubMed]
  • Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M. The KEGG resource for deciphering the genome. Nucleic Acids Res. 2004;32:D277–D280. [PMC free article] [PubMed]
  • Kurian AK, Cardarelli KM. Racial and ethnic differences in cardiovascular disease risk factors: a systematic review. Ethn Dis. 2007;17:143–152. [PubMed]
  • Kwan T, Benovoy D, Dias C, Gurd S, Serre D, Zuzan H, Clark TA, Schweitzer A, Staples MK, Wang H, Blume JE, Hudson TJ, Sladek R, Majewski J. Heritability of alternative splicing in the human genome. Genome Res. 2007;17:1210–1218. [PMC free article] [PubMed]
  • Kwan T, Benovoy D, Dias C, Gurd S, Provencher C, Beaulieu P, Hudson TJ, Sladek R, Majewski J. Genome-wide analysis of transcript isoform variation in humans. Nat Genet. 2008;40:225–231. [PubMed]
  • Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. [PubMed]
  • McGarvey PB, Huang H, Barker WC, Orcutt BC, Garavelli JS, Srinivasarao GY, Yeh LS, Xiao C, Wu CH. PIR: a new resource for bioinformatics. Bioinformatics. 2000;16:290–291. [PubMed]
  • McKusick VA. A catalog of human genes and genetic disorders. 12th edn. Johns Hopkins University Press; Baltimore: 1998. Mendelian inheritance in man.
  • Morley M, Molony CM, Weber TM, Devlin JL, Ewens KG, Spielman RS, Cheung VG. Genetic analysis of genome-wide variation in human gene expression. Nature. 2004;430:743–747. [PMC free article] [PubMed]
  • Noble RC, Miller BR. Auxotypes and antimicrobial susceptibilities of Neisseria gonorrhoeae in black and white patients. Br J Vener Dis. 1980;56:26–30. [PMC free article] [PubMed]
  • Novoyatleva T, Tang Y, Rafalska I, Stamm S. Pre-mRNA missplicing as a cause of human disease. Prog Mol Subcell Biol. 2006;44:27–46. [PubMed]
  • Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35:D61–D65. [PMC free article] [PubMed]
  • R Development Core Team . R: a language and environment for statistical computing. R Foundation for Statistical Computing; Vienna: 2005.
  • Rozen S, Skaletsky H. Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol. 2000;132:365–386. [PubMed]
  • Saeed AI, Sharov V, White J, Li J, Liang W, Bhagabati N, Braisted J, Klapa M, Currier T, Thiagarajan M, Sturn A, Snuffin M, Rezantsev A, Popov D, Ryltsov A, Kostukovich E, Borisovsky I, Liu Z, Vinsavich A, Trush V, Quackenbush J. TM4: a free, open-source system for microarray data management and analysis. Biotechniques. 2003;34:374–378. [PubMed]
  • Sorek R, Shamir R, Ast G. How prevalent is functional alternative splicing in the human genome? Trends Genet. 2004;20:68–71. [PubMed]
  • Sorof JM, Hawkins EP, Brewer ED, Boydstun II, Kale AS, Powell DR. Age and ethnicity affect the risk and outcome of focal segmental glomerulosclerosis. Pediatr Nephrol. 1998;12:764–768. [PubMed]
  • Spielman RS, Bastone LA, Burdick JT, Morley M, Ewens WJ, Cheung VG. Common genetic variants account for differences in gene expression among ethnic groups. Nat Genet. 2007;39:226–231. [PMC free article] [PubMed]
  • Storey JD, Madeoy J, Strout JL, Wurfel M, Ronald J, Akey JM. Gene-expression variation within and among human populations. Am J Hum Genet. 2007;80:502–509. [PMC free article] [PubMed]
  • Stranger BE, Forrest MS, Clark AG, Minichiello MJ, Deutsch S, Lyle R, Hunt S, Kahl B, Antonarakis SE, Tavare S, Deloukas P, Dermitzakis ET. Genome-wide associations of gene expression variation in humans. PLoS Genet. 2005;1:e78. [PMC free article] [PubMed]
  • Stranger BE, Nica AC, Forrest MS, Dimas A, Bird CP, Beazley C, Ingle CE, Dunning M, Flicek P, Koller D, Montgomery S, Tavare S, Deloukas P, Dermitzakis ET. Population genomics of human gene expression. Nat Genet. 2007;39:1217–1224. [PMC free article] [PubMed]
  • Tessier MC, Qu HQ, Frechette R, Bacot F, Grabs R, Taback SP, Lawson ML, Kirsch SE, Hudson TJ, Polychronakos C. Type 1 diabetes and the OAS gene cluster: association with splicing polymorphism or haplotype? J Med Genet. 2006;43:129–132. [PMC free article] [PubMed]
  • Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A. PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. 2003;13:2129–2141. [PMC free article] [PubMed]
  • Thorisson GA, Smith AV, Krishnan L, Stein LD. The International HapMap Project Web site. Genome Res. 2005;15:1592–1593. [PMC free article] [PubMed]
  • Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, Gocayne JD, Amanatides P, Ballew RM, Huson DH, Wortman JR, Zhang Q, Kodira CD, Zheng XH, Chen L, Skupski M, Subramanian G, Thomas PD, Zhang J, Gabor Miklos GL, Nelson C, Broder S, Clark AG, Nadeau J, McKusick VA, Zinder N, Levine AJ, Roberts RJ, Simon M, Slayman C, Hunkapiller M, Bolanos R, Delcher A, Dew I, Fasulo D, Flanigan M, Florea L, Halpern A, Hannenhalli S, Kravitz S, Levy S, Mobarry C, Reinert K, Remington K, Abu-Threideh J, Beasley E, Biddick K, Bonazzi V, Brandon R, Cargill M, Chandramouliswaran I, Charlab R, Chaturvedi K, Deng Z, Di Francesco V, Dunn P, Eilbeck K, Evangelista C, Gabrielian AE, Gan W, Ge W, Gong F, Gu Z, Guan P, Heiman TJ, Higgins ME, Ji RR, Ke Z, Ketchum KA, Lai Z, Lei Y, Li Z, Li J, Liang Y, Lin X, Lu F, Merkulov GV, Milshina N, Moore HM, Naik AK, Narayan VA, Neelam B, Nusskern D, Rusch DB, Salzberg S, Shao W, Shue B, Sun J, Wang Z, Wang A, Wang X, Wang J, Wei M, Wides R, Xiao C, Yan C, et al. The sequence of the human genome. Science. 2001;291:1304–1351. [PubMed]
  • Westfall PH, Young SS. Resampling-based multiple testing: examples and methods for p-value adjustment. Wiley Publishers; New York: 1993.
  • Wright S. Genetical structure of populations. Nature. 1950;166:247–249. [PubMed]
  • Zhang W, Dolan ME. Ancestry-related differences in gene expression: findings may enhance understanding of health disparities between populations. Pharmacogenomics. 2008a;9:489–492. [PMC free article] [PubMed]
  • Zhang W, Dolan ME. Beyond the HapMap genotypic data: prospects of deep resequencing projects. Curr Bioinform. 2008b;3 [PMC free article] [PubMed]
  • Zhang W, Dolan ME. On the challenges of the HapMap resource. Bioinformation. 2008c;2:238–239. [PMC free article] [PubMed]
  • Zhang W, Bleibel WK, Roe CA, Cox NJ, Dolan M Eileen. Gender-specific differences in expression in human lymphoblastoid cell lines. Pharmacogenet Genomics. 2007;17:447–450. [PMC free article] [PubMed]
  • Zhang W, Duan S, Kistner EO, Bleibel WK, Huang RS, Clark TA, Chen TX, Schweitzer AC, Blume JE, Cox NJ, Dolan ME. Evaluation of genetic variation contributing to differences in gene expression between populations. Am J Hum Genet. 2008a;82:631–640. [PMC free article] [PubMed]
  • Zhang W, Ratain MJ, Dolan ME. The HapMap resource is providing new insights into ourselves and its application to pharmacogenomics. Bioinform Biol Insights. 2008b;2:15–23. [PMC free article] [PubMed]
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...