![]() | ![]() |
Formats:
|
||||||||||||||||||
Copyright © 2006, Cold Spring Harbor Laboratory Press Transcription-mediated gene fusion in the human genome 1 Compugen Ltd., Tel Aviv 69512, Israel 2 Faculty of Life Sciences, Bar Ilan University, Ramat Gan 52900, Israel 3These two authors contributed equally to this work. 4Present address: Tel Aviv University, Department of Human Genet-ics, Sackler Faculty of Medicine, Tel Aviv, 69978 Israel. 5Corresponding author. E-mail sorek/at/post.tau.ac.il; fax 972-3-7658555. Received May 16, 2005; Accepted September 13, 2005. This article has been cited by other articles in PMC.Abstract Transcription of a gene usually ends at a regulated termination point, preventing the RNA-polymerase from reading through the next gene. However, sporadic reports suggest that chimeric transcripts, formed by transcription of two consecutive genes into one RNA, can occur in human. The splicing and translation of such RNAs can lead to a new, fused protein, having domains from both original proteins. Here, we systematically identified over 200 cases of intergenic splicing in the human genome (involving 421 genes), and experimentally demonstrated that at least half of these fusions exist in human tissues. We showed that unique splicing patterns dominate the functional and regulatory nature of the resulting transcripts, and found intergenic distance bias in fused compared with nonfused genes. We demonstrate that the hundreds of fused genes we identified are only a subset of the actual number of fused genes in human. We describe a novel evolutionary mechanism where transcription-induced chimerism followed by retroposition results in a new, active fused gene. Finally, we provide evidence that transcription-induced chimerism can be a mechanism contributing to the evolution of protein complexes. Eukaryotic genes are generally well defined on the genome. Transcription usually begins from a transcription start site, which is guided by the promoter, and ends at a regulated termination point (Zhao et al. 1999; Proudfoot et al. 2002). Consecutive genes are usually separated from each other by intergenic, nonexpressed regions (Lander et al. 2001). In recent years, however, evidence for the existence of mammalian transcripts that span two adjacent, independent genes, have emerged. Typically, such chimeric transcripts begin at the promoter of the upstream gene and end at the termination point of the downstream gene. The intergenic region is spliced out of the transcript as an intron, so that the resulting fused transcripts possess exons from the two different genes (Fig. 1
Fused RNAs were shown to be regulated and to have unique expression patterns. For example, the HHLA1-OC90 fusion transcript is restricted to teratocarcinoma cell lines while absent from normal cells (Kowalski et al. 1999). In the case of LY75-CD302 (CD205-DCL1) fusion, the chimera is predominant in Hodgkin and Reed-Strenberg cell lines (Kato et al. 2003). The fusion can change the properties of the participating proteins, or change their localization, such as in the case of Kua-UBE2Vl fusion, where the fused protein is localized to the cytoplasm, while UBE2Vl is a nuclear protein (Thomson et al. 2000). The most characterized human fusion transcript is of two members of the TNF ligand family, TNFSF12 (previously known as TWEAK) and TNFSF13 (APRIL), which represent a type-II transmembrane and a secreted protein, respectively (Pradet-Balade et al. 2002). The fused protein, composed of TNFSF12 cytoplasmic and transmembrane domains fused to the TNFSF13 C-terminal domain, is expressed and translated endogenously in human primary T cells and monocytes. The fused protein is membrane anchored and presents the TNFSF13 receptor-binding domain at the cell surface. It is a biologically active ligand, stimulating cycling in T- and B-lymphoma cell lines (Pradet-Balade et al. 2002). As no systematic effort was carried out to detect TIC events across the human genome, the extent of this phenomenon and its implications are unknown. In this study, we systematically identified TICs in human and discovered unique features that characterize them. We describe unappreciated roles of TICs in evolution of proteins and protein complexes, and demonstrate novel implications on regulation and function of genes. Results and Discussion Computational identification of transcription-induced chimerism events To characterize the phenomenon of TIC in a genome-wide manner, we first clustered ESTs and cDNAs from GenBank version 136 onto the human genome sequence (build 33) using the LEADS software platform (Sorek et al. 2002; Sorek and Safer 2003) (Table 2; see Methods). The clustering phase resulted in 26,057 clusters of expressed sequences aligned to the genome, containing at least one RNA sequence. We next searched for clusters reporting on fusion between two genes. To avoid cases in which such clusters are caused by natural antisense overlaps, we used the “Antisensor” algorithm (Yelin et al. 2003) to identify and separate such clusters into discrete sense and antisense genes.
To isolate reliable events of TIC we looked for clusters that contain at least two nonoverlapping cDNA sequences that are annotated as “complete CDS.” To avoid contaminations, we demanded that the sequences connecting these two cDNAs will be canonically spliced, and will share at least one splice site with each of the two separate genes. This demand also screens out cases of naturally overlapping genes (Veeramachaneni et al. 2004). To discard cases of alignment artifacts, in which two consecutive homologous genes were falsely connected, we filtered out connecting sequences having high-scoring alignments in both genes. Our computational search resulted in the identification of 281 putative TICs. Manually inspecting these 281 putative events revealed 55 spurious cases, where the two fused RNAs were apparently parts of one gene (see Methods). These cases, as well as an additional 14 inconclusive events, were discarded (Table 2). After removing the above-described artifacts, we acquired a reliable data set of 212 TIC events (Supplemental Table S1). The data set contained 421 genes, with four of them participating in more than one fusion (i.e., there was evidence for their fusion both with their upstream and downstream neighboring genes). Of the 212 fusion events, 54 (25%) were supported by more than one expressed sequence, suggesting that in most cases the fusion event is relatively rare or confined to a certain tissue/condition. To understand the extent to which these events are conserved between species, we searched evidence for our 212 TICs in ESTs of other mammals (see Methods). In 22 cases (10%), we were able to identify an EST from another species supporting the human TIC (Supplemental Table S1). This rate of conservation is similar to the reported 11% conservation of alternative splicing events between human and mouse (Yeo et al. 2005). Overall, 70 (33%) of the TIC events in our set were supported by multiple sequence evidence, i.e., either by more than one human sequence and/or by additional sequences from other species. As mentioned above, 13 human TICs were reported previously (Table 1). Of these, five were supported by ESTs, with 3/5 supported by one EST only, indicating that even one supporting spliced EST can reliably report on true transcriptional fusion. For the remaining eight reported events, there was no EST showing their existence. We therefore conclude that the 421 genes we detected are only a subset of the actual number of fused genes in human. To test the possibility that transcription-induced chimerism is cancer induced, we used EST library annotations to extract the histological origin (cancer/normal) of each EST. We then compared the histology distribution of the fusing ESTs with the general distribution in all ESTs. Of the fusing sequences, 51% originate from normal tissues, compared with 46% in the entire EST population. These results indicate that the transcription-induced chimeras present in our data are not the outcome of a cancerous condition. Unique intergenic splicing patterns and intergenic distance bias in fusion events We further analyzed the TIC events to understand the splicing patterns of the fused transcript. The most abundant intergenic splicing type, occurring in 44% of the events (93/212), was between the n-1 exon (one before last) of the upstream gene and the second exon (+2) of the downstream gene (Fig. 2
Figure 2 To test whether there is a preference for specific intergenic distance between fused genes, we calculated the intergenic distance distribution of the 212 fusion events, and compared it with the distance distribution of 12,395 human adjacent genes (see Methods). As shown in Figure 3
RT–PCR experimental validation To experimentally test our predicted TIC events, we selected 10% of the data (20 events) for RT–PCR screening using RNAs from a panel of 19 different tissues and cell lines (See Methods). For nine of the events, fusion was detected in at least one of the tissues tested. Six of these events were found to be ubiquitously expressed, while the remaining three were tissue specific (see Table 3). Figure 4
Functions of fusion products What are the possible functions of transcription-induced chimeras? To understand this, we examined the fusion patterns with respect to the resulting ORF. In 53 events in our set (25%), a fusion protein containing coding sequences of both genes (without a premature stop codon) was created. This kind of fusion might generate a bifunctional protein having properties from both original proteins, as happens in the known cases of TWE-PRIL (Pradet-Balade et al. 2002) or Kua-UEV1 (Thomson et al. 2000). An example of this type of fusion, between NME1 and NME2, is presented in Figure 4A Another functional impact could be at the transcriptional regulation level. This will occur when the fusion involves only the first exon of the upstream gene, so that the upstream gene mainly contributes its 5′UTR to the fused transcript. Indeed, 26 (12%) of our events correspond to this type of fusion. This will potentially cause the downstream gene to be regulated as the upstream one, both transcriptionally (promoter) and translationally (5′UTR; see Fig. 4B TIC can also be intended to suppress the expression of the upstream gene by the Nonsense Mediated Decay (NMD) mechanism (Hillman et al. 2004). This would occur when the fusion causes a frame-shift that results in a premature stop codon. Indeed, 120 (56%) TIC events in our set are expected to undergo NMD. Frame shift can also result in alternative C terminus of the upstream gene, if the resulting stop codon occurs in the last (or one before last) exon, as in the known ANKHD1-EIF4EBP3 (MASK-BP3) case (Poulin et al. 2003). Finally, some of the events seen in our database could represent transcriptional “leakage”, where the transcriptional machinery accidentally ignores the termination of the upstream gene and transcribes through the downstream one. Such a leakage can be a rare, nonregulated stochastic event that does not contribute to the fitness of the organism. Indeed, only 33% of the cases in our database were supported by multiple-sequence evidence, demonstrating the low-occurrence frequency of the majority of our events. In addition, only 10% of TICs were found to be conserved between species. It could also be argued that the low frequency of protein fusion events (25% of the total) is indicative of the stochastic nature of the TIC phenomenon; however, a similar frequency of fusion proteins (23%) was also detected in the subset of events that are conserved between mammals, indicating that nonfusion-protein events can be under selective pressure as well and are hence possibly functional. Overall, although we cannot determine the actual fraction of TICs that is functional, our results suggest that at least a subset of these events have a biological role. Currently, regulation of TIC is generally uncharacterized. Models for transcription termination indicate that both cis-acting sequence elements, as well as trans-acting termination factors that belong both to the transcriptional and the splicing machineries, act together to generate an accurate 3′ end (Zhao et al. 1999; Proudfoot et al. 2002). Regulated transcriptional read-through was described in viruses and suggested also for TIC (Hardy and Wertz 1998; Magrangeas et al. 1998). Presumably, both cis-acting sequences, such as weak polyadenylation signals, and trans-acting suppressors/regulators of the termination machinery, could regulate the transcriptional read-through involved in TIC. What is the proportion of TIC events in the genome? Our data suggest that ~2% of all human genes might be involved in such fusion. However, as ESTs are merely a sample of the transcriptome (Sorek et al. 2004), they do not represent all possible transcripts. Indeed, only five of the 13 known cases (40%) are represented in GenBank dbEST, indicating that many more genes might be fused than actually detected. In addition, our strict filtering process removed many events that might actually be real (see Methods). Indeed, an EST-independent search for TICs in the Encode regions suggests that ~5% of all human genes are involved in TIC (Parra et al. 2006). Evolution of protein complexes It has been shown that gene fusion events across genomes can be used for predicting functional associations of proteins, including physical interactions and complex formation (Enright and Ouzounis 2001). This relies on the observation that two proteins that function in the same complex in one organism are frequently fused into a single “Rosetta Stone” protein in another organism (Marcotte et al. 1999). The TIC phenomenon might be a major process supporting this evolutionary mechanism. For example, the nucleoside diphosphate kinase (NDK) complex is a hexamer composed of the “A” (encoded by NME1) and “B” (NME2) polypeptides (Gilles et al. 1991). We detected and experimentally validated a ubiquitously expressed chimeric transcript fusing these two subunits, which codes for a natural NME1-NME2 fusion protein, in agreement with the above hypothesis (Fig. 4A Intriguingly, we were able to identify a processed pseudogene indicating fusion of the genes PIP5K1A and PSD4. Although no EST supported this fusion event, we verified experimentally the existence of a PIP5K1A-PSD4 fusion transcript in human RNA (Fig. 4C This unique pseudogene example sets transcription-induced chimerism followed by retro-position as a novel molecular evolutionary mechanism enabling the creation of new, fused “Rosetta Stone” sequences. This mechanism is expected to affect mainly Eukaryotes, where the splicing machinery can efficiently remove the intergenic region. Presumably, additional fused genes were created through this mechanism during the evolutionary history of metazoa. Conclusions We have demonstrated that transcription-induced chimerism is much more widespread in the human transcriptome than initially appreciated, forming yet an additional layer of protein diversity. The fusion transcripts might function in various levels, either by creating newly functioning proteins, or by changing the regulation of pre-existing proteins. The function of each fusion event, as well as the mechanism enabling such fusion and the actual impact of this phenomenon on the human genome, remain to be elucidated. Methods Computational search Human ESTs and cDNAs were obtained from NCBI GenBank version 136 (June 2003; http://www.ncbi.nlm.nih.gov/Genbank/) and aligned to the human genome build 33 (April 2003; http://www.ncbi.nlm.nih.gov/genome/guide/human/) using the LEADS clustering and assembly software as described previously (Sorek et al. 2002). Briefly, the software cleans the expressed sequences from vectors and immunoglobulins, masking them for repeats and low-complexity regions. It then aligns the expressed sequences to the genome, taking alternative splicing into account, and clusters overlapping expressed sequences into “clusters” that represent genes or partial genes. Clusters were separated to sense/antisense clusters using the “Antisensor” algorithm as described in Yelin et. al. (2003). “Complete CDS” annotation of RNA sequences was obtained from the “DEFINITION” field in the GenBank sequence records. In each cluster, overlapping complete CDS cDNA sequences that aligned fully to the genome were grouped together. Each group was referred to as a gene. Gene boundaries were extended in cases where ESTs suggested longer UTRs than present in the RNA. In clusters containing more than one gene, connecting sequences were identified. Connecting sequences were required to have canonical splice sites at the fusion junction, and to share at least one splice site with each of the two separate genes. For the “alignment artifacts” filtering, each exon in the connecting sequences was aligned to the cDNA sequences of both connected genes. Sequences with exons aligned to both genes were discarded. For the manual filtration of fusion events, we used the following information from the UCSC genome browser: (1) occurrence of CpG islands before both genes; (2) existence of SWISS-PROT annotations for both genes; (3) existence of ORF, 5′ and 3′ UTRs for both genes. For gene distance calculation, known RefSeqs were localized to the genome using the UCSC genome browser annotations (Karolchik et al. 2003). The distance between each gene pair was calculated from the most downstream hit of the upstream gene to the most upstream hit of the downstream gene. Only distances up to 400 kb were considered, as this is the maximum intron length allowed by the LEADS software. To calculate possible NMD of transcripts, the fused transcript was first assembled using the upstream and downstream RefSeqs connected by the fusing EST. In this transcript, premature stop codon was searched according to the rule of 55 nucleotides or more upstream to the last exon–exon junction. For the “evolution of protein complexes” analysis, annotations of genes were downloaded from the “RefSeq Summary” field in UCSC genome browser and from the comments fields in SWISS-PROT. Processed pseudogenes indicating on TIC were systematically searched in the database of >8000 processed pseudogenes compiled by Zhang et al (2003). Experimental validation of fusion events Total RNA was isolated from a variety of human tissues (kidney, liver, brain, ovary, white blood cells [WBC], testis, prostate, spleen, heart, breast, pancreas, lung, colon, bladder, thymus, Farage B lymphocyte cell line [CRL-2360, ATCC], K562 cell line [CCL-243, ATCC], JURKAT T lymphocyte cell line [TIB-152, ATCC], and HepG2 liver cell line [HB-8065, ATCC]) by Tri-Reagent (MRC) according to the manufacturer's instructions. First-strand cDNAs were prepared using SuperScript II reverse transcriptase (Invitrogen) primed with random hexamers and oligo dT's (Invitrogen). RT–PCR reactions were performed in 50-μL reactions encompassing 1 μL of RT in the presence of 2 mM dNTPs, 10 pmol of primers, 2.5 units of SUPER-THERMO polymerase, and subjected to 28–35 amplification cycles. Reactions were designed to include a forward primer from the upstream gene and two reverse primers, one from the last exon of the upstream gene and one from the internal exon of the downstream gene, in order to identify both the existence of a wild-type RNA and the fused RNA (primers appear in Supplemental Table S2). RT–PCR products were isolated from agarose gels and verified by sequencing. Acknowledgments We thank D. Schaffer, A. Golubev, A. Haviv, and N. Keren for biocomputatioanl assistance; M. Oz for providing critical resources; G. Naveh for literature assistance; and Z. Levine, U. Nir, K. Savitsky, E. Levanon, D. Milo, S. Pollock, G. Cojocaru, E. Eisenberg, and D. Dahary for fruitful discussions. Notes Article published online ahead of print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.4137606. Footnotes [Supplemental material is available online at www.genome.org.] References
Web site references
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||
Microbiol Mol Biol Rev. 1999 Jun; 63(2):405-45.
[Microbiol Mol Biol Rev. 1999]Cell. 2002 Feb 22; 108(4):501-12.
[Cell. 2002]Nature. 2001 Feb 15; 409(6822):860-921.
[Nature. 2001]Genomics. 1999 May 1; 57(3):371-9.
[Genomics. 1999]J Biol Chem. 2003 Sep 5; 278(36):34035-41.
[J Biol Chem. 2003]Genome Res. 2000 Nov; 10(11):1743-56.
[Genome Res. 2000]EMBO J. 2002 Nov 1; 21(21):5711-20.
[EMBO J. 2002]Genome Res. 2002 Jul; 12(7):1060-7.
[Genome Res. 2002]Nucleic Acids Res. 2003 Feb 1; 31(3):1067-74.
[Nucleic Acids Res. 2003]Nat Biotechnol. 2003 Apr; 21(4):379-86.
[Nat Biotechnol. 2003]Genome Res. 2004 Feb; 14(2):280-6.
[Genome Res. 2004]Proc Natl Acad Sci U S A. 2005 Feb 22; 102(8):2850-5.
[Proc Natl Acad Sci U S A. 2005]EMBO J. 2002 Nov 1; 21(21):5711-20.
[EMBO J. 2002]Genome Res. 2000 Nov; 10(11):1743-56.
[Genome Res. 2000]Genome Res. 2004 Jan; 14(1):79-89.
[Genome Res. 2004]Genome Biol. 2004; 5(2):R8.
[Genome Biol. 2004]J Biol Chem. 2003 Dec 26; 278(52):52290-7.
[J Biol Chem. 2003]Microbiol Mol Biol Rev. 1999 Jun; 63(2):405-45.
[Microbiol Mol Biol Rev. 1999]Cell. 2002 Feb 22; 108(4):501-12.
[Cell. 2002]J Virol. 1998 Jan; 72(1):520-6.
[J Virol. 1998]J Biol Chem. 1998 Jun 26; 273(26):16005-10.
[J Biol Chem. 1998]Genome Res. 2004 Aug; 14(8):1617-23.
[Genome Res. 2004]Genome Biol. 2001; 2(9):RESEARCH0034.
[Genome Biol. 2001]Science. 1999 Jul 30; 285(5428):751-3.
[Science. 1999]J Biol Chem. 1991 May 15; 266(14):8784-9.
[J Biol Chem. 1991]Genome Res. 2002 Jul; 12(7):1060-7.
[Genome Res. 2002]Nat Biotechnol. 2003 Apr; 21(4):379-86.
[Nat Biotechnol. 2003]Nucleic Acids Res. 2003 Jan 1; 31(1):51-4.
[Nucleic Acids Res. 2003]Genome Res. 2003 Dec; 13(12):2541-58.
[Genome Res. 2003]J Mol Biol. 1987 Jul 20; 196(2):261-82.
[J Mol Biol. 1987]