![]() | ![]() |
Formats:
|
||||||||||||||
Repetitive sequence environment distinguishes housekeeping genes 1 UCLA Department of Human Genetics David Geffen School of Medicine, Gonda Center, 695 E. Young Drive South, Los Angeles, California 90095-7088, USA 2 UCLA Department of Biostatistics, School of Public Health, Box 951772, Los Angeles, California 90095-1772, USA * to whom correspondence should be addressed: York Marahrens, UCLA Department of Human Genetics, Gonda Center, Room 4554b, 695 Charles E. Young Drive, Los Angeles, CA 90095, USA, Phone: (310) 267-2466, Fax: (310) 794-5446, E-mail: ymarahrens/at/mednet.ucla.edu 3Current address: Moira Regelson, Senior Research Scientist, Yahoo!, 3333 West Empire Ave, Burbank CA 91504, (818) 524-3549, moira/at/Yahoo-inc.com Abstract Housekeeping genes are expressed across a wide variety of tissues. Since repetitive sequences have been reported to influence the expression of individual genes, we employed a novel approach to determine whether housekeeping genes can be distinguished from tissue-specific genes their repetitive sequence context. We show that Alu elements are more highly concentrated around housekeeping genes while various longer (>400-bp) repetitive sequences ("repeats"), including Long Interspersed Nuclear Element 1 (LINE-1) elements, are excluded from these regions. We further show that isochore membership does not distinguish housekeeping genes from tissue-specific genes and that repetitive sequence environment distinguishes housekeeping genes from tissue-specific genes in every isochore. The distinct repetitive sequence environment, in combination with other previously published sequence properties of housekeeping genes, were used to develop a method of predicting housekeeping genes on the basis of DNA sequence alone. Using expression across tissue types as a measure of success, we demonstrate that repetitive sequence environment is by far the most important sequence feature identified to date for distinguishing housekeeping genes. Keywords: random forest, Alu, SINE, LINE, repeat, tissue-specific genes, isochores 1. INTRODUCTION Housekeeping genes perform the basic functions common to all dividing cells; they are widely expressed across tissues, and are associated with CpG islands (Bird, 1986). Housekeeping genes also typically have small introns (Eisenberg and Levanon, 2003) that lack repetitive sequences (Han et al., 2004). Housekeeping genes have been found to cluster together on the genome to some degree (Lercher et al., 2002), and to preferentially localize to GC-rich fractions of genomic DNA known as isochores on cesium sulfate gradients (Lercher et al., 2003). We were interested in the relationship of housekeeping genes to repetitive sequences. Nearly half of the human genome consists of repetitive sequence, the majority of which is transposon-derived and widely considered to be "junk DNA." The major classes of repeats are LTR retrotransposons (7.9% of genome sequence), non-LTR retrotransposons (32.0%), DNA transposons (2.8%), satellite and satellite-related sequences (0.34%), low complexity repeats (0.54%), and simple sequence repeats (0.84%). The non-LTR retrotransposons consist primarily of the Long Interspersed Nuclear Element-1 (LINE-1, 15.6%) and the non-autonomous Alu element (10.1%). LINE-1 transposons encode the enzymatic activities required for both their own mobility and for the mobilization of Alu elements (Hagan and Rudin, 2002; Kajikawa and Okada, 2002; Dewannieux et al., 2003). Alu transposons differ from other SINEs in that they are not derived from tRNA genes, but rather from the 7SL RNA gene (Ullu and Tschudi, 1984; Quentin, 1994; Smit, 1996; Okada and Hamada, 1997; Terai et al., 1998; Lander et al., 2001) which encodes the RNA component of the signal recognition particle that mediates the translocation of nascent secretory and membrane proteins (Wild et al., 2004). Aside from favoring TT|AAAA target sequences (Feng et al., 1996; Jurka, 1997; Cost and Boeke, 1998), human Alu and LINE-1 elements have been reported to insert at random positions in the genome (Smit, 1999; Boissinot et al., 2001; Lander et al., 2001; Ovchinnikov et al., 2001; Gilbert et al., 2002; Myers et al., 2002; Symer et al., 2002; Szak et al., 2002; Jurka et al., 2004; Gilbert et al., 2005). However, there is some evidence for insertional hot spots (Cost and Boeke, 1998; Myers et al., 2002; Graham and Boissinot, 2006). A popular idea is that the non-random distribution of these repetitive sequences arises from their loss via purifying selection (Boissinot et al., 2001; Myers et al., 2002; Graham and Boissinot, 2006). There are a number of reports of repetitive sequences influencing gene expression. For example, in fragile X patients expansion mutations of a tandem simple sequence repeat located in an intron of the FMR1 gene result in the transcriptional silencing of the FMR1 gene (Pieretti et al., 1991). Transposable elements in Drosophila and plants have been implicated in the transcriptional silencing of nearby genes by the spread of heterochromatin (Lippman et al., 2004; Sun et al., 2004) raising the possibility that transposons may also be capable of reducing expression if located near genes in humans. DNA methylation is an important feature of heterochromatin that silences gene expression (Stancheva, 2005). Tissue-specific differences in DNA methylation have been reported for various repetitive sequences including LINE-1 elements (Sano and Sager, 1982; Breznik et al., 1984; Nishioka, 1988; Mietz and Kuff, 1990; Allingham-Hawkins et al., 1996; Hassan et al., 2001; Chalitchagorn et al., 2004; Khodosevich et al., 2004) raising the expectation that repetitive sequences are more repressive to the expression of nearby genes in some tissues than in others. Here we sought to determine whether the repetitive sequence environments flanking housekeeping genes that are widely expressed and important across tissues are subject to unique constraints. We show that long (>400-bp) repeats including LINE-1 elements are excluded from the regions flanking housekeeping genes and that short repeats, in particular Alu elements, are particularly highly enriched around these genes. We demonstrate that repetitive sequence environment is by far the most important sequence feature identified to date for distinguishing housekeeping genes and speculate that Alu elements are advantageous for housekeeping genes. 2. METHODS 2.1 Assembly of Gene Lists For housekeeping genes, we used a published list of 575 genes that were expressed in all available tissues above 200 standard Affymetrix average-difference units on an Affymetrix U95A microarray chip containing 12,600 probes from 47 different human tissues and cell lines (Eisenberg and Levanon, 2003). We assembled a list of tissue-specific genes by combining the published lists of two studies (Warrington et al., 2000; Hsiao et al., 2001). Genes from all lists were identified by either Genbank RefSeq ID or Unigene ID and were converted to RefSeq ID via DAVID Tools (http://apps1.niaid.nih.gov/david) (Dennis et al., 2003). We then looked up each gene by its RefSeq ID in the UCSC Genome Browser (http://genome.ucsc.edu) and marked it as HK (housekeeping) or TS (tissue-specific), as appropriate. Genes with common transcription start or stop positions on the same chromosome were considered to be the same gene were treated identically. Seventeen genes appeared in both the housekeeping and tissue-specific lists and were therefore excluded from both groups. After all conversions, we had 586 autosomal housekeeping genes and 468 autosomal tissue-specific genes. 2.2 Sequence Characteristics Human sequence information including repetitive sequences were obtained from the July 2003 assembly (hg16) UCSC annotation tracks in the chrN_rmsk tables (http://hgdownload.cse.ucsc.edu/downloads.html#human). For each gene, we initially defined a region of analysis extending from 100-kb upstream of the transcription start position (txStart) to 100-kb downstream of the transcription end position (txEnd). We excluded the transcribed gene regions from our analysis to avoid effects that can be attributed to displacement by the coding sequence or splicing elements, or to the interference of transcriptional elongation by repetitive sequences in introns (Han et al., 2004). Gaps in the DNA sequence were also omitted. The total number of base pairs comprising each repeat by family was then calculated and divided by the total number of base pairs included in the region. These data were extracted from a copy of the UCSC Genome Browser Database running locally on MySQL http://hgdownload.cse.ucsc.edu/downloads.html#human; http://www.mysql.com) (Regelson et al., 2006). Calculations were performed using the Perl scripting language (http://www.perl.org) and MySQL functions. CpG islands were treated in the same manner as repetitive sequence composition: the total number of bases comprising CpG islands was divided by total bases in each region of analysis. CpG islands are defined by the UCSC Genome Browser as sequences that are at least 50 base pairs long and have at least 50% GC content. The extent of gene clustering was estimated by counting the number of transcription start positions found within each region of analysis. Sense and antisense orientation were determined relative to the gene being analyzed; repeats oriented in the same direction as the gene are designated "sense," while genes in the opposite orientation are designated "antisense." The long-range distribution of each characteristic was measured in much the same way as above, except flanking regions extended 40-Mb in each direction, excluding the gene, rather than 100-kb. Also, rather than working with the entire flanking region at once, we divided the 80-Mb regions into 1-Mb segments and calculated the fraction of each segment comprising repeats, CpG islands and number of genes that have transcription start sites in that segment. Isochores are defined as contiguous regions along a chromosome sharing a homogenous GC composition and are identified by a custom track of the UCSC Genome Browser provided by IsoFinder (http://genome.ucsc.edu ) (Oliver et al., 2004). Genes were assigned to isochores if their transcription start positions fell within the isochore’s boundaries. 2.3 Statistical Analysis Statistical analysis was performed using the software R (http://www.R-project.org) (Team, 2003). Means for each group were compared using the Kruskal-Wallis test. Error bars indicate 1.96x standard error (95% confidence interval). The long-range effect of each sequence characteristic was plotted using local regression (loess) curves, as implemented in the R loess function (below). These plots represent a moving average and can be adjusted with respect to the smoothness of the curve. We used a smoothness parameter of 0.02 for plots presented in Figure 1b
2.4 Random Forest Classification A random forest predictor is an ensemble of individual classification tree predictors (Breiman, 2001). For each observation, each individual tree votes for one class and the forest predicts the class that has the plurality of votes. The user specifies the number of randomly selected variables to be searched through for the best split at each node, using the Gini index (Breiman et al., 1984) as the splitting criterion. We used the random forest package in R, which also implements partial dependence plots (Liaw and Wiener, 2002). The root node of each tree in the forest contains a bootstrap sample of roughly 2/3 of the original data as the training set. Observations not in the training set are referred to as out-of-bag observations. For a case in the original data, the outcome is predicted by plurality vote involving only those trees that did not contain the case in their corresponding bootstrap sample. By contrasting these out-of-bag predictions with the training set outcomes, one can arrive at an estimate of the prediction error rate. Our out-of-bag error rate was 23%, which compared favorably with the proportion of housekeeping genes used in the training set (55%). Thus the input variables contain predictive information of housekeeping status. We estimated the specificity of our random forest classifier by dividing the number of incorrectly classified genes (i.e. tissue-specific genes classified as housekeeping genes) by the total number of genes classified with a stringent probability (>90%). The random forest construction allows one to define several measures of variable importance. In this article, we used the "node purity"-based variable importance measure. For each variable, it measures the mean decrease in Gini index over all node splits that involve it. The absolute values of the variable importance measure have no meaning; instead, this measure is used to rank variables. 2.5 Affymetrix Microarray Atlases Two independent microarray datasets were used. Gene Atlas #1 was by the authors (A.D., B.M. and S.N., submitted). Gene Atlas #2 was provided by the Genomics Institute of the Novartis Research Foundation (http://symatlas.gnf.org ) (Su et al., 2004). Conversion from RefSeq gene ID to Affymetrix probe ID was performed using the DAVID website (http://apps1.niaid.nih.gov/david). We considered a gene to be expressed in a tissue if its presence call p-value was below 0.05. Gene Atlas #1 presence call p-values were generated with the software MAS 5.0 (http://www.affymetrix.com/products/software/specific/mas.affx). Gene Atlas #2 presence call p-values were generated with the software DNA-Chip Analyzer (dChip) (http://biosun1.harvard.edu/complab/dchip). 3. RESULTS 3.1 Differences in flanking sequence composition between housekeeping and tissue-specific genes We chose for our analysis 1000 randomly selected autosomal Reference sequence (RefSeq) genes, published lists of 583 housekeeping genes (Eisenberg and Levanon, 2003), and 468 tissue-specific genes (Warrington et al., 2000; Hsiao et al., 2001). For each gene, we initially considered a region extending 100-kb upstream from the transcription start and 100-kb downstream from the transcription end, but excluding the transcribed gene region. We excluded the transcribed regions from our analysis to avoid effects attributable to displacement by the coding sequence or splicing elements, or to the interference of transcriptional elongation by repetitive sequences in introns (Han et al., 2004). For the 200-kb region flanking each gene, we identified the types and positions of repeats from the RepeatMasker output provided by the UCSC genome browser (http://genome.ucsc.edu) (Supplementary Table 1 online). We then obtained a value for each repeat type representing the percent of the 200-kb flanking sequence occupied by that repeat type. We also considered the percent CpG island sequence, the size of the each gene, and the number of neighboring genes (gene clustering) whose transcription start position fell within the 200-kb regions. Housekeeping (HK) genes contained more genes in their 200-kb flanking regions and were flanked by significantly higher concentrations of CpG island sequence than tissue-specific (TS) genes or the random sample of genes (Fig. 1a 3.2 Comparison of housekeeping and tissue-specific genes among isochores Isochores are defined as long stretches of a chromosome that exhibit a more or less homogenous GC composition. There are currently five commonly recognized isochores, identified by their relative GC content (low 1 or 2; high 1, 2 or 3) (http://genome.ucsc.edu ) (Oliver et al., 2004). To determine whether isochore membership distinguishes HK from TS genes, we determined the isochore assignment to which each HK and TS gene belonged. Both HK and TS genes resided in the high-GC isochores (H1, H2 and H3) and in the low-GC isochores (L1 and L2) (Table 1) indicating that isochore membership did not distinguish HK from TS genes in agreement with previous reports (D’Onofrio, 2002). However, there were very few HK genes in isochore L1 (Fig. 2a
3.3 Sequence composition differences extend over megabase distances To determine how far along the chromosome the flanking sequence differences between housekeeping and tissue-specific genes extended, we calculated the density of each repeat for 1-Mb intervals extending 40-Mb upstream and downstream from the three sets of genes. Alu elements were significantly more enriched around housekeeping genes than tissue-specific genes across an 18-Mb region (Fig. 1b Visual inspection revealed that most of the trends that distinguish housekeeping genes are also evident around tissue-specific genes, but to a lesser extent. Nevertheless, most repeats displayed elevated or reduced densities over similar distances around tissue-specific genes as housekeeping genes (Fig. 1b 3.4 Using repetitive sequence environment to identify housekeeping genes The finding that repetitive sequence environment distinguishes housekeeping genes suggested that it may be possible to identify housekeeping genes based on DNA sequence features alone. We constructed a random forest classifier of housekeeping status using repeats, CpG islands, gene size and gene clustering as input variables. The out-of-bag accuracy (1 – error) was 77%, which compares favorably with the accuracy of 45% if all genes in the training set had been called tissue-specific. However, only a small proportion of the published set of housekeeping genes was successfully classified (Supplementary Table 3 online). We evaluated the importance of each characteristic in the random forest classification by removing them one at a time and recreating the classifier. Removal of CpG islands, gene size or clustering effects had no significant impact on the accuracy of the classifier, whereas removal of repeat information significantly increased the classifier’s error rate (Fig. 3a
3.5 Validation of the Random Forest prediction using microarray data We applied the intact random forest classifier to more than 16,000 RefSeq genes and identified >800 genes with housekeeping gene prediction probabilities ("HK probability") greater than 80%. If these high-scoring genes are indeed housekeepers, they should be expressed across tissues. To test this, we calculated the proportion of tissues in which each gene is reported to be expressed in two independent Affymetrix microarray atlases. Genes scoring over 80% HK probability had markedly higher expression levels than those with lower scores (Fig. 4 3.6 Alu concentration around a gene correlates with percent of tissues in which the gene is expressed A gradual increase in the predicted probability of being a housekeeping gene coincided with a corresponding gradual increase in the number of tissues in which those genes were expressed (Fig. 5a
4. DISCUSSION 4.1 Housekeeping genes are distinguished by a distinct repetitive sequence environment Repetitive sequences have been shown to silence nearby genes via the spread of heterochromatin (Lippman et al., 2004; Sun et al., 2004). However, several longer repetitive sequences or tracts of repetitive sequences have been reported to show differences in an important heterochromatin property, DNA methylation (Sano and Sager, 1982; Breznik et al., 1984; Nishioka, 1988; Mietz and Kuff, 1990; Allingham-Hawkins et al., 1996; Hassan et al., 2001; Chalitchagorn et al., 2004; Khodosevich et al., 2004), suggesting that such influences should vary with tissue type. We therefore reasoned that housekeeping genes, which are utilized across all or nearly all tissues, ought to be subject to stronger constraints in the repetitive sequence environment of their flanking regions. We show that housekeeping genes indeed reside in a distinct repetitive sequence environment that distinguishes these genes more accurately than their previously identified sequence characteristics: association with CpG islands, gene clustering and the presence of small introns. Alu elements were enriched in large regions around housekeeping genes but peaked sharply near the genes themselves. Most other repetitive sequences, including LINE-1 elements, showed the inverse pattern: partial exclusion over large regions with a sharp drop in concentration in the immediate vicinity of the genes. Most repeats showed similar but significantly less pronounced trends around the less widely expressed tissue-specific genes. Interestingly, the most significant correlations were the enrichment of all short (<400-bp) transposons around HK genes and the exclusion of all long (>400-bp) transposons or repeat tracts from these regions. Though isochore membership influenced repeat abundance, both HK and TS genes were distributed across isochores, and repeat environment distinguished HK and TS genes in every isochore. Isochore membership, therefore, failed to distinguish the two types of genes. Several recent studies have concluded that isochores, while excellent for broad descriptions of chromosomal properties, lack the resolution necessary to analyze subtle influences on gene expression (Häring and Kypr, 2001; Lander et al., 2001; Cohen et al., 2005). Alu enrichment was only 16% correlated with housekeeping genes, but even this modest correlation was highly significant. Clearly, there are one or more processes at work that cause housekeeping genes to acquire a distinctive repetitive sequence environment. Therefore, we find it prudent to consider all of the many quantifiable elements that contribute to a gene’s sequence environment. 4.2 The possibility that non-random insertion patterns contribute to the distinct repeat environment of housekeeping genes Why are Alu elements abundant around housekeeping genes while long repeats are scarce? One possibility is that Alu elements preferentially insert around housekeeping genes while longer transposons including LINE-1 elements preferentially insert elsewhere. The notion that human transposons preferentially integrate in certain regions gains support from evidence for insertion hotspots. For example, it was reported that nine of 14 disease-causing LINE-1 insertions were restricted to only three genes (Ostertag and Kazazian, 2001). Alu elements might preferentially recognize the chromatin near housekeeping genes while LINE-1 elements and other long transposons avoid such chromatin. However, the notion of strikingly distinct insertion biases seems contrary to the finding that the LINE-1 transposition machinery mobilizes both LINE-1 and Alu elements (Jurka, 1997; Dewannieux et al., 2003) which consequently insert at the same consensus TT/AAAA sequence (Feng et al., 1996; Jurka, 1997; Cost and Boeke, 1998). Indeed, the factor IX gene has been shown to be targeted by one disease-causing LINE-1 insertion and two independent Alu insertions (Ostertag and Kazazian, 2001) suggesting that Alu and LINE-1 elements can share an insertion hotspot. The notion that Alu and LINE-1 elements have different insertion biases is also inconsistent with the finding that evolutionarily recent insertions of active Alu and LINE-1 subfamilies do not follow the non-random genomic distribution of the older elements (Feng et al., 1996; Ovchinnikov et al., 2001; Gilbert et al., 2002; Symer et al., 2002; Szak et al., 2002; Gilbert et al., 2005). Indeed, we show that all older (>2.4 myr) Alu subfamilies examined are significantly more abundant around housekeeping genes than tissue-specific genes while this trend is weaker or not evident among the youngest subfamilies. The insertion bias scenario would therefore also require all young Alu and LINE-1 subfamilies to have lost their contrasting regional insertion preferences. 4.3 Non-random repeat distributions via natural selection Another explanation is that natural selection rather than insertion bias is responsible for the non-random repeat distribution. Selection may simply involve new insertions being lost when they are deleterious. In this case longer (>400-bp) repeats would be disadvantageous when located near housekeeping genes but not detrimental in gene poor regions. In contrast, Alu elements would not be deleterious near housekeeping genes, but disadvantageous when too abundant around tissue-specific genes and most detrimental when concentrated in gene-poor regions. This selection scenario allows us to avoid invoking a mechanism whereby the LINE-1 transposition machinery leads to Alu elements being inserted in different regions from LINE-1 elements. It would also account for why all of the youngest Alu and LINE-1 subfamilies are more randomly distributed than the older subfamilies as not enough time would have passed to select for the favored distributions (Gu et al., 2000; Pavlícek et al., 2001; Medstrand et al., 2002; Belle et al., 2005; Hackenberg et al., 2005). How might high Alu concentrations be increasingly detrimental as one moves away from housekeeping genes? We show that this decrease in Alu concentration is accompanied by an increase in long repeats and repeat tracts, the most abundant of which are LINE-1 elements. Long repeats including LINE-1 elements are normally heavily DNA-methylated (Woodcock et al., 1988; Crowther et al., 1991; Woodcock et al., 1997), a feature of heterochromatin. DNA methylation (and heterochromatin in general) has been reported to suppress homologous recombination (Pàldi et al., 1995; Maloisel and Rossignol, 1998; Schnable et al., 1998; Fu et al., 2002; Yao et al., 2002; Yamada et al., 2004; Myers et al., 2005) as well as transposition (Yoder et al., 1997; Walsh et al., 1998; Hirochika et al., 2000; Robertson, 2001; Bird, 2002; Kato et al., 2003). Accordingly, reports of deletions caused by homologous recombination between LINE-1 elements are rare (Segal et al., 1999) except in cancers (Florl and Schulz, 2003) where LINE-1 elements are frequently hypomethylated (Santourlidis et al., 1999; Takai et al., 2000; Ehrlich, 2002; Carnell and Goodman, 2003; Florl et al., 2004; Roman-Gomez et al., 2005). LINE-1 transposition has also been reported to be elevated in cancers (Schulz, 2006). If high concentrations of Alu elements cause nearby LINE-1 elements and other repeats to be hypomethylated (or lose other heterochromatic features), the expected result would be genome instability due to an increase in illegitimate homologous recombination and an increase in transposition. Studies on a small number of Alu elements describe properties that, if widespread among Alu elements, suggest a general mechanism for the genome destabilization scenario. In an impressively thorough pair of studies (Thorey et al., 1993; Willoughby et al., 2000), one of three Alu elements was shown to protect transgenes against transcriptional repression by position effects and also by being integrated as tandem multicopy repeats (Garrick et al., 1998; Selker, 1999). This led to the suggestion that a subset of all Alu elements define transcriptionally permissive (euchromatic) domains (Willoughby et al., 2000). A substantial proportion of Alu elements are hypomethylated (Schmid, 1998; Yang et al., 2004). In one study, it was shown that seven out of 21 Alu elements examined displayed properties of euchromatin (H3-K4 methylation and histone acetylation) and furthermore were bound be a SNF2-containing chromatin remodeling complex (Hakimi et al., 2002). Treatment of cells with the DNA methyltransferase inhibitor 5-azacytidine resulted in additional Alu elements being bound by the chromatin remodeling complex (Hakimi et al., 2002). Unfortunately, it was not addressed whether the same Alu elements are consistently hypomethylated or whether Alu elements slowly alternate between heterochromatic (e.g., DNA methylated) and euchromatic (e.g., DNA hypomethylated) states as has been documented for a mouse IAP element near a gene (Whitelaw and Martin, 2001) and a proviral reporter (Lorincz et al., 2002). We favor alternation between heterochromatic and euchromatic Alu states in part because the same Alu element has been shown to be associated with both heterochromatic and euchromatic histone modifications (Kondo et al., 2004). An Alu element has also been shown to display variability in its histone modifications in mice (Martens et al., 2005). If a significant proportion of all Alu elements are euchromatic, a consequence should be illegitimate recombination-induced genome instability that is limited by the short lengths of the repeats but possibly increased by the presence of a chi-like recombination sequence (Lupski, 2004). Indeed, there are numerous reports of disease-causing deletions resulting from recombination between Alu elements (Batzer and Deininger, 2002; Nishimura et al., 2005; Casarin et al., 2006; Has et al., 2006; Kozak et al., 2006; Li et al., 2006; Matejas et al., 2006; Nissen et al., 2006; Sen et al., 2006; Shabbeer et al., 2006; Uddin et al., 2006; Xie et al., 2006; Zhang et al., 2006). If a large proportion of Alu elements indeed fosters euchromatic domains as suggested by the aforementioned study (Willoughby et al., 2000), then flanking sequences may also be destabilized. Interestingly, high concentrations of Alu elements have been associated with disease-causing deletions whose breakpoints were not in the Alu elements themselves (Abrao et al., 2006; Abu-Safieh et al., 2006). LINE-1 elements located among high concentrations of such Alu elements might be rendered euchromatic. Since efficiency of illegitimate homologous recombination increases with length of homology (Waldman and Liskay, 1988; Baker et al., 1996), this would engender genome instability via illegitimate homologous recombination and possibly also via transposition if the element is suitably intact (Baker et al., 1996; Ostertag and Kazazian, 2001; Robertson, 2001). We would expect the genome instability engendered by euchromatic LINE-1 elements to result in these elements’ being selected against. 4.4 Positive selection may enrich Alu elements near genes Why are Alu elements more enriched around housekeeping genes and less so around other genes? One possibility is that nearby Alu elements increase gene expression levels under normal circumstances. There are several reports of variant Alu sequences functioning as tissue-specific enhancers for genes (for a partial list, see http://zmbe.uni-muenster.de/expath/alltables.htm) (Britten, 1996). However, Alu elements might also aid the expression of the more widely expressed housekeeping genes. The Alu element shown to protect transgenes (Thorey et al., 1993; Willoughby et al., 2000) was proposed to support open chromatin barriers that acted as barriers against the spread of heterochromatin. A mutational analysis implicated conserved elements in the internal Pol III promotor in the protective function of this Alu (Thorey et al., 1993). Pol III promoters do not display tissue-specificity (Lander et al., 2001) and have been shown to also function as barriers that protect genes from the spread of heterochromatin in S. cerevisiae (Donze et al., 1999) and S. pombe (Noma et al., 2006; Scott et al., 2006). In S. cerevisiae, Pol III promotor barrier function has been shown to require histone acetyltransferases that are chromatin remodeling enzymes that foster the formation of euchromatin (Donze and Kamakaka, 2002). Other heterochromatin barriers are also believed to serve as entry sites for the recruitment of histone acetyltransferase or chromatin remodeling activities that disrupt the binding of silencing proteins to histones and thereby terminate the spread of heterochromatin structures (Grewal and Moazed, 2003). Pol III heterochromatin barriers in yeast have been shown to require subunits of the cohesin complex (Donze and Kamakaka, 2002) and the Alu elements reported to be bound by a SNF2h-containing chromatin remodeling complex also recruited cohesin subunits (Hakimi et al., 2002). Since transgenes tend to become partially or completely silenced over time by repressive chromatin (Pannell and Ellis, 2001; Iba et al., 2003; Kwaks and Otte, 2006; Lavigne and Gorecki, 2006), most chromosomal regions appear to be repressive to gene function. Such repressiveness would select for the appearance and retention of elements that facilitate the expression of housekeeping genes. We therefore speculate that Alu elements are not only tolerated by housekeeping genes due to the scarcity of long repeats, but that they are more abundant around the highly expressed housekeeping genes than other genes because they promote gene expression across tissues. Another possible pressure driving the greater enrichment of Alu elements around housekeeeping genes arises from the observation that in a number of species, stress has been shown to trigger Alu transcription resulting in Alu transcripts that bind the protein kinase PKR; this blocks the ability of PKR to inhibit protein translation (Chu et al., 1998; Schmid, 1998; Deininger and Batzer, 1999). The concentration of Alu elements in the euchromatic environment around genes may facilitate this. Since housekeeping genes provide a more consistent euchromatic environment than other genes, the positive selective pressure may be stronger for housekeeping genes. 4.8 Long repeats may be disadvantageous to nearby housekeeping genes Regardless of whether Alu elements are beneficial or merely tolerated, the increased abundance of Alu elements around housekeeping genes stood in stark contrast to the scarcity of longer (>400-bp) repeats and repeat tracts (LINE-1 elements and various other repeats) in these same regions Although, the lower abundance of the TT|AAAA target sequence near housekeeping genes may very well contribute to LINE-1 scarcity (Jurka, 1997; Cost and Boeke, 1998; Lander et al., 2001; Graham and Boissinot, 2006) despite their otherwise random insertion pattern (Smit, 1999; Boissinot et al., 2001; Lander et al., 2001; Ovchinnikov et al., 2001; Gilbert et al., 2002; Myers et al., 2002; Symer et al., 2002; Szak et al., 2002; Jurka et al., 2004; Gilbert et al., 2005), at least one additional explanation is needed since long repeats in general were scarce. One reason why long repeats might be selected against near housekeeping genes is that an abundance of these repeats might reduce gene expression via heterochromatin spread. Long transposons (Lyon, 1998; Marahrens, 1999; Bailey et al., 2000; Allen et al., 2003; Lippman et al., 2004; Sun et al., 2004) and long tracts of tandem repeats (Pieretti et al., 1991; Hansen et al., 1997; Saveliev et al., 2003) have been implicated in gene silencing via the spread of heterochromatin. Another reason why LINE-1 and other long transposons might be scarce around housekeeping genes is that the euchromatin could spread into the transposons and activate their internal promoters (Swergold, 1990; Minakami et al., 1992; Leib-Mösch and Seifarth, 1995; Speek, 2001; Athanikar et al., 2004). This could cause gene over-expression as has been reported for and IAP element insertion in the 5′ upstream region of the mouse Agouti gene (Whitelaw and Martin, 2001) or down-regulate genes if the transcription proceeds through the gene in the antisense direction (Whitelaw and Martin, 2001). Promotor activation in a transposon could also cause transposition if the element is suitably intact. Finally, the encroachment of euchromatin into long repeats would facilitate illegitimate homologous recombination between repeats. Long repeats were much more abundant around tissue-specific genes than housekeeping genes. Tissue-specific genes are known to employ tissue-specific enhancers and LCRs to open the chromatin for transcription. Long repeat content might help prevent unwanted expression in all other tissues and might overwhelm the Alu elements near tissue-specific genes. However, in tissues where the gene is to be expressed, Alu elements might interact with tissue-specific promotor or enhancer elements to modulate gene expression, as has been reported for the K18 gene (Rhodes and Oshima, 1998; Willoughby et al., 2000). Note that although short repeats (other than demethylated Alu elements) also display properties of heterochromatin, there is no evidence that heterochromatin can spread more than a few bases beyond the repeat. Indeed, in plants a polymorphic SINE element has been reported to serve as a nucleation center for DNA methylation but the methylation only spread a few bases beyond the SINE (Arnaud et al., 2000). Limited spread of methylation would provide an additional explanation for why Alu elements are tolerated the vicinity of widely expressed genes. 01 Click here to view.(381K, pdf) Acknowledgments C.D.E. was supported by a UCLA-IGERT bioinformatics traineeship (NSF DGE-9987641). M.R. was supported by a Tumor Cell Biology Fellowship (USHHS Institutional National Research Service Award #T32 CA09056). Y.M. was supported in part by National Institutes of Health Grants GM6100701 and HD041451-02. Abbreviations Footnotes Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. WEB SITE REFERENCES
References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||
Nature. 1986 May 15-21; 321(6067):209-13.
[Nature. 1986]Trends Genet. 2003 Jul; 19(7):362-5.
[Trends Genet. 2003]Nature. 2004 May 20; 429(6989):268-74.
[Nature. 2004]Nat Genet. 2002 Jun; 31(2):180-3.
[Nat Genet. 2002]Hum Mol Genet. 2003 Oct 1; 12(19):2411-5.
[Hum Mol Genet. 2003]Am J Pharmacogenomics. 2002; 2(1):25-35.
[Am J Pharmacogenomics. 2002]Cell. 2002 Nov 1; 111(3):433-44.
[Cell. 2002]Nat Genet. 2003 Sep; 35(1):41-8.
[Nat Genet. 2003]Nature. 1984 Nov 8-14; 312(5990):171-2.
[Nature. 1984]Genetica. 1994; 93(1-3):203-15.
[Genetica. 1994]Cell. 1991 Aug 23; 66(4):817-22.
[Cell. 1991]Nature. 2004 Jul 22; 430(6998):471-6.
[Nature. 2004]Mol Cell Biol. 2004 Sep; 24(18):8210-20.
[Mol Cell Biol. 2004]Biochem Cell Biol. 2005 Jun; 83(3):385-95.
[Biochem Cell Biol. 2005]Proc Natl Acad Sci U S A. 1982 Jun; 79(11):3584-8.
[Proc Natl Acad Sci U S A. 1982]Trends Genet. 2003 Jul; 19(7):362-5.
[Trends Genet. 2003]Physiol Genomics. 2000 Apr 27; 2(3):143-7.
[Physiol Genomics. 2000]Physiol Genomics. 2001 Dec 21; 7(2):97-104.
[Physiol Genomics. 2001]Genome Biol. 2003; 4(5):P3.
[Genome Biol. 2003]Nature. 2004 May 20; 429(6989):268-74.
[Nature. 2004]Cytogenet Genome Res. 2006; 112(3-4):184-93.
[Cytogenet Genome Res. 2006]Nucleic Acids Res. 2004 Jul 1; 32(Web Server issue):W287-92.
[Nucleic Acids Res. 2004]Proc Natl Acad Sci U S A. 2004 Apr 20; 101(16):6062-7.
[Proc Natl Acad Sci U S A. 2004]Trends Genet. 2003 Jul; 19(7):362-5.
[Trends Genet. 2003]Physiol Genomics. 2000 Apr 27; 2(3):143-7.
[Physiol Genomics. 2000]Physiol Genomics. 2001 Dec 21; 7(2):97-104.
[Physiol Genomics. 2001]Nature. 2004 May 20; 429(6989):268-74.
[Nature. 2004]Nature. 1986 May 15-21; 321(6067):209-13.
[Nature. 1986]Nat Genet. 2002 Jun; 31(2):180-3.
[Nat Genet. 2002]Nucleic Acids Res. 2004 Jul 1; 32(Web Server issue):W287-92.
[Nucleic Acids Res. 2004]Gene. 2002 Oct 30; 300(1-2):155-60.
[Gene. 2002]Nature. 2004 Jul 22; 430(6998):471-6.
[Nature. 2004]Mol Cell Biol. 2004 Sep; 24(18):8210-20.
[Mol Cell Biol. 2004]Proc Natl Acad Sci U S A. 1982 Jun; 79(11):3584-8.
[Proc Natl Acad Sci U S A. 1982]Virology. 1984 Jul 15; 136(1):69-77.
[Virology. 1984]Tissue Cell. 1988; 20(6):875-80.
[Tissue Cell. 1988]Biochem Biophys Res Commun. 2001 Jan 19; 280(2):567-73.
[Biochem Biophys Res Commun. 2001]Nature. 2001 Feb 15; 409(6822):860-921.
[Nature. 2001]Mol Biol Evol. 2005 May; 22(5):1260-72.
[Mol Biol Evol. 2005]Annu Rev Genet. 2001; 35():501-38.
[Annu Rev Genet. 2001]Proc Natl Acad Sci U S A. 1997 Mar 4; 94(5):1872-7.
[Proc Natl Acad Sci U S A. 1997]Nat Genet. 2003 Sep; 35(1):41-8.
[Nat Genet. 2003]Cell. 1996 Nov 29; 87(5):905-16.
[Cell. 1996]Biochemistry. 1998 Dec 22; 37(51):18081-93.
[Biochemistry. 1998]Gene. 2000 Dec 23; 259(1-2):81-8.
[Gene. 2000]Gene. 2001 Oct 3; 276(1-2):39-45.
[Gene. 2001]Genome Res. 2002 Oct; 12(10):1483-95.
[Genome Res. 2002]J Mol Evol. 2005 Mar; 60(3):290-6.
[J Mol Evol. 2005]J Mol Evol. 2005 Mar; 60(3):365-77.
[J Mol Evol. 2005]Nucleic Acids Res. 1988 May 25; 16(10):4465-82.
[Nucleic Acids Res. 1988]Nucleic Acids Res. 1991 May 11; 19(9):2395-401.
[Nucleic Acids Res. 1991]J Biol Chem. 1997 Mar 21; 272(12):7810-6.
[J Biol Chem. 1997]Curr Biol. 1995 Sep 1; 5(9):1030-5.
[Curr Biol. 1995]Genes Dev. 1998 May 1; 12(9):1381-9.
[Genes Dev. 1998]Mol Cell Biol. 1993 Nov; 13(11):6742-51.
[Mol Cell Biol. 1993]J Biol Chem. 2000 Jan 14; 275(2):759-68.
[J Biol Chem. 2000]Nat Genet. 1998 Jan; 18(1):56-9.
[Nat Genet. 1998]Cell. 1999 Apr 16; 97(2):157-60.
[Cell. 1999]Nucleic Acids Res. 1998 Oct 15; 26(20):4541-50.
[Nucleic Acids Res. 1998]Genome Biol. 2004; 5(10):242.
[Genome Biol. 2004]Nat Rev Genet. 2002 May; 3(5):370-9.
[Nat Rev Genet. 2002]Am J Hum Genet. 2005 Dec; 77(6):1021-33.
[Am J Hum Genet. 2005]Mol Diagn Ther. 2006; 10(4):243-9.
[Mol Diagn Ther. 2006]J Invest Dermatol. 2006 Aug; 126(8):1776-83.
[J Invest Dermatol. 2006]Proc Natl Acad Sci U S A. 1996 Sep 3; 93(18):9374-7.
[Proc Natl Acad Sci U S A. 1996]Mol Cell Biol. 1993 Nov; 13(11):6742-51.
[Mol Cell Biol. 1993]J Biol Chem. 2000 Jan 14; 275(2):759-68.
[J Biol Chem. 2000]Nature. 2001 Feb 15; 409(6822):860-921.
[Nature. 2001]Genes Dev. 1999 Mar 15; 13(6):698-708.
[Genes Dev. 1999]Mol Cell Biol. 1998 Jan; 18(1):58-68.
[Mol Cell Biol. 1998]Nucleic Acids Res. 1998 Oct 15; 26(20):4541-50.
[Nucleic Acids Res. 1998]Mol Genet Metab. 1999 Jul; 67(3):183-93.
[Mol Genet Metab. 1999]Proc Natl Acad Sci U S A. 1997 Mar 4; 94(5):1872-7.
[Proc Natl Acad Sci U S A. 1997]Biochemistry. 1998 Dec 22; 37(51):18081-93.
[Biochemistry. 1998]Nature. 2001 Feb 15; 409(6822):860-921.
[Nature. 2001]Curr Opin Genet Dev. 1999 Dec; 9(6):657-63.
[Curr Opin Genet Dev. 1999]Mol Biol Evol. 2001 Jun; 18(6):926-35.
[Mol Biol Evol. 2001]J Biol Chem. 1998 Oct 9; 273(41):26534-42.
[J Biol Chem. 1998]J Biol Chem. 2000 Jan 14; 275(2):759-68.
[J Biol Chem. 2000]Mol Cell Biol. 2000 May; 20(10):3434-41.
[Mol Cell Biol. 2000]Nucleic Acids Res. 2004 Jul 1; 32(Web Server issue):W287-92.
[Nucleic Acids Res. 2004]