![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Identification of cis-Regulatory Elements in Gene Co-expression Networks Using A-GLAM The publisher's final edited version of this article is available at Methods Mol Biol.Abstract Reliable identification and assignment of cis-regulatory elements in promoter regions is a challenging problem in biology. The sophistication of transcriptional regulation in higher eukaryotes, particularly in metazoans, could be an important factor contributing to their organismal complexity. Here we present an integrated approach where networks of co-expressed genes are combined with gene ontology–derived functional networks to discover clusters of genes that share both similar expression patterns and functions. Regulatory elements are identified in the promoter regions of these gene clusters using a Gibbs sampling algorithm implemented in the A-GLAM software package. Using this approach, we analyze the cell-cycle co-expression network of the yeast Saccharomyces cerevisiae, showing that this approach correctly identifies cis-regulatory elements present in clusters of co-expressed genes. Keywords: Promoter sequences, transcription factor–binding sites, co-expression, networks, gene ontology, Gibbs sampling 1. Introduction The identification and classification of the entire collection of transcription factor–binding sites (TFBSs) are among the greatest challenges in systems biology. Recently, large-scale efforts involving genome mapping and identification of TFBS in lower eukaryotes, such as the yeast Saccharomyces cerevisiae, have been successful (1). On the other hand, similar efforts in vertebrates have proven difficult due to the presence of repetitive elements and an increased regulatory complexity (2–4). The accurate prediction and identification of regulatory elements in higher eukaryotes remains a challenge for computational biology, despite recent progress in the development of algorithms for this purpose (5). Typically, computational methods for identifying cis-regulatory elements in promoter sequences fall into two classes, enumerative and alignment techniques (6). We have developed algorithms that use enumerative approaches to identify cis-regulatory elements statistically significantly over-represented in promoter regions (7). Subsequently, we developed an algorithm that combines both enumeration and alignment techniques to identify statistically significant cis-regulatory elements positionally clustered relative to a specific genomic landmark (8). Here, we will present a systems biology framework to study cis-regulatory elements in networks of co-expressed genes. This approach includes a network comparison operation, namely the intersection between co-expression and functional networks to reduce complexity and false positives due to co-expression linkage but absence of functional linkage. First, co-expression (9, 10) and functional networks (11, 12) are created using user-selected thresholds. Second, the construction of a single network is obtained from the intersection between co-expression and functional networks (13). Third, the highly interconnected regions in the intersection network are identified (14). Fourth, upstream regions of the gene clusters that are linked by both co-expression and function are extracted. Fifth, candidate cis-regulatory elements using A-GLAM (8) present in dense cluster regions of the intersection network are identified. In principle, the calculation of intersections for other types of networks with co-expression and/or functional networks could also be used to identify groups of co-regulated genes of interest (15) that may share cis-regulatory elements. 2. Materials 2.1. Hardware Requirements
2.2. Software Requirements
3. Methods The size of co-expression networks depends on the number of nodes in the network and the threshold used to define an edge between two nodes. There are a number of distance measures that are often used to compare gene expression profiles (16). Here we use the Pearson correlation coefficient (PCC) as a metric to measure the similarity between expression profiles and to construct gene co-expression networks (17, 18). We establish a link by an edge between two genes, represented by nodes, if the PCC value is higher or equal to 0.7; this is an arbitrary cut-off that can be adjusted depending on the dataset used. The microarray dataset used here is the yeast cell-cycle progression experiment from Cho et al. (9) and Spellman et al. (10). The semantic similarity method (11) was used to quantitatively assess the functional relationships between S. cerevisiae genes. The A-GLAM software package uses a Gibbs sampling algorithm to identify functional motifs (such as TFBSs, mRNA splicing control elements, or signals for mRNA 3’-cleavage and polyadenylation) in a set of sequences. Gibbs sampling (or more descriptively, successive substitution sampling) is a respected Markov-chain Monte Carlo procedure for discovering sequence motifs (19). Briefly, A-GLAM takes a set of sequences as input. The Gibbs sampling step in A-GLAM uses simulated annealing to maximize an ‘overall score’, a figure of merit corresponding to a Bayesian marginal log-odds score. The overall score is given by
In Eq. [1], m! = m(m – 1)…1 denotes a factorial; aj, the pseudocounts for nucleic acid j in each position; a = a1 + a2 + a3 + a4, the total pseudo-counts in each position; cij, the count of nucleic acid j in position i; and c = ci1 + ci2 + ci3 + ci4, the total number of aligned windows, which is independent of the position i. The rationale behind the overall score s in A-GLAM is explained in detail elsewhere (8). To initialize its annealing maximization, A-GLAM places a single window of arbitrary size and position at every sequence, generating a gapless multiple alignment of the windowed subsequences. It then proceeds through a series of iterations; on each iteration step, A-GLAM proposes a set of adjustments to the alignment. The proposal step is either a repositioning step or a resizing step. In a repositioning step, a single sequence is chosen uniformly at random from the alignment; and the set of adjustments include all possible positions in the sequence where the alignment window would fit without overhanging the ends of the sequence. In a resizing step, either the right or the left end of the alignment window is selected; and the set of proposed adjustments includes expanding or contracting the corresponding end of all alignment windows by one position at a time. Each adjustment leads to a different value of the overall score s. Then, A-GLAM accepts one of the adjustments randomly, with probability proportional to exp(s/T). A-GLAM may even exclude a sequence if doing so would improve alignment quality. The temperature T is gradually lowered to T = 0, with the intent of finding the gapless multiple alignment of the windows maximizing s. The maximization implicitly determines the final window size. The randomness in the algorithm helps it avoid local maxima and find the global maximum of s. Due to the stochastic nature of the procedure, finding the optimum alignment is not guaranteed. Therefore, A-GLAM repeats this procedure ten times from different starting points (ten runs). The idea is that if several of the runs converge to the same best alignment, the user has increased confidence that it is indeed the optimum alignment. The steps (below) corresponding to E-values and post-processing were then carried out with the PSSM corresponding to the best of the ten scores s. The individual score and its E-value in A-GLAM The Gibbs sampling step produces an alignment whose overall score s is given by Eq. [1]. Consider a window of length w that is about to be added to A-GLAM’s alignment. Let δi(j) equal 1 if the window has nucleic acid j in positioni, and 0 otherwise. The addition of the new window changes the overall score by
The score change corresponds to scoring the new window according to a position-specific scoring matrix (PSSM) that assigns the ‘individual score’
The assignment of an E-value to a subsequence with a particular individual score is done as follows: consider the alignment sequence containing the subsequence. Let n be the sequence length, and recall that w is the window size. If ΔSi denotes the quantity in Eq. [2] if the final letter in the window falls at position i of the alignment sequence, then ΔS* = max{ΔSi : i = w,…,n} is the maximum individual score over all sequence positions i. We assigned an E-value to the actual value ΔS* = Δs*, as follows. Staden’s method (21) yields {ΔSi Δs*} (independent of i) under the null hypothesis of bases chosen independently and randomly from the frequency distribution {pj}. The E-value E = (n – w + 1) {ΔSi Δs*} is therefore the expected number of sequence positions with an individual score exceeding Δs*. The factor n – w + 1 in E is essentially a multiple test correction.More recently, the A-GLAM package has been improved to allow the identification of multiple instances of an element within a target sequence (22). The optional ‘scanning step’ after Gibbs sampling produces a PSSM given by Eq. [3]. The new scanning step resembles an iterative PSI-BLAST search based on the PSSM. First, it assigns an ‘individual score’ to each subsequence of appropriate length within the input sequences using the initial PSSM. Second, it computes an E-value from each individual score to assess the agreement between the corresponding subsequence and the PSSM. Third, it permits subsequences with E-values falling below a threshold to contribute to the underlying PSSM, which is then updated using the Bayesian calculus. A-GLAM iterates its scanning step to convergence, at which point no new subsequences contribute to the PSSM. After convergence, A-GLAM reports predicted regulatory elements within each sequence in the order of increasing E-values; users then have a statistical evaluation of the predicted elements in a convenient presentation. Thus, although the Gibbs sampling step in A-GLAM finds at most one regulatory element per input sequence, the scanning step can now rapidly locate further instances of the element in each sequence. 3.1. Co-expression Network Construction
3.2. Functional Similarity Network Construction
3.3. Intersection Network Construction
3.4. Identification of Highly Interconnected Regions 3.5. Identification of Proximal Promoter Regions
3.6. Identification of cis-Regulatory Elements in Promoter Regions
4. Notes
Acknowledgments The authors would like to thank King Jordan for important suggestions and helpful discussions and Alex Brick for his assistance in obtaining intergenic regions during his internship at NCBI. This research was supported by the Intramural Research Program of the NIH, NLM, NCBI. References 1. Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, et al. Transcriptional regulatory code of a eukaryotic genome. Nature. 2004;431(7004):99–104. [PubMed] 2. Bieda M, Xu X, Singer MA, Green R, Farnham PJ. Unbiased location analysis of E2F1-binding sites suggests a widespread role for E2F1 in the human genome. Genome Res. 2006;16(5):595–605. [PubMed] 3. Cawley S, Bekiranov S, Ng HH, Kapranov P, Sekinger EA, Kampa D, Piccolboni A, Sementchenko V, Cheng J, Williams AJ, et al. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell. 2004;116(4):499–509. [PubMed] 4. Guccione E, Martinato F, Finocchiaro G, Luzi L, Tizzoni L, Dall’ Olio V, Zardo G, Nervi C, Bernard L, Amati B. Myc-binding-site recognition in the human genome is determined by chromatin context. Nat Cell Biol. 2006;8(7):764–770. [PubMed] 5. Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005;23(1):137–144. [PubMed] 6. Ohler U, Niemann H. Identification and analysis of eukaryotic promoters: recent computational approaches. Trends Genet. 2001;17(2):56–60. [PubMed] 7. Marino-Ramirez L, Spouge JL, Kanga GC, Landsman D. Statistical analysis of over-represented words in human promoter sequences. Nucleic Acids Res. 2004;32(3):949–958. [PubMed] 8. Tharakaraman K, Marino-Ramirez L, Sheetlin S, Landsman D, Spouge JL. Alignments anchored on genomic landmarks can aid in the identification of regulatory elements. Bioinformatics. 2005;21 Suppl 1:i440–i448. [PubMed] 9. Cho RJ, Campbell MJ, Winzeler EA, Steinmetz L, Conway A, Wodicka L, Wolfsberg TG, Gabrielian AE, Landsman D, Lockhart DJ, et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell. 1998;2(1):65–73. [PubMed] 10. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell. 1998;9(12):3273–3297. [PubMed] 11. Lord PW, Stevens RD, Brass A, Goble CA. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics. 2003;19(10):1275–1283. [PubMed] 12. Azuaje F, Wang H, Bodenreider O. Ontology-driven similarity approaches to supporting gene functional assessment. Proceedings of the ISMB’2005 SIG Meeting on Bio-Ontologies; Detroit, MI. 2005. pp. 9–10. 13. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–2504. [PubMed] 14. Bader GD, Hogue CW. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003;4:2. [PubMed] 15. Tsaparas P, Marino-Ramirez L, Bodenreider O, Koonin EV, Jordan IK. Global similarity and local divergence in human and mouse gene co-expression networks. BMC Evol Biol. 2006;6:70. [PubMed] 16. Babu MM. Grant RP. Computational Genomics: Theory and Application. Cambridge, UK: Horizon Bioscience; 2004. An introduction to microarray data analysis; pp. 225–249. 17. Jordan IK, Marino-Ramirez L, Koonin EV. Evolutionary significance of gene expression divergence. Gene. 2005;345(1):119–126. [PubMed] 18. Jordan IK, Marino-Ramirez L, Wolf YI, Koonin EV. Conservation and coevolution in the scale-free human gene coexpression network. Mol Biol Evol. 2004;21(11):2058–2070. [PubMed] 19. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993;262(5131):208–214. [PubMed] 20. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–3402. [PubMed] 21. Staden R. Methods for calculating the probabilities of finding patterns in sequences. Comput Appl Biosci. 1989;5(2):89–96. [PubMed] 22. Tharakaraman K, Marino-Ramirez L, Sheetlin S, Landsman D, Spouge JL. Scanning sequences after Gibbs sampling to find multiple occurrences of functional elements. BMC Bioinformatics. 2006;7:408. [PubMed] 23. Orwant J, Hietaniemi J, Macdonald J. Mastering Algorithms with Perl. Sebastopol, CA: O’Reilly; 1999. 24. Maere S, Heymans K, Kuiper M. BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics. 2005;21(16):3448–3449. [PubMed] 25. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc Ser B. 1995;57(1):289–300. 26. Marino-Ramirez L, Jordan IK, Landsman D. Multiple independent evolutionary solutions to core histone gene regulation. Genome biology. 2006;7(12):R122. [PubMed] 27. Eriksson PR, Mendiratta G, McLaughlin NB, Wolfsberg TG, Marino-Ramirez L, Pompa TA, Jainerin M, Landsman D, Shen CH, Clark DJ. Global regulation by the yeast Spt10 protein is mediated through chromatin structure and the histone upstream activating sequence elements. Mol Cell Biol. 2005;25(20):9127–9137. [PubMed] 28. Schneider TD, Stephens RM. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990;18(20):6097–6100. [PubMed] 29. Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence logo generator. Genome Res. 2004;14(6):1188–1190. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Nature. 2004 Sep 2; 431(7004):99-104.
[Nature. 2004]Genome Res. 2006 May; 16(5):595-605.
[Genome Res. 2006]Nat Cell Biol. 2006 Jul; 8(7):764-70.
[Nat Cell Biol. 2006]Nat Biotechnol. 2005 Jan; 23(1):137-44.
[Nat Biotechnol. 2005]Trends Genet. 2001 Feb; 17(2):56-60.
[Trends Genet. 2001]Mol Cell. 1998 Jul; 2(1):65-73.
[Mol Cell. 1998]Mol Biol Cell. 1998 Dec; 9(12):3273-97.
[Mol Biol Cell. 1998]Bioinformatics. 2003 Jul 1; 19(10):1275-83.
[Bioinformatics. 2003]Genome Res. 2003 Nov; 13(11):2498-504.
[Genome Res. 2003]BMC Bioinformatics. 2003 Jan 13; 4():2.
[BMC Bioinformatics. 2003]Genome Res. 2003 Nov; 13(11):2498-504.
[Genome Res. 2003]BMC Bioinformatics. 2003 Jan 13; 4():2.
[BMC Bioinformatics. 2003]Bioinformatics. 2005 Jun; 21 Suppl 1():i440-8.
[Bioinformatics. 2005]Gene. 2005 Jan 17; 345(1):119-26.
[Gene. 2005]Mol Biol Evol. 2004 Nov; 21(11):2058-70.
[Mol Biol Evol. 2004]Mol Cell. 1998 Jul; 2(1):65-73.
[Mol Cell. 1998]Mol Biol Cell. 1998 Dec; 9(12):3273-97.
[Mol Biol Cell. 1998]Bioinformatics. 2003 Jul 1; 19(10):1275-83.
[Bioinformatics. 2003]Science. 1993 Oct 8; 262(5131):208-14.
[Science. 1993]Bioinformatics. 2005 Jun; 21 Suppl 1():i440-8.
[Bioinformatics. 2005]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Comput Appl Biosci. 1989 Apr; 5(2):89-96.
[Comput Appl Biosci. 1989]BMC Bioinformatics. 2006 Sep 8; 7():408.
[BMC Bioinformatics. 2006]Bioinformatics. 2003 Jul 1; 19(10):1275-83.
[Bioinformatics. 2003]Bioinformatics. 2005 Aug 15; 21(16):3448-9.
[Bioinformatics. 2005]BMC Bioinformatics. 2003 Jan 13; 4():2.
[BMC Bioinformatics. 2003]Genome Biol. 2006; 7(12):R122.
[Genome Biol. 2006]Mol Cell Biol. 2005 Oct; 25(20):9127-37.
[Mol Cell Biol. 2005]Nucleic Acids Res. 1990 Oct 25; 18(20):6097-100.
[Nucleic Acids Res. 1990]Genome Res. 2004 Jun; 14(6):1188-90.
[Genome Res. 2004]Mol Cell. 1998 Jul; 2(1):65-73.
[Mol Cell. 1998]Mol Biol Cell. 1998 Dec; 9(12):3273-97.
[Mol Biol Cell. 1998]