![]() | ![]() |
Formats:
|
||||||||||||||
Copyright © 2005, Cold Spring Harbor Laboratory Press Evaluation of regulatory potential and conservation scores for detecting cis-regulatory modules in aligned mammalian genome sequences 1 Center for Comparative Genomics and Bioinformatics, Huck Institutes of Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA 2 Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA 3 Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802, USA 4 Department of Statistics, The Pennsylvania State University, University Park, Pennsylvania 16802, USA 5 Department of Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA 6 National Human Genome Research Institute, Rockville, Maryland 20852, USA 7Corresponding author. E-mail rch8/at/psu.edu; fax (814) 863-7024. Received January 5, 2005; Accepted June 2, 2005. This article has been cited by other articles in PMC.Abstract Techniques of comparative genomics are being used to identify candidate functional DNA sequences, and objective evaluations are needed to assess their effectiveness. Different analytical methods score distinctive features of whole-genome alignments among human, mouse, and rat to predict functional regions. We evaluated three of these methods for their ability to identify the positions of known regulatory regions in the well-studied HBB gene complex. Two methods, multispecies conserved sequences and phastCons, quantify levels of conservation to estimate a likelihood that aligned DNA sequences are under purifying selection. A third function, regulatory potential (RP), measures the similarity of patterns in the alignments to those in known regulatory regions. The methods can correctly identify 50%–60% of noncoding positions in the HBB gene complex as regulatory or nonregulatory, with RP performing better than do other methods. When evaluated by the ability to discriminate genomic intervals, RP reaches a sensitivity of 0.78 and a true discovery rate of ~0.6. The performance is better on other reference sets; both phastCons and RP scores can capture almost all regulatory elements in those sets along with ~7% of the human genome. A major aim of genomics is to identify the functional segments of DNA (Collins et al. 2003; The ENCODE Project Consortium 2004), and comparative methods play a critical role in achieving this goal (Miller et al. 2004). Inclusion of comparative data has improved the accuracy of bioinformatic methods for predicting exons and gene structures (Brent and Guigó 2004), but full annotation of genes has not yet been achieved for complex genomes (Lander et al. 2001; Waterston et al. 2002; Gibbs et al. 2004; International Human Genome Sequencing Consortium 2004). DNA segments needed to control the level, developmental timing, and spatial pattern of gene expression, termed cis-regulatory modules (CRMs), are even more difficult to identify accurately. Few constraints analogous to the rules of the genetic code, used for coding exons, can be universally applied to CRMs (Wasserman and Sandelin 2004), and thus bioinformatic predictions of CRMs commonly use one or more sources of information in addition to the DNA sequence. These can include the presumptive start site for transcription (Trinklein et al. 2003), matches to transcription factor binding sites (Wingender et al. 2001; Sandelin et al. 2004a), overrepresentation of motifs in coexpressed genes (Spellman et al. 1998), and noncoding sequence conservation (for review, see Pennacchio and Rubin 2001; Cooper and Sidow 2003; Frazer et al. 2003; Hardison 2003). However, CRMs are difficult to distinguish from neutral DNA in mammalian genomes by relatively simple human–mouse conservation scores (Waterston et al. 2002; Elnitski et al. 2003). Interspecies comparisons can be used to infer function if aligned sequences are scored for the likelihood that they are under evolutionary constraint (purifying selection), which is often measured by an evolutionary rate slower than that observed in neutral DNA (Waterston et al. 2002; Cooper et al. 2004). Hidden Markov models and phylogenetic information have been used to estimate rates of evolution (Felsenstein and Churchill 1996) and to improve predictions of functional elements such as genes (Pedersen and Hein 2003; McAuliffe et al. 2004). Phylogenetic hidden Markov models (Siepel and Haussler 2003) have been applied to multiple sequence alignments to estimate a likelihood that a particular sequence is among the most highly conserved in a genome. A recent implementation of these models computes a score called phastCons (Siepel et al. 2005), which allows for rate variation in different lineages and assumes that adjacent bases score similarly. An earlier method finds multispecies conserved sequences (MCSs) (Margulies et al. 2003) as blocks of highly constrained aligned sequences. This algorithm weights the score by phylogenetic distance and adjusts estimates of significance by a neutral substitution rate. In addition, aligned sequences can be analyzed for features other than degree of constraint in order to discriminate between alignments in distinct functional classes. Elnitski et al. (2003) introduced a regulatory potential (RP) score that evaluates the extent to which patterns in an alignment (strings of alignment columns) are more similar to patterns found in alignments of known regulatory elements than in alignments of ancestral repeats, which are a model for neutral DNA. This approach has been adapted to three-way alignments among human, mouse, and rat sequences (Kolbe et al. 2004). Alignments with positive scores have patterns similar to those found in the regulatory region training set, whereas those with negative scores have patterns more similar to those in aligned ancestral repeats. Although this method examines patterns in alignments, it does not use information about known factor binding sites. In this article, we derive a set of all the known regulatory elements in the intensively studied β-globin gene complex of mammals (HBB gene complex) and use it as a reference set to evaluate the sensitivity (Sn) and specificity (Sp) of the constraint-based alignment scores (phastCons and MCSs) and the pattern-matching RP score. This calibration of the scores allows their effectiveness to be evaluated genome-wide, using alignments of the human, mouse, and rat genomes. Results Reference set of known regulatory elements from the HBB complex DNA sequences needed to regulate the set of developmentally controlled, erythroid-specific genes encoding β-globin and its relatives have been studied intensively, and in this gene complex, the fraction of human sequences aligning with mouse and rat (35%) is very close to the genome average (Gibbs et al. 2004). Thus, regulatory elements in this gene complex comprise a good (but not perfect) data set with which to assess false-positive and false-negative rates for bioinformatic predictions of CRMs. This cluster of genes at human chromosome 11p15.5, referred to here as the HBB complex, includes the embryonically expressed HBE1, the fetally expressed HBG1 and HBG2, and HBD and HBB, which are expressed in adult life, along with a pseudogene HBBP1. A reference set of all known CRMs was assembled from the literature describing this gene complex, including promoters for the genes, upstream sequences (adjacent to the promoter) implicated in regulation, and five DNase hypersensitive sites in the distal strong enhancer called the locus control region (for review, see Hardison et al. 1997; Forget 2001; Hardison 2001; Li et al. 2002). The reference set includes 23 CRMs (Table 1).
One limitation to using interspecies conservation to predict CRMs is that some bona fide regulatory elements do not align between the species being examined. Of the 23 CRMs in the human HBB complex, 20 are conserved in mouse, 19 are conserved in rat, and only four are conserved in chicken (Table 1), based on BLASTZ pairwise alignments (Schwartz et al. 2003b). Fortunately, a substantial majority (19 of 23) is conserved among human, mouse, and rat, and 18 are in the genome-wide multiple alignments considered here (see Methods). The four CRMs conserved between human and chicken are in the promoters of the genes (Table 1); none of the upstream or distal CRMs, including enhancers, align at the stringencies used for the whole-genome human–chicken alignments (Hillier et al. 2004). It is possible that more sensitive alignment methods will discover more distantly related sequences in future studies. It is important to realize that knowledge of CRMs is still incomplete, even in a rigorously studied region such as the HBB complex. DNA intervals identified by comparative genomics methods but not in our reference set are considered false positives (FPs), but in reality, they could be regulatory elements not yet tested for function. Calibration of discriminatory thresholds Since the conservation and RP scores are computed on human–mouse–rat three-way alignments, there is an associated score for the 18 CRMs that are in the alignments (green peaks in Fig. 1
Our goal is to find the threshold for each score that optimizes the ability to find the CRMs (high Sn) while minimizing the amount of other DNA that also passes the threshold (high Sp; see Methods). As expected, Sn decreases and Sp increases with increasing score thresholds for each method (Fig. 2
In this binary discrimination analysis, the optimal threshold for RP scores is –0.006 (Table 2). The fact that it is a negative number is initially surprising, because negative values mean that the patterns of the alignments are more like those in the negative training set (aligned ancestral repeats) than those in the known regulatory elements. However, it is important to realize that in this binary discrimination analysis, the methods are evaluated by how much of all the regulatory regions are found. Another important feature to evaluate is whether any part of a regulatory element passes a given threshold. Thus, we conducted a second analysis, in which the regulatory regions are considered as intervals (not individual positions) and the relevant score is the maximum within the interval. The intervals containing nonregulatory regions are continuous runs of positions whose RP scores meet or exceed the threshold; thus their size and number varies with the threshold. They also were evaluated by the maximum score within the interval. We computed the fraction of regulatory region intervals that exceed a threshold score, called the interval Sn, or Snint, and the fraction of intervals exceeding a threshold that are regulatory regions, called the “true discovery rate.” With this approach, an RP threshold of zero achieves a Snint of 0.78 and a true discovery rate of ~0.6 (Table 2). Thus an RP of zero is a useful operational threshold for the human–mouse–rat RP scores. The interval-based evaluation did not reveal an improved performance for MCS and phastCons (Table 2). The RP scores were trained on a set of 93 known regulatory regions, which included four of the CRMs in the reference set from the HBB gene complex, namely, HS2 of the LCR and promoters for the HBE1, HBG2, and HBB genes. To remove bias introduced by this overlap in training and testing sets, we repeated the training of the RP model excluding the CRMs from the HBB gene complex. The threshold, Sn, and Sp of RP scores generated in this way are similar to the ones generated by including the CRMs from the HBB gene complex (Table 2). Genome-wide evaluation of alignment scores for regulatory elements The HBB complex provides a continuous region of intensively tested regulatory regions for evaluation, but it is important to know how well the results on this locus apply to known regulatory elements in the rest of the genome. Additional sets of functional elements distributed across the human genome were collected, and the maximum RP and phastCons scores within each interval were determined. (The MCS score was not included because it has not been computed across the entire human genome.) Because of the sparseness of known regulatory elements across the entire human genome, it is not possible to assess a false-positive or true-negative rate for these data sets. Hence, Sp cannot be determined, but it is useful to compare the distribution of scores for each functional set with the genome-wide scores. The distribution of RP scores for positions in the human genome (including coding regions) that align with mouse and rat shows that ~20% has RP scores above zero (Fig. 3A
The distribution of phastCons scores for the sequences in the human genome that align with mouse and rat are dramatically skewed toward low values (Fig. 3B One of the intriguing results for both methods is that their diagnostic effectiveness for the HBB complex reference set is less than that observed for the other sets of regulatory elements. The cumulative distributions for both scores in the HBB complex CRMs are shifted considerably to the left of the scores for other sets of CRMs, coding sequences, and miRNAs (Fig. 3 Discussion We organized the extensive experimental results on the DNA segments regulating expression of the HBB gene complex into a reference data set, and then used this data set to evaluate three different approaches for analyzing multispecies alignments to find CRMs. Two of the methods, MCSs and phastCons, are based exclusively on conservation, whereas a third method, RP, uses a pattern-matching discriminatory function within the conserved regions. All three methods had some success in detecting the CRMs in the reference set, with sensitivities and specificities ranging from 50%–60% when evaluated on all nucleotide positions. The RP function performed better than did the conservation measures on the reference set of CRMs in the HBB gene complex. This is expected, given that high conservation should reflect the effects of purifying selection for any function, whereas the RP function was trained to find patterns in alignments similar to those in transcriptional regulatory regions. When the performance was evaluated on the maximum score in each interval, RP scores reached a Sn of 78% with a true discovery rate of ~60%. Strikingly, both conservation-based scores and RP scores perform much better against other sets of CRMs. The RP and phastCons scores are deposited in databases such as the Genome Browser (Kent et al. 2002; Karolchik et al. 2004) and GALA (Giardine et al. 2003; Elnitski et al. 2005), and MCS scores can be computed on a Web server. Investigators can apply the thresholds determined in the current study to search the databases and predict CRMs with reasonable but not perfect reliability. Given that the fraction of the human genome with positive RP scores (~7%) exceeds the fraction estimated to be under selection between primates and rodents (~5%) (Waterston et al. 2002; Chiaromonte et al. 2003), one should expect some FPs in the RP-based predictions. Recent studies show that Sp of predictions is improved by combining bioinformatic features associated with regulatory elements, such as conservation with overrepresented motifs (Blanchette et al. 2002; Liu et al. 2004) or conservation with clusters of transcription factor binding sites (Berman et al. 2004). We find that when RP scores above the discriminatory threshold are combined with a conserved transcription factor binding site, the predicted CRMs are validated at a high rate (Hardison et al. 2003a; Welch et al. 2004). Although the prospects for application of the current measures to experimental investigation of gene regulation are promising, our study also illustrates some important limitations in using multispecies alignments to predict CRMs. First, RP scores fail to distinguish at least 20% of the conserved CRMs in the HBB complex, and other methods have less Sn. Fortunately, for other reference data sets, the performance is better, but it is important to realize that some conserved CRMs will be missed using current methods. Second, some human CRMs have no reliable matches with mouse and sequence, as is the case for four of the 23 CRMs known in the human HBB complex. Obviously, CRMs such as these are invisible to predictive algorithms based on primate–rodent alignments, but they may be detectable over shorter phylogenetic distances using techniques such as phylogenetic shadowing (Boffelli et al. 2003). The failure of these CRMs to align could be explained in at least three different ways. One is that homologs to some of the “nonconserved” human CRMs could be present in rodent or avian species, but current alignment algorithms are not sufficiently sensitive to detect them. For example, the HS2 enhancer in the LCR does not match in whole-genome human–chicken alignments generated with BLASTZ, but it does align between mammals and chicken when TBA (Blanchette et al. 2004) is used to align homologs to this ENCODE region (The ENCODE Project Consortium 2004). However, in this case the functional protein-binding sites are not preserved in the aligned chicken sequence (data not shown), which raises the issue of whether the alignment is biologically meaningful. It is also possible that some of the nonconserved human CRMs function only in one lineage and are not preserved in other lineages. Still other nonconserved CRMs possibly could be explained by turnover in the factor binding sites (Ludwig et al. 2000; Dermitzakis and Clark 2002; Berman et al. 2004). The hypersensitive site 3′HS1 may be an example of turnover. A hypersensitive site is located at a similar location downstream of the β-globin gene in both mouse and human (Fleenor and Kaufman 1993; Bulger and Groudine 1999), and this region has been implicated in boundary function (Bulger et al. 2003). However, the sequences of these HSs between mouse and human are not very similar, perhaps because the critical factor binding sites have changed order or location between the species. A third limitation to the use of comparative genomics approaches for finding potential cis-regulatory elements is the incomplete knowledge of protein-coding regions. All the methods examined here, including RP, give high scores to exons. Exons that have not been annotated will not be excluded from the analysis of “noncoding” regions, and thus they can contribute to FPs in the predictions. The reasons for Sn differing among data sets are of considerable interest. Recent studies show that genes encoding proteins involved in developmental and transcriptional regulation tend to have highly constrained CRMs (Sandelin et al. 2004b; Plessy et al. 2005; Woolfe et al. 2005). In contrast, the extensive studies in the HBB gene complex, many of which were not driven by sequence conservation, may have revealed some types of regulatory elements that do not have as strong a conservation signal as do those in developmental regulatory genes. Detailed analyses of the evolutionary features of different types of regulatory elements are an important area for future research. Improvements are expected in the predictive power of all the scores being computed on multispecies alignments. The discriminatory power of alignments increases as more sequences are added, both for a particular locus (Thomas et al. 2003) and genome-wide (Gibbs et al. 2004). Indeed, all three of the methods evaluated here perform better on three-way human–mouse–rat alignments than on pairwise human–mouse alignment (data not shown). Including the sequences of other species, such as dog and opossum, should improve the discriminatory power. Other studies that address statistical challenges in developing discriminatory models (Kolbe et al. 2004) should also lead to improved performance. Methods Reference sets of transcriptional regulatory regions The β-globin gene (HBB) complex contains several regulatory regions that have been well studied experimentally. A set of 23 experimentally determined CRMs was compiled from a literature survey and mapped within a 95-kb interval (chr11:5185001–5280000 in hg16), which encompasses the HBB complex and terminates at the surrounding olfactory receptor genes (Bulger et al. 2000). Several types of experimental data were used in establishing that a CRM is functional, including naturally occurring thalassemia mutations in humans, analysis of large DNA constructs in transgenic mice, effects on expression of reporter genes in either transient transfections or stably transformed cultured cells, DNase hypersensitive sites in chromatin, and in vivo footprints (see references in Table 1). Regions identified solely by electrophoretic mobility shift assays were not included. Of the 23 CRMs in this reference set, 19 can be found in multiple alignments of the human, mouse, and rat sequences. However, only 18 were available for the evaluation of the scores computed on the multiple alignment of hg16, mm3, and rn3 (see below) because much of the sequence of hypersensitive site HS4 (Stamatoyannopoulos et al. 1995) was masked as a repeat. Specifically, it is within an ERV1 transposable element, a member of a family that was active around the time of the primate-rodent divergence. This history makes it difficult to accurately determine whether to include the repeats in alignments (soft-masking) or to exclude them entirely (hard-masking) (Schwartz et al. 2003b). For the whole-genome multiple alignment set used in this study, the ERV1 family was hard-masked, and consequently, we could not include HS4 in the evaluations. However, it is well-known that the sequence of HS4 aligns among mammals, including humans and rodents (Stamatoyannopoulos et al. 1995; Hardison et al. 1997), and hence it is listed as conserved in rat and mouse in Table 1. A set of 40,000 predicted promoters were compiled by Trinklein et al. (2003). Of these, 152 were tested for promoter activity in transient transfection assays, with 138 verified (termed functional promoters). The 93 known regulatory regions were compiled from the literature and comprise the training set of RP (Elnitski et al. 2003). The developmental enhancers are the human homologs of a collection of 26 enhancers for mouse genes whose products regulate early development (Plessy et al. 2005). Other sets of functional sequences were the 176 miRNAs obtained from the miRNA Registry (Griffiths-Jones 2004; http://www.sanger.ac.uk/Software/Rfam/mirna/index.shtml) and the ~200,000 coding exons from RefSeq (Pruitt and Maglott 2001). Alignments Three-way human–mouse–rat alignments were computed on the July 2003 human genome assembly (hg16, NCBI build 34), the February 2003 mouse genome assembly (mm3), and the June 2003 rat assembly (rn3), using MULTIZ (Blanchette et al. 2004) on the relevant pairwise BLASTZ alignments (Schwartz et al. 2003b). Of the 95,000 bp in the HBB gene complex, 33,642 bp (35%) are in the whole-genome human–mouse–rat alignments, similar to the fraction obtained genome-wide (Gibbs et al. 2004). Scores based on alignments Regulatory potential The regulatory model was trained by using the frequencies of patterns in the alignment of a set of known regulatory regions (Elnitski et al. 2003). The coordinates of the known regulatory regions are available from GALA (Giardine et al. 2003; Elnitski et al. 2005) at http://www.bx.psu.edu/. The neutral model was trained by using frequencies of patterns in the alignment of a set of ancestral repeats, a model for neutral DNA (Hardison et al. 2003b). Patterns in the human–mouse–rat alignment were described by a 10-symbol alphabet using an order 2 Markov model (Kolbe et al. 2004). The RP score assigned to each position is the RP score computed for a 100-bp window centered on the position. Details on the computation are provided at http://www.bx.psu.edu/projects/rp/. MCS MCSs were calculated on MultiPipMaker alignments (Schwartz et al. 2003a) of human, mouse, and/or rat sequences of the HBB complex downloaded from University of California, Santa Cruz (UCSC) Genome Browser (Kent et al. 2002). The MCSs were computed by using WebMCS at http://research.nhgri.nih.gov/MCS/ (Margulies et al. 2003). The data were averaged into 25-bp windows with 1-bp slides. The scoring threshold representing the top 5% fraction was 3.7 for three-way (human–mouse–rat) alignment scores. phastCons The phastCons scores (Siepel et al. 2005) were calculated by using the software package Woody, obtained from Adam Siepel (Center for Biomolecular Science and Engineering, UCSC). The parameters used were λ = 0.9, k = 10 rate categories, Rev model; these were the same as those used to generate the data available on the UCSC Genome Browser when the calibration study was done. The probability that each alignment column is observed in the first of the k rate categories, hence the most conserved category, constituted the raw data. These data were then taken as 100-bp averages, in increments of 1-bp slides. Evaluation of alignment scores for detecting known CRMs For binary classification, all noncoding aligned positions in the HBB gene complex were assigned to be positive (regulatory) or negative (not regulatory), based on the positions of the known CRMs. Each aligned position was also assigned a score by each method, as described in the previous section. The distributions of scores for the “regulatory” and “nonregulatory” positions were evaluated to determine Sn and Sp at each score threshold. Positions in regulatory regions were classified as true positives (TPs) if they scored at or above a threshold and as false negatives (FNs) if they scored below it. The Sn was calculated as TP/(TP + FN). The positions in nonregulatory regions were classified as true negatives (TNs) if they scored below the threshold and as FPs if they scored above it. These results gave the Sp, which is TN/(TN + FP). The threshold at which Sn and Sp were most similar (crossover point) was taken as optimal for the binary classification by position. A separate analysis was used to evaluate the ability of each score to discriminate the regulatory intervals from nonregulatory DNA, based on the highest score in each interval. In this analysis, the average value for each of the three scores was computed in 100-bp windows (with a 1-bp slide) for all the aligned positions in the HBB complex (and genome-wide for phastCons and RP). Windows whose average score met or exceeded a given threshold comprised the predicted set for that threshold. Overlapping windows were combined to make a single contiguous interval that passed the threshold. Regulatory regions that overlapped an interval that passed the threshold were counted as TPs, and those that did not were FNs. The intervals that passed the threshold but did not overlap with a regulatory region were counted as FPs. Note that the size of each regulatory interval is determined experimentally as the region required for regulation. The sizes vary among CRMs but are not affected by the score threshold. The sizes also vary among the FP intervals, being determined by the scores of overlapping windows; in addition, the sizes can differ for each threshold. Defining a TN interval is difficult, and thus the evaluation was based on interval-based Sn (Snint = TP/[TP + FN]) and the true discovery rate (TP/[TP + FP]). The optimal threshold is approximately the crossover between Snint and the true discovery rate. This evaluation is similar to procedures used to evaluate gene prediction programs (Burset and Guigó 1996). A similar discrimination based on the maximum RP or phastCons score in each interval was performed for several sets of functional elements in the human genome, and cumulative distributions are shown in Figure 3 Availability The whole-genome alignments, RP scores, and phastCons scores can be downloaded from or queried upon at the UCSC Genome Browser (Kent et al. 2002) and Table Browser (Karolchik et al. 2004; http://www.genome.ucsc.edu/) and GALA (Giardine et al. 2003) (http://www.bx.psu.edu/). The promoter, known regulatory region, and miRNA data sets are available from GALA or the original investigators. The reference set of regulatory regions for the HBB gene complex is available at http://www.bx.psu.edu/~ross/dataset/DatasetHome.html, in both hg16 and hg17 coordinates. More information, along with references to the supporting literature, is recorded in dbERGE II (Elnitski et al. 2005), which allows users to obtain the data in a variety of formats (graphical or textual) and in various depths of detail. This resource can be accessed from the home page for Penn State Center for Comparative Genomics and Bioinformatics (http://www.bx.psu.edu/). After selecting the link to dbERGE II, on the Start page select dbERGE II v.2 (for hg16), on the Quick Links page select “Other” for Type of Data and ENCODE region “ENm009β Globin” for Restrict to Region, place no additional filters on the Detailed Experiments page, and select the desired output from the Display Options page. If viewed in the UCSC Genome Browser, links are provided back to dbERGE II for detailed information. A table of the intervals is available upon request. The operations for the evaluations reported here can be performed in a UNIX environment using command-line pipes and wrapper scripts for software that is available on request. Acknowledgments This work was supported by NIH grants DK65806 (R.H.), HG02238 (W.M.), and HG02325 (L.E.). We thank Adam Siepel and David Haussler for access to the phastCons scores and programs prior to publication. Notes Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.3642605. Article published online before print in July 2005. References
Web site references
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||
Nature. 2003 Apr 24; 422(6934):835-47.
[Nature. 2003]Science. 2004 Oct 22; 306(5696):636-40.
[Science. 2004]Annu Rev Genomics Hum Genet. 2004; 5():15-56.
[Annu Rev Genomics Hum Genet. 2004]Curr Opin Struct Biol. 2004 Jun; 14(3):264-72.
[Curr Opin Struct Biol. 2004]Nature. 2001 Feb 15; 409(6822):860-921.
[Nature. 2001]Nat Rev Genet. 2004 Apr; 5(4):276-87.
[Nat Rev Genet. 2004]Genome Res. 2003 Feb; 13(2):308-12.
[Genome Res. 2003]Nucleic Acids Res. 2001 Jan 1; 29(1):281-3.
[Nucleic Acids Res. 2001]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D91-4.
[Nucleic Acids Res. 2004]Mol Biol Cell. 1998 Dec; 9(12):3273-97.
[Mol Biol Cell. 1998]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Genome Res. 2004 Apr; 14(4):539-48.
[Genome Res. 2004]Mol Biol Evol. 1996 Jan; 13(1):93-104.
[Mol Biol Evol. 1996]Bioinformatics. 2003 Jan 22; 19(2):219-27.
[Bioinformatics. 2003]Bioinformatics. 2004 Aug 12; 20(12):1850-60.
[Bioinformatics. 2004]Genome Res. 2003 Jan; 13(1):64-72.
[Genome Res. 2003]Genome Res. 2004 Apr; 14(4):700-7.
[Genome Res. 2004]Nature. 2004 Apr 1; 428(6982):493-521.
[Nature. 2004]Gene. 1997 Dec 31; 205(1-2):73-94.
[Gene. 1997]Blood. 2002 Nov 1; 100(9):3077-86.
[Blood. 2002]Genome Res. 2003 Jan; 13(1):103-7.
[Genome Res. 2003]Nature. 2004 Dec 9; 432(7018):695-716.
[Nature. 2004]Nature. 2004 Apr 1; 428(6982):493-521.
[Nature. 2004]Genome Res. 2002 Jun; 12(6):996-1006.
[Genome Res. 2002]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D493-6.
[Nucleic Acids Res. 2004]Genome Res. 2003 Apr; 13(4):732-41.
[Genome Res. 2003]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D466-70.
[Nucleic Acids Res. 2005]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Science. 2003 Feb 28; 299(5611):1391-4.
[Science. 2003]Genome Res. 2004 Apr; 14(4):708-15.
[Genome Res. 2004]Science. 2004 Oct 22; 306(5696):636-40.
[Science. 2004]Nature. 2000 Feb 3; 403(6769):564-7.
[Nature. 2000]Mol Biol Evol. 2002 Jul; 19(7):1114-21.
[Mol Biol Evol. 2002]BMC Genomics. 2004 Dec 21; 5(1):99.
[BMC Genomics. 2004]Trends Genet. 2005 Apr; 21(4):207-10.
[Trends Genet. 2005]PLoS Biol. 2005 Jan; 3(1):e7.
[PLoS Biol. 2005]Nature. 2003 Aug 14; 424(6950):788-93.
[Nature. 2003]Nature. 2004 Apr 1; 428(6982):493-521.
[Nature. 2004]Genome Res. 2004 Apr; 14(4):700-7.
[Genome Res. 2004]Proc Natl Acad Sci U S A. 2000 Dec 19; 97(26):14560-5.
[Proc Natl Acad Sci U S A. 2000]EMBO J. 1995 Jan 3; 14(1):106-16.
[EMBO J. 1995]Genome Res. 2003 Jan; 13(1):103-7.
[Genome Res. 2003]Gene. 1997 Dec 31; 205(1-2):73-94.
[Gene. 1997]Genome Res. 2003 Feb; 13(2):308-12.
[Genome Res. 2003]Genome Res. 2003 Jan; 13(1):64-72.
[Genome Res. 2003]Trends Genet. 2005 Apr; 21(4):207-10.
[Trends Genet. 2005]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D109-11.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2001 Jan 1; 29(1):137-40.
[Nucleic Acids Res. 2001]Genome Res. 2004 Apr; 14(4):708-15.
[Genome Res. 2004]Genome Res. 2003 Jan; 13(1):103-7.
[Genome Res. 2003]Nature. 2004 Apr 1; 428(6982):493-521.
[Nature. 2004]Genome Res. 2003 Jan; 13(1):64-72.
[Genome Res. 2003]Genome Res. 2003 Apr; 13(4):732-41.
[Genome Res. 2003]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D466-70.
[Nucleic Acids Res. 2005]Genome Res. 2003 Jan; 13(1):13-26.
[Genome Res. 2003]Genome Res. 2004 Apr; 14(4):700-7.
[Genome Res. 2004]Nucleic Acids Res. 2003 Jul 1; 31(13):3518-24.
[Nucleic Acids Res. 2003]Genome Res. 2002 Jun; 12(6):996-1006.
[Genome Res. 2002]Genome Res. 2003 Dec; 13(12):2507-18.
[Genome Res. 2003]Genomics. 1996 Jun 15; 34(3):353-67.
[Genomics. 1996]Genome Res. 2002 Jun; 12(6):996-1006.
[Genome Res. 2002]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D493-6.
[Nucleic Acids Res. 2004]Genome Res. 2003 Apr; 13(4):732-41.
[Genome Res. 2003]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D466-70.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2001 Jan 1; 29(1):137-40.
[Nucleic Acids Res. 2001]Genome Res. 2003 Jan; 13(1):64-72.
[Genome Res. 2003]Trends Genet. 2005 Apr; 21(4):207-10.
[Trends Genet. 2005]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D109-11.
[Nucleic Acids Res. 2004]Genome Res. 2003 Feb; 13(2):308-12.
[Genome Res. 2003]