![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||
Copyright © 2005, Cold Spring Harbor Laboratory Press Whole genome shotgun sequencing of Brassica oleracea and its application to gene discovery and annotation in Arabidopsis The Institute for Genomic Research, Rockville, Maryland 20850, USA 1Current address: Pioneer Hi-Bred International, Inc./A DuPont Company, Johnston, IA 50131-0552. 2Corresponding author. E-mail cdtown/at/tigr.org; fax (301) 838-2808. Received August 21, 2004; Accepted January 5, 2005. This article has been cited by other articles in PMC.Abstract Through comparative studies of the model organism Arabidopsis thaliana and its close relative Brassica oleracea, we have identified conserved regions that represent potentially functional sequences overlooked by previous Arabidopsis genome annotation methods. A total of 454,274 whole genome shotgun sequences covering 283 Mb (0.44×) of the estimated 650 Mb Brassica genome were searched against the Arabidopsis genome, and conserved Arabidopsis genome sequences (CAGSs) were identified. Of these 229,735 conserved regions, 167,357 fell within or intersected existing gene models, while 60,378 were located in previously unannotated regions. After removal of sequences matching known proteins, CAGSs that were close to one another were chained together as potentially comprising portions of the same functional unit. This resulted in 27,347 chains of which 15,686 were sufficiently distant from existing gene annotations to be considered a novel conserved unit. Of 192 conserved regions examined, 58 were found to be expressed in our cDNA populations. Rapid amplification of cDNA ends (RACE) was used to obtain potentially full-length transcripts from these 58 regions. The resulting sequences led to the creation of 21 gene models at 17 new Arabidopsis loci and the addition of splice variants or updates to another 19 gene structures. In addition, CAGSs overlapping already annotated genes in Arabidopsis can provide guidance for manual improvement of existing gene models. Published genome-wide expression data based on whole genome tiling arrays and massively parallel signature sequencing were overlaid on the Brassica–Arabidopsis conserved sequences, and 1399 regions of intersection were identified. Collectively our results and these data sets suggest that several thousand new Arabidopsis genes remain to be identified and annotated. Arabidopsis thaliana is the first plant model species to have been sequenced on a genome-wide scale. Since the publication of its genome in the year 2000 (The Arabidopsis Genome Initiative 2000), there has been a continuous effort to complete sequence gaps and to improve gene annotation. Large-scale sequencing of expressed sequence tags (ESTs) and full-length cDNAs from Arabidopsis and other plants, coupled with better ab initio gene structure prediction programs, have contributed to improvements in the annotation of Arabidopsis (Wortman et al. 2003). Notable among them include the refinement of exon–intron boundaries and the addition of untranslated regions (UTRs) (Haas et al. 2002, 2003; Zhu et al. 2003). However, despite the availability of a great number of ab initio and database search–based gene prediction methods, identification and annotation of all the genes in any organism is still one of the major challenges in biology. This is partly because the existing EST/cDNA sequences in databases do not represent all transcription units and/or available gene-prediction programs are limited in their capacity. Comparative genomics is rapidly emerging as a powerful tool for genome analysis and annotation. Over the course of evolution, functional regions such as exons and regulatory sequences tend to be more conserved than are nonfunctional regions, thus local sequence similarity has implications for biological functionality. Several recent studies have demonstrated that a comparative genomics approach is useful in refining gene predictions in human (Ansari-Lari et al. 1998; Bouck et al. 2000; Mallon et al. 2000; Flicek et al. 2003; Gugio et al. 2003), Drosophila (Bergman et al. 2002), yeast (Brachat et al. 2003), and Plasmodium (Carlton et al. 2002). B. oleracea and A. thaliana diverged 15–20 million years ago (Yang et al. 1999). Earlier comparative mapping studies of Brassica and Arabidopsis using molecular markers revealed extensive synteny between these two species, suggesting that knowledge gained in one species can be productively applied to the other (Lan et al. 2000; Babula et al. 2003) although deviation from colinearity was also noted (Kowalski et al. 1994; Sadowski et al. 1996; Ryder et al. 2001). Nucleotide sequence conservation between these two species has been reported to be in the range of 75%–90% in exons, whereas in introns and intergenic regions, it is ≤70% (Quiros et al. 2001). Thus a genome scale comparison of Arabidopsis with Brassica at the sequence level provides an excellent opportunity to test the applicability of this phylogenetic footprinting approach in the annotation of plant genomes. To this end, we and the groups at Cold Spring Harbor Laboratories and Washington University carried out Brassica whole genome shotgun (WGS) sequencing, collectively generating 538,418 good reads. These data were aligned with and compared to Data Release 4.0 of The Institute for Genomic Research (TIGR) (ftp://ftp.tigr.org/pub/data/a_thaliana/ath1/PREVIOUS_RELEASE_VERSIONS/release4.tar.gz) Arabidopsis genome sequence to discover new genes and refine the existing Arabidopsis annotation. Our in silico analysis was augmented with in vitro validation of conserved sequences for transcription by using PCR. This work, a similar effort by our colleagues at Cold Spring Harbor (Katari et al. 2005), and the use of the Brassica WGS sequence alignments, along with Arabidopsis cDNA alignments to the Arabidopsis genome to train TWINSCAN (Korf et al. 2001; P. Hu and M. Brent, in prep.), represent the first efforts to utilize comparative whole genome sequencing to improve gene annotation in plants. The Gramene database (www.gramene.org) also provides comparative alignments between maize and rice genomic sequence, but these have not yet been utilized in the improvement of gene models. Results Brassica genome sequencing The WGS effort at TIGR resulted in 415,093 distinct high-quality reads of 667 bases average length. The distribution of these sequences across the various libraries constructed is shown in Supplement 1, Table S1. These sequences were submitted to the Genome Survey Sequence (GSS) division of GenBank. Accession numbers can be found in Supplement 2. Another set of 122,897 high-quality reads generated by Cold Spring Harbor Laboratory and Washington University in St. Louis (MO) were acquired from GenBank (Katari et al. 2005). Prior to the identification of conserved sequences, these sequences were searched against chloroplast, mitochondrial, ribosomal DNA, transposon-related sequences and repeats, and matching sequences were removed. The filtered combined sequence set consists of 454,274 reads with an average read length of 623 bases, representing a total of 283.0 Mb (0.44× coverage) of the B. oleracea genome, estimated at 650 Mb (Paterson et al. 2001). Mapping Brassica WGS sequences to the Arabidopsis genome The individual Brassica WGS sequences were mapped to the Arabidopsis chromosome sequences by using BLASTZ (Schwartz et al. 2000, 2003) at its default settings. Since many reads mapped to more than one genome location, the position of each WGS sequence was determined by the highest scoring alignment cluster. Of the 197,344 Brassica sequences that could be mapped to the Arabidopsis genome, 144,821 were found to intersect 84% (23,005 out of 27,384) of the annotated genes and 54% (84,481 out of 157,737) of the annotated exons. Incremental sampling of the Brassica sequences that matched the Arabidopsis genome provides an indication of the extent of gene and exon coverage as a function of the number of sequence reads (Fig. 1
The average coverage of the Arabidopsis genome by Brassica reads on a nucleotide basis (using only alignments from the topscoring read) was calculated within 5-kb windows across each chromosome, scoring only nucleotides in aligned positions. The distribution of these matches along the Arabidopsis chromosome 1 is shown in Figure 2
Distribution of conserved regions across the Arabidopsis genome and the prediction of novel genes and exons To evaluate the relationship between Brassica–Arabidopsis sequence conservation and Arabidopsis gene structure, the filtered Brassica sequences were aligned against a sample of 1000 Arabidopsis genes whose structures were supported by full-length cDNAs. An example of such alignments to a single Arabidopsis gene (At1g20090) is shown in Figure 3
BLASTZ alignments against this set of cDNA-validated gene models were evaluated with respect to sensitivity and specificity for detection of nucleotides in exons by using sets of alignments having a range of either minimum percentage identity or minimum match length. Specificity showed steady increase with both match length and percentage identity up to a certain point, whereas the opposite was true for sensitivity (Fig. 4
For the purposes of enumerating intergenic regions that might represent novel genes or other conserved regions such as noncoding RNAs, regulatory elements, and conserved noncoding sequences (CNSs) (Kaplinsky et al. 2002), individual CAGSs were joined together into “chains” (Fig. 5C
Improving existing gene annotation using CAGSs A set of 370 BACs, each containing intergenic chains comprised of three or more CAGSs, was identified and subjected to manual curation by using our ATH1 database and Annotation Station, resulting in the creation of 104 new gene models, identification of an additional 295 pseudogenes, and updates to 60 existing genes models. The Brassica alignments alone do not provide sufficient information for the creation of new gene models or the precise definition of splice junctions, even with careful manual curation, much less simply by computation. Thus we have not attempted to use the remaining set of CAGSs for automated gene refinement. Investigation of expression from CAGSs Transcriptional activity was assessed from 192 conserved regions selected approximately equally from the singleton and paired (chains of two) CAGS data sets, without regard to their location with respect to neighboring genes. PCR was performed on a population of cDNAs pooled from diverse Arabidopsis tissues that have been shown to contain transcripts for 70%–80% of all hypothetical genes assayed (Xiao et al. 2002). The sequences of the CAGSs examined and primers used are provided in Supplements 5 through 7. PCR products were sequenced to determine whether they indeed arose from the target locus. The results are summarized in Table 3. It is striking that although singleton and paired CAGSs yielded approximately the same the number of highquality sequences, the proportion of these sequences that were derived from the expected target was much higher for the paired (~45%) than for the singleton CAGSs (~18%). The reason(s) for this difference are not clear. All the paired CAGS sequences were unique in the genome in that there were no other regions with ≥80% identity to any part of the CAGS. Among the singleton CAGSs, 86% were unique while the remainder had one or more additional matches of >80% identity over ≥80% of their length (from two to nine repeats). However, among the singleton CAGSs that were repeated in the genome, 50% of those that sequenced successfully matched their target, while the other 50% did not. Thus the higher proportion of sequences from singleton CAGSs that did not match their targets cannot be ascribed to their repetitive nature but is more likely due to the idiosyncracies of PCR possibly coupled with the shorter target regions. The average length of the paired CAGS targets was 382 ± 9 bp (SEM), whereas for the singleton CAGSs it was 176 ± 9 (SEM). Table 3 also shows the breakdown of expression between perigenic (27%) and truly intergenic CAGSs (32%).
Cloning of full-length cDNAs and its application to genome annotation Those regions showing evidence of expression were subjected to 5′ and 3′ RACE analysis (Frohman et al. 1988). The resulting sequences from each location (33 paired and 18 singleton CAGSs, with seven loci producing no good sequences) were assembled by using TIGR assembler (Sutton et al. 1995) as described previously (Xiao et al. 2002) and a total of 70 transcript assemblies submitted to GenBank as FL-cDNAs or partial cDNAs (GenBank accession nos. AY299234–AY299303). Subsequently, the same sequences were aligned against the Arabidopsis genome and assembled into a minimal set of distinct transcripts by using the PASA pipeline (Haas et al. 2003). Of the 70 assemblies, 63 aligned properly to the genome sequence and were collapsed by PASA into 49 alignment assemblies. From these, 21 new genes were modeled at 19 loci (two having splice variants), new splice variants were added to two existing genes and the structures of another 17 existing genes were updated. The other nine alignment assemblies matched unannotated intergenic regions but provided insufficient information to instantiate new gene models. In almost every case, the regions in which the experimental data supported the annotation of a novel gene were already spanned by one or more gene predictions. However, there were always significant inconsistencies between the gene predictions in that region such that no new gene could have been modeled on the basis of predictions alone. Furthermore, the experimentally supported gene model always differed at one or more splice sites from any of the predictions. A list of new genes instantiated and existing genes matched and updated in this study is shown in Table 4. Some examples of newly created genes are shown in Figure 6
Global expression profiling of CAGSs In a recent publication, Yamada et al. (2003) described the development of an Arabidopsis whole genome tiling array and its deployment to characterize genome-wide transcription. Their “intergenic clusters” correspond to regions of the genome that were unannotated in TIGR release 3.0 but to which Arabidopsis ESTs could be mapped. Most of these clusters were incorporated into annotated gene models in later releases. Their “intergenic regions” contained neither an annotated gene nor an intergenic cluster. We have mapped both the intergenic clusters and the intergenic regions from the Yamada et al. (2003) data set onto our intergenic chains of CAGSs and find that 102 expressed clusters and 458 expressed intergenic regions intersect our 15,686 intergenic CAGS chains. Similarly, we have taken the massively parallel signature sequencing (MPSS) data provided by Blake Meyers (Meyers et al. 2004) and find that 1060 CAGSs of the 15,686 CAGS chains contain a significantly expressed MPSS signature. Together these two data sets provide evidence for expression of 1399 intergenic chains of CAGS. Discussion So far, the annotation of Arabidopsis and other plant genomes has relied on ab initio gene prediction and alignment of known features (ESTs, cDNAs, proteins, functional domains, etc.) against the target genome. In this work we have described a comparative genomic approach to annotation by using genomewide alignment of WGS sequences from a fairly close relative (B. oleracea) to a finished and annotated genome (Arabidopsis) to direct both in silico and experimentally based improvements to the genome annotation. Mapping Brassica reads onto the Arabidopsis genome At 0.44× coverage of the B. oleracea genome, the number of Arabidopsis genes having a Brassica match (84%) appears to be approaching saturation, while the number exons matched (54%) is far from saturated. Thus further sequencing of Brassica should lead to matches to more exons, a fraction of which would be part of novel genes. The relatively high level of coverage of the Arabidopsis genome by <0.5× coverage of the B. oleracea genome is most likely due to the fact that this genome is triplicated with respect to Arabidopsis (Cavell et al. 1998; Lagercrantz 1998). It is also interesting to note that in spite of the prior removal of sequences matching transposable element ORFs and known Arabidopsis repeats, there is still a higher density of Brassica alignments near the centromeres of each chromosome (e.g., between ~12.5 and 14.8 Mb and between 15.2 and 16.5 Mb on chromosome 1). This suggests the presence in the Brassica genome of additional classes repetitive sequences not removed by our filtration process that are also found in (peri)-centromeric regions of the Arabidopsis genome. Sequence conservation between Brassica and Arabidopsis and the development and chaining of CAGSs BLASTZ produced a very large number of HSP alignments between the individual Brassica reads and the Arabidopsis genome. To maximize the number of conserved regions identified at the expense of specificity, values of minimum match length (40 bp) and minimum percentage identity (60%) toward the low side of the broad maxima for Sn.Sp were used to generate CAGSs for further analysis. Although the cut-off value of 60% identity for exon definition is lower than that reported by Quiros et al. (2001), our Sn.Sp analysis and the results exemplified in Figure 3 One possible factor contributing to the effectiveness of the lower values for exon detection is the fact that in these Brassica–Arabidopsis alignments (CAGSs), no distinction is made between alignments between possible orthologs and paralogs. For alignments between a paralogous Brassica–Arabidopsis pair, it is likely that both exon and intron similarity are reduced while the exon–intron contrast is maintained. We should also note that, based on our experience with Arabidopsis genome annotation (Wortman et al. 2003), alignments between paralogous genes are often informative in defining gene structure. Consistent with observation from many comparative genomics studies, the majority of the 229,735 CAGSs matched existing gene models, although the boundaries of conservation delineated by the BLASTZ alignments often do not coincide with experimentally determined gene boundaries. Approximately one-fourth (60,378) of the CAGSs fell between annotated genes and were classified as intergenic. The average coverage per nucleotide of 4.04 and 5.78 in the genic and intergenic CAGSs, respectively, likely reflects contributions both from the triplicated nature of the Brassica genome and the contribution to CAGSs of both orthologous and paralogous sequences that cannot be distinguished with the limited Brassica sequence available. Expression of CAGS chains Expression analysis showed that ~30% (58 of 192) of CAGSs investigated were expressed. Cloning and sequencing of 5′ and 3′ RACE products from these regions allowed us to create 23 new gene models and to update the annotation of another 17 genes. Additional evidence for CAGS expression comes from our analysis of published data on whole genome tiling arrays (Yamada et al. 2003) and MPSS (Meyers et al. 2004), which revealed that ~10% of all intergenic CAGSs may produce transcripts. The higher proportion of expressed CAGSs detected by our PCR-based approach (~30%) likely reflects the greater sensitivity of this method. Taken together, these results suggest that between 10% and 30% of the ≥15,000 intergenic chains of CAGSs may correspond to novel and as yet to be annotated genes. Further analysis of the possible identity of CAGSs as genes will require targeted approaches such as high-throughput RACE/PCR, since current microarray technologies are limited in their ability to detect low level transcripts from such regions. Other clues as to the function of CAGSs may come from the study of T-DNA, transposon, or RNAi mutants that disrupt their function. Use of CAGS in modeling gene structures Because CAGSs do not precisely define exon boundaries, they cannot be used to automatically model or update gene structures. However, they do serve to flag regions where manual intervention can improve gene models as described above for 370 BACs examined. They also provide credibility for hypothetical gene predictions. Release 4.0 of the TIGR annotation contained 4253 hypothetical genes, of which 3433 overlapped one or more CAGSs. The Brassica–Arabidopsis genome alignments generated in this study were used to train a novel gene prediction program, Twinscan (Korf et al. 2001). Of 2229 genes predicted by Twinscan in previously unannotated regions of the genome, 1674 coincide with our CAGSs. Based on experimental evidence, Twinscan is reported to perform somewhat better than other ab initio gene prediction programs on Arabidopsis (M. Brent, pers. comm.). Although, our limited investigations reported here support this observation, they also indicate that not all Twinscan predictions are correct and that expressed genes can be found in regions defined by CAGS chains that are devoid of a Twinscan prediction. Summary In this study, we have shown that WGS sequencing of a related species can be a valuable guide to identify genes that were not properly annotated and also to locate regions of the genome that have escaped the annotation process. Expression analysis of a sample of CAGS using PCR-based methods indicated that as many as 30% of the CAGS chains might be expressed, and showed that RACE can be effectively used to obtain full-length sequences and model gene structures. In future studies, we plan to use the global expression profiling data to target those regions most likely to harbor expressed genes for FL-cDNA cloning. In addition to its use to identify and improve the annotation of protein coding genes in Arabidopsis sequence conservation with Brassica, the conserved sequences can also be used to identify other genome features such as promoter elements (Colinas et al. 2002) and noncoding RNAs (Rivas and Eddy 2001), but the sequence information reported here has not yet been fully exploited for these purposes. Methods DNA sequencing Several small (2–3 kb) and one large (10–15 kb) insert libraries were constructed in the high- or medium-copy vectors pHOS1 and pHOS2 by standard in-house methods (Fleischmann et al. 1995) using either total or “nuclear” DNA (generously provided by Dr. Pablo Rabinowicz) isolated from TO1000DH3, a doubled haploid derived from a rapid cycling Brassica oleracea kindly provided by Dr. Tom Osborn. End sequences were generated on ABI PRISM 3700 DNA sequencers with ABI “Big Dye” dye terminator cycle sequencing kits. In silico identification of conserved segments The filtered sequences were aligned against version 4.0 of the TIGR Arabidopsis Genome Annotation by using BLASTZ (Schwartz et al. 2003), the sequence alignment program underlying PIPMaker (Schwartz et al. 2000), at its default setting. To determine search parameters that would best discriminate between the alignments of the Brassica reads to Arabidopsis exons relative to introns, a sample of 1000 Arabidopsis genes that were supported by FL-cDNAs and had matches to Brassica sequences were selected, 200 from each of the five chromosomes, and concatenated. The Brassica WGS sequences were searched against this data set, and the resulting alignments were filtered by using a range of values for minimum percentage identity and minimum match length. Sensitivity and specificity values for the identification of exons from BLASTZ alignments were calculated at each nucleotide position from regions of the genomic sequence where exons and introns intersected with the regions spanned by BLASTZ alignments. Sensitivity (probability that a base known to be in an exon is found within a BLASTZ alignment) and specificity (probability that a base found within a BLASTZ alignment is actually in an exon) were calculated as follows: BLASTZ alignment segments were collapsed into CAGSs by using appropriate values for minimum match length and minimum percentage identify as described in Figure 5 Laboratory validation of CAGS chains A sample of 192 of the longer singleton and paired CAGS chains was selected for in vitro validation. Primers were designed within the CAGS, one in each CAGS in the case of paired CAGSs, avoiding regions close to the gap. PCR was performed by using a cDNA population derived from a variety of Arabidopsis tissues and treatments as template (Xiao et al. 2002). PCR products were treated with a mixture of alkaline phosphatase and exonuclease to remove residual primers and deoxynucleotide triphosphates and then sequenced from both ends using the same PCR primers. To obtain RACE products for the expressed regions, a second set of nested primers was designed and used for 5′ and 3′ RACE. Four to five independent clones from each of the 5′- and 3′-RACE products were sequenced from both ends by using generic sequencing primers, and the sequences from each locus tested were assembled by using TIGR assembler (Sutton et al. 1995). The assembled cDNA sequences for each gene were aligned against the corresponding CAGS sequences by using the AAT program (Huang et al. 1997). This was followed by manual inspection and submission of the assemblies to GenBank. Acknowledgments We thank Drs. Tom Osborn and Pablo Rabinowicz for providing the Brassica seed and DNA, respectively. We gratefully acknowledge the contributions and support of the TIGR sequencing, bioinformatics and information technology and Arabidopsis annotation groups. We also thank Dr. Michael Brent and Dr. Blake Meyers for discussions and access to data for Twinscan predictions and MPSS, respectively, as well as Dr. Steven Salzberg for advice on the specificity and sensitivity analysis. This project was supported by National Science Foundation (Cooperative Agreement DBI 9813586). Notes Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.3176505. Footnotes [Supplemental material is available online at www.genome.org. The following individuals kindly provided reagents, samples, or unpublished information as indicated in the paper: T. Osborn and P. Rabinowicz.] References
Web site references
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||
Nature. 2000 Dec 14; 408(6814):796-815.
[Nature. 2000]Plant Physiol. 2003 Jun; 132(2):461-8.
[Plant Physiol. 2003]Genome Biol. 2002; 3(6):RESEARCH0029.
[Genome Biol. 2002]Nucleic Acids Res. 2003 Oct 1; 31(19):5654-66.
[Nucleic Acids Res. 2003]Plant Physiol. 2003 Jun; 132(2):469-84.
[Plant Physiol. 2003]Genome Res. 1998 Jan; 8(1):29-40.
[Genome Res. 1998]Nat Genet. 2000 May; 25(1):31-3.
[Nat Genet. 2000]Genome Res. 2000 Jun; 10(6):758-75.
[Genome Res. 2000]Genome Res. 2003 Jan; 13(1):46-54.
[Genome Res. 2003]Proc Natl Acad Sci U S A. 2003 Feb 4; 100(3):1140-5.
[Proc Natl Acad Sci U S A. 2003]J Mol Evol. 1999 May; 48(5):597-604.
[J Mol Evol. 1999]Genome Res. 2000 Jun; 10(6):776-88.
[Genome Res. 2000]Mol Genet Genomics. 2003 Feb; 268(5):656-65.
[Mol Genet Genomics. 2003]Genetics. 1994 Oct; 138(2):499-510.
[Genetics. 1994]Mol Gen Genet. 1996 Jun 12; 251(3):298-306.
[Mol Gen Genet. 1996]Genome Biol. 2001; 2(3):REVIEWS1011.
[Genome Biol. 2001]Genome Res. 2000 Apr; 10(4):577-86.
[Genome Res. 2000]Genome Res. 2003 Jan; 13(1):103-7.
[Genome Res. 2003]Plant Physiol. 2002 Dec; 130(4):2118-28.
[Plant Physiol. 2002]Proc Natl Acad Sci U S A. 1988 Dec; 85(23):8998-9002.
[Proc Natl Acad Sci U S A. 1988]Plant Physiol. 2002 Dec; 130(4):2118-28.
[Plant Physiol. 2002]Nucleic Acids Res. 2003 Oct 1; 31(19):5654-66.
[Nucleic Acids Res. 2003]Science. 2003 Oct 31; 302(5646):842-6.
[Science. 2003]Nat Biotechnol. 2004 Aug; 22(8):1006-11.
[Nat Biotechnol. 2004]Genome. 1998 Feb; 41(1):62-9.
[Genome. 1998]Genetics. 1998 Nov; 150(3):1217-28.
[Genetics. 1998]Genetics. 2001 Mar; 157(3):1321-30.
[Genetics. 2001]Plant Physiol. 2003 Jun; 132(2):461-8.
[Plant Physiol. 2003]Science. 2003 Oct 31; 302(5646):842-6.
[Science. 2003]Nat Biotechnol. 2004 Aug; 22(8):1006-11.
[Nat Biotechnol. 2004]Bioinformatics. 2001; 17 Suppl 1():S140-8.
[Bioinformatics. 2001]Plant Physiol. 2002 Jun; 129(2):451-4.
[Plant Physiol. 2002]BMC Bioinformatics. 2001; 2():8.
[BMC Bioinformatics. 2001]Science. 1995 Jul 28; 269(5223):496-512.
[Science. 1995]Genome Res. 2003 Jan; 13(1):103-7.
[Genome Res. 2003]Genome Res. 2000 Apr; 10(4):577-86.
[Genome Res. 2000]Plant Physiol. 2002 Dec; 130(4):2118-28.
[Plant Physiol. 2002]Genomics. 1997 Nov 15; 46(1):37-45.
[Genomics. 1997]