Logo of narLink to Publisher's site
Nucleic Acids Res. 1998 Jun 15; 26(12): 2941–2947.
PMCID: PMC147632

Combining diverse evidence for gene recognition in completely sequenced bacterial genomes.


Analysis of a newly sequenced bacterial genome starts with identification of protein-coding genes. Functional assignment of proteins requires the exact knowledge of protein N-termini. We present a new program ORPHEUS that identifies candidate genes and accurately predicts gene starts. The analysis starts with a database similarity search and identification of reliable gene fragments. The latter are used to derive statistical characteristics of protein-coding regions and ribosome-binding sites and to predict the complete set of genes in the analyzed genome. In a test on Bacillus subtilis and Escherichia coli genomes, the program correctly identified 93.3% (resp. 96.3%) of experimentally annotated genes longer than 100 codons described in the PIR-International database, and for these genes 96.3% (83.9%) of starts were predicted exactly. Furthermore, 98.9% (99.1%) of genes longer than 100 codons annotated in GenBank were found, and 92.9% (75.7%) of predicted starts coincided with the feature table description. Finally, for the complete gene complements of B.subtilis and E.coli , including genes shorter than 100 codons, gene prediction accuracy was 88.9 and 87.1%, respectively, with 94.2 and 76.7% starts coinciding with the existing annotation.

Full Text

The Full Text of this article is available as a PDF (145K).

Selected References

These references are in PubMed. This may not be the complete list of references from this article.
  • Dreyfus M. What constitutes the signal for the initiation of protein synthesis on Escherichia coli mRNAs? J Mol Biol. 1988 Nov 5;204(1):79–94. [PubMed]
  • Shine J, Dalgarno L. The 3'-terminal sequence of Escherichia coli 16S ribosomal RNA: complementarity to nonsense triplets and ribosome binding sites. Proc Natl Acad Sci U S A. 1974 Apr;71(4):1342–1346. [PMC free article] [PubMed]
  • Borodovsky M, Rudd KE, Koonin EV. Intrinsic and extrinsic approaches for detecting genes in a bacterial genome. Nucleic Acids Res. 1994 Nov 11;22(22):4756–4767. [PMC free article] [PubMed]
  • Fickett JW. Finding genes by computer: the state of the art. Trends Genet. 1996 Aug;12(8):316–320. [PubMed]
  • Gelfand MS. Prediction of function in DNA sequence analysis. J Comput Biol. 1995 Spring;2(1):87–115. [PubMed]
  • Bult CJ, White O, Olsen GJ, Zhou L, Fleischmann RD, Sutton GG, Blake JA, FitzGerald LM, Clayton RA, Gocayne JD, et al. Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science. 1996 Aug 23;273(5278):1058–1073. [PubMed]
  • Blattner FR, Plunkett G, 3rd, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, et al. The complete genome sequence of Escherichia coli K-12. Science. 1997 Sep 5;277(5331):1453–1462. [PubMed]
  • Kunst F, Ogasawara N, Moszer I, Albertini AM, Alloni G, Azevedo V, Bertero MG, Bessières P, Bolotin A, Borchert S, et al. The complete genome sequence of the gram-positive bacterium Bacillus subtilis. Nature. 1997 Nov 20;390(6657):249–256. [PubMed]
  • Lukashin AV, Borodovsky M. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 1998 Feb 15;26(4):1107–1115. [PMC free article] [PubMed]
  • Salzberg SL, Delcher AL, Kasif S, White O. Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 1998 Jan 15;26(2):544–548. [PMC free article] [PubMed]
  • Krogh A, Mian IS, Haussler D. A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Res. 1994 Nov 11;22(22):4768–4778. [PMC free article] [PubMed]
  • Gish W, States DJ. Identification of protein coding regions by database similarity search. Nat Genet. 1993 Mar;3(3):266–272. [PubMed]
  • Robison K, Gilbert W, Church GM. Large scale bacterial gene discovery by similarity search. Nat Genet. 1994 Jun;7(2):205–214. [PubMed]
  • Huang X. Fast comparison of a DNA sequence with a protein sequence database. Microb Comp Genomics. 1996;1(4):281–291. [PubMed]
  • Gelfand MS, Mironov AA, Pevzner PA. Gene recognition via spliced sequence alignment. Proc Natl Acad Sci U S A. 1996 Aug 20;93(17):9061–9066. [PMC free article] [PubMed]
  • Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997 Apr 25;268(1):78–94. [PubMed]
  • Pearson WR, Wood T, Zhang Z, Miller W. Comparison of DNA sequences with protein sequences. Genomics. 1997 Nov 15;46(1):24–36. [PubMed]
  • Bork P, Ouzounis C, Sander C, Scharf M, Schneider R, Sonnhammer E. What's in a genome? Nature. 1992 Jul 23;358(6384):287–287. [PubMed]
  • Koonin EV, Mushegian AR, Rudd KE. Sequencing and analysis of bacterial genomes. Curr Biol. 1996 Apr 1;6(4):404–416. [PubMed]
  • Nielsen H, Engelbrecht J, Brunak S, von Heijne G. Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng. 1997 Jan;10(1):1–6. [PubMed]
  • Varshavsky A. The N-end rule: functions, mysteries, uses. Proc Natl Acad Sci U S A. 1996 Oct 29;93(22):12142–12149. [PMC free article] [PubMed]
  • Barrick D, Villanueba K, Childs J, Kalil R, Schneider TD, Lawrence CE, Gold L, Stormo GD. Quantitative analysis of ribosome binding sites in E.coli. Nucleic Acids Res. 1994 Apr 11;22(7):1287–1295. [PMC free article] [PubMed]
  • Etzold T, Ulyanov A, Argos P. SRS: information retrieval system for molecular biology data banks. Methods Enzymol. 1996;266:114–128. [PubMed]
  • Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403–410. [PubMed]
  • Barker WC, Garavelli JS, Haft DH, Hunt LT, Marzec CR, Orcutt BC, Srinivasarao GY, Yeh LS, Ledley RS, Mewes HW, et al. The PIR-International Protein Sequence Database. Nucleic Acids Res. 1998 Jan 1;26(1):27–32. [PMC free article] [PubMed]
  • Bairoch A, Apweiler R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998. Nucleic Acids Res. 1998 Jan 1;26(1):38–42. [PMC free article] [PubMed]
  • Fickett JW, Tung CS. Assessment of protein coding measures. Nucleic Acids Res. 1992 Dec 25;20(24):6441–6450. [PMC free article] [PubMed]
  • Schneider TD, Stormo GD, Gold L, Ehrenfeucht A. Information content of binding sites on nucleotide sequences. J Mol Biol. 1986 Apr 5;188(3):415–431. [PubMed]
  • Weinrauch Y, Guillen N, Dubnau DA. Sequence and transcription mapping of Bacillus subtilis competence genes comB and comA, one of which is related to a family of bacterial regulatory determinants. J Bacteriol. 1989 Oct;171(10):5362–5375. [PMC free article] [PubMed]
  • Hayes WS, Borodovsky M. Deriving ribosomal binding site (RBS) statistical models from unannotated DNA sequences and the use of the RBS model for N-terminal prediction. Pac Symp Biocomput. 1998:279–290. [PubMed]
  • Ogasawara N. Markedly unbiased codon usage in Bacillus subtilis. Gene. 1985;40(1):145–150. [PubMed]
  • Shields DC, Sharp PM. Synonymous codon usage in Bacillus subtilis reflects both translational selection and mutational biases. Nucleic Acids Res. 1987 Oct 12;15(19):8023–8040. [PMC free article] [PubMed]
  • Eddy SR, Durbin R. RNA sequence analysis using covariance models. Nucleic Acids Res. 1994 Jun 11;22(11):2079–2088. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press


Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • Cited in Books
    Cited in Books
    NCBI Bookshelf books that cite the current articles.
  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...