NCBI Prokaryotic Genomes Automatic Annotation Pipeline (PGAAP)

For information on Eukaryotic Genome Annotation and Assembly, go here.

For specific instructions, check the README file.

For specific procedures, check the NCBI Annotation Procedures.

Overview Return to the top of the page.

The Prokaryotic Genomes Automatic Annotation Pipeline (PGAAP) is currently under development.

The pipeline is intended for use during the annotation of genomes in preparation for submission to GenBank, and several external groups have used the NCBI Annotation Pipeline to prepare their submissions. The pipeline is capable of annotating complete genomes as well as WGS genomes consisting of multiple contigs (at least 200 bases per contig).

The pipeline has been used in RefSeq project to improve the annotation of complete microbial genomes (Daraselia et al., 2003).

If you are interested in using PGAAP please contact us at

NCBI Genomes

For detailed instructions, view the README file.

Pipeline Return to the top of the page.

The PGAAP combines HMM-based gene prediction methods with a sequence similarity-based approach which combines comparison of the predicted gene products to the non-redundant protein database, Entrez Protein Clusters , the Conserved Domain Database, and the COGs (Clusters of Orthologous Groups).
Submitters requesting the use of the annotation pipeline for their genomic sequences submit them to NCBI in fasta format.
Gene predictions are done using a combination of GeneMark and Glimmer (Borodovsky and McIninch; 1993; Lukashin and Borodovsky, 1998; Delcher et al., 1998). A short step resolving conflicts of start sites is done at this point. Ribosomal RNAs are predicted by sequence similarity searching using BLAST against an RNA sequence database and/or using Infernal and Rfam models. Transfer RNAs are predicted using tRNAscan-SE (Lowe and Eddy, 1997). In order to detect missing genes, a complete six-frame translation of the nucleotide sequence is done and predicted proteins (generated above) are masked. All predictions are then searched using BLAST against all proteins from complete microbial genomes. Annotation is based on comparison to protein clusters and on the BLAST results. Conserved Domain Database and Cluster of Orthologous Group information is then added to the annotation. Frameshift detection and cleanup occurs and then the final output is then sent back to the submitters who can then analyze the results in preparation for submission to GenBank.

End Products Return to the top of the page.

The end product of the annotation pipeline can be used to submit to GenBank.
For each genomic contig annotation results include:

  • DNA FASTA - *.fsa files
  • Feature table in Sequin format - *.tbl files
  • ASN.1 produced from pairs of table and FASTA sequence files - *.sqn files
  • GenBank format produced from the ASN.1 - *.gbf files
For Bacterial Genome Submission Guidelines, see this page.
Supplementary data available upon request for futher manual evaluation analysis of the annotation results
  • Blast results of predicted proteins against NCBI non-reduntant protein and protein clusters databases
  • Domain assignments for each protein by runing rps-BLAST against CDD database
  • COG assignments produced by using Cognitor against COG database.

README and Submission Return to the top of the page.

It is essential that the submission is in the proper format before we can proceed. This README file shows the correct steps and file formats.

Anyone wishing to submit sequences to the annotation pipeline must contact us first at: NCBI Genomes

References Return to the top of the page.

1. GeneMark
Borodovsky M and McIninch J. GeneMark: Parallel Gene 1993. Recognition for both DNA Strands. Comput. Chem. 17: 123-133.

2. GeneMark.hmm
Lukashin A. and Borodovsky M. 1998. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 26: No. 4, pp. 1107-1115. PMID: 9461475

3. GeneMarkS
Besemer, J., Lomsadze, A., and Borodovsky, M. 2001. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 26: No. 4, pp. 1107-1115. PMID: 11410670

4. Glimmer
Delcher A L, Hormon D, Kasif S, White O and Salzberg S L. 1999. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 27: 4636-4641. PMID: 10556321

5. Shewanella oneidensis
Daraselia N, Dernovoy D, Tian Y, Borodovsky M, Tatusov R, Tatusova T. 2003. Reannotation of Shewanella oneidensis genome. OMICS. 25: Summer 7(2):171-5. PMID: 14506846

6. Rfam
Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A., and Eddy, S.R. 1997. Rfam: an RNA family database. Nucleic Acids Research, 2003, 31, 1, 439-441. PMID: 15608160

7. tRNAscan-SE
Lowe, T.M. & Eddy, S.R. 1997. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucl. Acids Res. 25: 955-964. PMID: 9023104

8. Infernal
Eddy, S.R. 2002. A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure. BMC Bioinformatics. 3: 18. PMID: 12095421


Revised Feb 6, 2008

Disclaimer     Privacy statement     NCBI Service Desk