NCBI Prokaryotic Genome Annotation Pipeline and Submission Checks: William Klimke, Azat Badretdin, Dima Dernovoy, Sergei Resenchuk, Tatiana Tatusova (email: genomes@ncbi.nlm.nih.gov) NCBI has developed an automatic annotation pipeline for complete and WGS prokaryotic chromosomes and plasmids. The output of this annotation is intended for manual curation. Many of the procedures described in this document also refer to the cleanup procedures employed during RefSeq processing. Information on the PGAAP can be found here: http://www.ncbi.nlm.nih.gov/genomes/static/Pipeline.html Detailed instructions: http://www.ncbi.nlm.nih.gov/genomes/static/Annotation_pipeline_README.txt Microbial Genome Submission Check: http://web.ncbi.nlm.nih.gov/genomes/frameshifts/frameshifts.cgi Information on submission to GenBank can be found on this page: http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html Information on submitting Genome Project information can be found here: http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi 1. gene/CDS prediction. 1A. initial Gene/CDS predictions. Gene predictions involve a combination of Glimmer (3.02), GeneMark(2.5f) and GeneMark.HMM(2.6m) predictions (complete genomes) or only GeneMark(2.5f) and GeneMark.HMM(2.6m) predictions (WGS genomes) based on input-sequence-specific model/matrix built with GeneMarkS(4.6a). For WGS genomes, the longest contigs may be used for training phase. Heuristic model/matrix applied for short sequences (based on GC content). 1B. Start site resolution. This multi-step process uses a number of methods to correct the initiation site for coding regions. During initial prediction, start conflicts are resolved in favor of GeneMark.HMM predictions (which use RBS if any) in the middle of DNA contigs or GeneMark at the ends of contigs to compensate for partial coding regions due to incomplete sequences in WGS submissions. Once all BLAST results (Clusterprot) are calculated, additional start site correction occurs based on subjects in the following manner. Only hits with >= 75% identity are used where the aligned length includes almost the entire query and subject (except for the protein start). Start site correction tallies the votes of subjects: 1B1. minimum 5 positive votes are required for correction to occur 1B2. opposing votes must be < ½ the total of the positive votes for correction to occur If these two criteria is met then start site correction to the exact position of positive votes is done (start site must match in query and positive vote subjects). Otherwise, no start site correction occurs. 1C. 6-frame translation. In order to find proteins missed in the above prediction step, all 6 reading frames are translated and used in an additional BLAST search (against Bactprot). The entire nucleotide sequence is translated in all frames, the existing predictions are masked in their exact frame only, and the entire polypeptide is sliced in overlapping intervals, including stop codons, and used in BLAST searches (approximately 3000 nucleotides = approximately 1000 amino acids - depending on contig size). Queries that have hits >= 35% identity which can be extended to a start and stop codon, and where the extension does not result in a protein > +/- 10% of the length of the subject are added to the pool of proteins encoded by a given genome. 2. gene/RNA prediction tRNA predictions are made using tRNAscanSE (1.4) using cove models for prokaryotic tRNAs. rRNAs are predicted using BLAST-based prediction against rRNA database. For 5S rRNA, sequences delimited by BLAST hits (+/- 200 bases) to 5S rRNA database are used for further search against Rfam using Infernal software (cmsearch). For 16S and 23S, search results against the respective RNA databases are used to generate rRNA annotation. 3. frameshift prediction/ resolution of overlaps/dropped predictions. 3A. Dropped predictions. BLAST results against Bactprot are analyzed and some predictions may be removed from annotation results prior to overlap resolution and frameshift detection. 3A1. All predictions shorter than 60 nucleotides without any blast hits. 3A2. Predictions (of any length) that do not have any BLAST hits, but that overlap (>=40 nucleotides) an adjacent prediction that does have a BLAST hit. 3B. Frameshift prediction. An additional step is the identification of potential frameshifted coding regions. Using BLAST results from Bactprot, genes that produce proteins that hit common subjects are collected and analyzed (two nearby genes on a nucleotide sequence that encode proteins that hit the same subject in BLAST results). These are analyzed to determine if the two genes are the result of a real frameshift or simply a split gene (since gene fusions/splits occur in bacterial genome evolution). When it is decided that a frameshift has occurred, both genes and coding regions are removed and replaced with a misc_feature noting the location, but for which a protein is not translated and represented in the finished genomic record. Genes are marked as frameshifted if: (Both genes hit at least one common subject and subjects are not "hypothetical proteins". AND There are no genes with BLAST hits that are located between the genes of these two proteins. AND (One of the query proteins is annotated as hypothetical. OR ( both proteins match by at least two different BLAST hits. AND both proteins do not hit full-length subjects (where the sum of the unaligned portion of query and subject < 0.1)))) 3C. Overlap resolution. Adjacent genes are also analyzed for overlaps. As in frameshift detection, removal of coding region/RNA/gene is replaced by misc_feature marking the annotation with this information. 3C1. RNA/protein overlaps. An RNA and a coding region are considered as significantly overlapping if the overlap is more than 30 bp. 3C1A. If RNA is pseudo or atypical and coding region encodes a protein that is not hypothetical, then the RNA feature is removed. 3C1B. Otherwise, if coding region encodes a "hypothetical protein" then coding region is removed. 3C1C. Otherwise, both are left as is. 3C2. Protein/protein overlaps. Two coding regions are considered as significantly overlapping if one of the coding region locations completely covers the location of the other. Any coding region in the pair that encodes a "hypothetical protein" is removed. Otherwise all coding regions are left as is. 4. annotation (gene/protein names, etc.) Sequence similarity search results are calculated for each protein against Bactprot and Clusterprot (see Appendix). Annotation transfer occurs in the following manner. 4A. from curated protein cluster annotation to protein (BLAST against Clusterprot with top 3 hits belonging to the same protein cluster). a. Protein name b. Gene name c. EC number 4B. if there is no hit to Clusterprot or above requirement is not met, then Bactprot results are used a. protein name is transferred from the top BLAST hit 4C. additional information is added a. COG note from COGnitor assignment b. CDD from RPS-BLAST hit 5. Annotation issues. A number of checks are now done to check for potential annotation issues. 5A. Discrepancy report. The tbl2asn utility can now produce a discrepancy report that is used by GenBank and genome submitters to detect potential issues as detailed here: http://www.ncbi.nlm.nih.gov/Genbank/asndisc.html The options used in PGAAP produce reports for a limited set of potential problems. tbl2asn -p path_to_files -t template -a s -V v -Z discrep EXTRA_GENES RNA_CDS_OVERLAP SUSPECT_PRODUCT_NAMES 5B. Conserved functions. There are two ways we detect this. 5B1. Universal/near universal functions that are conserved. Protein clusters that correspond to functions expected to be in all/nearly all taxonomic branches. 5B2. Taxonomically conserved functions. Protein clusters that correspond to functions that are present in the taxonomic branch of the submitted sequence. Predicted genes (proteins) are searched against BLAST databases corresponding to 5B1 and 5B2, and the nucleotide sequence is also searched (translated BLAST). For 5B1, submissions that do not contain a conserved function in the protein set, but a nucleotide location is found that hits the same conserved function, then a report is generated with the nucleotide locations of the hit. For 5B2, submissions that do not contain the taxonomically conserved function in the set of proteins are simply reported as 'missing' and the nucleotide location of a hit is reported if one is found. Note that there may be potential ambiguities in conserved functions both in terms of related protein clusters (one example in the set of tRNA synthetases for isoleucine, leucine and valine). The other problem may be in the definition of 'conserved' for a taxonomic branch with a small number of representative genomic sequences in a branch potentially skewing the results. Appendix: BLAST databases. BLAST for correction of start sites, for 6-frame translation, for sequence similarity searches, and for annotation transfer are the following. 1. Bactprot = all proteins from complete prokaryotic chromosomes and plasmids from RefSeq records updated every week. 2. Clusterprot = all proteins from protein clusters that occur in clusters large enough to be represented by at least 3 taxonomic nodes at the level of species or above (note, this is a subset of the proteins in #1) and updated every week. 3. NCBI Conserved Domain Database for RPS-BLAST searches for domain assignment are retrieved from the NCBI CDD FTP (superset including SMART, Pfam, COG, alignments from the LOAD-database (Library Of Ancient Domains), contributed by I. Aravind, E. Koonin, and colleagues, and CD alignments curated at NCBI.) 4. Protein clusters database (dynamically generated dependent upon input taxonomy of sequence submission). BLAST for RNA searches uses RNAs collected from complete prokaryotic chromosomes and plasmids and culled of outliers and updated approximately every 3 months. Protein BLAST searches use the following parameters: -FT -e 1e-06 -z 500000000 RPS-BLAST searches: -e 1e-02 -d cdd Conserved function BLAST searches -e 1e-03 -FF RNA searches: 5S -FF -e 10 -q -1 16S and 23S rRNA blast search -FT -e 1e-20 Cmsearch parameters: For search for 5S rRNA, a window size of 130 is set, and the 5S rRNA model is used. References: 1. GeneMark/GeneMark.hmm/GeneMarkS Borodovsky M and McIninch J. GeneMark: Parallel Gene 1993. Recognition for both DNA Strands. Comput. Chem. 17: 123-133. Lukashin A. and Borodovsky M. 1998. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 26: No. 4, pp. 1107-1115. Besemer, J., Lomsadze, A., and Borodovsky, M. 2001. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. NAR. 29: 2607-2618. 2. Glimmer Delcher A L, Hormon D, Kasif S, White O and Salzberg S L. 1999. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 27: 4636-4641. 3. Rfam Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A., and Eddy, S.R. 1997. Rfam: an RNA family database. Nucleic Acids Research, 2003, 31, 1, 439-441. 4. tRNAscan-SE Lowe, T.M. & Eddy, S.R. 1997. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucl. Acids Res. 25: 955-964. 5. Infernal Eddy, S.R. 2002. A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure. BMC Bioinformatics. 3: 18.