NCBI Prokaryotic Genome Annotation Process

Go back to NCBI Prokaryotic Genome Annotation Pipeline

NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. PGAP determines structural annotation by comparing open reading frames (ORFs) to libraries of protein hidden Markov models (HMMs), representative RefSeq proteins and proteins from well characterized reference genomes. GeneMark S+ then makes ab initio coding region predictions for genomic regions that lack HMM or protein evidence and selects start sites for ORFs whose evidence comes from HMMs.

The flowchart below describes the major components of the pipeline:

Flowchart describing the major components of the pipeling

Structural annotation of proteins

All ORFs are tested against a library of HMMs. Currently, PGAP uses the HMM libraries from TIGRFAM and Pfam; a new library created from our previously described PRK clusters; and HMMs (NCBIfams) custom-built to identify high-value protein families, including proteins involved in antimicrobial resistance. In addition to improve the annotation of highly specialized, high interest and relatively rare proteins PGAP utilizes the BlastRules database. Lineage specific high-quality reference proteins and proteins identified by ORF hits to the protein cluster set are (re)aligned to the genome using ProSplign, a frameshift-aware protein aligner. GeneMark S+, an extension of the GeneMarkS ab initio gene finding program, makes ab initio coding region predictions for genomic regions that lack HMM or protein evidence and selects start sites for ORFs whose evidence comes from HMMs.

Non-coding RNA

Structural RNAs/small ncRNAs

Structural RNAs (5S, 16S, and 23S rRNAs) are highly conserved in closely related prokaryotic species. For the 16S and 23S rRNAs the NCBI Reference Sequence Collection (RefSeq) contains a curated set of reference sequences. The pipeline uses a BLASTn search against the reference set to identify these rRNA. 5S rRNAs and small ncRNAs are identified using RFAM HMMs, these hits are further refined using Cmsearch. Partial alignments that fall below 50% of the average length are dropped.

tRNAs

To identify tRNA genes, the input genome sequence is split into ~200nt windows with overlap of ~100nt and passed through tRNAscan-SE. tRNAscan-SE identifies 99–100% of transfer RNA genes in DNA sequence while giving less than one false positive per 15 gigabases. It is currently one of the most powerful tRNA identification tools, and uses different, targeted parameter sets for Archaea and Bacteria. tRNA predictions below a tRNAscan-SE score of 20 are discarded.

Mobile/fast evolving genes

Phages

The annotation of phage related proteins is based on homology to a reference set of curated phage proteins. The phage reference data set comes from an independent effort of calculating and curating protein clusters from the complete bacteriophage genomes.

CRISPR

CRISPRs (Clustered Regularly Interspaced Short Palindromic Repeats) are identified by searching the CRISPR database. CRISPRs are a family of DNA direct repeats of 20 to 40 nucleotides separated by unique sequences of similar length and are commonly found in prokaryotic genomes. These defense systems are encoded by operons that have an extraordinarily diverse architecture and a high rate of evolution for both the cas genes and the unique spacer content.

Frameshift detection

Detecting frameshifts is a critical component of resolving ambiguities in automated annotation, and provides important feedback in assessing the quality of an assembly. Proteins from the target set are aligned to the genome with ProSplign (a global alignment algorithm) that detects alignments with frameshifts. In a second step, newly predicted GeneMarkS+ genes are evaluated for potential frameshifts. All proteins are aligned to the search set used for protein identification and naming. All candidate search proteins are then aligned to the region with ProSplign and evaluated for frameshifts. The original gene models are replaced with a new gene feature with a pseudo qualifier covering the maximal extent of aligned frameshifted protein. PGAP is also annotating programmed frameshifts/ribosomal slippage now for some transposases and PrfB genes and providing a translated CDS feature for these genes. Universal proteins (proteins common to all bacteria) that are frameshifted are annotated with a translation exception and have a translated CDS feature.

Functional annotation

Functional annotation of proteins identified by HMMs and BlastRules are assigned the name curated for the HMM or BlastRule. For proteins not identified by HMM or BlastRule the protein naming procedure uses BLAST search against special naming BLAST database (the search set). The search set includes representatives from global protein clusters generated from all prokaryotic RefSeq genomes. Protein alignments are screened for quality (identity and symmetric overlap).  A candidate protein is assigned to an identification cluster if there are a sufficient number of high quality search set proteins that point consistently to the same cluster. The names given to proteins follow the International Protein Nomenclature Guidelines, agreed upon by the European Bioinformatics Institute (EMBL-EBI), the National Center for Biotechnology Information (NCBI), the Protein Information Resource (PIR) and the Swiss Institute for Bioinformatics (SIB).

Annotation results

The annotation pipeline produces ASN.1 files (*.sqn) ready for GenBank submission. It can also convert ASN.1 to traditional GenBank flat file format for manual review. The summary report is generated as part of the output file and includes the total number of predictions by each feature type.

Example of the summary report in structured comment:

##Genome-Annotation-Data-START##

Annotation Provider          :: NCBI

Annotation Date              :: 04/17/2013 07:59:04

Annotation Pipeline          :: NCBI Prokaryotic Genome Annotation Pipeline

Annotation Method            :: Best-placed reference protein set; GeneMarkS+

Annotation Software revision :: . (rev. 395869)

Features Annotated           :: Gene; CDS; rRNA; tRNA; ncRNA; repeat_region

Genes                        :: 3,625

CDS                          :: 2,838

Pseudo Genes                 :: 739

rRNAs                        :: 3 ( 5S, 16S, 23S )

tRNAs                        :: 45

Frameshifted Genes           :: 723

##Genome-Annotation-Data-END##

Support Center

Last updated: 2018-09-10T14:07:54Z