The NCBI Eukaryotic Genome Annotation Pipeline

The NCBI Eukaryotic Genome Annotation Pipeline provides content for various NCBI resources including Nucleotide, Protein, BLAST, Gene and the Map Viewer genome browser.

This page provides on overview of the annotation process. Please refer to the Eukaryotic Genome Annotation chapter of the NCBI Handbook for algorithmic details.

The pipeline uses a modular framework for the execution of all annotation tasks from the fetching of raw and curated data from public repositories (sequence and Assembly databases) to the alignment of sequences and the prediction of genes, to the submission of the accessioned annotation products to public databases. Core components of the pipeline are alignment programs (Splign and ProSplign) and an HMM-based gene prediction program (Gnomon) developed at NCBI.

Important features of the pipeline include:

  • flexibility and speed
  • higher weight given to curated evidence than non-curated evidence
  • utilization of RNA-Seq for gene prediction
  • production of models that compensate for assembly issues
  • tracking of gene loci from one annotation to the next
  • ability to co-annotate multiple assemblies for the same organism

The products of an annotation run (chromosome, scaffolds and model transcripts and proteins) are labeled with an Annotation Release number. The Annotation Release name is the combination of the organism name and Annotation Release number (e.g. NCBI Homo sapiens Annotation Release 105) and is used throughout NCBI as a way to uniquely identify annotation products originating from the same annotation run.

Content

Please see The Eukaryotic Genome Annotation chapter in the NCBI Handbook for more details about the algorithms.

Process

The figure below provides an overview of the annotation process. The genomic sequences are masked (grey) and transcripts (blue), proteins (green) and RNA-Seq reads (orange) are aligned to the genome . If available for the organism being annotated, curated RefSeq genomic sequences are also aligned (pink). Gene model prediction based on transcript and protein alignments is then performed (brown). The best models are selected among the RefSeq and the predicted models, named and accessioned (purple). Finally, the annotation products are formatted and deployed to public resources (yellow).

pipeline_overview

 

Source of genome assemblies

The RefSeq assemblies that are annotated by NCBI are copies of the genome assemblies that are public in DDBJ, ENA and GenBank. Both RefSeq and GenBank assemblies are further described in the Assembly resource.

Masking

Masking is done using RepeatMasker or WindowMasker. Organisms with well-characterized repeats are masked with RepeatMasker, others with WindowMasker.

Transcript alignments

The set of transcripts selected for alignment to the genome varies by species, and may include transcripts from other organisms. This set generally includes:

  • Known RefSeq transcripts: Coding and non-coding RefSeq transcripts with NM_ or NR_ prefixes, respectively, are generated by NCBI staff based on automatic processes, manual curation, or data from collaborating groups (see more details here)
  • GenBank transcripts from the taxonomically relevant GenBank divisions, and the Third-Party Annotation (TPA), High-throughput cDNA (HTC) and Transcriptome Shotgun Assembly (TSA) divisions
  • ESTs from dbEST

Sequences highly likely to be mitochondrial or to have cloning vector or IS element contamination, and sequences identified as low quality by RefSeq curation staff are screened out.

RefSeq transcripts and non-RefSeq transcripts that pass the contamination screen are aligned locally to the genome using BLAST to identify the location(s) at which transcripts align. Global re-alignment at these locations is performed with Splign to refine the identification of splice sites. Alignments are then ranked and filtered based on customizable criteria (such as coverage, identity, rank). Typically, only the best-placed (rank 1) alignment for a given query is selected for use in downstream steps.

RNA-Seq read alignments

RNA-Seq reads for the species are aligned to the genome. When a very large number of reads (multiple billions) are available in SRA, reads spanning the widest range of tissues and developmental stages are chosen over others, with a preference for untreated or non-diseased samples. RNA-Seq reads are aligned with BLAST and Splign like traditional transcripts. However, additional steps are performed to address the short length, redundancy and abundance of the reads:

  • only single representatives of identical sequences retrieved from SRA are aligned
  • alignments with the same splice structure and the same or similar start and end points are collapsed into a single representative alignment
  • alignments representing very rare introns likely to be background noise are filtered out

At each step, information is recorded about the samples and number of reads represented by each read and alignment, so the level of support can be used to filter alignments and evaluate gene predictions.

Protein alignments

The set of proteins selected for alignment to the genome varies by species, and may include proteins from other organisms. This set generally includes:

  • Known RefSeq proteins
  • GenBank proteins derived from cDNAs from the taxonomically relevant GenBank divisions

Highly repetitive sequences are removed from the set. Proteins are aligned locally to the genome with BLAST and re-aligned globally using ProSplign. Alignments are then ranked and filtered based on customizable criteria.

Model prediction

Protein, transcript and RNA-Seq read alignments are passed to Gnomon for gene prediction. Gnomon first chains together non-conflicting alignments into putative models. In a second step, Gnomon extends predictions missing a start or a stop codon or internal exon(s) using an HMM-based algorithm. Gnomon additionally creates pure ab initio predictions where open reading frames of sufficient length but with no supporting alignment are detected.

This first set of predictions is further refined by alignment against a subset of the nr (non-redundant) database of protein sequences. The additional alignments are added to the initial alignments, and the chaining and ab initio extension steps are repeated. The results constitute the set of Gnomon predictions.

Frameshifts, indels and stop codons may occur in the resulting Gnomon predictions. These reflect sequence differences between the input transcript and protein alignments and the genome assembly.

Curated RefSeq genomic sequence alignments

For some organisms, a set of genomic sequences is curated (RefSeq accessions with NG_ prefixes). These sequences represent either non-transcribed pseudogenes, a manually annotated gene cluster that is difficult to annotate via automated methods, and human RefSeqGene records. They are aligned to the genome, and their best placement is identified.

Choosing the best models for a gene

The final set of annotated features comprises, in order of preference, pre-existing RefSeq sequences and a subset of well-supported Gnomon-predicted models. It is built by evaluating together at each locus the known RefSeq transcripts, the features projected from curated RefSeq genomic alignments and the models predicted by Gnomon.

1. Models based on known and curated RefSeq

RefSeq transcripts are given precedence over overlapping Gnomon models with the same splice pattern. Alignments of known same-species RefSeq transcripts or curated genomic sequences are used directly to annotate the gene, RNA and CDS features on the genome. Since the RefSeq sequence may not align perfectly or completely to the genomic sequence, a consequence of this rule is that the annotated product may differ from the conceptual translation of the genome. Differences between the RefSeq transcripts and the genome are provided in a note on the RefSeq genomic record (scaffold or chromosome).

2. Models based on Gnomon predictions

Gnomon predictions are included in the final set of annotations if they do not share all splice sites with a RefSeq transcript and if they meet certain quality thresholds including:

  •  Only fully- or partially-supported Gnomon predictions, or pure ab initio Gnomon predictions with high coverage hits to UniProtKB/SwissProt proteins are selected
  • When multiple fully-supported transcript variants are predicted for a gene, only the Gnomon predictions supported in their entirety by a single long alignment (e.g. a full-length mRNA) or by RNA-Seq reads from a single BioSample are selected
  • Poorly-supported Gnomon predictions conflicting with better-supported models annotated on the opposite strand are excluded from the final set of models
  •  Gnomon predictions with high homology to transposable or retro-transposable elements are excluded from the final set of models

3. Integrating RefSeq and Gnomon annotations

As a result of the model selection process, a gene may be represented by multiple splice variants, with some of them known RefSeq and others model RefSeq (originating from Gnomon predictions).

Gnomon predictions selected for the final annotation set are assigned model RefSeq accessions with XM_ or XR_ prefixes for transcripts and XP_ prefixes for proteins to distinguish them from known RefSeq with NM_/NR_ and NP_ prefixes. Model RefSeq can be searched in Entrez with the query “srcdb_refseq_model[properties]” while known RefSeq sequences can be obtained with the query “srcdb_refseq_known[properties]”.

Protein naming and determination of locus type

  • Genes represented by known or curated RefSeq sequences inherit the Gene symbol, name and locus type (e.g. coding, pseudogene...) of the RefSeq sequence
  • Genes represented by predicted models are named based on homology to SwissProt proteins
  • Most Gnomon models with insertions, deletions or frameshifts are labeled pseudogenes
  • Gnomon models with insertions, deletions or frameshifts may be considered coding if they have a strong unique hit to the SwissProt database or appear to be orthologs of known protein-coding genes. Titles for these models are prefixed with “PREDICTED: LOW QUALITY PROTEIN”. There may be defects in the assembly and/or model in these cases.
  • Gnomon models that appear to be single-exon retrocopies of protein-coding genes may also be annotated as pseudogenes
  • When multiple assemblies are annotated, a partial or imperfect model may be called coding because a complete model exists at the corresponding locus on one of the other annotated assemblies

Assignment of GeneIDs

Genes in the final set of models are assigned GeneIDs in NCBI's Gene database.

  • A gene represented by a known RefSeq transcript will receive the GeneID of the RefSeq transcript.
  • All alternative splice forms of a gene get the same GeneID.
  • As much as possible, GeneIDs are carried forward from one annotation run to the next, using the mapping of the new assembly to the previous one if the assembly was updated.
  • Gene features mapped to equivalent locations of co-annotated assemblies are assigned the same GeneIDs.

Annotation of small RNAs

  • miRNAs are imported from miRBase, accessioned with NR_ prefixes and placed using Splign.
  • tRNAs are predicted with tRNAscan-SE.

Special considerations

Annotation of multiple assemblies

When multiple assemblies of good quality are available for a given organism, annotation of all is done in coordination. To ensure that matching regions across assemblies are annotated the same way, assemblies are aligned to each other before the annotation.

  • Assembly-assembly alignment results are used to rank the transcript and the curated genomic alignments: for a given query sequence, alignments to corresponding regions of two assemblies receive the same rank.
  • Corresponding loci of multiple assemblies are assigned the same GeneID and locus type.

Assembly-assembly alignments are available through the NCBI Genome Remapping Service.

Re-annotation

Organisms are periodically re-annotated when new evidence is available (e.g. RNA-Seq) or when a new assembly is released. Special attention is given to tracking of models and genes from one release of the annotation to the next. Previous and current models annotated at overlapping genomic locations are identified and the locus type and GeneID of the previous models are taken into consideration when assigning GeneIDs to the new models. If the assembly was updated between the two rounds of annotation, the assemblies are aligned to each other and the alignments used to match previous and current models in mapped regions.

Annotation products

  • The products of the annotation process comprise:
    • The scaffolds and chromosomes of the assembled genomes, with the annotation products as features.
    • The individual products (transcripts and proteins)
Product Origin of the product Note for the features on the scaffolds and chromosomes*
Known transcripts/proteins (NM_, NR_, NP_) curated RefSeq genomic

"Derived by automated computational analysis using gene prediction method: Curated Genomic"

Known transcripts/proteins (NM_, NR_, NP_) known RefSeq transcript "Derived by automated computational analysis using gene prediction method: BestRefseq"
Model transcripts/proteins (fully or partially -supported) (XM_, XR_, XP_) Gnomon "Derived by automated computational analysis using gene prediction method: Gnomon"
tRNAs (no accession) tRNAscan-SE

"tRNA features were annotated by tRNAscan-SE"

Non-transcribed pseudogenes (no accession) curated RefSeq genomic

"Derived by automated computational analysis using gene prediction method: Curated Genomic"

Non-transcribed pseudogenes (no accession) Gnomon "Derived by automated computational analysis using gene prediction method: Gnomon"
Full set of Gnomon predictions (no accession) Gnomon NA. Not in the sequence database. Available on the FTP site and as BLAST databases

* For predicted models, the note is also on the records of individual annotation products.

  • Sequence records for predicted models, scaffolds and chromosomes contain the Annotation Release number, which in combination with the species uniquely identifies the annotation. For example, the sequence records for scaffolds, chromosomes and predicted transcripts and proteins for NCBI Homo sapiens Annotation Release 105 contain the following comment:

##Genome-Annotation-Data-START##
Annotation Provider         :: NCBI
Annotation Status           :: Full annotation
Annotation Version          :: Homo sapiens Annotation Release 105
Annotation Pipeline         :: NCBI eukaryotic genome annotation pipeline
Annotation Software Version :: 5.1
Annotation Method           :: Best-placed RefSeq; Gnomon
Features Annotated          :: Gene; mRNA; CDS; ncRNA
##Genome-Annotation-Data-END##

Data availability

The data produced by the annotation pipeline is available in various resources:

References

Last updated: 2014-09-16T15:41:32-04:00