Prokaryotic RefSeq Genome Re-annotation Project

What we are doing?

Prokaryotic RefSeq genomes are being re-annotated using NCBI's annotation pipeline. The scope of this project includes all new WGS and complete genomes and all pre-existing RefSeq WGS and complete genomes with the exception of a small number of identified highest quality reference genomes which have been extensively curated, community-vetted and/or have a large amount of experimental confirmation. Coincident with this update, we are converting prokaryotic RefSeq genomes to a new non-redundant protein data model that was first announced via the NCBI NEWS site and an accompanying longer announcement. Where an important representative genome exists, such as Escherichia coli K12 substr. MG1655, it will continue to be annotated with species and strain-specific RefSeq proteins (with NP_ or YP_ accession prefix) which in turn cross-reference the non-redundant RefSeq protein dataset in the CONTIG line that appears above the protein sequence in the GenPept display format (e.g., NP_414543). Identical proteins in other non-reference genome strains will only be represented by the non-redundant protein accession number (WP_).

Image of the CONTIG line in the GenPept display of RefSeq protein accession NP_414543.1, which cross-references a non-redundant RefSeq protein (WP_001264707.1).

The rapidly growing redundancy in the prokaryotic database due to an increase in sequence submissions has caused us to change the scope definition for Gene. The scope of Gene, for prokaryotes, was changed from all complete genomes, to only those genomes that are: a) a reference genome; or b) a representative genome for which there are at least 10 sequenced genomes for the species or clade. In other words, Gene will now include only the highest quality and most supported subset of RefSeq prokaryotic genomes.

We also refined our quality criteria and data validation for prokaryotic genomes. We realize that it is important to provide genome annotation for the multiplicity of bacterial genomes; however, we do not want to release annotation results that may have significant quality issues. We may continue to refine our quality criteria in the coming year which may result in an additional changes. At this time (March 2015), we have identified and suppressed several hundred prokaryotic RefSeq genomes that we feel are not currently of sufficient quality to provide as a RefSeq record(RefSeq genome quality criteria). Some of these may be added back in the future if the source GenBank genome assembly is updated or following additional improvements to NCBI's prokaryotic genome annotation pipeline.

The bacterial re-annotation project is nearly complete as of March 2015. We are also applying this approach to RefSeq archaeal genomes and although at this time we have re-annotated some archaeal genomes we have not yet completed that effort.

Why are we doing this?

With the advent of high-throughput sequencing, the generation and use of genome sequences is evolving. One change is the sequencing of very large numbers of nearly identical bacterial genomes to analyze food borne pathogens or infection outbreaks. In these cases, annotating the genes on these genome is important and useful for identifying the small number of places where mutations may affect function, but most of the genes annotated on these new genomes encode proteins that are identical to proteins already in the RefSeq dataset.

This health-related use case makes it critically important to provide genome annotation that has been generated using a consistent method. Previously, RefSeq bacterial genomes represented a mix of submitted and NCBI-generated annotation that was provided using different methods reflecting different states of technology development over time. Therefore, in order to continue to provide a prokaryotic dataset that is of highest utility for disease, pathogen, and other comparative analysis needs, NCBI decided to re-annotate all RefSeq prokaryotic genomes using an improved genome annotation pipeline.

In the past, differences in annotation methods resulted in different protein names and lengths from identical DNA sequences submitted from the same species. These problems will be avoided with the re-annotation project, and the same DNA sequence from the same species will now be annotated with exactly the same protein sequence and name in RefSeq.

How are we doing this?

In order to manage the flood of identical proteins resulting from the decision to provide genome annotation for this volume of sequenced genomes, we introduced a new data model for prokaryotic RefSeq genomes, and a new protein data type in the RefSeq collection which is signified by a "WP_" accession prefix. Previously, each annotated CDS on each RefSeq prokaryotic genome was tracked with a distinct RefSeq protein accession number. This practice led to redundancies in the protein database, since the identical protein was represented by multiple accessions as a result of extensive sequencing of bacterial genomes. To avoid such redundancies in the database, NCBI has adopted the use of non-redundant WP proteins for RefSeq prokaryotic genomes that are annotated using NCBI pipelines. If the identical protein sequence (exactly the same protein sequence and length) appears on more than one RefSeq genome, NCBI simply re-uses the existing WP accession number instead of creating a new accession for each new occurrence and genome. For conserved proteins the same WP accession may appear on thousands of genomes. This is a first step toward dealing with a world when genomes are sequenced just for assays, rather than to discover novel proteins. We appreciate that this is new and a major change for RefSeq prokaryotic genomes, and that there are some issues still to be worked out, but we felt we needed to start making this change as the number of disease-outbreak and other isolate sequencing continues to increase rapidly.

For example, the RefSeq non-redundant 50S ribosomal protein L11, WP_003156430.1, is annotated on over 70 RefSeq bacterial genomes. The CDS feature annotated on those genomes cross-references this same WP protein accession number, indicated by a red arrow. Examples include:

1) 50S ribosomal protein L11 from Bacillus sp. JS:

Image of CDS feature for 50S ribosomal protein L11 as annotated on NC_017743.1. The CDS cross-references nonredundant protein WP_003156430.1.

2) 50S ribosomal protein L11 from Bacillus subtilis subsp. spizizenii strain NRS 231:

Image of CDS feature for 50S ribosomal protein L11 as annotated on NZ_CP010434.1. The CDS cross-references nonredundant protein WP_003156430.1


Consistent genome annotation

This approach enables direct comparison of RefSeq prokaryotic genome annotation across genomes because annotation is generated using a consistent method. The following figure provides an example of this. To generate this figure, the displayed RefSeq and GenBank genomes were each aligned to subject NC_021200.1, the pair-wise alignments were added to NCBI's graphical sequence display tool (using the Configure tool and selecting the option to add local data, add BLAST RID), and the display was configured to hide the set of primary alignments (leaving only the cleaned and merged alignments) and to display the annotated features (using the Configure tool, select the 'cleaned alignment' row and check the box to project features). The display was then zoomed into the range 2,123 to 2,125 kb. Note that the three RefSeq genomes shown have consistently annotated a gene, between the two vertical blue bars, for phosphoribosyl transferase (229 amino acids) but the annotation differs in the three GenBank genomes both in terms of the protein length and name. For two of the RefSeq genomes, the annotated protein is identical and cross-references a single non-redundant protein (WP_003872216.1), and the third RefSeq genome differs in sequence but not length and cross-references a different non-redundant RefSeq protein (WP_033715495.1). Also note that the GenBank genome AE016958.1 is missing two genes (blue circle) in the displayed region, and the GenBank genome CP005829.1 is missing one gene (not circled).

image showing feature annotation for a series of aligned genomic accessions (NC_002944.2, NC_008595.1, NC_021200.1, AE016958.1, CP000479.1, CP005928.1). The image shows RefSeq annotation has greater consistency.

Reduced protein redundancy

The use of non-redundant RefSeq accessions significantly reduces redundancy in the prokaryotic proteome represented in RefSeq. For example, the non-redundant protein WP_003872216.1 is annotated on 26 RefSeq genomes of Micobacterium avium subsp. paratuberculosis; this is a 26 fold reduction in redundancy for this one protein sequence in the RefSeq collection as only one RefSeq protein accession is annotated on these 26 genomes, instead of creating 26 accessioned records - one for each genome. The reduction can be even greater for other highly conserved proteins that are annotated on multiple hundred, or even over a thousand, RefSeq genomes. For example, WP_000289090.1 is annotated on over 500 genomes of Escherichia and Shigella. Each of these RefSeq genomes has an annotated CDS feature that cross-references this single protein accession.

Image illustrating a CDS feature annotated on NC_021200.1 which cross-references non-redundant RefSeq protein accession WP_003872216.1

Improved name management

As protein names are updated on the non-redundant RefSeq protein records, the improved protein name will automatically become available on the annotated genome record which cross-references the non-redundant dataset. We are continuing to work on protein name improvements by leveraging information from: a) curated trusted protein datasets for families; b) curated Swiss-Prot names; c) community curated names on reference genomes; d) Hidden Markov Models (HMMs); e) conserved domains; and f) protein clustering approaches.

Easy navigation

Protein records now include an "Identical Proteins" link, which provides convenient access to all 100% identical proteins as well as their annotated locations. For example, all genomes containing a protein identical to WP_013315922.1 can be reviewed through the Identical Proteins report, allowing quick review of species and strain diversity for this protein as well as the option to view the corresponding genomic annotations. Image of the Identical protein report page for non-redundant RefSeq protein WP_013315922.1

Description of data changes

  • Re-annotated genomes: Over 32,000 prokaryotic RefSeq genomes have now been annotated using NCBI s improved prokaryotic genome annotation pipeline. At the same time, the approach used to annotate CDS features and represent RefSeq proteins was change to use the non-redundant protein data model. The vast majority of these are bacterial genomes. Processing of archaeal genomes is not complete as of March 18, 2015. {: #reannotatedgenomes }

  • Finished transition to new data model: The new data model is now in use for both WGS and Complete bacterial genomes. Archaeal RefSeq genomes are also using this data model but the transition has not been completed as of March 2015.

    • This data model was first introduced in mid-2013 and has been in place since then for NCBI annotations of whole genome shotgun sequences.
    • As planned, we have now finished transitioning all WGS and complete bacterial genomes to this new model with the exception of a small number of selected reference genomes.
    • Now, genomes are annotated with non-redundant proteins (WP_ accessions).
    • referencegenomesReference genomes: 122 bacterial reference genomes have been selected based on genome and annotation quality, established community reference standards, and experimentally supported community annotation as further described here. This data set supports those who want a minimal prokaryotic dataset of the highest quality. The reference genome status is maintained in NCBI's Assembly and Genome resources. You can find this data in Genome (browse reference genomes), in the Assembly database (query hint: prokaryota[orgn] AND "reference genome"[RefSeq Category] AND latest[filter]), or on our FTP site.
  • Representative genomes: 3152 representative genomes have been selected based on considerations that include genome quality, annotation quality, and comparison to similar genomes after clustering clades. Selection criteria are described here. We select one to a small number of best genome(s) for a clade cluster, or, for very diverse species, we also select the best available for that species. In some cases there may be only one genome available for that diverse species. This data set can be found in the Assembly database (query hint: prokaryota[orgn] AND "representative genome"[RefSeq Category] AND latest[filter] ), or on our FTP site.

  • Suppressed genomes: Several hundred RefSeq bacterial genomes that did not pass assembly or annotation quality validation have been suppressed(RefSeq genome quality criteria). An FTP report file listing the most recent suppressions will be provided soon in the RefSeq FTP site.

  • Suppressed proteins: Approximately 7 million prokaryotic NP_ and YP_ accessioned proteins have been suppressed as a result of both suppressing RefSeq genomes with quality issues and due to the large-scale re-annotation project which transitioned bacterial genomes to the new non-redundant protein data model. The new data model provides CDS annotation that cross-references non-redundant RefSeq proteins. This data model applies to all non-reference prokaryotic genomes. Our original intention was to retain strain-specific NP_ and/or YP_ accessions for the layer of 'representative' genomes; however, the designation of 'representative' status is subject to change over time and it would be a great disservice to add and remove NP_/YP_ accessions if the RefSeq category changes to or away from the 'representative' status.

  • Changes in the Gene resource: The scope definition for annotating prokaryotic genomes with NCBI GeneIDs was changed resulting in the suppression of a large number of entries. Gene continues to provide current entries for all prokaryotic reference genomes. For bacteria, Gene includes the best supported sub-set of representative genomes for which there are >= 10 sequenced genomes for the clade. We are still refining the Gene policy for archaeal genomes. Suppressed Gene entries can still be accessed and we are supplementing the current set of suppressed records with information to facilitate navigating to the replacement non-redundant RefSeq protein. The embedded graphical display will continue to show annotation of the genomic coordinates that the Gene entry represents. If that RefSeq genome was re-annotated, then the display in Gene will automatically show the updated annotation for the accession.version:from-to coordinates associated with the Gene record. Thus, while the Gene entry may have initially displayed a CDS annotation associated with a YP_ accession number (still reported in the Reference Sequences section of the record), it may now display a CDS annotation associated with a non-redundant WP_ accession number. These records will not be subject to any further update.

  • Locus_tag changes: New locus_tags were assigned as genomes were re-annotated using the format of <original locus_tag prefix>_RS<digits>, for example "MAP4_RS09710". The locus_tag that was previously annotated on this genome is reported in the 'old_locus_tag' qualifier when the equivalent CDS was re-annotated on the record. In some cases a new, or substantially modified, CDS was annotated on the genome in which case there is no corresponding previous locus_tag. In order to continue to provide locus_tag annotation comprehensively on these records we had to provide it in a new format. Image of an annotated CDS feature that includes a qualifier for locus_tag and one for old_locus_tag.

What we are doing to help you transition to the new data

We are working to provide information to facilitate the transition to the new annotation data using several approaches:

  • datamappingreportsData mapping and other report files: We provided supplemental report files on the RefSeq FTP site, with RefSeq release 70, which can be used to identify removed records, and to map removed records to the non-redundant protein accessions. These files are available in the release archive directory at: Two files are provided that report: a) mapped protein accessions and locus_tags, for the set of NP_ or YP_ accessioned proteins that were suppressed and replaced by a nonredundant WP_ accessioned protein (file: release70.bacterial-reannotation-report.txt.gz); and b) a set of assemblies that were suppressed between March and May 2015 due to changes in the RefSeq quality criteria (file: release70.addedQa-suppressedAssemblies.txt).
    • protein2genomiclocationSuppressed proteins: One or more files will be provided in the RefSeq FTP site that map suppressed NP_ and YP_ protein accessions to non-redundant WP_ accessions which are identical in sequence and length to the suppressed protein record. Because these updates have occurred over time we are planning to provide a comprehensive report of all previously suppressed NP_ and YP_ proteins that can be mapped to a non-redundant WP_ accession. Note that this is mapping at the level of protein sequence identity, not at the level of genome annotation, and this report may include some suppressed accessions that were removed a number of years prior to development of our re-annotation plans. In addition, the RefSeq release FTP site includes a report of accessions that have been removed since the previous release (file name format: release##.removed-records.gz). This file can be parsed to identify all removed bacterial records. An example command line that identifies removed bacterial protein accessions is:
      • >zcat release69.removed-records.gz |grep bacteria|grep P_|more
    • Locus_tags: We will also provide mapping data between old and new locus_tags for those annotated genes and CDSs which are identical or highly similar to the previous version of the annotated genome. This data is also available directly within the re-annotated RefSeq genomes as the previous locus_tag has been provided in the "/old_locus_tag:" qualifier.
    • Suppressed genomes: We will provide a report of RefSeq bacterial genomes that were removed due to concerns about either the assembly or the annotation. In some cases, the quality issue identified requires an update to NCBI's prokaryotic genome annotation pipeline. Once the pipeline software has been updated, then the affected genomes will be updated and made publicly available again. In other cases, where the issue is about the assembled sequence quality, the RefSeq genome would only become publicly available again if an improved quality update is submitted to GenBank, ENA, or DDBJ.
    • Suppressed Gene entries: The Gene FTP site provides a file (gene_history.gz) reporting all removed records for all taxa. This file can be parsed to find removed entries for a given date range or NCBI taxonomy ID. An example command line argument is:
      • >zcat gene_history.gz | grep 2015 | grep ^470 | more
    • The "Identical Proteins" link at the top of the record can be used to view a table of all genomic locations for the protein accession, as well as other 100% identical proteins. This table provides convenient links to each annotated genomic record in GenBank format. See the above image.
    • The "Genomic records" link under "Related Information" (shown below) can be used to navigate to the set of nucleotide records on which the protein is annotated.
    • The "Species level organisms" link under "Related Information" can be used to find the set of species on which this specific non-redundant protein record has been annotated.

      Image of links available in the related information section of the GenPept display of non-redundant RefSeq protein WP_013315922.1.

  • Navigation from suppressed protein records to replacement non-redundant records: When accessing a suppressed NP_ or YP_ protein sequence in NCBI's Protein resource, an informational message will appear on suppressed sequences that are identical to a non-redundant RefSeq protein record (see below). The message will include links to the replacement WP_ accession, the Identical Protein display, and this documentation. Initially this message will be provided when viewing the record (GenBank format) and for suppressed records for which there is an identical non-redundant sequence record. We are working to provide a similar message on suppressed records for which there is a similar but not identical non-redundant RefSeq record, and to expose this or similar text in other contexts (Summary view as one example). The additional links may not be available in all places due to technical details of how different display options are managed.

    Image of informative message added to suppressed bacterial YP/NP accessions.

  • BLAST: We are evaluating technical aspects of some modifications to BLAST results to improve access to species, strain, and genome annotation information for BLAST hits to non-redundant WP_ accessions. This improvement will be added as soon as possible and some information may be easier to add before other content.
  • Documentation and announcements {: #documentation }
    • We have added extensive documentation for Prokaryotic genomes, non-redundant proteins, and the re-annotation project to the RefSeq web site. We anticipate that these documents will continue to be expanded upon in the coming weeks and months.
    • Announcements of changes to RefSeq policy and data content are distributed to subscribers to the refseq-announce email list. In addition, announcements of significant changes are distributed in the NCBI NEWS site, NCBI Staff Twitter account and more.

Measurable reduction in protein redundancy

Counts of annotated proteins for four species

Included file '' not found

Percent reduction in protein accessions

Included file '' not found

Singleton WPs per Genome

Included file '' not found

Table Definitions

  • "Total Proteins" counts the number of times non-redundant proteins accessions are annotated on the set of genomes for the species.
  • "Total Unique WPs" counts the distinct number of non-redundant proteins used across all genomes. This is the truly non-redundant set of proteins for the species.
  • "Total Singleton WPs" counts the number of non-redundant proteins used only once in the set of genomes for the species.
  • "Percent Reduction" measures the compression in protein identifier space gained by using non-redundant protein accessions (WP_ prefix)
  • "Percent Singleton WPs" measures the percent of all non-redundant proteins for that species that are used only once in that species.

Last updated: 2019-01-08T18:00:20Z