Prokaryotic RefSeq Genomes

Related documentation:


All RefSeq archaeal and bacterial genomes, with the exception of selected reference genomes, are annotated using NCBI’s prokaryotic genome annotation pipeline. This improves consistency across the dataset. Each annotated genome continues to represent a set of gene and protein feature annotations that are unique to that genome. Gene features are provided with unique locus_tags but NCBI GeneID cross-references are only annotated for reference genomes and a sub-set of representative genomes. Therefore, online links between RefSeq genomes, or annotated proteins, and the Gene resource are only available for a subset of RefSeq prokaryotic genomes (and corresponding annotated proteins). Protein coding regions (CDS features) include cross-references to RefSeq non-redundant protein accessions (with WP_ prefix). A given non-redundant protein accession may be annotated on more than one genome.  

A subset of the RefSeq archaeal and bacterial genomes are categorized as a ‘reference’ or ‘representative' genome. This new classification of RefSeq archaeal and bacterial genomes will make it easier to study sequence variation by providing datasets that allow comparisons of proteins from organisms with variable levels of genome sequence quality, or small sequence variations that arise during the course of an acute infectious disease outbreak, to those with highest quality reference or representative genomes. This organization of RefSeq prokaryotic genomes data also better supports comparisons across broad taxonomic ranges, as well as within a species or clade. However, a genome assembly may be excluded from RefSeq for reasons related to assembly quality or completeness.

Reference genomes

RefSeq reference genomes represent the highest quality dataset that is supported by curation by NCBI scientific staff, and in many cases, has also been curated by collaborators. Some reference genomes are selected based on a long history of collaboration and wide recognition as a community standard, such as the reference genome of Escherichia coli str. K-12 substr. MG1655. Other reference genomes are selected based on medical importance, assembly and annotation quality, and the availability of experimental support. NCBI generates annotation for these genomes using our prokaryotic genome annotation pipeline and compares the results to the submitted genome annotation that is available in GenBank.  NCBI staff scientists review the annotation to resolve differences and add annotation of missing genes and features such as pathogenicity islands, virulence factors, or, to note experimental evidence for the protein. Reference genomes are annotated with YP_ or NP_ protein accessions which in turn cross-reference the non-redundant protein records. Reference genomes are also annotated with the GeneID cross-reference to NCBI's Gene resource. You can browse the list of reference genomes in NCBI's Genomeresource, retrieve them by searching the Assembly resource, or download a report file from the FTP site. Reference genome records, such as the nucleotide record from Fusobacterium nucleatum, NC_003454.1, include a custom comment in the COMMENT section that may include attributes indicating why it was selected as a reference genome.

Image of the Reference Genome comment on NC_003454.1.

The protein product name that is provided on these records most often reflects the primary community-curated data that is available in GenBank for that genome assembly. For example, the RefSeq reference genome for Bacillus anthracis str. Ames (NC_003997.3) reflects the annotation that was submitted to GenBank for accession AE016879.1.

  1. Annotation on GenBank accession AE016879.1: Image of the first CDS feature annotated on AE016879.1
  2. Annotation on reference genome NC_003997.3 (from 1-2,000 bp), derived from AE016879.1: Image of the first CDS feature annotated on NC_003997.3

The annotated protein record NP_842573.1 includes an additional line, "CONTIG", which is not typically found on protein records. The CONTIG line provides a link to the identical RefSeq protein WP_000428021.1 that is part of the non-redundant protein dataset.

Image of NP_842573.1 showing the CONTIG line with link to identical non-redundant RefSeq protein WP_000428021.1

Representative genomes

Additional high-quality genomes are identified by clustering genomes and applying weighting metrics that include consideration of species-level taxonomic classification (e.g., a preference for type strain) and assembly quality (e.g. a preference for complete genomes but WGS is allowed). Additional quality assurance analysis is being added to add consideration of annotation quality metrics such as assessing the number of frameshifted proteins (compared to close neighbors), presence of the set of expected rRNA and tRNAs, and gene density. We also take into consideration taxonomic diversity and will include some genomes that are taxonomic outliers for which little functional information is available in the representative genome collection. Representative genomes are annotated with non-redundant RefSeq protein accessions (WP_ accession prefix) and display the protein product name that appears on the WP-accessioned record (see Protein data model below).  The most supported sub-set of representative genomes are annotated with a GeneID cross-reference to NCBI's Gene resource; currently, this is provided for representative genomes having at least 10 nearly identical variant genomes. Representative genomes are provided for clades and species that do not have a designated reference genome. You can search Assembly to retrieve the set of prokaryotic representative genomes, or download a report file from the FTP site.

Variant genomes

The large remainder of prokaryotic genomes that are not tracked as either a reference or representative genome may include some taxonomically diverse organisms that have more fragmented genomes. However, the vast majority of these records represent isolate- and strain-specific RefSeq genomes. These sequenced genomes primarily represent sequence variation so are not tracked as a reference or representative genome. The RefSeq project continues to provide annotation for these genomes at this time as many are of medical importance. Variant genomes are annotated with non-redundant RefSeq protein accessions (WP_ accession prefix) and display the protein product name that appears on the WP-accessioned record

Protein data model

With the exception of reference genomes, only non-redundant protein accessions (WP_ accession prefix) will be annotated on new or re-annotated RefSeq prokaryotic WGS and Complete genomes. A single non-redundant protein may be annotated on many RefSeq genomes, when the CDS annotated on those genomes encodes exactly the same protein that is identical in both sequence and length. For example, the coding sequence for the 50S ribosomal protein L11 that is annotated on NC_017743.1 provides a cross-link, shown below, to the non-redundant RefSeq protein WP_003156430.1. Approximately 75 prokaryotic genomes are annotated with a CDS feature that encodes the identical sequence of the same length as shown in the Identical Protein report, which can be accessed by clicking on the "Identical Proteins" link near the top of the protein record.

Image of CDS feature for 50S ribosomal protein L11 as annotated on NC_017743.1. The CDS cross-references nonredundant protein WP_003156430.1.

Last updated: 2018-06-12T17:24:12Z