NCBI RefSeq Functional Elements

Overview

NCBI provides RefSeq and Gene records for non-genic functional elements that have been described in the literature and are experimentally validated. Elements in scope include experimentally-verified gene regulatory regions (e.g., enhancers, silencers, locus control regions), known structural elements (e.g., insulators, DNase I hypersensitive sites, matrix/scaffold-associated regions), well-characterized DNA replication origins, and clinically-significant sites of DNA recombination and genomic instability. Priority is given to genomic regions that are implicated in human disease or are otherwise of significant interest to the research community. Currently, the scope of this project is restricted to human and mouse. Our current scope does not include functional elements predicted from large-scale epigenomic mapping studies, nor elements that exist solely based on disease-associated variation.

Each functional element RefSeq has a corresponding record in NCBI's Gene database (see example in Figure 1). NCBI Gene records for functional elements differ from conventional genes in that they have the Gene type 'biological region.' All functional element Gene records also include a list and a graphical view of annotated feature types, a brief summary of the function of the region, a list of related INSDC accessions, and a comprehensive bibliography of relevant publications. A link to the orthologous human or mouse record is provided where appropriate.

Gene Summary

Figure 1. An example of an NCBI Gene record for a functional element (only the Summary section is shown here). Note that the Gene type is 'biological region' and 'Feature type(s)' are listed.

Functional Element RefSeq Records

Functional element RefSeq records are represented as follows:

  • As genomic RefSeqs with an NG_ prefix (e.g., NG_046887.1)
  • As DNA sequences encompassing the genomic range of one or more experimentally-validated functional elements.
  • Based on the plus strand of the current human or mouse reference genome assembly, unless otherwise indicated.
  • With 100-nt padding on each end for extra genomic context.
  • Elements considered to be functionally related and closely situated in the genome are included together on the same NG_ record (e.g., an enhancer and contained protein-binding sites; multiple nearby enhancer and/or promoter fragments).
  • Experimentally-validated features are annotated on the record through manual curation by NCBI RefSeq scientific experts as described below.
  • Manually curated functional element RefSeq records have a REVIEWED status (see About RefSeq for status descriptions).
  • Records generated through automatic bulk processing, such as the validated dataset from the VISTA Enhancer Browser, have a PROVISIONAL status.

Functional Element RefSeq Feature Annotation

Functional element NG_ RefSeqs include manually curated features in accordance with International Nucleotide Sequence Database Collaboration (INSDC) standards. Features that are supported by direct experimental evidence include at least one '/experiment' qualifier with an evidence code (ECO ID) from the Evidence & Conclusion Ontology, and at least one citation from PubMed. It is important to note that annotated sequence ranges may be approximate depending on the experimental evidence type, and that features may include extraneous sequences that are not necessary for function. Feature annotation can be viewed on the RefSeq Nucleotide flat file (Figure 2), in the graphical view in Gene records (Figure 3), and in NCBI genome browsers (see Access via NCBI Graphical Displays below).

Flat File Features

Figure 2. Example of a functional element NG_ flat file and description of feature annotation and common qualifiers.

Graphical View

Figure 3. Feature annotation in the Gene graphical display. Additional track sets, including conventional gene annotation, repeats, and variation, may be exposed using the Tracks button (see NCBI Sequence Viewer Documentation).

Feature Annotation Glossary

Features are annotated on functional element NG_ records at the discretion of RefSeq curators based on review of the experimental literature. Annotated features are in accord with INSDC Feature Table specifications, where some INSDC feature keys have specific feature classes (e.g., the 'misc_recomb' and 'regulatory' feature keys). In addition, RefSeq-specific controlled vocabulary terms are sometimes used to provide further feature specificity, e.g., for 'misc_feature,' or 'misc_recomb' or 'regulatory' features that are not defined by a specific feature class. Most feature keys, feature classes and controlled vocabularies can be mapped to equivalent terms in the Sequence Ontology (SO), where those SO terms are usually used as SO_types for genome-annotated features in column 3 of NCBI GFF3 files (see below). The following feature types are used for RefSeq Functional Elements, with equivalent SO IDs shown in parentheses:

misc_feature

Used for functionally significant features that currently lack a more specific INSDC feature key. Controlled vocabularies are provided to facilitate bulk search and retrieval. In GenBank flat files, controlled vocabulary terms are used at the beginning of a '/note' qualifier and are separated from any additional '/note' text by a semi-colon. Underscores replace spaces for the same terms in ASN.1 and GFF3 formats.

Flat file qualifier example: /note="conserved region; ultraconserved element uc.328"

RefSeq controlled vocabularies for 'misc_feature':

  • biological_region (SO:0001411; Special note: This is a parental feature spanning all other feature annotation on each functional element RefSeq. It is a 'misc_feature' in GenBank flat files but a 'Region' feature in ASN.1 and GFF3 formats.)
  • CAGE_cluster (SO:0001917)
  • conserved_region (SO:0000330)
  • nucleotide_cleavage_site (SO:0002204)
  • nucleotide_motif (SO:0000714)
  • repeat_instability_region (SO:0002202)
  • replication_start_site (SO:0002203)
  • sequence_alteration (SO:0001059)
  • sequence_comparison (SO:0002072)
  • sequence_feature (SO:0000110)
  • transcription_start_site (SO:0000315)

misc_recomb

Used for genomic regions known to undergo recombination events. See INSDC's Controlled vocabulary for /recombination_class for details on 'recombination_class' types.

Flat file qualifier example: recombination_class="non_allelic_homologous"

INSDC 'recombination_class' types used for RefSeq Functional Elements:

RefSeq controlled vocabularies for recombination_class="other":

misc_structure (SO:0000002)

Used for miscellaneous structural regions that cannot be annotated by another feature and are characterized in the literature as being important for the function of a region. Additional detail is provided in flat file '/note' qualifiers.

mobile_element (SO:0001037)

Used for mobile elements, including transposable elements, retrotransposons and endogenous retroviruses, that are described in the literature as being functionally significant and/or represent genomic landmarks for a region. The repeat family (e.g., SINE:AluSg) is indicated in the '/mobile_element_type' qualifier. An '/inference' qualifier may be provided with a reference to a publicly accessible algorithm, e.g., RepeatMasker:4.0.5. Note that the vast majority of mobile elements located within the span of a functional element NG_ will not be annotated. Mobile elements can be viewed in NCBI graphical displays by exposing the 'Repeats identified by RepeatMasker' track when viewing annotation on the genome.

protein_bind (SO:0000410)

Used where there is experimental evidence of direct protein binding to a DNA fragment; e.g., electrophoretic mobility shift assay (EMSA) or DNase I footprint evidence for binding of a specific protein, family, or complex. Predicted binding sites lacking experimental validation, or validated solely by chromatin immunoprecipitation, are not annotated. The annotated range is based on the experimental fragment described in the literature (e.g., an EMSA probe) and is typically longer than the core binding motif. The bound protein name (or protein family name) is provided in the '/bound_moiety' qualifier, where the protein name is derived from the HUGO Gene Nomenclature Committee offical full name of the encoding gene (e.g., CTCF is 'CCCTC-binding factor').

regulatory

See INSDC's Controlled vocabulary for /regulatory_class for details on 'regulatory_class' types. Typically, the annotated region corresponds to an experimentally-defined fragment that was found to be sufficient for function (e.g., a fragment used in a reporter assay). Annotated sequences may therefore include extraneous sequences that are not necessary for function.

Specific notes:

  • Short feature motifs (i.e., CAAT_signal, TATA_box, GC_signal) lacking direct experimental validation may occasionally be annotated when described in a publication as a significant genomic landmark.

  • DNase_I_hypersensitive_site ranges are determined on a case-by-case basis based on examination of published experimental evidence. They typically include generous padding equivalent to at least one nucleosome plus linker span (~200 nt) in addition to the determined core site. Closely situated sites that are difficult to resolve may be combined into a single annotated feature, as indicated in the '/note' qualifier.

INSDC 'regulatory_class' types used for RefSeq Functional Elements:

RefSeq controlled vocabularies for regulatory_class="other":

repeat_region

Used for tandem repeats, microsatellites, and other low-complexity repeats that are described in the literature as being functionally significant and/or represent genomic landmarks for a region. Note that not all low-complexity regions located within the span of a functional element NG_ will be annotated. An '/inference' qualifier may be provided with a reference to a publicly accessible algorithm, e.g., RepeatMasker:4.0.5.

INSDC repeat types used for RefSeq Functional Elements:

rep_origin (SO:0000296)

Used for DNA replication origins that are well-supported and reproducibly observed in the literature. A '/direction' qualifier may be set when it is known that the origin fires in one or both directions.

stem_loop (SO:0000313)

Used for stem-loop regions that are characterized in the literature as being important for the function of a region. Note that a stem-loop structure is formed by complementary pairing on the same strand, and frequently corresponds to a cruciform-like structure in double-stranded DNA. Details are provided in the '/note' qualifier.

Data Access

Functional element RefSeq records can be accessed via the following NCBI resources:

Access via Gene

Functional element records can be found in Gene by searching in a variety of ways, including by record names or symbols, associated PubMed IDs, accession IDs, annotated chromosome and base locations, organism, text words, and properties (e.g., genetype biological region[prop]). Additional options can be found in the Gene Advanced Search Builder. Results can be further filtered by selecting side facets in search results pages. See the Gene Help document for more information on querying Gene.

Example queries:

To find all represented functional elements in human: genetype biological region[prop] AND homo sapiens[orgn]

To find named recombination regions in human: genetype biological region[prop] AND homo sapiens[orgn] AND recombination region[gene/protein name]

To find human records that include a locus control region feature: genetype biological region[prop] AND homo sapiens[orgn] AND feattype locus control region[prop]

To find VISTA enhancer records in mouse: genetype biological region[prop] AND mus musculus[orgn] AND VISTA*[gene/protein name]

Access via Nucleotide

The Nucleotide database displays functional element RefSeqs in GenBank flat file format by default. See the Entrez Sequences Help document for details on Nucleotide record display options, including instructions on how to retrieve FASTA sequences for specific features annotated on each RefSeq.

Functional element records can be queried in the Nucleotide database in several ways, including by names, symbols, accession ID, organism or associated publications. See the Nucleotide Advanced Search Builder for further options. Queries by BioProject ID (see Access via BioProject below) and by Feature key are particularly useful for retrieving functional element RefSeqs.

Example queries:

To retrieve all human functional element RefSeqs: homo sapiens[orgn] AND PRJNA343958[bioproject]

To retrieve mouse functional element RefSeqs with an annotated enhancer feature: regulatory enhancer[feature key] AND mus musculus[orgn] AND PRJNA343958[bioproject]

To retrieve human functional element RefSeqs with an annotated rep_origin feature: rep origin[feature key] AND homo sapiens[orgn] AND PRJNA343958 [bioproject]

Access via BLAST

Functional element RefSeqs are in NCBI's Nucleotide database, thus matching RefSeqs can be retrieved through Nucleotide BLAST sequence searches when the following databases are selected in the 'Choose Search Set' area:

  • Nucleotide collection (nr/nt) -- pull-down menu selection
  • Reference genomic sequences (refseq_genomic) -- pull-down menu selection

Access via BioProject

All functional element RefSeqs are represented in the BioProject accession PRJNA343958. Sequence records can be retrieved from links within the 'Project Data' section. We recommend that all Nucleotide database queries be appended with that BioProject accession for retrieval of RefSeq functional element records (see Access via Nucleotide example queries above).

Access via NCBI Graphical Displays

All genome-annotated features from functional element RefSeqs can be viewed by opening the 'Biological regions' track available in the 'NCBI Other Features' category in NCBI graphical displays, such as the Genome Data Viewer, Sequence Viewer and graphical images in Gene records. For RefSeq graphical images, individual feature types may be viewed by opening desired track types in the 'Features' category (e.g., 'regulatory' Features, 'protein_bind' Features); see Figure 4 track configuration example. Note that some tracks may be viewable by default depending on the NCBI browser image, e.g., the 'Biological regions' track will be on view in Gene records when RefSeq features are annotated on the reference genome. If the RefSeq has only been annotated in an interim annotation, it will be necessary to load the appropriate interim annotation track, i.e., only full annotation tracks are displayed by default. Note that the 'Biological regions' track does not include overlapping or nearby conventional gene annotations, which can be found in the 'Genes' track (Figure 3). Similarly, users should refer to 'Variation' type tracks to see overlapping variation features (e.g., dbSNP, ClinVar or dbVar tracks).

Graphical view example: https://goo.gl/6LgWQ7

Configure Page

Figure 4. Track configuration page example for a functional element RefSeq (NG_046887.1). The 'Biological region' track can be selected in the 'Other tracks' category and different features types in the 'Features' category.

Access via FTP

Functional element RefSeqs are available for FTP download in the NCBI FTP site, including the following specific sites:

  • RefSeq FTP -- all RefSeq records including those not yet annotated on a genome assembly. Includes biological_region subdirectories containing weekly updated FASTA- and GenBank flat file-formatted functional element RefSeqs for human and mouse.
  • Genomes FTP -- genome-annotated feature data for human and mouse. See the Feature Annotation Glossary above for feature types used in RefSeq Functional Elements. For extracting specific feature types from GFF3 files (e.g., enhancers), please note that equivalent Sequence Ontology (SO) terms are usually used as SO_types in column 3. Alternatively, specific feature types may be extracted from GFF3 files based on feature key, feature class or controlled vocabulary attributes in column 9 (e.g., regulatory_class=enhancer_blocking_element).
  • Gene FTP -- all RefSeqs and associated Gene data, and genomic context if annotated on a genome assembly.

Last updated: 2018-04-26T19:30:32Z