NCBI RefSeq Functional Elements


NCBI provides RefSeq and Gene records for non-genic functional elements that have been described in the literature and are experimentally validated. Elements in scope include experimentally-verified gene regulatory regions (e.g., enhancers, silencers, locus control regions), known structural elements (e.g., insulators, DNase I hypersensitive sites, matrix/scaffold-associated regions), well-characterized DNA replication origins, and clinically-significant sites of DNA recombination and genomic instability. Priority is given to genomic regions that are implicated in human disease or are otherwise of significant interest to the research community. Currently, the scope of this project is restricted to human and mouse. Our current scope does not include functional elements predicted from large-scale epigenomic mapping studies, nor elements that exist solely based on disease-associated variation.

Each RefSeq Functional Element sequence has a corresponding record in NCBI's Gene database (see example in Figure 1). NCBI Gene records for Functional Elements differ from conventional genes in that they have the Gene type 'biological region.' All Functional Element Gene records include a list and a graphical view of annotated feature types, a brief summary of the function of the region, a list of related INSDC accessions, and a comprehensive bibliography of relevant publications. A link to the orthologous human or mouse record is provided where appropriate.

Gene Summary

Figure 1. An example of an NCBI Gene record for a biological region (only the Summary section is shown here). Note that the 'Gene type' is 'biological region' and 'Feature type(s)' are listed.

RefSeq Functional Element Records

RefSeq Functional Element sequences are represented as follows:

  • As genomic RefSeqs with an NG_ prefix (e.g., NG_046887.1)
  • As DNA sequences encompassing the genomic range of one or more experimentally-validated functional elements.
  • Based on the plus strand of the current human or mouse reference genome assembly, unless otherwise indicated.
  • With 100-nt padding on each end for extra genomic context.
  • Elements considered to be functionally related and closely situated in the genome are included together on the same NG_ record (e.g., an enhancer and contained protein-binding sites; multiple nearby enhancer and/or promoter fragments).
  • Experimentally-validated features are annotated on each sequence record through manual curation by NCBI RefSeq staff as described below.
  • Manually curated RefSeq Functional Element records have a REVIEWED status (see About RefSeq for status descriptions).
  • Records generated through automatic bulk processing, such as the validated dataset from the VISTA Enhancer Browser, have a PROVISIONAL status.

RefSeq Functional Element Feature Annotation

RefSeq Functional Element sequences include manually curated features in accordance with International Nucleotide Sequence Database Collaboration (INSDC) standards. Features that are supported by direct experimental evidence include at least one '/experiment' qualifier with an evidence code (ECO ID) from the Evidence & Conclusion Ontology, and at least one citation from PubMed. It is important to note that annotated sequence ranges may be approximate depending on the experimental evidence type, and that features may include extraneous sequences that are not necessary for function. Feature annotation can be viewed on the RefSeq Nucleotide flat file (Figure 2), in the graphical view in Gene records (Figure 3), and in NCBI genome browsers (see Access via NCBI Graphical Displays below).

Flat File Features

Figure 2. Example of a RefSeq Functional Element NG_ flat file and descriptions of feature annotation and common qualifiers.

Graphical View

Figure 3. Feature annotation in the Gene graphical display. Additional track sets, including conventional gene annotation, repeat region and variation tracks, may be exposed using the Tracks button (see NCBI Sequence Viewer Documentation).

Feature Annotation Glossary

Features are annotated on RefSeq Functional Element NG_ records based on review of the scientific literature. Annotated features are in accord with INSDC Feature Table specifications, where some INSDC feature keys have specific feature classes, e.g., the 'misc_recomb' and 'regulatory' feature keys. In addition, RefSeq-specific controlled vocabulary terms are sometimes used to provide further feature specificity, e.g., for 'misc_feature,' or 'misc_recomb' or 'regulatory' features that are not defined by a specific feature class. The feature keys, feature classes and controlled vocabularies can be mapped to equivalent terms in the Sequence Ontology (SO), where those SO terms are used as SO_types for genome-annotated features in column 3 of NCBI GFF3 files (see the feature table below). The following feature types are used for RefSeq Functional Elements, with equivalent SO IDs shown in parentheses:


Used for functionally significant features that currently lack a more specific INSDC feature key. Controlled vocabularies are provided for additional feature specificity and to facilitate bulk search and retrieval. In GenBank flat files, controlled vocabulary terms are used at the beginning of a '/note' qualifier and are separated from any additional '/note' text by a semi-colon. Underscores replace spaces for the same terms in ASN.1 and GFF3 formats.

Flat file qualifier example: /note="conserved region; ultraconserved element uc.328"

RefSeq controlled vocabularies for 'misc_feature':

  • biological_region (SO:0001411; Special note: This is a parental feature spanning all other feature annotation on each RefSeq Functional Element record. It is a 'misc_feature' in GenBank flat files but a 'Region' feature in ASN.1 and GFF3 formats.
  • CAGE_cluster (SO:0001917)
  • conserved_region (SO:0000330)
  • nucleotide_cleavage_site (SO:0002204)
  • nucleotide_motif (SO:0000714)
  • repeat_instability_region (SO:0002202)
  • replication_start_site (SO:0002203)
  • sequence_alteration (SO:0001059)
  • sequence_comparison (SO:0002072)
  • sequence_feature (SO:0000110)
  • transcription_start_site (SO:0000315)


Used for genomic regions known to undergo recombination events. See INSDC's Controlled vocabulary for recombination_class for details on 'recombination_class' types.

Flat file qualifier example: recombination_class="non_allelic_homologous"

INSDC 'recombination_class' types used for RefSeq Functional Elements:

RefSeq controlled vocabularies for recombination_class="other":

misc_structure (SO:0000002)

Used for miscellaneous structural regions that cannot be annotated by another feature and are characterized in the literature as being important for the function of a region. Additional details are provided in flat file '/note' qualifiers.

mobile_element (SO:0001037)

Used for mobile elements, including transposable elements, retrotransposons and endogenous retroviruses, that are described in the literature as being functionally significant and/or represent genomic landmarks for a region. The repeat family (e.g., SINE:AluSg) is indicated in the '/mobile_element_type' qualifier. An '/inference' qualifier may be provided with a reference to a publicly accessible algorithm, e.g., RepeatMasker:4.0.5. Note that the vast majority of mobile elements located within the span of a RefSeq Functional Element NG_ will not be annotated. Mobile elements can be viewed in NCBI graphical displays by exposing the 'Repeats identified by RepeatMasker' track when viewing annotation on the genome.

protein_bind (SO:0000410)

Used where there is experimental evidence of direct protein binding to a DNA fragment, e.g., electrophoretic mobility shift assay (EMSA) or DNase I footprint evidence for binding of a specific protein, family, or complex. Predicted binding sites that lack experimental validation, or which are validated solely by chromatin immunoprecipitation, are not annotated. The annotated range is based on the experimental fragment described in the literature (e.g., an EMSA probe) and is typically longer than the core binding motif. The bound protein name (or protein family name) is provided in the '/bound_moiety' qualifier, where the protein name is derived from the HUGO Gene Nomenclature Committee offical full name of the encoding gene (e.g., CTCF is 'CCCTC-binding factor').


See INSDC's Controlled vocabulary for regulatory_class for details on 'regulatory_class' types. Typically, the annotated region corresponds to an experimentally-defined fragment that was found to be sufficient for function, e.g., a fragment used in a reporter assay. Annotated sequences may therefore include extraneous sequences that are not necessary for function.

Specific notes:

  • Short feature motifs (i.e., CAAT_signal, TATA_box, GC_signal) that lack direct experimental validation may occasionally be annotated when described in a publication as a significant genomic landmark.

  • DNase_I_hypersensitive_site ranges are determined on a case-by-case basis based on examination of published experimental evidence. They typically include generous padding equivalent to at least one nucleosome plus linker span (~200 nt) in addition to the determined core site. Closely situated sites that are difficult to resolve may be combined into a single annotated feature, as indicated in the '/note' qualifier.

INSDC 'regulatory_class' types used for RefSeq Functional Elements:

RefSeq controlled vocabularies for regulatory_class="other":


Used for tandem repeats, microsatellites, and other low-complexity repeats that are described in the literature as being functionally significant and/or represent genomic landmarks for a region. Note that not all low-complexity regions located within the span of a RefSeq Functional Element NG_ will be annotated. An '/inference' qualifier may be provided with a reference to a publicly accessible algorithm, e.g., RepeatMasker:4.0.5.

INSDC repeat types used for RefSeq Functional Elements:

rep_origin (SO:0000296)

Used for DNA replication origins that are well-supported and reproducibly observed in the literature. A '/direction' qualifier may be included when it is known that the origin fires in one or both directions.

stem_loop (SO:0000313)

Used for stem-loop regions that are characterized in the literature as being important for the function of a region. Note that a stem-loop structure is formed by complementary pairing on the same strand, and frequently corresponds to a cruciform-like structure in double-stranded DNA. Details are provided in the '/note' qualifier.

Data Access

RefSeq Functional Element records can be accessed via the following NCBI resources:

Access via Gene

RefSeq Functional Element records can be found in Gene by searching in a variety of ways, including by record names or symbols, associated PubMed IDs, accession IDs, annotated chromosome and base locations, organism, text words, and properties (e.g., genetype biological region[prop]). Additional options can be found in the Gene Advanced Search Builder. Results can be further filtered by selecting side facets in search results pages. See the Gene Help document for more information on querying Gene.

Example queries:

To find all RefSeq Functional Element records in human: genetype biological region[prop] AND homo sapiens[orgn]

To find named recombination regions in human: genetype biological region[prop] AND homo sapiens[orgn] AND recombination region[gene/protein name]

To find human records that include a locus control region feature: genetype biological region[prop] AND homo sapiens[orgn] AND feattype locus control region[prop]

To find VISTA enhancer records in mouse: genetype biological region[prop] AND mus musculus[orgn] AND VISTA*[gene/protein name]

Access via Nucleotide

The Nucleotide database displays RefSeq Funcional Element records in GenBank flat file format by default. See the Entrez Sequences Help document for details on Nucleotide record display options, including instructions on how to retrieve FASTA sequences for specific features annotated on each RefSeq.

RefSeq Functional Element records can be queried in the Nucleotide database in several ways, including by names, symbols, accession ID, organism or associated publications. See the Nucleotide Advanced Search Builder for further options. Queries by BioProject ID (see Access via BioProject below) and by Feature key are particularly useful for retrieving Functional Element RefSeqs.

Example queries:

To retrieve all human RefSeq Functional Element records: homo sapiens[orgn] AND PRJNA343958[bioproject]

To retrieve mouse RefSeq Functional Element records with an annotated enhancer feature: regulatory enhancer[feature key] AND mus musculus[orgn] AND PRJNA343958[bioproject]

To retrieve human RefSeq Functional Element records with an annotated rep_origin feature: rep origin[feature key] AND homo sapiens[orgn] AND PRJNA343958 [bioproject]

Access via BLAST

RefSeq Functional Element sequences are in NCBI's Nucleotide database, thus matching RefSeqs can be retrieved through Nucleotide BLAST sequence searches when the following databases are selected in the 'Choose Search Set' area:

  • Nucleotide collection (nr/nt) -- pull-down menu selection
  • Reference genomic sequences (refseq_genomic) -- pull-down menu selection

Access via BioProject

All RefSeq Functional Elements are represented in the BioProject accession PRJNA343958. Sequence records can be retrieved from links within the 'Project Data' section. We recommend that all Nucleotide database queries be appended with that BioProject accession for retrieval of RefSeq Functional Element records (see Access via Nucleotide example queries above).

Access via NCBI Graphical Displays

All genome-annotated features from RefSeq Functional Elements can be viewed by turning on the 'Biological regions' track available in the 'Genes/Products' track group and 'NCBI Other Features' category in NCBI graphical displays, including the Genome Data Viewer, Sequence Viewer, Variation Viewer and graphical images in Gene records. Note that the 'Biological regions' track may be viewable by default depending on the NCBI browser. For example, it will be on view in Gene records when RefSeq features are annotated on the reference genome, or in the Genome Data Viewer or Variation Viewer when the 'Genes' track set is selected under 'NCBI Recommended Track Sets', but it may be necessary to turn on the track if other track sets or the default tracks are selected. The track can be turned on under Tracks -> Configure Tracks -> Genes/Products -> Category: NCBI Other Features, where selection of the most recent 'Biological regions, aggregate' track is recommended; see the Figure 4 track configuration interface. Note that the 'Biological regions' track does not include overlapping or nearby conventional gene annotations, which can be found in the 'Genes' track (Figure 3). Similarly, users should refer to 'Variation' type tracks to see overlapping variation features, e.g., dbSNP, ClinVar or dbVar tracks. If the RefSeq has not yet been annotated on the genome, only the RefSeq (NG_ accession) sequence will be available for graphical viewing. For RefSeq graphical images, individual feature types, if not already viewable by default, may be viewed by turning on desired track types in the 'Features' track group (e.g., 'regulatory' Features, 'protein_bind' Features).

Graphical view example for HBB-LCR, GeneID:109580095:

Configure Page

Figure 4. NCBI genome browser track configuration dialog box. The most recent 'Biological regions' track can be selected in the 'Genes/Products' track group and 'NCBI Other Features' category.

Access via FTP

RefSeq Functional Elements are available for FTP download in the NCBI FTP site, including the following specific subsites:

  • RefSeq FTP -- all RefSeq records including those not yet annotated on a genome assembly. Includes biological_region subdirectories containing weekly updated FASTA- and GenBank flat file-formatted functional element RefSeqs for human and mouse.

  • Gene FTP -- all RefSeqs and associated Gene data, and genomic context if annotated on a genome assembly.

  • Genomes FTP -- genome-annotated feature data for human and mouse. See the Feature Annotation Glossary above for descriptions of RefSeq Functional Element feature types. For extracting specific feature types from GFF3 files, please note that equivalent Sequence Ontology (SO) terms are used as SO_types in column 3. Alternatively, specific feature types may be extracted from GFF3 files based on feature key, feature class or controlled vocabulary attributes in column 9. The following table shows how features are indicated in columns 3 and 9 of GFF3 files:

Feature table

INSDC feature Feature class or controlled vocabulary SO ID GFF3 column 3 SO_type GFF3 column 9 specific attribute(s)
misc_feature biological_region SO:0001411 biological_region gbkey=Region
misc_feature CAGE_cluster SO:0001917 CAGE_cluster feat_class=CAGE_cluster
misc_feature conserved_region SO:0000330 conserved_region feat_class=conserved_region
misc_feature nucleotide_cleavage_site SO:0002204 nucleotide_cleavage_site feat_class=nucleotide_cleavage_site
misc_feature nucleotide_motif SO:0000714 nucleotide_motif feat_class=nucleotide_motif
misc_feature repeat_instability_region SO:0002202 repeat_instability_region feat_class=repeat_instability_region
misc_feature replication_start_site SO:0002203 replication_start_site feat_class=replication_start_site
misc_feature sequence_alteration SO:0001059 sequence_alteration feat_class=sequence_alteration
misc_feature sequence_comparison SO:0002072 sequence_comparison feat_class=sequence_comparison
misc_feature sequence_feature SO:0000110 sequence_feature feat_class=sequence_feature
misc_feature transcription_start_site SO:0000315 sequence_feature feat_class=transcription_start_site
misc_recomb chromosome_breakpoint SO:0001021 chromosome_breakpoint recombination_class=chromosome_breakpoint
misc_recomb meiotic SO:0002155 meiotic_recombination_region recombination_class=meiotic
misc_recomb mitotic SO:0002154 mitotic_recombination_region recombination_class=mitotic
misc_recomb non_allelic_homologous SO:0002094 non_allelic_homologous_recombination_region recombination_class=non_allelic_homologous
misc_recomb recombination_hotspot SO:0000298 recombination_feature recombination_class=recombination_hotspot
misc_structure n/a SO:0000002 sequence_secondary_structure gbkey=misc_structure
mobile_element n/a SO:0001037 mobile_genetic_element gbkey=mobile_element
protein_bind n/a SO:0000410 protein_binding_site gbkey=protein_bind or bound_moiety=
regulatory CAAT_signal SO:0000172 CAAT_signal regulatory_class=CAAT_signal
regulatory DNase_I_hypersensitive_site SO:0000685 DNAseI_hypersensitive_site regulatory_class=DNase_I_hypersensitive_site
regulatory enhancer SO:0000165 enhancer regulatory_class=enhancer
regulatory enhancer_blocking_element SO:0002190 enhancer_blocking_element regulatory_class=enhancer_blocking_element
regulatory GC_signal SO:0000173 GC_rich_promoter_region regulatory_class=GC_signal
regulatory imprinting_control_region SO:0002191 imprinting_control_region regulatory_class=imprinting_control_region
regulatory insulator SO:0000627 insulator regulatory_class=insulator
regulatory locus_control_region SO:0000037 locus_control_region regulatory_class=locus_control_region
regulatory matrix_attachment_region SO:0000037 matrix_attachment_site regulatory_class=matrix_attachment_region
regulatory epigenetically_modified_region SO:0001720 epigenetically_modified_region regulatory_class=epigenetically_modified_region
regulatory micrococcal_nuclease_hypersensitive_site SO:0005836 regulatory_region regulatory_class=micrococcal_nuclease_hypersensitive_site
regulatory promoter SO:0000167 promoter regulatory_class=promoter
regulatory response_element SO:0002205 response_element regulatory_class=response_element
regulatory replication_regulatory_region SO:0001682 replication_regulatory_region regulatory_class=replication_regulatory_region
regulatory silencer SO:0000625 silencer regulatory_class=silencer
regulatory TATA_box SO:0000174 TATA_box regulatory_class=TATA_box
regulatory transcriptional_cis_regulatory_region SO:0001055 transcriptional_cis_regulatory_region regulatory_class=transcriptional_cis_regulatory_region
repeat_region minisatellite SO:0000643 minisatellite satellite=minisatellite
repeat_region microsatellite SO:0000289 microsatellite satellite=microsatellite
repeat_region direct_repeat SO:0000314 direct_repeat rpt_type=direct or rpt_type=flanking or rpt_type=tandem
rep_origin n/a SO:0000296 origin_of_replication gbkey=rep_origin
stem_loop n/a SO:0000313 stem_loop gbkey=stem_loop

Feature extraction examples

The following Unix command line examples use the GCF_000001405.39_GRCh38.p13_genomic.gff.gz file from NCBI's human Updated Annotation Release 109.20190607. The GFF3 file from NCBI's latest human or mouse annotation release can be downloaded to a local directory (see the human and mouse Genomes FTP links above), with substitution of the GFF3 file names in the following commands. Please refer to the feature table above to determine appropriate strings for extraction of desired feature types.

  • Extraction of all RefSeq Functional Element features using awk, excluding the parental/spanning 'biological_region' features. This command takes advantage of the fact that RefSeq Functional Element features are not stranded (except for 'stem_loop' features), while all other NCBI-annotated features are stranded, including 'gene', 'exon' and 'CDS' features:

    zgrep -v "^#" GCF_000001405.39_GRCh38.p13_genomic.gff.gz | awk 'BEGIN{FS="\t";OFS"\t"}$7=="."||$3=="stem_loop"'

  • Extraction of all regulatory_class features using awk:

    zgrep -v "^#" GCF_000001405.39_GRCh38.p13_genomic.gff.gz | awk 'BEGIN{FS="\t";OFS"\t"}$9~/regulatory_class=/'

    Alternatively, $9~/regulatory_class=/ can be substituted with $9~/gbkey=regulatory/

  • Extraction of enhancer features based on the column 3 SO_type using awk:

    zgrep -v "^#" GCF_000001405.39_GRCh38.p13_genomic.gff.gz | awk 'BEGIN{FS="\t";OFS"\t"}$3=="enhancer"'

  • Extraction of enhancer features based on the column 9 specific attribute using awk:

    zgrep -v "^#" GCF_000001405.39_GRCh38.p13_genomic.gff.gz | awk 'BEGIN{FS="\t";OFS"\t"}$9~/regulatory_class=enhancer/&&$9!~/regulatory_class=enhancer_blocking_element/'

  • Extraction of enhancer features based on the column 9 specific attribute using grep:

    zgrep -v "^#" GCF_000001405.39_GRCh38.p13_genomic.gff.gz | grep regulatory_class=enhancer | grep regulatory_class=enhancer_blocking_element -v

Caution for extraction of specific features: Some feature strings may also be present in names or other free-text attributes in GFF3 files. The use of explicit terms with associated attribute strings is therefore recommended to avoid the extraction of non-specific features. Explicit features can be extracted by using awk based on the SO_type in column 3 in the format $3=="feature_name", while full attribute strings may be necessary for extractions based on column 9 attributes. For example, the use of regulatory_class=enhancer with awk or grep as in the examples above, where it is additionally necessary to exclude 'enhancer_blocking_element' features given that the 'enhancer' string is also found within regulatory_class=enhancer_blocking_element.

Last updated: 2019-08-26T16:44:38Z