NCBI RefSeq Functional Elements

Overview

NCBI provides RefSeq and Gene records for non-genic functional elements that have been described in the literature and are experimentally validated. Elements in scope include experimentally verified gene regulatory regions (e.g., enhancers, silencers, locus control regions), known structural elements (e.g., insulators, DNase I hypersensitive sites, matrix/scaffold-associated regions), well-characterized DNA replication origins, and clinically-significant sites of DNA recombination and genomic instability. Priority is given to genomic regions that are implicated in human disease or are otherwise of significant interest to the research community. Currently, the scope of this project is restricted to human and mouse. Our current scope does not include functional elements predicted from large-scale epigenomic mapping studies, nor elements that exist solely based on disease-associated variation.

Each RefSeq Functional Element sequence has a corresponding record in NCBI's Gene database (see example in Figure 1). NCBI Gene records for Functional Elements differ from conventional genes in that they have the Gene type 'biological region.' All Functional Element Gene records include a list and a graphical view of annotated feature types, a brief summary of the function of the region, a list of related INSDC accessions, and a comprehensive bibliography of relevant publications. A link to the orthologous human or mouse record is provided where appropriate.

Gene Summary

Figure 1. An example of an NCBI Gene record for a biological region (only the Summary section is shown here). Note that the 'Gene type' is 'biological region' and 'Feature type(s)' are listed.

RefSeq Functional Element Records

RefSeq Functional Element sequences are represented as follows:

  • As genomic RefSeqs with an NG_ prefix (e.g., NG_046887.1)
  • As DNA sequences encompassing the genomic range of one or more experimentally-validated functional elements.
  • Based on the plus strand of the current human or mouse reference genome assembly, unless otherwise indicated.
  • With 100-nt padding on each end for extra genomic context.
  • Elements considered to be functionally related and closely situated in the genome are included together on the same NG_ record (e.g., an enhancer and contained protein-binding sites; multiple nearby enhancer and/or promoter fragments).
  • Experimentally-validated features are annotated on each sequence record through manual curation by NCBI RefSeq staff as described below.
  • Manually curated RefSeq Functional Element records have a REVIEWED status (see About RefSeq for status descriptions).
  • Records generated through automatic bulk processing, such as the validated dataset from the VISTA Enhancer Browser, have a PROVISIONAL status.

RefSeq Functional Element Feature Annotation

RefSeq Functional Element sequences include manually curated features in accordance with International Nucleotide Sequence Database Collaboration (INSDC) standards. Features that are supported by direct experimental evidence include at least one '/experiment' qualifier with an evidence code (ECO ID) from the Evidence & Conclusion Ontology, and at least one citation from PubMed. It is important to note that annotated sequence ranges may be approximate depending on the experimental evidence type, and that features may include extraneous sequences that are not necessary for function. Feature annotation can be viewed on the RefSeq Nucleotide flat file (Figure 2), in the graphical view in Gene records (Figure 3), and in NCBI genome browsers (see Access via NCBI Graphical Displays below).

Flat File Features

Figure 2. Example of a RefSeq Functional Element NG_ flat file and descriptions of feature annotation and common qualifiers.

Graphical View

Figure 3. Feature annotation in the Gene graphical display. Additional track sets, including conventional gene annotation, repeat region and variation tracks, may be exposed using the Tracks button (see NCBI Sequence Viewer Documentation).

Feature Annotation Glossary

Features are annotated on RefSeq Functional Element NG_ records based on review of the scientific literature. Annotated features are in accord with INSDC Feature Table specifications, where some INSDC feature keys have specific feature classes, e.g., the 'misc_recomb' and 'regulatory' feature keys. In addition, RefSeq-specific controlled vocabulary terms are sometimes used to provide further feature specificity, e.g., for 'misc_feature,' or 'misc_recomb' or 'regulatory' features that are not defined by a specific feature class. The feature keys, feature classes and controlled vocabularies can be mapped to equivalent terms in the Sequence Ontology (SO), where those SO terms are used as SO_types for genome-annotated features in column 3 of NCBI GFF3 files (see the feature table below). The following feature types are used for RefSeq Functional Elements, with equivalent SO IDs shown in parentheses:

misc_feature

Used for functionally significant features that currently lack a more specific INSDC feature key. Controlled vocabularies are provided for additional feature specificity and to facilitate bulk search and retrieval. In GenBank flat files, controlled vocabulary terms are used at the beginning of a '/note' qualifier and are separated from any additional '/note' text by a semi-colon. Underscores replace spaces for the same terms in ASN.1 and GFF3 formats.

Flat file qualifier example: /note="conserved region; ultraconserved element uc.328"

RefSeq controlled vocabularies for 'misc_feature':

  • biological_region (SO:0001411; Special note: This is a parental feature spanning all other feature annotation on each record. It is a 'misc_feature' denoted by '/note="Region: biological region"' in GenBank flat files, a 'Region' feature in ASN.1 format and a 'biological_region' in GFF3 format.
  • CAGE_cluster (SO:0001917)
  • conserved_region (SO:0000330)
  • nucleotide_cleavage_site (SO:0002204)
  • nucleotide_motif (SO:0000714)
  • repeat_instability_region (SO:0002202)
  • replication_start_site (SO:0002203)
  • sequence_alteration (SO:0001059)
  • sequence_comparison (SO:0002072)
  • sequence_feature (SO:0000110)
  • transcription_start_site (SO:0000315)

misc_recomb

Used for genomic regions known to undergo recombination events. See INSDC's Controlled vocabulary for recombination_class for details on 'recombination_class' types.

Flat file qualifier example: recombination_class="non_allelic_homologous"

INSDC 'recombination_class' types used for RefSeq Functional Elements:

RefSeq controlled vocabularies for recombination_class="other":

misc_structure (SO:0000002)

Used for miscellaneous structural regions that are considered functionally important, including G-quadruplex and cruciform structures. Additional details are provided in flat file '/note' qualifiers.

mobile_element (SO:0001037)

Used for mobile elements, including transposable elements, retrotransposons and endogenous retroviruses, that are described in the literature as being functionally significant and/or represent genomic landmarks for a region. The repeat family (e.g., SINE:AluSg) is indicated in the '/mobile_element_type' qualifier. An '/inference' qualifier may be provided with a reference to a publicly accessible algorithm, e.g., RepeatMasker:4.0.5. Note that the vast majority of mobile elements located within the span of a RefSeq Functional Element NG_ will not be annotated. Mobile elements can be viewed in NCBI graphical displays by exposing the 'Repeats identified by RepeatMasker' track when viewing annotation on the genome.

protein_bind (SO:0000410)

Used where there is experimental evidence of direct protein binding to a DNA fragment, e.g., electrophoretic mobility shift assay (EMSA) or DNase I footprint evidence for binding of a specific protein, family, or complex. Predicted binding sites that lack experimental validation, or which are validated solely by chromatin immunoprecipitation, are not annotated. The annotated range is based on the experimental fragment described in the literature (e.g., an EMSA probe) and is typically longer than the core binding motif. The bound protein name (or protein family name) is provided in the '/bound_moiety' qualifier, where the protein name is the HUGO Gene Nomenclature Committee (HGNC) official full name for the encoding gene (e.g., CTCF is 'CCCTC-binding factor').

regulatory

See INSDC's Controlled vocabulary for regulatory_class for details on 'regulatory_class' types. Typically, the annotated region corresponds to an experimentally-defined fragment that was found to be sufficient for function, e.g., a fragment used in a reporter assay. Annotated sequences may therefore include extraneous sequences that are not necessary for function.

Specific notes:

  • Short feature motifs (i.e., CAAT_signal, TATA_box, GC_signal) that lack direct experimental validation may occasionally be annotated when described in a publication as a significant genomic landmark.

  • DNase_I_hypersensitive_site ranges are determined on a case-by-case basis based on examination of published experimental evidence. They typically include generous padding equivalent to at least one nucleosome plus linker span (~200 nt) in addition to the determined core site. Closely situated sites that are difficult to resolve may be combined into a single annotated feature, as indicated in the '/note' qualifier.

INSDC 'regulatory_class' types used for RefSeq Functional Elements:

RefSeq controlled vocabularies for regulatory_class="other":

repeat_region

Used for tandem repeats, microsatellites, and other low-complexity repeats that are described in the literature as being functionally significant and/or represent genomic landmarks for a region. Note that not all low-complexity regions located within the span of a RefSeq Functional Element NG_ will be annotated. An '/inference' qualifier may be provided with a reference to a publicly accessible algorithm, e.g., RepeatMasker:4.0.5.

INSDC repeat types used for RefSeq Functional Elements:

rep_origin (SO:0000296)

Used for DNA replication origins that are well-supported and reproducibly observed in the literature. A '/direction' qualifier may be included when it is known that the origin fires in one or both directions.

stem_loop (SO:0000313)

Used prior to August 2020 for stem-loop regions that are considered functionally important. Note that stem-loop structures are formed by complementary pairing on the same strand and frequently correspond to cruciform-like structures in double-stranded DNA, thus these are now represented as misc_structure features.

Interaction Data

Non-genic biological regions are linked to target genes and other biological regions when there is experimental evidence for interactions. Interaction data is provided for regulatory interactions, typically between a gene regulatory element and a target gene, and for recombination partner interactions between recombination-type biological regions. Biological region ranges are based on parental 'biological_region' features that may include several underlying gene regulatory or other non-genic functional features, while target gene ranges are based on NCBI 'gene' feature annotations. Each interaction is supported by experimental data in the literature, and publications are provided in our bigInteract files described below.

Interaction data can be accessed as follows:

  • By download from the RefSeq FTP site in bigInteract file format. Pairwise interactions with genomic coordinates are provided in the 'FEregintxns_AR##.inter.bb' and 'FErecombpartners_AR##.inter.bb' files, where '##' represents the NCBI annotation release identifier. These files are available at this link for human, or at this link for mouse. We recommend the use of data from the most recent annotation release (AR##) directory. Separate bigInteract files are provided for regulatory interactions (both human and mouse) and recombination partners (human only). These files also include a custom metadata column (#19) containing a comma-separated list of PubMed IDs with experimental support for each interaction.

  • By download from the RefSeq FTP site in bigBed file format. Interactions are provided for biological regions in a custom metadata column (#13) in the 'FEbiolregions_AR##.bb' files as pipe-separated lists of loci known to interact with the given biological region. These files are available from the same FTP directories indicated for the bigInteract files above. Additional information about 'FEbiolregions_AR##.bb' files can be found below.

  • By download from the RefSeq FTP site in GenBank flat file format ('[human/mouse].biological_region.gbff.gz' files). RefSeq accessions for biological regions with validated functional interactions contain a '/function' qualifier listing loci that are known to interact with the biological region. The qualifier is present on the 'misc_feature' denoted as '/note="Region: biological region"'. For example, on NG_042043.1 the '/function' qualifier appears as:
    /function="regulatory_interactions: H19 | Igf2 | LOC105311846"
    The weekly-updated files are available at this link. Alternatively, individual RefSeq accessions may be queried, viewed and downloaded in flat file format in the Nucleotide database, as described below.

  • By download from the Genomes FTP site in GFF3 file format. Relevant 'biological_region' features have a 'function' attribute in column 9, where interacting loci are indicated as described for GenBank flat files above. GFF3 file availability is described below.

  • By graphical visualization from the RefSeq Functional Elements track hub (RefSeqFE Hub). Interaction data may be viewed in the regulatory interactions and recombination partner tracks, as described for our track hub below and shown in Figure 5.

Data Access

RefSeq Functional Element records can be accessed via the following NCBI resources:

Access via Gene

RefSeq Functional Element records can be found in Gene by searching in a variety of ways, including by record names or symbols, associated PubMed IDs, accession IDs, annotated chromosome and base locations, organism, text words, and properties (e.g., genetype biological region[prop]). Additional options can be found in the Gene Advanced Search Builder. Results can be further filtered by selecting side facets in search results pages. See the Gene Help document for more information on querying Gene.

Example queries:

To find all RefSeq Functional Element records in human: genetype biological region[prop] AND homo sapiens[orgn]

To find named recombination regions in human: genetype biological region[prop] AND homo sapiens[orgn] AND recombination region[gene/protein name]

To find human records that include a locus control region feature: genetype biological region[prop] AND homo sapiens[orgn] AND feattype locus control region[prop]

To find VISTA enhancer records in mouse: genetype biological region[prop] AND mus musculus[orgn] AND VISTA*[gene/protein name]

To find human Functional Element records associated with coronavirus biology: coronavirus related[filter] AND genetype biological region[properties]
Or find subsets of those records involved in particular processes with:

Access via Nucleotide

The Nucleotide database displays RefSeq Functional Element records in GenBank flat file format by default. See the Entrez Sequences Help document for details on Nucleotide record display options, including instructions on how to retrieve FASTA sequences for specific features annotated on each RefSeq.

RefSeq Functional Element records can be queried in the Nucleotide database in several ways, including by names, symbols, accession ID, organism or associated publications. See the Nucleotide Advanced Search Builder for further options. Queries by BioProject ID (see Access via BioProject below) and by Feature key are particularly useful for retrieving Functional Element RefSeqs.

Example queries:

To retrieve all human Functional Element RefSeqs: homo sapiens[orgn] AND PRJNA343958[bioproject]

To retrieve mouse Functional Element RefSeqs with an annotated enhancer feature: regulatory enhancer[feature key] AND mus musculus[orgn] AND PRJNA343958[bioproject]

To retrieve human Functional Element RefSeqs with an annotated rep_origin feature: rep origin[feature key] AND homo sapiens[orgn] AND PRJNA343958 [bioproject]

To retrieve human Functional Element RefSeqs associated with coronavirus biology: coronavirus related[prop] AND biomol genomic[prop]
Or find subsets of those records involved in particular processes with:

Access via BLAST

RefSeq Functional Element sequences are in NCBI's Nucleotide database, thus matching RefSeqs can be retrieved through Nucleotide BLAST sequence searches when the following databases are selected in the 'Choose Search Set' area:

  • Nucleotide collection (nr/nt) -- pull-down menu selection
  • Reference genomic sequences (refseq_genomic) -- pull-down menu selection

Access via BioProject

All RefSeq Functional Elements are represented in BioProject accession PRJNA343958. Sequence records can be retrieved from links within the 'Project Data' section. We recommend that Nucleotide database queries be appended with that BioProject accession for retrieval of RefSeq Functional Element records (see Access via Nucleotide example queries above).

Access via NCBI Graphical Displays

All genome-annotated features from RefSeq Functional Elements can be viewed by turning on the 'Biological regions' track available in the 'Genes/Products' track group and 'NCBI Other Features' category in NCBI graphical displays, including the Genome Data Viewer, Sequence Viewer, Variation Viewer and graphical images in Gene records. Note that the 'Biological regions' track may be viewable by default depending on the NCBI browser. For example, it will be on view in Gene records when RefSeq features are annotated on the reference genome, or in the Genome Data Viewer or Variation Viewer when the 'Genes' track set is selected under 'NCBI Recommended Track Sets', but it may be necessary to turn on the track if other track sets or the default tracks are selected. The track can be turned on under Tracks -> Configure Tracks -> Genes/Products -> Category: NCBI Other Features, where selection of the most recent 'Biological regions, aggregate' track is recommended; see the Figure 4 track configuration interface. Note that the 'Biological regions' track does not include overlapping or nearby conventional gene annotations, which can be found in the 'Genes' track (Figure 3). Similarly, users should refer to 'Variation' type tracks to see overlapping variation features, e.g., dbSNP, ClinVar or dbVar tracks. If the RefSeq has not yet been annotated on the genome, only the RefSeq (NG_ accession) sequence will be available for graphical viewing. For RefSeq graphical images, individual feature types, if not already viewable by default, may be viewed by turning on desired track types in the 'Features' track group (e.g., 'regulatory' Features, 'protein_bind' Features).

Graphical view example for HBB-LCR, GeneID:109580095: https://go.usa.gov/xf9ea

Configure Page

Figure 4. NCBI genome browser track configuration dialog box. The most recent 'Biological regions' track can be selected in the 'Genes/Products' track group and 'NCBI Other Features' category.

Access via the RefSeq Functional Elements Track Hub

The RefSeq Functional Elements track hub (RefSeqFE Hub) includes tracks for biological regions, features, regulatory interactions and recombination partners. The hub is in UCSC track hub format and can be viewed on a compatible genome browser, including the UCSC Genome Browser (all tracks and metadata, see Figure 5 example), the NCBI Genome Data Viewer (biological region and feature tracks only; metadata is best viewed in the existing 'Biological regions' track described above) or the Ensembl genome browser (biological region and feature tracks only).

Use the following URL to connect to the RefSeqFE Hub: https://ftp.ncbi.nlm.nih.gov/refseq/FunctionalElements/trackhub/hub.txt

Please see additional information in the track hub document. Briefly, the following tracks are represented:

  • RefSeq Functional Element biological regions (FE_biol_regions): Biological regions with metadata from the 'FEbiolregions_AR##.bb' bigBed file described below. See the biological regions track document for more details.

  • RefSeq Functional Element features (FE_features): Functional features with metadata from the 'FEfeats_AR##.bb' bigBed file described below. See the features track document for more details.

  • RefSeq Functional Element recombination partners (Recomb_partners; currently human only): Recombination interactions from the 'FErecombpartners_AR##.inter.bb' bigInteract file described above. See the recombination partners track document for more details.

  • RefSeq Functional Element regulatory interactions (Reg_interactions): Regulatory interactions from the 'FEregintxns_AR##.inter.bb' bigInteract file described above. See the regulatory interactions track document for more details.

Note on track display in other genome browsers: While not all commonly used genome browsers are compatible with UCSC-formatted track hubs, many browsers, including those mentioned above, may offer options to view RefSeq Functional Element tracks by remote connection from file URLs, or as custom tracks based on downloaded files saved locally. These browsers will vary in their ability to accept bigBed or bigInteract files, or to display color-coding or metadata, and some may require the conversion of binary files to non-binary formats. Please check the relevant genome browser documentation for futher information.

Configure Page

Figure 5. UCSC Genome Browser image displaying tracks from the RefSeqFE Hub at the human alpha-globin locus.

Access via FTP

RefSeq Functional Elements are available for FTP download in the NCBI FTP site, including the following specific subsites:

RefSeq FTP

All RefSeq records including those not yet annotated on a genome assembly. Includes a FunctionalElements directory with the following content:

Weekly updated RefSeq accession files:

  • [human/mouse].biological_region.fna.gz -- RefSeq accessions for genomic biological regions (NG_ prefix) in FASTA format
  • [human/mouse].biological_region.gbff.gz -- RefSeq accessions for genomic biological regions (NG_ prefix) in GenBank flatfile format

These files can also be found in species-specific 'biological_region' directories for human and mouse.

trackhub directory:

Files and subdirectories necessary to support the RefSeq Functional Elements track hub described above. See UCSC track hub documentation for an explanation of track hub structure and the necessary components. The following data files are provided within the data subdirectory and the species-specific annotation release (AR##) subdirectories therein:

  • Genome-annotated biological region and feature files in bigBed format:

    • FEbiolregions_AR##.bb -- biological regions with metadata, provided as binary indexed BED 9+5 files where columns 1-9 are in standard BED format, with five additional custom columns representing:

      • Column 10: the Gene database identifier cross-reference
      • Column 11: the locus name
      • Column 12: the locus summary
      • Column 13: known interacting loci, regulatory interactions or recombination partners as described above
      • Column 14: the RefSeq accession ID
        Note that the locus symbol is used for the name (column 4). These files use UCSC-style chromosome notation (column 1).
    • FEfeats_AR##.bb -- functional features with metadata, provided as binary indexed BED 9+41 files where columns 1-9 are in standard BED format, with four additional custom columns representing:

      • Column 10: database cross-references, including to the Gene record and other database records if applicable
      • Column 11: functional and descriptive information including experimental evidence and publication support
      • Column 12: supporting publications from PubMed 2
      • Column 13: the RefSeq accession ID 3
        Note that the name (column 4) is in the format <SO_type>_<GeneID> for most features, where the 'SO_type' is the 'GFF3 column 3 SO_type' in the feature table below. Protein binding site features are labeled in the format protein_binding_site_<bound_moiety>_<GeneID>, where the bound_moiety is the HGNC official full name for the encoding gene. These files use UCSC-style chromosome notation (column 1).

      1 BED 9+3 format for the human FEfeats_AR109.20200522.bb and mouse FEfeats_AR108.20200622.bb files
      2 not included in the human FEfeats_AR109.20200522.bb and mouse FEfeats_AR108.20200622.bb files; publications can be extracted from column 11
      3 this is column 12 in the human FEfeats_AR109.20200522.bb and mouse FEfeats_AR108.20200622.bb files

  • Pairwise interaction data files in bigInteract format, provided as binary indexed BED 5+14 files where columns 1-18 are in standard bigInteract format, and column 19 is a custom column listing supporting publications:

    • FErecombpartners_AR##.inter.bb -- recombination interactions as described above
    • FEregintxns_AR##.inter.bb -- regulatory interactions as described above

All of the bigBed and bigInteract files can also be found in species-specific 'biological_region' directories and annotation release (AR##) subdirectories therein for human and mouse.

Gene FTP

All RefSeqs and associated Gene data, and genomic context if annotated on a genome assembly. See the Gene README file for a description of the Gene FTP directory contents.

Genomes FTP

Genome-annotated feature data for human and mouse provided in GFF3 format. See the Feature Annotation Glossary above for descriptions of RefSeq Functional Element feature types. These GFF3 files use RefSeq accession chromosome notation (column 1). Effective as of human Updated Annotation Release 109.20200522 and mouse Updated Annotation Release 108.20200622, RefSeq Functional Element features are indicated by the 'RefSeqFE' source in column 2 of NCBI GFF3 files.

For extracting specific feature types from GFF3 files, please note that equivalent Sequence Ontology (SO) terms are used as SO_types in column 3. Alternatively, specific feature types may be extracted from GFF3 files based on feature key, feature class or controlled vocabulary attributes in column 9. The following table shows how features are indicated in columns 3 and 9 of GFF3 files:

Feature Table

INSDC feature Feature class or controlled vocabulary SO ID GFF3 column 3 SO_type GFF3 column 9 specific attribute(s)
misc_feature biological_region SO:0001411 biological_region gbkey=Region
misc_feature CAGE_cluster SO:0001917 CAGE_cluster feat_class=CAGE_cluster
misc_feature conserved_region SO:0000330 conserved_region feat_class=conserved_region
misc_feature nucleotide_cleavage_site SO:0002204 nucleotide_cleavage_site feat_class=nucleotide_cleavage_site
misc_feature nucleotide_motif SO:0000714 nucleotide_motif feat_class=nucleotide_motif
misc_feature repeat_instability_region SO:0002202 repeat_instability_region feat_class=repeat_instability_region
misc_feature replication_start_site SO:0002203 replication_start_site feat_class=replication_start_site
misc_feature sequence_alteration SO:0001059 sequence_alteration feat_class=sequence_alteration
misc_feature sequence_comparison SO:0002072 sequence_comparison feat_class=sequence_comparison
misc_feature sequence_feature SO:0000110 sequence_feature feat_class=sequence_feature
misc_feature transcription_start_site SO:0000315 sequence_feature feat_class=transcription_start_site
misc_recomb chromosome_breakpoint SO:0001021 chromosome_breakpoint recombination_class=chromosome_breakpoint
misc_recomb meiotic SO:0002155 meiotic_recombination_region recombination_class=meiotic
misc_recomb mitotic SO:0002154 mitotic_recombination_region recombination_class=mitotic
misc_recomb non_allelic_homologous SO:0002094 non_allelic_homologous_recombination_region recombination_class=non_allelic_homologous
misc_recomb recombination_hotspot SO:0000298 recombination_feature recombination_class=recombination_hotspot
misc_structure n/a SO:0000002 sequence_secondary_structure gbkey=misc_structure
mobile_element n/a SO:0001037 mobile_genetic_element gbkey=mobile_element
protein_bind n/a SO:0000410 protein_binding_site gbkey=protein_bind or bound_moiety=
regulatory CAAT_signal SO:0000172 CAAT_signal regulatory_class=CAAT_signal
regulatory DNase_I_hypersensitive_site SO:0000685 DNAseI_hypersensitive_site regulatory_class=DNase_I_hypersensitive_site
regulatory enhancer SO:0000165 enhancer regulatory_class=enhancer
regulatory enhancer_blocking_element SO:0002190 enhancer_blocking_element regulatory_class=enhancer_blocking_element
regulatory GC_signal SO:0000173 GC_rich_promoter_region regulatory_class=GC_signal
regulatory imprinting_control_region SO:0002191 imprinting_control_region regulatory_class=imprinting_control_region
regulatory insulator SO:0000627 insulator regulatory_class=insulator
regulatory locus_control_region SO:0000037 locus_control_region regulatory_class=locus_control_region
regulatory matrix_attachment_region SO:0000037 matrix_attachment_site regulatory_class=matrix_attachment_region
regulatory epigenetically_modified_region SO:0001720 epigenetically_modified_region regulatory_class=epigenetically_modified_region
regulatory micrococcal_nuclease_hypersensitive_site SO:0005836 regulatory_region regulatory_class=micrococcal_nuclease_hypersensitive_site
regulatory promoter SO:0000167 promoter regulatory_class=promoter
regulatory response_element SO:0002205 response_element regulatory_class=response_element
regulatory replication_regulatory_region SO:0001682 replication_regulatory_region regulatory_class=replication_regulatory_region
regulatory silencer SO:0000625 silencer regulatory_class=silencer
regulatory TATA_box SO:0000174 TATA_box regulatory_class=TATA_box
regulatory transcriptional_cis_regulatory_region SO:0001055 transcriptional_cis_regulatory_region regulatory_class=transcriptional_cis_regulatory_region
repeat_region minisatellite SO:0000643 minisatellite satellite=minisatellite
repeat_region microsatellite SO:0000289 microsatellite satellite=microsatellite
repeat_region direct_repeat SO:0000314 direct_repeat rpt_type=direct or rpt_type=flanking or rpt_type=tandem
rep_origin n/a SO:0000296 origin_of_replication gbkey=rep_origin
stem_loop1 n/a SO:0000313 stem_loop1 gbkey=stem_loop1

1 No longer used for Functional Element features as of August 2020, replaced with misc_structure

Feature and Metadata Extraction Examples

Annotated RefSeq Functional Element features can be extracted from NCBI-provided GFF3 files or from the 'FEfeats_AR##.bb' bigBed files following their conversion to a non-binary format using the bigBedToBed application provided in the UCSC binary utilities directory. The following data extraction examples are based on the Unix command line.

Feature extraction from GFF3 files:

The GFF3 file from NCBI's latest human or mouse annotation release can be downloaded to a local directory (see the human and mouse Genomes FTP links above), with substitution of the GFF3 file names (if applicable) in the following commands. Please refer to the feature table above to determine appropriate strings for extraction of desired feature types.

  • Extraction of all RefSeq Functional Element features using awk, excluding parental 'biological_region' features, for GFF3 files from human AR 109.20200522 and mouse AR 108.20200622 onwards. This command uses the 'RefSeqFE' source to extract all relevant features:

    zgrep -v "^#" GCF_000001405.39_GRCh38.p13_genomic.gff.gz | awk 'BEGIN{FS="\t";OFS"\t"}$2=="RefSeqFE"&&$3!="biological_region"'

  • Extraction of all RefSeq Functional Element features using awk, excluding parental 'biological_region' features, for GFF3 files prior to human AR 109.20200522 and mouse AR 108.20200622. This command takes advantage of the fact that RefSeq Functional Element features are not stranded (except for 'stem_loop' features), while all other NCBI-annotated features are stranded, including 'gene', 'exon' and 'CDS' features:

    zgrep -v "^#" GCF_000001405.39_GRCh38.p13_genomic.gff.gz | awk 'BEGIN{FS="\t";OFS"\t"}$7=="."||$3=="stem_loop"'

  • Extraction of all regulatory_class features using awk:

    zgrep -v "^#" GCF_000001405.39_GRCh38.p13_genomic.gff.gz | awk 'BEGIN{FS="\t";OFS"\t"}$9~/regulatory_class=/'

    Alternatively, $9~/regulatory_class=/ can be substituted with $9~/gbkey=regulatory/

  • Extraction of enhancer features based on the column 3 SO_type using awk:

    zgrep -v "^#" GCF_000001405.39_GRCh38.p13_genomic.gff.gz | awk 'BEGIN{FS="\t";OFS"\t"}$3=="enhancer"'

  • Extraction of enhancer features based on the column 9 specific attribute using awk:

    zgrep -v "^#" GCF_000001405.39_GRCh38.p13_genomic.gff.gz | awk 'BEGIN{FS="\t";OFS"\t"}$9~/regulatory_class=enhancer/&&$9!~/regulatory_class=enhancer_blocking_element/'

  • Extraction of enhancer features based on the column 9 specific attribute using grep:

    zgrep -v "^#" GCF_000001405.39_GRCh38.p13_genomic.gff.gz | grep regulatory_class=enhancer | grep regulatory_class=enhancer_blocking_element -v

  • Extraction of features from human records associated with coronavirus biology, here selecting relevant GeneIDs based on flat file 'coronavirus related' attributes found in the human.biological_region.gbff.gz file described above:

    zgrep -v "^#" GCF_000001405.39_GRCh38.p13_genomic.gff.gz | awk -F '\t' '$2=="RefSeqFE"&&$3!="biological_region"' | fgrep -f <(zgrep -A30 "coronavirus related" human.biological_region.gbff.gz | grep -o "GeneID:[0-9]*" | sort -u) -w

Feature extraction from bigBed files:

The 'FEfeats_AR##.bb' file from NCBI's latest human or mouse annotation release can be downloaded to a local directory (see file availability from the RefSeq FTP site above).

  • File conversion to non-binary format using the bigBedToBed application provided in the UCSC binary utilities directory:

    bigBedToBed FEfeats_AR109.20201120.bb FEfeats_AR109.20201120.bed

  • Extraction of enhancer features from the bigBedToBed-converted file based on the column 4 name using awk:

    awk -F '\t' '$4~/enhancer_GeneID/' FEfeats_AR109.20201120.bed

  • Extraction of features from the bigBedToBed-converted file for human functional elements associated with coronavirus biology, here selecting relevant GeneIDs based on flat file 'coronavirus related' attributes found in the human.biological_region.gbff.gz file described above:

    cat FEfeats_AR109.20201120.bed | fgrep -f <(zgrep -A30 "coronavirus related" human.biological_region.gbff.gz | grep -o "GeneID:[0-9]*" | sort -u) -w

Caution for extraction of specific features:

Some feature strings may also be present in names or other free-text attributes in GFF3 and bigBedToBed-converted files. The use of explicit terms with associated strings is therefore recommended to avoid the extraction of non-specific features. Explicit features can be extracted by using awk based on the SO_type in column 3 of GFF3 files in the format $3=="<feature_name>", while full attribute strings may be necessary for extractions based on GFF3 column 9 attributes. For example, the use of regulatory_class=enhancer with awk or grep as in the GFF3 examples above, where it is additionally necessary to exclude 'enhancer_blocking_element' features given that the 'enhancer' string is also found within regulatory_class=enhancer_blocking_element. In bigBedToBed-converted files, 'enhancer' features can be distinguished from 'enhancer_blocking_element' features by specifying $4~/enhancer_GeneID/ in an awk command given that an <SO_type>_<GeneID> naming format is used in column 4.

Metadata extraction from bigBedToBed-converted files:

Specific types of metadata may also be extracted from either the 'FEbiolregions_AR##.bb' or 'FEfeats_AR##.bb' files following their conversion to non-binary format. See above for a description of the custom metadata columns within these files.

  • Extraction of basic enhancer features (columns 1-4) together with just functional and descriptive information (column 11) from a converted non-binary FEfeats_AR109.20201120.bed file:

    awk -F '\t' '$4~/enhancer_GeneID/' FEfeats_AR109.20201120.bed | cut -f1-4,11

  • To extract the basic enhancer features (columns 1-4) along with just publications from column 12:

    awk -F '\t' '$4~/enhancer_GeneID/' FEfeats_AR109.20201120.bed | cut -f1-4,12

  • Using sed to extract the basic enhancer features (columns 1-4) along with just publications from column 11 for the human FEfeats_AR109.20200522.bb and mouse FEfeats_AR108.20200622.bb files, which lack a standalone supporting publications column:

    awk -F '\t' '$4~/enhancer_GeneID/' FEfeats_AR109.20200522.bed | cut -f1-4,11 | sed 's/\(_GeneID:[0-9]*\t\).*\[PMID/\1PMID/; s/\(PMID:[0-9]*\)\].*/\1/'

  • Extraction of just GeneIDs, locus summaries and RefSeq accessions from a similarly converted non-binary 'FEbiolregions_AR##.bed' file:

    cut -f10,12,14 FEbiolregions_AR109.20201120.bed

Similar command line strategies may also be used to extract feature-specific metadata from column 9 of GFF3 files, where several semi-colon-separated functional and descriptive attributes are present.

All of the above examples are just a sampling. The extraction of numerous other combinations of features or metadata is also possible, including by using other computer programming languages, tools or data parsers. We encourage users to explore our data based on individual data needs.

Feedback

We welcome user comments and suggestions. Please use the green Feedback button on the bottom right of the page to enter your comments.

Last updated: 2020-12-16T17:12:44Z