Skip navigation and go to main content

Frequently Asked Questions

Genome Reference Assembly

Obtaining Data and Assembly Updates

Bioinformatics Tools


Genome Reference Assembly

What is the correct name for the human genome reference assembly?

The official name for the current human reference genome assembly is Genome Reference Consortium Human Build 38. It is abbreviated as GRCh38. GRCh38 is referred to as hg38 in the UCSC Genome Browser, but this is not the official assembly name or abbreviation. The GenBank accession for GRCh38 is GCA_000001405.15. RefSeq annotates an identical copy of GRCh38, which has the accession GCF_000001405.26. An accession provides a unique and unambiguous assembly identifier and the GRC recommends its use in all publications and assembly communications. Assembly patch releases, which provide corrections and add new alternate sequence representations to the reference without changing chromosome coordinates, are named by the addition of the suffix “.p” to the assembly name and a version increment to the GenBank accession. For example, the ninth patch release of GRCh38 is officially known as Genome Reference Consortium Human Build 38 patch release 9, abbreviated as GRCh38.p9 and has GenBank accession GCA_000001405.24 and RefSeq accession GCF_000001405.35. Please see the GRC website’s Human Overview for the current patch release.

What is the correct name for the mouse genome reference assembly?

The official name for the current mouse reference genome assembly is Genome Reference Consortium Mouse Build 38. It is abbreviated as GRCm38. GRCm38 is referred to as mm10 in the UCSC Genome Browser, but this is not the official assembly name or abbreviation. The GenBank accession for GRCm38 is GCA_000001635.2. RefSeq annotates an identical copy of GRCm38, which has the accession GCF_000001635.20. An accession provides a unique and unambiguous assembly identifier and the GRC recommends its use in all publications and assembly communications. Assembly patch releases, which provide corrections and add new alternate sequence representations to the reference without changing chromosome coordinates, are named by the addition of the suffix “.p” to the assembly name and a version increment to the GenBank accession. For example, the fifth patch release of GRCm38 is officially known as Genome Reference Consortium Mouse Build 38 patch release 5, abbreviated as GRCm38.p5 and has GenBank accession GCA_000001635.7 and RefSeq accession GCF_000001635.25. Please see the GRC website’s Mouse Overview for the current patch release.

What is the correct name for the zebrafish genome reference assembly?

The official name for the current zebrafish reference genome assembly is Genome Reference Consortium Zebrafish Build 11. It is abbreviated as GRCz11. GRCz11 is referred to as danRer11 in the UCSC Genome Browser, but this is not the official assembly name or abbreviation. The GenBank accession for GRCz11 is GCA_000002035.4. RefSeq annotates an identical copy of GRCz11 which has the accession GCF_000002035.6. An accession provides a unique and unambiguous assembly identifier and the GRC recommends its use in all publications and assembly communications.

What is the correct name for the chicken genome reference assembly?

The official name for the current chicken reference genome assembly is Genome Reference Consortium Chicken Build 6a. It is abbreviated as GRCg6a. The GenBank accession for the assembly is GCA_000002315.5. RefSeq annotates an identical copy of GRCg6a which has the accession GCF_000002315.5. An accession provides a unique and unambiguous assembly identifier and the GRC recommends its use in all publications and assembly communications.

How many individuals were sequenced for the human reference genome assembly?

The human reference genome is a composite genome, derived from the sequence of several different anonymous individuals. Approximately 93% of the GRCh38 primary assembly (the assembled chromosomes, unlocalized and unplaced sequences) consists of sequences from 11 genomic clone libraries (a library can generally be considered a proxy for an individual’s genome). One of these libraries, RP11 or RPCI - 11 Human Male BAC Library has a much higher representation than all others, and contributes to 70% of the primary assembly. The donor of RP11 library was an anonymous male, though analysis suggests his DNA is an African-European admixture (see page 146 of Supporting Online Material of PMID:20448178). The remaining 7% represents sequences from >50 libraries. These libraries were developed from individuals (male and female), as well as flow-sorted chromosomes from various cell lines. The make-up of GRCh37 is largely similar to GRCh38 (Figure 1).

GRCh37_38_LibrariesBreakdown

Figure 1. Contribution of genomic libraries to GRCh37 and GRCh38.

Where can I get information about the DNA sources for the human reference genome?

Publications describing the generation of the human reference assembly provide information about the DNA sources used:

Also, see FAQ How many individuals were sequenced for the human reference genome assembly?

Where can I get information about the DNA sources for the mouse reference genome?

Publications describing the mouse reference assembly include information about DNA sources:

In the mouse reference assembly, all sequences in the primary assembly unit (chromosomes, unlocalized and unplaced scaffolds) represent the C57BL/6J strain. These sequences overwhelming come from genomic libraries derived from 3 different mice. The RP23 and RP24 BAC libraries come from a female and male mouse, respectively, both of which are from generation F204-207. The fosmid WI1 library was generated from a female mouse in generation F208-F214. For several genomic regions that exhibit strain variation, the GRC provides alternate loci assembly units containing scaffolds comprised of clone-based sequences from other strains.

Where can I get information about the DNA source for the zebrafish reference genome?

The publication describing the zebrafish reference assembly includes information about DNA source:

The zebrafish genome reference assembly represents the sequence of the Tuebingen strain. Several genomic clone libraries and WGS were used, the main description of the zebrafish genome assembly construction can be found in the Supplementary information to the paper.

Where can I get information about the DNA source for the chicken reference genome?

The publication describing the chicken reference assembly: A New Chicken Genome Assembly Provides Insight into Avian Genome Structure

The chicken genome reference assembly GRCg6a was generated from a single individual from the Red Jungle Fowl strain, inbred line UCD001. The assembly is a hybrid comprised primarily of WGS contigs, into which genomic clones from libraries of CH261, J_AD, J_AE, TAM31, TAM32 and TAM33 have been integrated.


Obtaining Data and Assembly Updates

There are a lot of human genome assemblies in GenBank, which one is the reference?

GRCh38 is the current major release of the human reference assembly. It is curated by the GRC, who also release minor assembly updates to reflect corrections and addition of new alternate sequence representations, without changing chromosome coordinates. You can recognize minor releases by the suffix “.p”. For example, GRCh38.p9 is the ninth minor/patch release of the human reference assembly. An assembly “Genome representation” indicates completeness of the assembly as “full/partial". The “version status” which is shown in parentheses after the assembly accession represents whether the assembly is the “latest/replaced” version. See the FAQ What is the correct name for the human genome reference assembly? for additional information.

Where can I see the current working version of the reference assembly?

Working assembly versions include pre-release assembly edits and are subject to change at any time. You can view the tiling path files (TPFs) for the current working versions of all GRC-curated assemblies on the TPF Overview page of the GRC website. The TPFs and corresponding AGP (A Golden Path) files are also available on the GRC FTP site in the “MOST_RECENT” sub-directory of the GRC sub-directory for each organism (e.g. ftp://ftp.ncbi.nlm.nih.gov/pub/grc/human/GRC/MOST_RECENT/).

Where can I download genome sequences?

GRC assemblies should be downloaded from the GenBank Assembly FTP site. The GRC organism overview webpages provide convenient access to these data. From the GRC Home page select an organism (Human, Mouse, Zebrafish, Chicken). The FTP links to the genome data and sequences can be found in the page section “Download data”.

Where can I find the component sequences for the human reference assembly?

The construction of all GRC-curated assemblies is described in AGP (A Golden Path) files, which can be downloaded from the corresponding GenBank FTP directory. The organism overview pages on the GRC website (e.g. https://www.ncbi.nlm.nih.gov/grc/human) provide easy access to these data via the “Download data” section. The AGP files are found in the assembly unit subdirectories. Please see the current AGP specifications for a detailed description of this file format.

Where can I find assembly statistics?

Assembly statistics for GRC-curated organisms are shown on the GRC website and can be downloaded from GenBank:

Human: Human Assembly Data for the current release. To download the statistic report for the latest release go to: https://www.ncbi.nlm.nih.gov/assembly/?term=GCF_000001405. In this page, clicking on the assembly on the list will take you to the assembly page for the latest release. Select Download the statistics report on the right side of the page and under “Access the data”.

Mouse: Mouse Assembly Data for the current release. To download the statistic report for the latest release go to: https://www.ncbi.nlm.nih.gov/assembly/?term=GCF_000001635. In this page, clicking on the assembly on the list will take you to the assembly page for the latest release. Select Download the statistics report on the right side of the page and under “Access the data”.

Zebrafish: Zebrafish Assembly Data for the current release. To download the statistic report for the latest release go to: https://www.ncbi.nlm.nih.gov/assembly/?term=GCF_000002035. In this page, clicking on the assembly on the list will take you to the assembly page for the latest release. Select Download the statistics report on the right side of the page and under “Access the data”.

Chicken: Chicken Assembly Data for the current release. To download the statistic report for the latest release go to: https://www.ncbi.nlm.nih.gov/assembly/?term=GCF_000002315. In this page, clicking on the assembly on the list will take you to the assembly page for the latest release. Select Download the statistics report on the right side of the page and under “Access the data”.

When are you going to update the human/mouse/zebrafish/chicken reference genome assembly again?

All assembly release plans, including those for non-coordinate changing patch updates, are provided on the organism-specific pages of the GRC website Human, Mouse, Zebrafish, Chicken. When we start to plan for a major release for any organism, we make community-specific announcements via multiple routes of communication and we continue to provide updates as timelines become more defined. To receive information for the latest assembly updates, subscribe to the GRC-announce email list.

What is the difference between a GRC major assembly release and a patch (minor assembly) release?

A GRC major assembly release such as GRCh38 (human) or GRCm38 (mouse) is comprised of a primary assembly unit (consisting of the chromosomes, unlocalized and unplaced scaffolds) and one or more alternate loci assembly units that contain scaffolds providing alternate sequence representations for discrete regions of the primary assembly unit, and the alignments of those scaffolds to the chromosomes. A patch release, such as GRCh38.p12 (human) or GRCm38.p6 (mouse), is a minor assembly release that does not change any sequence coordinates in the major assembly release. In addition to having the same assembly units of the corresponding major release, patch releases include a “PATCHES” assembly unit. This additional unit includes scaffolds that provide updated sequence for particular genomic regions in the form of fix patches (assembly corrections) and novel patches (alternate sequence representations). Patch releases are cumulative, such that the PATCHES assembly unit for the latest minor release contains the scaffolds and associated data for all prior patch releases. For additional information, please see Introductions to Patches.

What are alternate loci and novel patches?

Alternate loci and novel patches enable the reference assembly to represent allelic diversity. They are scaffold sequences that are given chromosome context through alignments to the corresponding chromosome regions. Alternate loci scaffolds and their alignments are included in major assembly releases, while novel patch scaffolds and their alignments are included in subsequent patch releases for that assembly. They can be considered functionally equivalent, as novel patches will be reassigned to the role of alternate loci scaffolds at the time of the next major assembly release. Assembly regions for which the GRC provides alternate loci or novel patch scaffolds are typically those with known alternate haplotypes (e.g. immune-associated regions), highly variable genomic regions (e.g. olfactory receptor regions) or those where there are structural variants having 5 Kb or more sequence not represented on the chromosome. Human alternate loci and all novel patch scaffolds also include one or more anchor sequence components to ensure their robust alignment to the chromosomes. Anchor sequences are component(s) that are also found in the corresponding chromosome. The sequence locations corresponding to anchor components are annotated on the GenBank records for all alternate loci and patch scaffolds. For more detail on patches please see Introductions to Patches.

What are fix patches?

Fix patches represent changes to existing assembly sequences. These are generally error corrections (such as base changes, component replacements/updates, switch point updates or tiling path changes) or assembly improvements (such as extension of sequence into gaps). Fix patches are scaffold sequences that are given chromosome context through alignments to the corresponding chromosome regions. A fix patch scaffold represents a preview of what the assembly will look like at the next major release. When the next major release occurs, the accessions for the fix patch scaffolds will be deprecated and the changes will be found in the chromosomes. Human fix patch scaffolds also include one or more anchor sequence components to ensure their robust alignment to the chromosomes. Anchor sequences are component(s) that are also found in the corresponding chromosome. The sequence locations corresponding to anchor components are annotated on the GenBank records for all fix patch scaffolds. For more detail on patches please see Introductions to Patches.

What are assembly regions?

Human alternate loci and all patches are assigned to named assembly regions. The regions to which alternate loci and novel patches are assigned are defined as sequence ranges on primary assembly unit sequences (chromosome and unlocalized or unplaced scaffolds), while regions for fix patches may also be found on alternate loci scaffolds. A region contains one or more alternate loci or patch scaffolds. While scaffolds from different assembly units may overlap within a region, there is no overlap of scaffolds from the same assembly unit within a region. The GRC provides web pages with detailed information for each region that can be accessed from the “Patches and alternate loci” tables found on the organism overview pages of the GRC website (e.g. https://www.ncbi.nlm.nih.gov/grc/human). Assembly region reports can also be downloaded from the GenBank FTP site for the assembly of interest GRCh38.p12_assembly_regions.txt and GRCm38.p6_assembly_regions.txt.

What MHC haplotypes are used in the reference assembly

We use sequences defined by the MHC consortium.

GenBank ID RefSeq ID Assembly unit Clone library Cell line Haplotype
CM000668.2 NC_000006.12 Primary CH501 PGF A3-B7-DR15
GL000250.2 NT_167244.2 ALT_REF_LOCI_1 DAAP APD A1-B60-DR13
GL000251.2 NT_113891.3 ALT_REF_LOCI_2 CH502 COX A1-B8-DR3
GL000252.2 NT_167245.2 ALT_REF_LOCI_3 DADB DBB A2-B57-DR7
GL000253.2 NT_167246.2 ALT_REF_LOCI_4 DAMA MANN A29-B44-DR7
GL000254.2 NT_167247.2 ALT_REF_LOCI_5 DAMC MCF A2-B62-DR4
GL000255.2 NT_167248.2 ALT_REF_LOCI_6 DAQB QBL A26-B18-DR3
GL000256.2 NT_167249.2 ALT_REF_LOCI_7 DASS SSTO A32-B44-DR4

What LHC-KIR haplotypes are used in the reference assembly?

LRC haplotypes provided as alternate loci or novel patches in the GRCh37 and GRCh38 assemblies are described in the following publications: Traherne et al., 2010, Horton et al., 2006, and Barrow and Trowsdale, 2008.

More information can be found at the LRC Haplotype Project and IPD-KIR.

GenBank ID RefSeq ID Assembly unit Clone library
CM000681.2 NC_000019.10 Primary CHM1
GL949746.1 NW_003571054.1 ALT_REF_LOCI_1 COX1
GL949747.2 NW_003571055.2 ALT_REF_LOCI_2 COX2
GL949748.2 NW_003571056.2 ALT_REF_LOCI_3 LRC_i
GL949749.2 NW_003571057.2 ALT_REF_LOCI_4 LRC_j
GL949750.2 NW_003571058.2 ALT_REF_LOCI_5 LRC_s
GL949751.2 NW_003571059.2 ALT_REF_LOCI_6 LRC_t
GL949752.1 NW_003571060.1 ALT_REF_LOCI_7 PGF1
GL949753.2 NW_003571061.2 ALT_REF_LOCI_8 PGF2
KI270938.1 NT_187693.1 ALT_REF_LOCI_9 mixed?
KI270882.1 NT_187636.1 ALT_REF_LOCI_10 FH15_B
KI270883.1 NT_187637.1 ALT_REF_LOCI_11 G085_A
KI270884.1 NT_187638.1 ALT_REF_LOCI_12 G085_BA1
KI270885.1 NT_187639.1 ALT_REF_LOCI_13 G248_A
KI270886.1 NT_187640.1 ALT_REF_LOCI_14 G248_BA2
KI270887.1 NT_187641.1 ALT_REF_LOCI_15 GRC212_AB
KI270888.1 NT_187642.1 ALT_REF_LOCI_16 GRC212_BA1
KI270889.1 NT_187643.1 ALT_REF_LOCI_17 LUCE_A
KI270890.1 NT_187644.1 ALT_REF_LOCI_18 LUCE_Bdel
KI270891.1 NT_187645.1 ALT_REF_LOCI_19 RSH_A
KI270914.1 NT_187668.1 ALT_REF_LOCI_20 RSH_BA2
KI270915.1 NT_187669.1 ALT_REF_LOCI_21 T7526_A
KI270916.1 NT_187670.1 ALT_REF_LOCI_22 T7526_Bdel
KI270917.1 NT_187671.1 ALT_REF_LOCI_23 ABC08_A
KI270918.1 NT_187672.1 ALT_REF_LOCI_24 ABC08_AB
KI270919.1 NT_187673.1 ALT_REF_LOCI_25 ABC08_AB
KI270920.1 NT_187674.1 ALT_REF_LOCI_26 FH05_A
KI270921.1 NT_187675.1 ALT_REF_LOCI_27 FH05_B
KI270922.1 NT_187676.1 ALT_REF_LOCI_28 FH06_A
KI270923.1 NT_187677.1 ALT_REF_LOCI_29 FH06_BA1
KI270929.1 NT_187683.1 ALT_REF_LOCI_30 FH08_A
KI270930.1 NT_187684.1 ALT_REF_LOCI_31 FH08_BAX
KI270931.1 NT_187685.1 ALT_REF_LOCI_32 FH13_A
KI270932.1 NT_187686.1 ALT_REF_LOCI_33 FH13_BA2
KI270933.1 NT_187687.1 ALT_REF_LOCI_34 FH15_A
GL000209.2 NT_113949.2 ALT_REF_LOCI_35 RP5_B
KV575252.1 NW_016107306.1 PATCHES B05-tA01
KV575253.1 NW_016107307.1 PATCHES A01-tA01
KV575246.1 NW_016107300.1 PATCHES A01-tA01
KV575251.1 NW_016107305.1 PATCHES A01-tA01
KV575255.1 NW_016107309.1 PATCHES A01-tA01
KV575259.1 NW_016107313.1 PATCHES A01-tA01
KV575247.1 NW_016107301.1 PATCHES A01-tA01
KV575248.1 NW_016107302.1 PATCHES A01-tA01
KV575256.1 NW_016107310.1 PATCHES B01-tB01
KV575258.1 NW_016107312.1 PATCHES B04-tB03
KV575254.1 NW_016107308.1 PATCHES A03-tB02
KV575260.1 NW_016107314.1 PATCHES B02-tA01
KV575249.1 NW_016107303.1 PATCHES A01-tB04
KV575250.1 NW_016107304.1 PATCHES A01-tB01
KV575257.1 NW_016107311.1 PATCHES A04

Can I get reference assembly data sets formatted for use by sequence read alignment pipelines?

The GenBank FTP site provides assembly data for GRCh37.p13, GRCh38 and GRCm38 that are formatted and packaged for use with tools in several common sequence analysis pipelines, including BWA, Samtools and Bowtie. Known as analysis sets, the various data packages include copies of the assemblies with and without alternate loci scaffolds, with and without additional sequences commonly used as alignment targets, such as chr. EBV and the GRCh37 (https://media.nature.com/full/nature-assets/nature/journal/v526/n7571/extref/nature15393-s1.pdf, section 3.6.1) and GRCh38 decoys constructed by Heng Li, and with masking of genomic regions such as PAR and the centromeres. For complete information, see the README file provided with each analysis set.

Links to analysis sets:

Human genome assembly GRCh38

Human genome assembly GRCh37.p13

Mouse genome assembly GRCm38.p3

Does the human reference genome represent common/major allele at all chromosomal loci?

The human reference genome is a composite genome, derived from the genomes of several different individuals. The reference assembly chromosomes overwhelmingly represent the alleles found in their underlying component sequences, which are derived from these various DNA sources (see FAQ Where can I get information about the DNA sources for the human reference genome?. As part of its curation effort, the GRC strives to ensure that the reference genome represents alleles that are not unique to a specific individual or universally rare, but are commonly found in 1 or more populations. Some loci from GRCh37 were updated in GRCh38 as part of this work (see “Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly”). As sequence from more humans, representing even more populations, becomes available, we continue with this effort.

It should be noted that in some instances, the most “common” allele is neither the longest nor the ancestral allele, two other reference representations that are often requested of the GRC. The GRC uses alternate loci scaffolds to provide additional sequence representations for diverse genomic regions. By including multiple representations of such loci, the human reference genome assembly is better able to represent population genomic diversity. Thus, in some instances, the most “common”, longest or ancestral allele may be found on an alternate loci scaffold instead of the chromosome. For additional information on alternate loci, see the FAQ What are alternate loci and novel patches?.

What assembly method was used to create the reference assembly?

The reference assembly is distinguished from most other human assemblies by virtue of being a clone-based assembly comprised of DNA from multiple individuals, rather than a whole genome shotgun assembly of a single individual. As a result, each chromosome assembly is a haploid mosaic, rather than a haploid consensus, in which valid haplotypes may transition at clone boundaries. For additional information, please see: Finishing the euchromatic sequence of the human genome

What types of assembly resources are available from the GRC?

The GRC webpage provides users with a summary of all assembly regions under review, announcements about assembly release plans, links to download the assembly data and assembly statistics. A GRC blog discusses recent curation events and highlights genomic regions of interest. On the GRC website, organism-specific data (Human, Mouse, Zebrafish, and Chicken) are provided under separate tabs:

https://www.ncbi.nlm.nih.gov/grc/human/issues

https://www.ncbi.nlm.nih.gov/grc/mouse/issues

https://www.ncbi.nlm.nih.gov/grc/zebrafish/issues

https://www.ncbi.nlm.nih.gov/grc/chicken/issues

Assembly regions under GRC curation can also be viewed in tracks in several common genome browsers. A GRC-provided Track Hub is available in the Ensembl and UCSC browsers. Example of Individual tracks available in the GRC hub are tracks for GRC curation issues, alignments between the primary assembly and alternate loci and patches, optical mapping data and clone sequence anomalies.

At NCBI, the “Assembly Support” track set, accessed via the “Tracks” menu in the Genome Data Viewer (GDV), 1000 Genomes and Variation Viewer browsers provides many of the same tracks as the GRC track hub. For more information on using NCBI Track Sets, see https://www.youtube.com/watch?v=Q9kOLBHZR4s and https://www.ncbi.nlm.nih.gov/tools/sviewer/faq/#tracksets.

What are the different types of GRC curation issues?

The GRC has defined the following categories for curation issues:

  • Clone Problem: Issue is related to a specific clone
  • Variation: No error, problem is related to biological variation
  • Path Problem: Issue is related to a tiling path problem
  • Localization Problem: Issue is related to a sequence localization (unlocalized or unplaced)
  • Missing sequence: Issue is related to sequence that is missing from assembly
  • Gap: Issue is related to a specific assembly gap (inter- or intra-scaffold)
  • GRC Housekeeping: For general assembly improvements issues not affiliated with reported assembly errors. These include, but are not limited to: YAC replacements (not associated with clone problem), switch point updates not associated with clone or path problems
  • Unknown: Issue type is unknown

What is meant by the different statuses for GRC curation issues?

The progress of an issue in the GRC curation workflow is defined by its status:

  • Open: New issue, not yet reviewed
  • Under Review: Issue has undergone initial review, and it has been determined that work is required
  • Awaiting Elec Data: Work on issue has started, and is awaiting electronic data from trace analysis, alignment review, Genome Workbench analysis, PGP Viewer analysis, optical map (OM) review, etc.
  • Awaiting Exptl Data: Work on issue has started, and is awaiting experimental data from PCR, clone sequencing and/or mapping
  • Awaiting External Info: Work on issue has started, and an external (non-GRC) consultation has been initiated
  • Continuing Investigation: Issue requires further work
  • Resolved: Issue has been addressed, and all relevant data changes have been submitted to the GRC database
  • Reopened: Issue was previously resolved, but review indicates the problem has not been corrected
  • Stalled: Issue has been reviewed and determined to be unresolvable with current technologies

Where can I find gene content and annotations of the genome reference assembly?

The GRC produces and curates reference assemblies, but does not perform gene annotation. You can obtain the most recent RefSeq annotation by the NCBI Genome annotation pipeline for Human (GRCh38), Mouse (GRCm38), Zebrafish (GRCz11), and Chicken (GRCg6a). GENCODE offers annotations for Human and Mouse, while Ensembl annotates zebrafish and chicken. UCSC provides annotations for Human, Mouse, Zebrafish and Chicken.

Does the human reference assembly contain representation for ribosomal DNA sequences?

Due to the highly repetitive nature of the 5S rDNA cluster on 1q42, and the 45S cluster on the p-arms of the acrocentric chromosomes, we are unable to provide a complete, biologically accurate, representation for these regions in the human reference assembly with currently available resources. However, the GRCh38 reference assembly does provide a representation for a limited number of repeat copies in each cluster. Sequence from the 5S cluster (representing ~19 copies) is located between NC_000001.11 (GenBank accession: CM000663.2): 228,408,802-228-664,283. We recognize that this is a gross under-representation of this cluster, which is estimated to occur at ~100 repeats (PMID: 18025267) and will address this as resources become available. Sequence from the 45S cluster on the acrocentric p-arms is found on the unplaced scaffold NT_167214.1 (GenBank accession: GL000220.1). This includes ~1.5 copies of the 45S cluster. At this time, we do not know from which of the acrocentric chromosomes this sequence derives. You can track ongoing work for the 45S (HG-1101) and 5S (HG-2002) regions at the GRC website.

In the GRCm38 assembly, the Rn45s 45S pre-ribosomal RNA is not annotated, but there is a related sequence (Rn18s-rs5) located on chromosome 17. You can track GRC ongoing work on the issue related to Rn45s 45S (MG-4232). We recognize the importance of providing accurate representations for these important clusters and are continuing to look at new technologies (e.g. long read sequencing, optical maps) that may help us improve the reference for these regions.

How do I find the latest data about reference assembly problems?

The latest information about assembly problems, ongoing work and other curation issues related to GRC-managed genome assemblies are available on the GRC website:

Human: https://www.ncbi.nlm.nih.gov/grc/human/issues

Mouse: https://www.ncbi.nlm.nih.gov/grc/mouse/issues

Zebrafish: https://www.ncbi.nlm.nih.gov/grc/zebrafish/issues

Chicken: https://www.ncbi.nlm.nih.gov/grc/chicken/issues

You can search for issues and regions under review. The GRC provides a brief description of each issue, its resolution status and its mapping to the current and previous reference assembly versions. For more details, see the “Genome Issues Under Review” and “Individual Genome Issues” posts on the GRC blog.

How can I report an assembly problem?

If you find an error in the assembly or a variant region that needs to be better presented, please let us know at “Report an Issue”. If you have a question for GRC, we welcome your comments at “Contact us”.

How to cite the GRC or a reference assembly?

You can cite the GRC website or the articles “Modernizing reference genome assemblies” (for GRCh37) and “Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly” (for GRCh38).


Bioinformatics Tools

What tools can I use to map/convert data between different releases of the reference assembly or between the reference and other genome assemblies?

You can use the NCBI Genome Remapping Service to map annotation data from one coordinate system to another, select “Assembly-Assembly” from the top menu. This tool accepts inputs in a variety of formats, and uses assembly-assembly alignments to track the relationship between the two assemblies. The underlying assembly-assembly alignments to prior versions used by this tool are sanctioned by the GRC and are available on FTP in ASN.1 and GFF3 format (with CIGAR string). The tool can be accessed as a web interface or as an API. Documentation on the website provides details for use.

UCSC offers the LiftOver tool, which converts genome coordinates and genome annotation files between assemblies.

Ensembl offers the Assembly Converter to convert coordinates between different releases of one genome assembly to another.

Remapping results are straightforward and identical for regions of genome which align well and are without complications of repeats or structural variation, but can be problematic for complicated genomic regions such as duplicated or collapsed regions between old and new assemblies, or newly added regions, at which there may be reads missing or mis-mapped in the older assembly. Since all remapping tools are limited by their reliance on alignment between old and new assemblies, in such regions de novo read mapping will likely be more accurate than remapping.

What tools can I use to convert data between chromosomes coordinates and alternate loci or patch scaffold coordinates?

You can use the NCBI Genome Remapping Service to map annotation data between the Primary Assembly and the Alternate Loci or patches, select “Alt loci remap” from the top menu. The tool can be accessed as a web interface or as an API.