Contents

Introduction

The Assembly database has information about the structure of assembled genomes as represented in an AGP file or as a collection of completely sequenced chromosomes. The database provides a versioned Assembly accession number that tracks changes to assemblies as they are updated by submitting groups over time. The web resource provides meta-data about assemblies such as assembly names (and alternate names), simple statistical reports of the assembly (type and number of contigs, scaffolds; N50s) and a history view of updates. It also tracks the relationship between an assembly submitted to the International Nucleotide Sequence Database Collaboration ( INSDC ), i.e. DDBJ , ENA or GenBank , and the assembly represented in the NCBI Reference Sequence (RefSeq) project.

Scope

The Assembly resource includes prokaryotic and eukaryotic genomes with a Whole Genome Shotgun (WGS) assembly, clone-based assembly, or completely sequenced genome (gapless chromosomes). Organelle genomes are included only when there is also a nuclear genome assembly. Similarly, plasmids are only included when they are associated with chromosome sequences. Viral genomes are included if they are in the NCBI Reference Sequence (RefSeq) database or have been selected as viral neighbor genomes by the NCBI viral genomes group (more details). Metagenomes are also included.

The database represents genomes assembled to different levels:

  • Complete genome assemblies
  • Assemblies that include chromosomes or linkage groups, scaffolds, and contigs
  • Assemblies that include scaffolds and contigs
  • Assemblies that include only contigs

Assembly accessions and versions

The assembly accession and version numbers shown in the Assembly resource unambiguously identify the set of sequences in a particular version of an assembly. The full assembly identifier is accession.version, e.g. GCA_000002285.2. These assembly identifiers allow anyone using them to know whether or not they are working with the same version of an assembly.

The first instance of an assembly that is provided by a submitter receives an assembly accession number with version 1. If the submitter later provides an updated genome assembly, that update has the same assembly accession number and its version is incremented. We describe this collection of all versions of an assembly accession as an assembly chain. The Assembly resource offers users a choice between seeing all assembly versions or only seeing the latest version in each assembly chain.

Finding assemblies of interest

Assemblies of interest can be found either by searching in the resource or by browsing the assemblies available for a particular organism.

Finding an assembly by searching

Search terms

Search terms should be entered into the search box in the grey bar found at the top of the page. The Assembly resource may be searched using several fields including:

  • an organism or species name
    • search terms that are recognized as the common name for an organism, e.g. cow, will automatically be translated to the scientific name
    • you can explicitly specify that your query term is a species or organism name by including '[orgn]' in the query (e.g., 'Ruminantia[orgn]' or 'Bos[orgn]'), although in most cases this is not necessary
  • an assembly accession
    • a search using an assembly accession without any version, e.g. GCA_000146045, will return the latest version of that assembly plus any earlier versions of the assembly
    • a search using an assembly accession with a version, e.g. GCA_000146045.2, will return the specified version whether or not it is the latest
    • either GenBank or RefSeq assembly accessions may be used
  • an assembly name, or synonym (e.g., hg19, GRCh37, GRCh38.p4)

Search results

If only one assembly was matched by the search, then the Assembly details page for that assembly is shown. If the search matched multiple assemblies, then the results are presented in the standard NCBI document summary format. The assembly names in the record titles are linked to the Assembly details page.

The sidebar on the left provides a selection of filters (also known as facets) that can be used to refine the search results, e.g. to restrict the results to assemblies from a particular organism group, level of assembly, annotation status or RefSeq category. Select a filter to apply it to the search results. A checkmark will appear next to the activated filters. Subsequent searches will be filtered until the selected filters are cleared.

Filters to limit the search results to only the latest version for each assembly and to exclude anomalous assemblies are applied by default. These filters must be turned off if you want older versions of assemblies or anomalous assemblies to be included in the search results. Unselect a filter to stop applying an individual filter, use the "clear" button to stop applying all filters in a group, or use the "Clear all" button to turn off all filters.

Commonly used filters are shown in the sidebar by default. Additional filter groups can be exposed using the “Show additional filters menu”. Additional filters within a group can be exposed using the “Customize…" menus; note that selecting a filter for display does not automatically turn it on.

NCBI has a video tutorial showing how to use a similar set of filters to refine the search results in the PubMed database.

Some more specialized filters are only available under the "Advanced" search menu option "Filter" field, such as the set of "vhost" filters that enable genome assemblies for viruses with a particular host to be selected, e.g. "vhost human".

Information presented for each assembly in the search results

The document summaries reported in the search results include the following information:

Assembly name - the submitter's name for the assembly when one was provided, otherwise a default name is provided by NCBI; this is linked to a view of the Assembly details page.

Description - a short description of the assembly, when provided.

Organism - the scientific name of the organism from which the sequences in the assembly were derived, followed by the common name in parentheses (if the organism itself does not have a common name, a common name for the organism group is shown)

Infraspecific name - the strain, breed, cultivar or ecotype of the organism from which the sequences in the assembly were derived. [Field is not shown if the source material for the assembly is only described at the species level.]

Sex - the sex of the organism from which the sequences in the assembly were derived. [Field is not shown if the source material for the assembly does not specify a sex.]

Submitter - the submitting consortium or first position if a list of organizations. The full submitter information is available in BioProject .

Date - the date the sequences in the assembly were released in the INSDC databases ( DDBJ , ENA or GenBank ).

Assembly type - haploid, haploid-with-alt-loci (a haploid assembly with alternative loci, for example as provided by the Genome Reference Consortium for the human genome), diploid, unresolved diploid, or alternate pseudohaplotype. See the NCBI Assembly Data Model for a definition of these terms. [Field is not shown for haploid assemblies since this is the default type.]

Assembly level - the highest level of assembly for any object in the assembly:

  • Complete genome - all chromosomes are gapless and have no runs of 10 or more ambiguous bases (Ns), there are no unplaced or unlocalized scaffolds, and all the expected chromosomes are present (i.e. the assembly is not noted as having partial genome representation). Plasmids and organelles may or may not be included in the assembly but if present then the sequences are gapless.
  • Chromosome - there is sequence for one or more chromosomes. This could be a completely sequenced chromosome without gaps or a chromosome containing scaffolds or contigs with gaps between them. There may also be unplaced or unlocalized scaffolds.
  • Scaffold - some sequence contigs have been connected across gaps to create scaffolds, but the scaffolds are all unplaced or unlocalized
  • Contig - nothing is assembled beyond the level of sequence contigs

Genome representation - whether the goal for the assembly was to represent the whole genome or only part of it:

  • Full - the data used to generate the assembly was obtained from the whole genome, as in Whole Genome Shotgun (WGS) assemblies for example. There may still be gaps in the assembly.
  • Partial - the data used to generate the assembly came from only part of the genome. Most assemblies have full genome representation with a minority being partial genome representation. Reasons for the genome representation being set to partial include:
    • the assembly description indicates that the assembly was targeted to a single chromosome or a subset of the genome
    • the chromosome set in the assembly is less than the expected chromosome complement for the organism, ignoring any plasmids, organelle chromosomes and the small sex chromosome (Y for mammals, W for birds)
    • the genome coverage in a WGS assembly is less than 1
    • the ungapped sequence length of the assembly is less than half the average for other assemblies from the same species

RefSeq category - shown if the assembly is a reference or representative genome in the NCBI Reference Sequence ( RefSeq ) project classification:

  • Reference genome - a manually selected high quality genome assembly that NCBI and the community have identified as being important as a standard against which other data are compared
  • Representative genome - a genome computationally or manually selected as a representative from among the best genomes available for a species or clade that does not have a designated reference genome
  • Prokaryotes may have more than one reference or representative genome per species. For more information see the Prokaryotic RefSeq Genomes web page
  • Eukaryotes have no more than one reference or representative genome per species. If there are no assemblies in RefSeq for a particular eukaryotic species, then the GenBank assembly that RefSeq would select as the best available for that species will be designated as the representative genome.
  • Viruses may have one or more reference genomes per species. The representative genome designation is not applied to viruses and viroids.

Relation to type material - shown if the sequences in the genome assembly were derived from type material, synonym type material or other type material (see Federhen 2015 ):

  • assembly from type material - the sequences in the genome assembly were derived from type material
  • assembly from synonym type material - the sequences in the genome assembly were derived from synonym type material
  • assembly from pathotype material - the sequences in the genome assembly were derived from pathovar type material
  • assembly designated as neotype - the sequences in the genome assembly were derived from neotype material
  • assembly designated as reftype - the sequences in the genome assembly were derived from reftype material
  • ICTV species exemplar - the International Committee on Taxonomy of Viruses (ICTV) designated the genome assembly as the exemplar for the virus species
  • ICTV additional isolate - the International Committee on Taxonomy of Viruses (ICTV) designated the genome assembly an additional isolate for the virus species

Synonyms - synonyms for the assembly are provided when known, including the UCSC name.

GenBank assembly accession - Accession & version for the GenBank version of the assembly.

RefSeq assembly accession - Accession & version for the RefSeq version of the assembly. Note: this is not always present because only certain assemblies are selected for RefSeq.

Version status - the status for the assembly version (latest, replaced, or suppressed) is shown in parentheses after the GenBank or RefSeq assembly accession & version.

Release type - whether this version of the genome assembly is a major, minor or patch release:

  • Major - changes from the previous assembly version result in a significant change to the coordinate system. The first version of an assembly is always a major release. Most subsequent genome assembly updates are also major releases. [Field is not shown for major releases since this is the default type.]
  • Minor - changes from the previous assembly version are limited to the following changes, none of which result in a significant change to the coordinate system of the primary assembly-unit:
    • adding, removing or changing a non-nuclear assembly-unit
    • dropping unplaced or unlocalized scaffolds
    • adding up to 50 unplaced or unlocalized scaffolds which are shorter than the current scaffold-N50 value
    • replacing a component with a gap of the same length
  • Patch - the only change from the previous assembly version is the addition or modification of a patch assembly-unit (relevant for assemblies maintained by the Genome Reference Consortium )
  • See the NCBI Assembly Data Model for definitions of assembly-units and genome patches.

Assembly anomaly - any assembly anomalies noted by the NCBI Reference Sequence ( RefSeq ) group.

Excluded from RefSeq - reasons the assembly was excluded from the NCBI Reference Sequence ( RefSeq ) project, including any assembly anomalies.

IDs - unique identifiers that are internal to NCBI's Entrez search and retrieval system.

Browsing assemblies by organism

Follow the link provided on the Home page or that provided under the top-most search bar (" Browse by organism ") to access this page. The default view presents the list of Eukaryotic organisms for which there is at least one complete or partial genome assembly available. To find a subset, delete the current entry in the "Assembly information by organism" search box, begin typing an organism name, select from the list of auto-completion terms and then click on the "Search by organism" button. For example, as you begin typing 'yeast' a list appears with options for budding yeasts, lager yeast, fission yeast etc.

All assemblies are shown by default; this includes previous versions when an assembly has been updated over time. Select the tab "Show only latest assemblies" to see only the most recent version for each assembly. The table can be sorted by organism, name, submitter, date or assembly level.

Follow the link from the assembly name to see more detailed information on that assembly.

Assembly details page

Assembly meta-data section

The top of the Assembly details page includes the assembly name as the page title. Below that are several meta-data elements which are found in the document summary display as described above . Additional data elements are provided on the Assembly details page including:

Taxonomy check - the result of NCBI's taxonomy validation when available, currently for prokaryotes only. The status displayed comes from the Average Nucleotide Identity (ANI) results available in the FTP file ANI_report_prokaryotes.txt, described in README_ANI_report_prokaryotes.txt. The specific method used is described in Cuifo et al 2018. The best-match-status (column 23) and comment (column 24) are converted into three Taxonomy check statuses as follows.

  • OK - the ANI result is consistent with the declared species
    • the best-match-status is species-match, subspecies-match, derived-species-match, synonym-match, genus-match, approved-mismatch, or the comment indicates either that the assembly is the type-strain and no match is expected, or that the assembly is the type-strain, the mismatch is within genus and is expected
  • Inconclusive - the ANI result is inconclusive
    • the best-match-status is low-coverage, below-threshold-match, below-threshold-mismatch, na, or the comment indicates that the assembly is a type-strain that failed to match other type-strains on its species
  • Failed - the ANI result is inconsistent with the declared species
    • the best-match-status is mismatch and the comment is na

Isolate - the isolate from which the sequences in the assembly were derived. [Field is not shown if the source material for the assembly does not specify an isolate.]

BioProject - the BioProject that generated the assembly

BioSample - the BioSample from which the assembly was generated

RefSeq Assembly and GenBank Assembly Identical - yes, no, n/a indicate if the RefSeq and GenBank assemblies are identical.

WGS project - the accession prefix and version of the whole genome shotgun (WGS) project.

Linked assembly - the accession.version and designation (principal or alternate pseudohaplotype) of a paired genome assembly derived from the same diploid individual (see the assembly type definitions).

Assembly version history

Click on the link "Show revision history" to display a table reporting assembly updates over time for the GenBank and RefSeq assemblies. Columns include:

GenBank assembly accession - the accession.version of the GenBank assembly. The table is sorted by this column.

(no header) Identity column - indicates when paired GenBank and RefSeq assembly accessions are identical or not identical. n/a is shown when the two assemblies are not paired.

RefSeq assembly accession - the accession.version of the RefSeq assembly.

Assembly name - the submitter's name for the assembly when one was provided, otherwise a default name is provided by NCBI

Assembly level - the highest level of assembly for any object in the assembly (values are as described above ).

Status - the current status for the GenBank and/or RefSeq assembly accession.versions are shown. The possible values are latest, replaced, or suppressed.

Global statistics

The Global statistics section reports general statistics information including:

Total sequence length - total length of all top-level sequences.

Total ungapped length - total length of all top-level sequences ignoring gaps. Any stretch of 10 or more Ns in a sequence is treated like a gap.

Gaps between scaffolds - number of unspanned gaps between scaffolds.

Number of scaffolds - number of scaffolds including placed, unlocalized, unplaced, alternate loci and patch scaffolds.

Scaffold N50 - length such that scaffolds of this length or longer include half the bases of the assembly.

Scaffold L50 - number of scaffolds that are longer than, or equal to, the N50 length and therefore include half the bases of the assembly.

Number of Contigs - total number of sequence contigs in the assembly. Any stretch of 10 or more Ns in a sequence is treated as a gap between two contigs in a scaffold when counting contigs and calculating contig N50 & L50 values.

Contig N50 - length such that sequence contigs of this length or longer include half the bases of the assembly.

Contig L50 - number of sequence contigs that are longer than, or equal to, the N50 length and therefore include half the bases of the assembly.

Total number of chromosomes and plasmids - total number of chromosomes, organelle genomes, and plasmids in the assembly.

Number of component sequences (WGS or clone) - total number of component WGS or clone sequences in the assembly.

Number of regions with alternate loci or patches - number of genomic regions that contain one or more alternate loci or patch scaffolds.

Some statistics are omitted if they are not relevant for the structure of a particular assembly, for example if an assembly consists entirely of completely sequenced chromosomes.

If the assembly type is either haploid-with-alt-loci or diploid, then all statistics in the Global statics box are for the Primary Assembly, except for the number of regions with alternate loci or patches.

Assembly definition

This tab includes a table providing the name and sequence accession.version for each chromosome in the assembly. Both GenBank and RefSeq sequence accession.versions are shown when the assembly has both a GenBank version and a RefSeq version, and an identity column indicates whether or not the GenBank and RefSeq sequences are identical. The table also shows counts for any unlocalized or unplaced scaffolds in the assembly, and provides a link to download a tab-delimited file that gives the sequence accession.version for every object in the assembly, including the unlocalized and unplaced scaffolds.

If the full assembly is composed of more than one assembly-unit, then additional tables list the assembly-units and any genomic regions that have been defined in the assembly. These provide two different ways of viewing the alternate loci or patch scaffolds present in some assemblies. The first is by assembly-unit and the second is by region.

Assembly-unit table

The Assembly-Unit table is shown only when the full assembly is composed of more than one assembly-unit. Examples of additional assembly-units include a non-nuclear assembly-unit containing sequences from organelle(s), and alternate-loci assembly-units in a haploid-with-alt-loci type assembly. The right-most table refreshes as you select a different assembly-unit from the list. The columns shown in the table varies depending on the type of assembly-unit selected. The content of the table for the Primary Assembly unit is described above, and non-nuclear type units have the same layout. The table displayed for alternate-loci assembly units shows the name, chromosome assignment, and GenBank and RefSeq sequence accession.versions for all the scaffolds in the assembly-unit. The information for a PATCHES assembly-unit also shows whether each patch scaffold is of type fix or novel.

Regions table

The Regions table is shown only for haploid-with-alt-loci type assemblies that have alternate loci or patch scaffolds assigned to defined genomic regions. The Regions table shows the region name and its chromosomal location (in the format chromosome name:start-end) and is sorted chromosomal location. Selecting a region refreshes the right-most table which reports the name, chromosome assignment, and GenBank and RefSeq sequence accession.versions for all the scaffolds in the region. The scaffolds in a region will be alternate loci scaffolds or patch scaffolds and may come from more than one assembly-units.

Full assembly definition report

The “Download the full sequence report” link provides a tab-delimited file that gives the role and sequence accession.version for every top-level object in the full assembly, including all the unlocalized and unplaced scaffolds. The report covers all assembly-units indicated in the Assembly-unit table, hence it includes any alternate loci scaffolds or patch scaffolds in the assembly.

The Assembly definition report has a header that provides some meta-data for the assembly, including the assembly name, assembly accession.version (both GenBank & RefSeq if available), the scientific name of the organism, and the organism's taxonomy ID. This header is followed by rows consisting of the following columns:

Object name - name of the chromosome, linkage-group, plasmid, or scaffold.

Role - the role the object has in the assembly:

  • chromosome - the object represents a chromosome, linkage-group, or plasmid
  • unlocalized-scaffold - the object is a scaffold that is associated with a particular chromosome but has not been localized to a specific position on the chromosome
  • unplaced-scaffold - the object is a scaffold that does not have a chromosome assignment
  • alt-scaffold - the object is either an alternate loci scaffold or a patch scaffold

Chromosome - the chromosome with which the object is associated; 'unknown' is used for unplaced scaffolds.

GenBank Accn - GenBank sequence accession.version.

RefSeq Accn - RefSeq sequence accession.version.

Assembly Unit - the name of the assembly-unit name that contains the object.

If the GenBank and RefSeq versions of the assembly are not identical, some GenBank sequence accession.version fields may be empty, or may show an accession.version in parentheses, both of which indicate that this sequence only appears in the RefSeq version of the assembly. Even though a GenBank sequence accession.version appearing in parentheses is not part of the GenBank assembly, it is listed because it is an identical pair to a RefSeq sequence that is part of the RefSeq assembly. Conversely, an empty RefSeq sequence accession.version field or a RefSeq sequence accession.version in parentheses indicate that this sequence only appears in the GenBank version of the assembly. Any object that is part of both the GenBank and RefSeq assembly versions but which has a different sequence in the two versions will appear as two lines in the report, one line containing the GenBank sequence accession.version and the other line containing the RefSeq sequence accession.version

Assembly statistics

Detailed statistics are provided in one or more tables under the Assembly Statistics tab. These tables includes counts of scaffolds, total sequence length, ungapped sequence length, scaffold N50, number of spanned gaps, and number of unspanned gaps. The table in the first second level tab contains statistics for each chromosome in the Primary Assembly unit. If a chromosome has unlocalized scaffolds, then additional rows are included to show the statistics for the unlocalized scaffolds alone and for 'All' the sequences assigned to a chromosome (i.e. the chromosome and unlocalized scaffolds combined). If the assembly includes any unplaced scaffolds, then an additional row shows the statistics for all the unplaced scaffolds.

If the full assembly is composed of more than one assembly-unit, then a second tab provides a table showing statistics for the scaffolds in each assembly-unit. Assemblies that include alternate loci or patch assembly-units may also have a third tab that provides statistics for the alternate or patch scaffolds in each genomic region.

Access the data

Links to download the full sequence report (as described above) and a statistics report are provided for all assemblies. A link to download a regions report is also provided for those haploid-with-alt-loci type assemblies that have defined genomic regions.

Links to the relevant FTP directories from which users can download genome sequences and annotation (if available) are shown whenever the assembly being viewed is the latest GenBank or RefSq version in the assembly chain. Some older assembly versions also have data on the genomes FTP site and a download link is also shown when one of these versions is viewed. GenBank submissions may or may not include annotation depending on what the submitter provided. In contrast, annotation data is available for all RefSeq genome assemblies except for some viruses.

The "Access the data" section for eukaryotic genome assemblies also provides a link to a BLAST web page preconfigured to search against the genomic sequences in the assembly. In addition, many chromosome-level eukaryotic genome assemblies in RefSeq have a "View the Genome" link to an interactive genome annotation viewer. Genome assemblies annotated by the NCBI Eukaryotic Genome Annotation pipeline also have a link to annotation summary reports.

Glossary

Assembly designated as neotype - the sequences in the genome assembly were derived from neotype material (see Federhen 2015 )

Assembly designated as reftype - the sequences in the genome assembly were derived from reftype material (see Ciufo 2018 )

Assembly from pathotype material - the sequences in the genome assembly were derived from pathovar type material

Assembly from synonym type material - the sequences in the genome assembly were derived from synonym type material (see Federhen 2015 )

Assembly from type material - the sequences in the genome assembly were derived from type material (see Federhen 2015 )

Assembly level - the highest level of assembly for any object in the assembly:

  • Complete genome - all chromosomes are gapless and have no runs of 10 or more ambiguous bases (Ns), there are no unplaced or unlocalized scaffolds, and all the expected chromosomes are present (i.e. the assembly is not noted as having partial genome representation). Plasmids and organelles may or may not be included in the assembly but if present then the sequences are gapless.
  • Chromosome - there is sequence for one or more chromosomes. This could be a completely sequenced chromosome without gaps or a chromosome containing scaffolds or contigs with gaps between them. There may also be unplaced or unlocalized scaffolds.
  • Scaffold - some sequence contigs have been connected across gaps to create scaffolds, but the scaffolds are all unplaced or unlocalized
  • Contig - nothing is assembled beyond the level of sequence contigs

Contig - a contiguous stretch of sequence bounded by the end of the sequence, a gap of any kind or a run of 10 or more Ns.

Full genome representation - the data used to generate the assembly was obtained from the whole genome, as in Whole Genome Shotgun (WGS) assemblies for example. There may still be gaps in the assembly.

ICTV additional isolate - the International Committee on Taxonomy of Viruses (ICTV) designated the genome assembly an additional isolate for the virus species

ICTV species exemplar - the International Committee on Taxonomy of Viruses (ICTV) designated the genome assembly as the exemplar for the virus species

Infraspecific name - the strain, breed, cultivar or ecotype of the organism from which the sequences in the assembly were derived.

Major release - changes from the previous assembly version result in a significant change to the coordinate system. The first version of an assembly is always a major release. Most subsequent genome assembly updates are also major releases.

Minor release - changes from the previous assembly version are limited to the following changes, none of which result in a significant change to the coordinate system of the primary assembly-unit:

  • adding, removing or changing a non-nuclear assembly-unit
  • dropping unplaced or unlocalized scaffolds
  • adding up to 50 unplaced or unlocalized scaffolds which are shorter than the current scaffold-N50 value
  • replacing a component with a gap of the same length

Partial genome representation - the data used to generate the assembly came from only part of the genome. Most assemblies have full genome representation with a minority being partial genome representation. Reasons for the genome representation being set to partial include:

  • the assembly description indicates that the assembly was targeted to a single chromosome or a subset of the genome
  • the chromosome set in the assembly is less than the expected chromosome complement for the organism, ignoring any plasmids, organelle chromosomes and the small sex chromosome (Y for mammals, W for birds)
  • the genome coverage in a WGS assembly is less than 1
  • the ungapped sequence length of the assembly is less than half the average for other assemblies from the same species

Patch release - the only change from the previous assembly version is the addition or modification of a patch assembly-unit (relevant for assemblies maintained by the Genome Reference Consortium)

Reference genome - a category in the NCBI Reference Sequence ( RefSeq ) project classification applied to a manually selected high quality genome assembly that NCBI and the community have identified as being important as a standard against which other data are compared

Representative genome - a category in the NCBI Reference Sequence ( RefSeq ) project classification applied to a genome computationally or manually selected as a representative from among the best genomes available for a species or clade that does not have a designated reference genome. Notes:

  • Prokaryotes may have more than one reference or representative genome per species. For more information see the Prokaryotic RefSeq Genomes web page
  • Eukaryotes have no more than one reference or representative genome per species. If there are no assemblies in RefSeq for a particular eukaryotic species, then the GenBank assembly that RefSeq would select as the best available for that species will be designated as the representative genome.
  • Viruses may have one or more reference genomes per species. The representative genome designation is not applied to viruses and viroids.

Scaffold - an object comprised of one or more sequence contigs connected by spanned gaps, i.e. gaps that have linkage evidence (see the AGP specification ), also called within-scaffold gaps. Scaffolds are bounded by the end of the sequence or by unspanned gaps, i.e. gaps without linkage evidence, also called between-scaffold gaps. If an assembly has some contigs linked into scaffolds, then any singleton contigs are also treated as scaffolds.

Top-level sequences - the most highly assembled sequences in a genome assembly, i.e. chromosomes, plasmids, unplaced/unlocalized scaffolds or contigs, alt-loci scaffolds and patch scaffolds. The set of top-level sequences provides a non-redundant representation of the assembly since it excludes lower level sequences that are components in a higher level sequence, i.e. contigs are not included if they are part of scaffolds, scaffolds are not included if they are part of chromosomes.

Support Center

Last updated: 2021-02-26T04:33:44Z