Genome Size Check

GenBank compares the size of a submitted genome assembly to the expected genome size range for the species to identify outliers that can result from errors such as:

  • incorrect organism assignment
  • metagenome submitted as an organism genome
  • targeted sub-genome assembly not flagged as partial genome representation
  • gross contamination with other sequences

The NCBI Genome Size Check API can be used to check the size of a genome assembly against the expected genome size range in advance of submission.

Expected Genome Size Range

NBCI calculates an expected genome size range for all species that have at least four assemblies in the NCBI Assembly database. The "genome size" is the ungapped-length of the genome assembly, i.e. gaps and runs of 10 or more Ns are ignored. The expected genome size for eukaryotes is the value for a haploid genome assembly. The rules used to calculate the expected genome size range for a species can be summarized as:

  • skip assemblies that are flagged as partial, anomalous or excluded-from-RefSeq for other reasons
  • if the species has at least four genome assemblies remaining
    • calculate the median and standard-deviation (std-dev) of the genome assembly sizes
    • if 4 standard-deviations is between 20% and 50% of the median,
      • then the expected genome size range is: median - 4x std-dev to median + 4x std-dev
    • if 4 standard-deviations is less than 20% of the median,
      • then the expected genome size range is: 80% to 120% of median
    • if 4 standard-deviations is more than 50% of the median,
      • then the expected genome size range is: 50% to 150% of median
  • if the species has less than four genome assemblies remaining but one of them is a RefSeq reference genome assembly
    • the expected genome size range is: 80% to 120% the size of the reference genome

The expected genome sizes are calculated daily and reported in a file on the NCBI genomes FTP site:

ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/species_genome_size.txt.gz

Accepted sizes for other genomes

We fall back to using broader size ranges when an expected genome size is not available because the submitted assembly is from:

  • a species for which there are fewer than 4 decent public genomes in the Assembly database
  • an unidentified species, e.g. Vibrio sp. X123
  • an unspecified organism, e.g. uncultured proteobacterium or most Candidatus
  • a metagenome or unclassified organism

In such cases, the accepted genome size ranges are as follows:

  • organism is in the archaea superkingdom: 100,000 bp to 15,000,000 bp
  • organism is in the bacteria superkingdom: 100,000 bp to 15,000,000 bp
  • organism is in the eukaryota superkingdom: 100,000 bp to unlimited (100 Gbases in practice)
  • organism is in the viruses superkingdom: 100 bp to 15,000,000 bp
  • metagenome or unclassified organism: at least 200 bp (the minimum length for a Whole Genome Shotgun sequence in the International Nucleotide Sequence Database Collaboration)

Genome Size Check API

URL

https://api.ncbi.nlm.nih.gov/genome/v0/expected_genome_size

Input parameters

  • species_taxid=<INTEGER>
    • a species level NCBI taxonomy ID, or a sub-species taxonomy ID that will be automatically mapped up to the species level
    • integer
  • length=<VALUE>
    • the size of the genome assembly in base-pairs, ignoring gaps and N bases
    • either an integer number of basepairs or a length expressed with standard suffixes: K, M, G, KB, MB, GB, Kbp, Mbp, Gbp

Examples

  1. expected_genome_size?species_taxid=287&length=6264404
  2. expected_genome_size?species_taxid=1773&length=4.41M
  3. expected_genome_size?species_taxid=9606&length=3.1Gbp

Results

Successful requests output XML as in the following example.

<?xml version="1.0" encoding="ISO-8859-1"?>
<genome_size_response>
<input>
 <species_taxid>9606</species_taxid>
 <length>3000000000</length>
</input>
 <organism_name>Homo sapiens</organism_name>
 <species_taxid>9606</species_taxid>
 <size_source>species</size_source>
 <genome_count>95</genome_count>
 <expected_ungapped_length>2833537881</expected_ungapped_length>
 <minimum_ungapped_length>1416768000</minimum_ungapped_length>
 <maximum_ungapped_length>4250307000</maximum_ungapped_length>
 <length_status>within_range</length_status>
</genome_size_response>

The two fields under input are the data that was entered.

  • species_taxid is unchanged.
  • length is the parsed value. This can be helpful if the input length had a suffix that is not supported.

The other fields are:

  • species_taxid - the taxonomy ID that was used after mapping the input taxonomy ID to species level
  • size_source - indicates the source used to obtain the genome size range; one of:
  • genome_count - the number of genome assemblies used to calculate the expected size range
  • expected_ungapped_length - the median genome assembly size, only present when size_source is species
  • minimum_ungapped_length - the minimum genome assembly size
  • maximum_ungapped_length - the maximum genome assembly size, omitted when size_source is insdc-seq-min
  • length_status - the evaluation of the input length; one of:
    • within_range
    • too_small
    • too_large

Errors

There are two errors with HTTP status 400 "Bad Request" that return bare strings rather than XML.

  • "Given taxid XXXXXX is not a known taxid."
  • "Given taxid XXXXX is above species."
Support Center

Last updated: 2019-08-23T22:34:29Z