Commonly Used Genome Terms
Accession number An accession number is a unique identifier given to a sequence when it is submitted to one of the DNA repositories (GenBank, EMBL, DDBJ). The initial deposition of a sequence record is referred to as version 1. If the sequence is updated, the version number is incremented but the accession number will remain constant.
AGP file A file that describes how primary sequences can be assembled to make a non-redundant, contiguous sequence. The sequence being assembled may be a contig or a chromosome. This file describes the portion of the component sequence used in the contig, in addition to the location on the contig of the component sequence. For more information about the file specifiction, see the format definition page.
Allelic series A collection of distinct mutations that affect a single locus. Often, these different mutations will produce different phenotypes, thus providing a powerful genetic tool for the dissection of gene function.
Read more about alleles and complementation
references:
Vivian JL et al. An allelic series of mutations in Smad2 and Smad4 identified in a genotype-based screen of N-ethyl-N- nitrosourea-mutagenized mouse embryonic stem cells. Proc Natl Acad Sci U S A 2002; 99(24):15542-7.
Steingrimsson E et al. Interallelic complementation at the mouse mitf locus. Genetics 2003; 163(1):267-76.
Annotation Adding biological information to genome sequence. This is a very complex task, and the process for doing this is rapidly evolving. Several groups are doing automated computational annotation of several genomes. Features that are added to the genome often include gene models, SNPs, and STSs.
Annotation at NCBI
Annotation at Ensembl
Annotation UCSC
references:
Reese MG et al. Genome annotation assessment in Drosophila melanogaster. Genome Res. 2000; 10(4):483-501.
Hubbard T et al. The Ensembl genome database project. Nucleic Acids Res. 2002; 30(1):38-41.
BAC
PAC
Bacterial Artificial Chromosome.
P1 Artificial Chromosome
Commonly used cloning vectors for the human genome project. These vectors can hold large inserts, typically 80-200 kb, and propagate in E. coli as a single copy episome.
Read more about using BACs
Track specific clones at the NCBI Clone Registry
BacPac Resources: for information on the construction and maintenance of several BAC and PAC libraries.
references:
Osoegawa et al. Bacterial artificial chromosome libraries for mouse sequencing and functional analysis. Genome Res. 2000; 10(1): 116-28.
Osoegawa et al. A bacterial artificial chromosome library for sequencing the complete human genome. Genome Res. 2001; 11(3): 483-96.
BES BAC end sequence. The ends of BACs are sequenced and the clone association information is retained. In this way, BAC clones that do not have insert sequence can be integrated with other BAC clones, or with WGS assemblies.
Human BAC end sequencing info
Mouse BAC end sequencing info
Rat BAC end sequencing info
references:
Mahairas GG et al. Sequence-tagged connectors: a sequence approach to mapping and scanning the human genome. Proc Natl Acad Sci 1999; 96(17): 9739-44.
Zhao S et al. Human BAC ends quality assessment and sequence analyses. Genomics 2000; 63(3): 321-22.
Zhao S et al. Mouse BAC ends quality assessment and sequence analysis. Genome Res 2001; 11(10): 1736-45.
BLAST Basic Local Alignment Search Tool. A method for performing sequence comparisons. Either protein sequences or nucleotide sequences can be used. This algorithm has been extended and now includes a suite of programs including megaBLAST and discontiguous megaBLAST.
choose a BLAST program
Learn more about similarity searching
references:
Altschul et al. Basic local alignment search tool. J Mol Bio 1990; 215:403-10.
Zhang et al. A greedy algorithm for aligning DNA sequences. J. Comput Biol. 2000; 7(1-2):203-14.
BLAT

A hashing algorithm developed by Jim Kent to allow rapid searching of large amounts of genome sequence. A hashing algorithm divides the database into words of a prescribed size (often 12-14 bases). The locations of these words are stored in memory. The query sequence is scanned for exact matches to words stored in memory. These types of algorithms tend to be very fast and effective for closely related sequences, but fail as sequences diverge.
In addition to nucleotide BLAT, translated BLAT allows for comparison of protein sequences.
This sequence aligner also allows for accurate alignment of transcribed sequences by looking at splice site information.
reference:
Kent WJ. BLAT- the BLAST like alignment tool. Genome Res 2002; 12(4):656-64.

 
CDS

Coding sequence. This is the portion of an mRNA or genomic sequence that encodes for a protein sequence.

Chromosomal rearrangement

These are events that are mediated by double-strand breaks and subsequent repair occurring in the genome. When these breaks are repaired the location of landmarks in the genome have often changed or have been removed completely. There are many different types of rearrangements:

  • deletion: the removal of a DNA sequence.
  • insertion: the addition of a DNA sequence
  • translocation: fusing one part of a chromosome to another
  • inversion: this is an intra-chromosomal event in which two breaks occur on the chromosome, the piece in the middle is flipped and the ends are then repaired.

These events may have no phenotypic consequences, depending upon the amount of DNA involved and the location of the breakpoints. However, there are many well-characterized human syndromes that are associated with these events.
Read more about chromosomal rearrangements.
Learn more about rearrangements associated with cancer.
reference:
Inoue K, Lupski JR. Molecular mechanisms for genomic disorders.Annu Rev Genomics Hum Genet 2002; 3:199-242.

Contig

This is short for contiguous sequence. When two sequences overlap at their ends (known as a "dove-tail" overlap). The sequences can be collapsed into a single, non-redundant sequence.
Read more about contigs
references:
Jang W et al (1999) Making effective use of human genomic sequence data. Trends Genet. 15(7): 284-6.
Kent WJ and Haussler D (2001) Assembly of the working draft of the human genome with GigAssembler. Genome Res 11(9): 1541-8.
Agarwala R et al (Manuscript in preparation).

Cosmid

Cloning vector that typically contains insert sizes of 60-120kb. These vectors are hybrids of lambda phages and plasmids and can be propagated as plasmids or packaged like phage. The name comes from the fact that these vectors retain the phage cos sites that are used for lambda head stuffing. These are generally maintained in multiple copies in E. coli.
Read more about cosmids
references:
Evans GA et al. High efficiency vectors for cosmid microcloning and genomic analysis. Gene 1989; 79(1):9-20.
Coulsan A et al. The physical map of the Caenorhabditis elegans genome. Methods Cell Biol 1995; 48:533-50.

Draft sequence

This term has had several definitions, but generally refers to sequence that is not yet finished but is of generally high quality. In terms of clone based project, Draft sequence refers to a project in which greater than 90% of the bases are of high quality. This means that a clone project will have several fragments connected by Ns. Often, the order and orientation of these fragments is unknown. However, these sequences, in conjunction with other data are a useful substrate for genome assembly and annotation.
See HTGS for additional information.
references:
Collins FS et al. New goals for the U.S. Human Genome Project: 1998-2003. Science 1998; 282(5389):682-9.
The Human Genome Sequence Consortium. Initial sequencing and analysis of the human genome. Nature 2001; 409(6822):860-921.

e-PCR Electronic PCR. A program that searches a given sequence for the presence of primer pairs. These primers must be in the proper orientation and a specified distance apart to define a match.
reference:
Schuler GD. Electronic PCR: bridging the gap between genome mapping and genome sequencing. Trends Biotechnol 1998; 16(11):456-9.
EST

Expressed sequence tag. These are single-pass sequences of cDNA clones. Databases of EST sequences are highly redundant but quite useful for gene identification. There are many efforts to cluster EST sequences to remove the redundancy and low-quality sequences.
EST clusters in UniGene
Gene indexes at TIGR
EST clusters at Allgenes
references:
Adams MD et al. Complementary DNA sequencing: expressed sequence tags and human genome project. Science 1991; 252(5013):1651-6.
Marra M et al. An encyclopedia of mouse genes. Nat Genet 1999; 21(2):191-4.

ExoFish

A technique that utilizes Whole Genome Shotgun (WGS) reads from the pufferfish, Tetraodan nigroviridis, to identify potential coding sequences in mammalian genomes based on homology. This technique was first used to annotate the Human Genome.
Read more about exoFISH
reference:
Roest Crollius et al. Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat Genet 2000; 25(2):235-8.

Fingerprint The pattern of bands produced by a clone when restricted by a particular enzyme, such as HindIII. Clones that are related will have have fingerprint bands in common. The more bands in common, the greater the degree of overlap.
A BAC fingerprint map of the Mouse Genome
Human BAC map information
references:
Marra M et al. High throughput fingerprint analysis of large-insert clones. Genome Res 1997; 7(11):1072-84.
Marra M et al. zA map for sequence analysis of the Arabidopsis thaliana genome. Nat Genet 1999; 22(3):265-70.
McPherson JD et al. A physical map of the human genome. Nature 2001; 409(6822):934-41.
Soderlund C et al. Contigs built with fingerprints, markers and FPC v4.7. Gen. Research 2000; 10:1772-1787.
Soderlund C et al. FPC: a system for building contigs from restriction fingerprinted clones. CABIOS 1997; 13: 523-535.
Finished Sequence

A clone insert has been sequenced with an error rate of <0.01%. These sequence records generally have no gaps.
references:
Collins FS et al. New goals for the US Human Genome Project: 1998-2003. Science 1998; 282(5389):682-9.
The Human Genome Sequence Consortium Initial sequencing and analysis of the human genome. Nature 2001; 409(6822):860-921.

FISH

Fluorescent in situ hybridization. Genomic clones are fluorescently labeled and hybridized to chromosome spreads. In this way a clone can be mapped to a discrete cytogenetic band. If the clone has sequence associated with it, this information can be used to integrate sequence with cytogenetic information.
Read more about FISH
FISH methodology
reference:
Dyer SA and Green EK. Fluorescent in situ hybridization. Methods Mol Biol 2002; 187:73-86.

Fosmid A cloning system based on the E. coli F factor. These clones have an average insert size of 40 Kb, with a very small standard deviation.
reference:
Birren BW et al. A human chromosome 22 fosmid resource: mapping and analysis of 96 clones. Genomics 1996; 34(1):97-106.
Gene targeting

This is a specific type of transgenesis that targets a particular gene. If a mutated copy of a gene is electroporated into a cell, the inserted DNA will find the endogenous copy of itself and recombination will occur with some frequency (1-25%). If this event occurs in embryonic stem cells, cells carrying the new copy of the gene can be used to generate embryos that can be assessed for the phenotypic consequences of the mutation. This technique is used frequently in mice to study
loss-of -function mutations.
Read more about gene targeting
references:
Thomas et al. High frequency targeting of genes to specific sites in the mammalian genome. Cell 1986; 44(3):419-28.
van der Weyden L, et al. Tools for targeted manipulation of the mouse genome. Physiol Genomics 2002; 11(3):133-64.

Gene trapping

This strategy uses transgenesis to introduce DNA carrying a reporter gene (lacZ or GFP) flanked by various genomic signals (splice donor or acceptor sites, promoters, etc.). Expression of the reporter gene indicates that the DNA has integrated into a region of the genome containing a gene. The gene that has been trapped can be recovered using the DNA sequences associated with the reporter construct. Often, the introduction of the gene trapping vector inactivates the gene into which it was introduced.
Go to the Gene Trap web page
references:
Gossler et al. Mouse embryonic stem cells and reporter constructs to detect developmentally regulated genes. Science 1989; 244(4903):463-5.
Stanford et al. Gene-trap mutagenesis: past, present and beyond. Nat Rev Genet 2001; 2(10):756-68.

HTGS High Throughput Genome Sequence. This is a term to distinguish all genomic sequence generated in a high-throughput manner. In order to release data more rapidly, it became standard for all sequence centers to submit unfinished sequence into public repositories (the "Bermuda Rules"). This sequence is deposited into the HTG division of GenBank/EMBL/DDBJ. In general, these terms are used to describe clone (BAC/PAC/fosmid) based projects.
keywords associated with HTGS:
HTGS_phase0: A project that has very light coverage, generally 1-2 fold coverage of the clone. This initial light coverage is produced to ensure that the clone is not redundant to other sequence.
HTGS_phase1: An unfinished project, usually representing 3-6 fold coverage of the clone. The fragments within the clone are not ordered or oriented with respect to each other.
HTGS_phase2: An unfinished project, usually representing 5-10 fold coverage of the clone. The fragments within the clone are ordered and oriented.
HTGS_phase3: A finished project. A single fragment of very high quality.
HTGS_draft: A draft project is either a phase 1 or phase 2 project that has exceeded a specified quality standard. Generally, this translates to 3-4 fold sequence coverage of the BAC clone in high-quality bases.
HTGS_fulltop: Added to a record when the center responsible for finishing the clone has added sufficient new shotgun coverage for their finishing process to begin.
HTGS_activefin: Added when the center responsible for finishing actually begins the process of finishing the sequence
HTGS_cancelled: Added to clones that will never be finished.
reference:
Marshall E. Bermuda rules: community spirit, with teeth. Science 2001; 291(5507):1192.
Linkage mapping This type of mapping measures meiotic recombination using polymorphic markers to produce the relative order of markers with respect to each other. Distance between markers is measured in centiMorgans (cM). A centiMorgan is equivalent to a 1% cross-over rate.
Read more about genetic mapping
Read about genetic mapping in mice
references:
Donis-Keller H, et al. A genetic linkage map of the human genome. Cell 1987; 51(2):319-37.
Sturtevant AH. The linear arrangement of six sex-linked factors in Drosophila, as shown by their mode of association. J Exp Zool 1913; 14:43-59. (see a reprint (pdf) of this paper)
LocusID Unique identifier, assigned by LocusLink, given to all of the transcripts, proteins, and models associated with a given locus.
reference:
Pruitt KD, Maglott DR. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res 2001; 29(1):137-140.
Mate pair The sequence obtained from opposite ends of a particular clone are referred to as mate pairs. Knowing that two sequences are derived from the same clone allows these sequences to be linked, even if the full insert of the clone is unavailable. This is key to WGS assemblies.
references:
Weber JL , Myers EW. Human whole-genome shotgun sequencing. Genome Res 1997; 7(5):401-409.
Venter JC et al. The sequence of the human genome. Science 2001; 291(5507):1304-51.
Batzogluo S et al. ARACHNE: a whole-genome shotgun assembler. Genome Res 2002; 12(1):177-89.
Mullikin JC, Ning Z. The phusion assembler. Genome Res 2003; 13(1):81-90.
N50 The contig/scaffold length at which have of the bases in a given assembly reside. This provides a measure of continuity. For instance, a scaffold N50 of 15 Mb means that at least half of the bases in the assembly are in a contig that is at least 15 Mb.
Mutation

A sequence variation that deviates from the reference, or "wild type", sequence. This variation can be a SNP, an insertion of sequence, or a deletion of sequence. There can be a great deal of sequence variation between individuals in a population. For example, different humans may have as many as 1 basepair difference every 1000 bp. In practice, mutations are distinguished from variation because they have phenotypic consequences. Mutations in the Pax6 gene that lead to a loss of the function of that gene lead to the eyeless mutation in flies, the Small eye mutation in mice, and aniridia in humans.
Read more about mutations and mutant analysis
reference:
Gehring WJ. The master control gene for morphogenesis and evolution of the eye. Genes Cells 1996; 1(1):11-5.

Phenotype

An observable characteristic displayed by an organism. These characteristics can be controlled by genes, by the environment, or a combination of both. The characteristic can be directly observable, such as having brown eyes. In some cases, the phenotype will be measurable, such as having high blood pressure.
OMIM is an on-line catalog of human phenotypes

Positional cloning

Identification of a gene based on its physical location in the genome. Often, an individual has a phenotype, but the gene underlying this phenotype is unknown. Using linkage mapping, the phenotype can be assigned a position in the genome. Once a phenotype has been localized, overlapping sets of clones (for example, BACs) that cover the region are identified. Genes within the region are identified and compared to DNA from individuals that display the phenotype until the underlying mutation is identified.
Read more about positional cloning
references:
Kerem B, et al. Identification of the cystic fibrosis gene: genetic analysis. Science 1989; 8;245(4922):1073-80.

RefSeq

Reference Sequence. The goal of the RefSeq project is to produce a reference sequence for all naturally occurring molecules from the central dogma (DNA, RNA, Protein).
reference:
Pruitt KD, Maglott DR. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res 2001; 29(1):137-140.

RFLP

Restriction fragment length polymorphism. A type of polymorphism detectable in a genome by the size differences in DNA fragments generated by restriction enzyme analysis.
Read more about RFLPs
references:
Botstein D, et al. Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am J Hum Genet 1980; 32(3):314-31.
Donis-Keller H, et al. A genetic linkage map of the human genome. Cell 1987; 51(2):319-37.

RH mapping

Radiation Hybrid mapping. A physical mapping method that estimates linkage and distance relative to radiation-induced chromosome breaks. This is analogous to genetic mapping.
More on RH mapping
references:
Goss SJ, Harris H. New method for mapping genes in human chromosomes. Nature 1975; 255(5511):680-4.
Cox et al. Radiation hybrid mapping: a somatic cell genetic method for constructing high-resolution maps of mammalian chromosomes. Science 1990; 250(4978):245-50.
Hudson TJ, et al. A radiation hybrid map of mouse genes. Nat Genet 2001; 29(2):201-5.

Scaffold (see supercontig)
SNP Single Nucleotide Polymorphism. A single base difference found when comparing the same DNA sequence from two different individuals.
More on SNPs
reference:
Weiss KM. In search of human variation. Genome Res 1998; 8(7): 691-7.
SSAHA A hashing algorithm developed for rapid searching of large amounts of genome sequence. This program is similar to BLAT but does not use splice information to align mRNA sequences, nor can it perform translated searches.
reference:
Ning et al. SSAHA: a fast search method for large DNA databases. Genome Res 2001; 11(10):1725-9.
SSLP

Simple sequence length polymorphisms. Common examples of these in mammalian genomes include runs of dinucleotide or trinucleotide repeats (CACACACACACACACACA).
Read more about SSLPs
reference:
Weissenbach J, et al. A second-generation linkage map of the human genome. Nature 1992; 29:359(6398):777-8.

Stem Cells

Most cells of the adult body are terminally differentiated, that is, they are no longer able to replace themselves or to become another cell type. Stem cells are undifferentiated cells that are able to both proliferate and differentiate into numerous cell types. For example, embryonic stem cells are able to differentiate into any cell type found within the embryo, where as hematopoietic stem cells can differentiate into any blood cell.
Read more about stem cells
reference:
Rossant J. Stem cells from the Mammalian blastocyst. Stem Cells 2001; 19(6):477-82

STS

Sequence Tag Site. In general, short sequences (200-500 bp) are produced throughout a genome. Oligonucleotide primers are generated such that this sequence can be amplified using PCR to produce a discrete band when analyzed by electrophoresis. STS markers can be polymorphic or monomorphic. They are critical to integrating non-sequence based maps (such as genetic or RH) with sequence based maps.
Read more about STSs
references:
Green ED, Green P. Sequence-tagged site (STS) content mapping of human chromosomes: theoretical considerations and early experiences. PCR Methods Appl 1991; 1(2):77-90.
Hudson TJ et al. An STS-based map of the human genome. Science 1995; 270(5244):1919-20.

Supercontig (Scaffold)

A supercontig is formed when an association can be made between two contigs that have no sequence overlap. This commonly occurs using information obtained from paired plasmid ends. For example, both ends of a BAC clone are sequenced. It can be inferred that these two sequences are approximately 150-200 Kb apart (based on the average size of a BAC). If the sequence from one end is found in a particular sequence contig, and the sequence from the other end is found in a different sequence contig, the two sequence contigs are said to be linked. In general, it is useful to have end sequences from more than one clone to provide evidence for linkage.

TPF Tiling Path File. This is a simple file that simply lists the order of clones along a chromosome. These files are often used in genome assemblies in an effort to convey mapping information to the assembly program.
Transgenesis The introduction of exogenous DNA into a cell. Typically, this term refers to the introduction of a gene into an embryo or other eukaryotic cell. In general, this DNA will insert into the genome at random, although specific loci can be targeted. The size of the DNA molecule introduce can be small (a few basepairs) to quite large (over 100 Kb).
Read more about transgenesis
reference:
Selbert S, Rannie D. Analysis of transgenic mice. Methods Mol Biol 2002;180:305-41.
WGS

Whole Genome Shotgun. A sequencing method by which an entire genome is cut into chunks of discrete sizes (usually 2,10, 50 and 150 Kb) and cloned into an appropriate vector. The ends of these clones are sequenced. The two ends from the same clone are referred to as mate pairs. The distance between two mate pairs can be inferred if the library size is known and should have a narrow window of deviation.
references:
Weber JL, Myers EW. Human whole-genome shotgun sequencing. Genome Res 1997; 7(5): 401-409.
Venter JC et al. The sequence of the human genome. Science 2001; 291(5507):1304-51.
Batzogluo S et al. ARACHNE: a whole-genome shotgun assembler. Genome Res 2002; 12(1):177-89.
Mullikin JC, Ning Z. The phusion assembler. Genome Res 2003; 13(1):81-90.

YAC Yeast artificial chromosome. These cloning vectors were developed using yeast centromere and telomere sequences. The average insert size of these clones ranges from 100-1000 Kb. These clones can span large portions of the genome but can be highly unstable.
Read more about YACs
reference:
Burke DT et al. Cloning of large segments of exogenous DNA into yeast by means of artificial chromosome vectors. Science 1987; 236(4803):806-12.