|
| Accession number | An accession number is a unique identifier given to a sequence when it is submitted to one of the DNA repositories (GenBank, EMBL, DDBJ). The initial deposition of a sequence record is referred to as version 1. If the sequence is updated, the version number is incremented but the accession number will remain constant. | |
| AGP file | A file that describes how primary sequences can be assembled to make a non-redundant, contiguous sequence. The sequence being assembled may be a contig or a chromosome. This file describes the portion of the component sequence used in the contig, in addition to the location on the contig of the component sequence. For more information about the file specifiction, see the format definition page. | |
| Allelic series | A collection of distinct mutations
that affect a single locus. Often, these different mutations will produce
different phenotypes, thus providing a powerful genetic tool for the dissection
of gene function. Read more about alleles and complementation references: Vivian JL et al. An allelic series of mutations in Smad2 and Smad4 identified in a genotype-based screen of N-ethyl-N- nitrosourea-mutagenized mouse embryonic stem cells. Proc Natl Acad Sci U S A 2002; 99(24):15542-7. Steingrimsson E et al. Interallelic complementation at the mouse mitf locus. Genetics 2003; 163(1):267-76. |
|
| Annotation | Adding biological information to
genome sequence. This is a very complex task, and the process for doing
this is rapidly evolving. Several groups are doing automated computational
annotation of several genomes. Features that are added to the genome often
include gene models, SNPs, and STSs. Annotation at NCBI Annotation at Ensembl Annotation UCSC references: Reese MG et al. Genome annotation assessment in Drosophila melanogaster. Genome Res. 2000; 10(4):483-501. Hubbard T et al. The Ensembl genome database project. Nucleic Acids Res. 2002; 30(1):38-41. |
|
| BAC PAC |
Bacterial Artificial Chromosome. P1 Artificial Chromosome Commonly used cloning vectors for the human genome project. These vectors can hold large inserts, typically 80-200 kb, and propagate in E. coli as a single copy episome. Read more about using BACs Track specific clones at the NCBI Clone Registry BacPac Resources: for information on the construction and maintenance of several BAC and PAC libraries. references: Osoegawa et al. Bacterial artificial chromosome libraries for mouse sequencing and functional analysis. Genome Res. 2000; 10(1): 116-28. Osoegawa et al. A bacterial artificial chromosome library for sequencing the complete human genome. Genome Res. 2001; 11(3): 483-96. |
|
| BES | BAC end sequence. The ends of BACs
are sequenced and the clone association information is retained. In this
way, BAC clones that do not have insert sequence can be integrated with
other BAC clones, or with WGS assemblies. Human BAC end sequencing info Mouse BAC end sequencing info Rat BAC end sequencing info references: Mahairas GG et al. Sequence-tagged connectors: a sequence approach to mapping and scanning the human genome. Proc Natl Acad Sci 1999; 96(17): 9739-44. Zhao S et al. Human BAC ends quality assessment and sequence analyses. Genomics 2000; 63(3): 321-22. Zhao S et al. Mouse BAC ends quality assessment and sequence analysis. Genome Res 2001; 11(10): 1736-45. |
|
| BLAST | Basic Local Alignment Search Tool. A method for
performing sequence comparisons. Either protein sequences or nucleotide
sequences can be used. This algorithm has been extended and now includes
a suite of programs including megaBLAST
and discontiguous
megaBLAST. choose a BLAST program Learn more about similarity searching references: Altschul et al. Basic local alignment search tool. J Mol Bio 1990; 215:403-10. Zhang et al. A greedy algorithm for aligning DNA sequences. J. Comput Biol. 2000; 7(1-2):203-14. |
|
| BLAT |
A hashing algorithm developed by Jim
Kent to allow rapid searching of large amounts of genome sequence.
A hashing algorithm divides the database into words of a prescribed size
(often 12-14 bases). The locations of these words are stored in memory.
The query sequence is scanned for exact matches to words stored in memory.
These types of algorithms tend to be very fast and effective for closely
related sequences, but fail as sequences diverge. |
|
| CDS |
Coding sequence. This is the portion of an mRNA or genomic sequence that encodes for a protein sequence. |
|
| Chromosomal rearrangement |
These are events that are mediated by double-strand breaks and subsequent repair occurring in the genome. When these breaks are repaired the location of landmarks in the genome have often changed or have been removed completely. There are many different types of rearrangements:
These events may have no phenotypic consequences, depending upon the
amount of DNA involved and the location of the breakpoints. However, there
are many well-characterized human syndromes that are associated with these
events. |
|
| Contig |
This is short for contiguous sequence. When two sequences overlap at
their ends (known as a "dove-tail" overlap). The sequences can
be collapsed into a single, non-redundant sequence. |
|
| Cosmid |
Cloning vector that typically contains insert sizes of 60-120kb. These
vectors are hybrids of lambda phages and plasmids and can be propagated
as plasmids or packaged like phage. The name comes from the fact that
these vectors retain the phage cos sites that are used for lambda head
stuffing. These are generally maintained in multiple copies in E. coli. |
|
| Draft sequence |
This term has had several definitions, but generally refers to sequence
that is not yet finished but is of generally high quality. In terms of
clone based project, Draft sequence refers to a project in which greater
than 90% of the bases are of high quality. This means that a clone project
will have several fragments connected by Ns. Often, the order and orientation
of these fragments is unknown. However, these sequences, in conjunction
with other data are a useful substrate for genome assembly and annotation.
|
|
| e-PCR | Electronic PCR. A program that searches a given
sequence for the presence of primer pairs. These primers must be in the
proper orientation and a specified distance apart to define a match. reference: Schuler GD. Electronic PCR: bridging the gap between genome mapping and genome sequencing. Trends Biotechnol 1998; 16(11):456-9. |
|
| EST |
Expressed sequence tag. These are single-pass sequences of cDNA clones.
Databases of EST sequences are highly redundant but quite useful for gene
identification. There are many efforts to cluster EST sequences to remove
the redundancy and low-quality sequences. |
|
| ExoFish |
A technique that utilizes Whole Genome Shotgun (WGS)
reads from the pufferfish, Tetraodan nigroviridis, to identify
potential coding sequences in mammalian genomes based on homology. This
technique was first used to annotate the Human Genome. |
|
| Fingerprint | The pattern of bands produced by
a clone when restricted by a particular enzyme, such as HindIII. Clones
that are related will have have fingerprint bands in common. The more bands
in common, the greater the degree of overlap. A BAC fingerprint map of the Mouse Genome Human BAC map information references: Marra M et al. High throughput fingerprint analysis of large-insert clones. Genome Res 1997; 7(11):1072-84. Marra M et al. zA map for sequence analysis of the Arabidopsis thaliana genome. Nat Genet 1999; 22(3):265-70. McPherson JD et al. A physical map of the human genome. Nature 2001; 409(6822):934-41. Soderlund C et al. Contigs built with fingerprints, markers and FPC v4.7. Gen. Research 2000; 10:1772-1787. Soderlund C et al. FPC: a system for building contigs from restriction fingerprinted clones. CABIOS 1997; 13: 523-535. |
|
| Finished Sequence |
A clone insert has been sequenced with an error rate of <0.01%. These
sequence records generally have no gaps. |
|
| FISH |
Fluorescent in situ hybridization. Genomic clones are fluorescently
labeled and hybridized to chromosome spreads. In this way a clone can
be mapped to a discrete cytogenetic band. If the clone has sequence associated
with it, this information can be used to integrate sequence with cytogenetic
information. |
|
| Fosmid | A cloning system based on the E.
coli F factor. These clones have an average insert size of 40 Kb, with
a very small standard deviation. reference: Birren BW et al. A human chromosome 22 fosmid resource: mapping and analysis of 96 clones. Genomics 1996; 34(1):97-106. |
|
| Gene targeting |
This is a specific type of transgenesis that targets a particular gene.
If a mutated copy of a gene is electroporated into a cell, the inserted
DNA will find the endogenous copy of itself and recombination will occur
with some frequency (1-25%). If this event occurs in embryonic stem cells,
cells carrying the new copy of the gene can be used to generate embryos
that can be assessed for the phenotypic consequences of the mutation.
This technique is used frequently in mice to study |
|
| Gene trapping |
This strategy uses transgenesis to introduce DNA carrying a reporter
gene (lacZ or GFP) flanked by various genomic signals (splice donor or
acceptor sites, promoters, etc.). Expression of the reporter gene indicates
that the DNA has integrated into a region of the genome containing a gene.
The gene that has been trapped can be recovered using the DNA sequences
associated with the reporter construct. Often, the introduction of the
gene trapping vector inactivates the gene into which it was introduced. |
|
| HTGS | High Throughput Genome Sequence.
This is a term to distinguish all genomic sequence generated in a high-throughput
manner. In order to release data more rapidly, it became standard for all
sequence centers to submit unfinished sequence into public repositories
(the "Bermuda Rules"). This sequence is deposited into the HTG
division of GenBank/EMBL/DDBJ. In general, these terms are used to describe
clone (BAC/PAC/fosmid) based projects. keywords associated with HTGS: HTGS_phase0: A project that has very light coverage, generally 1-2 fold coverage of the clone. This initial light coverage is produced to ensure that the clone is not redundant to other sequence. HTGS_phase1: An unfinished project, usually representing 3-6 fold coverage of the clone. The fragments within the clone are not ordered or oriented with respect to each other. HTGS_phase2: An unfinished project, usually representing 5-10 fold coverage of the clone. The fragments within the clone are ordered and oriented. HTGS_phase3: A finished project. A single fragment of very high quality. HTGS_draft: A draft project is either a phase 1 or phase 2 project that has exceeded a specified quality standard. Generally, this translates to 3-4 fold sequence coverage of the BAC clone in high-quality bases. HTGS_fulltop: Added to a record when the center responsible for finishing the clone has added sufficient new shotgun coverage for their finishing process to begin. HTGS_activefin: Added when the center responsible for finishing actually begins the process of finishing the sequence HTGS_cancelled: Added to clones that will never be finished. reference: Marshall E. Bermuda rules: community spirit, with teeth. Science 2001; 291(5507):1192. |
|
| Linkage mapping | This type of mapping measures meiotic
recombination using polymorphic markers to produce the relative order of
markers with respect to each other. Distance between markers is measured
in centiMorgans (cM). A centiMorgan is equivalent to a 1% cross-over rate. Read more about genetic mapping Read about genetic mapping in mice references: Donis-Keller H, et al. A genetic linkage map of the human genome. Cell 1987; 51(2):319-37. Sturtevant AH. The linear arrangement of six sex-linked factors in Drosophila, as shown by their mode of association. J Exp Zool 1913; 14:43-59. (see a reprint (pdf) of this paper) |
|
| LocusID | Unique identifier, assigned by LocusLink,
given to all of the transcripts, proteins, and models associated with a
given locus. reference: Pruitt KD, Maglott DR. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res 2001; 29(1):137-140. |
|
| Mate pair | The sequence obtained from opposite ends of a
particular clone are referred to as mate pairs. Knowing that two sequences
are derived from the same clone allows these sequences to be linked, even
if the full insert of the clone is unavailable. This is key to WGS
assemblies. references: Weber JL , Myers EW. Human whole-genome shotgun sequencing. Genome Res 1997; 7(5):401-409. Venter JC et al. The sequence of the human genome. Science 2001; 291(5507):1304-51. Batzogluo S et al. ARACHNE: a whole-genome shotgun assembler. Genome Res 2002; 12(1):177-89. Mullikin JC, Ning Z. The phusion assembler. Genome Res 2003; 13(1):81-90. |
|
| N50 | The contig/scaffold length at which have of the bases in a given assembly reside. This provides a measure of continuity. For instance, a scaffold N50 of 15 Mb means that at least half of the bases in the assembly are in a contig that is at least 15 Mb. |
Mutation |
A sequence variation that deviates from the reference, or "wild
type", sequence. This variation can be a SNP,
an insertion of sequence, or a deletion of sequence. There can be a great
deal of sequence variation between individuals in a population. For example,
different humans may have as many as 1 basepair difference every 1000
bp. In practice, mutations are distinguished from variation because they
have phenotypic consequences. Mutations in the Pax6
gene that lead to a loss of the function of that gene lead to the eyeless
mutation in flies, the Small
eye mutation in mice, and aniridia in humans. |
Phenotype |
An observable characteristic displayed by an organism. These characteristics
can be controlled by genes, by the environment, or a combination of both.
The characteristic can be directly observable, such as having brown eyes.
In some cases, the phenotype will be measurable, such as having high blood
pressure. |
Positional cloning |
Identification of a gene based on its physical location in the genome.
Often, an individual has a phenotype, but the gene
underlying this phenotype is unknown. Using linkage mapping,
the phenotype can be assigned a position in the genome. Once a phenotype
has been localized, overlapping sets of clones (for example, BACs)
that cover the region are identified. Genes within the region are identified
and compared to DNA from individuals that display the phenotype until the
underlying mutation is identified. |
RefSeq |
Reference Sequence. The goal of the RefSeq project is to produce a reference
sequence for all naturally occurring molecules from the central dogma (DNA,
RNA, Protein). |
| RFLP |
Restriction fragment length polymorphism. A type of polymorphism detectable
in a genome by the size differences in DNA fragments generated by restriction
enzyme analysis. |
|
| RH mapping |
Radiation Hybrid mapping. A physical mapping method that estimates linkage
and distance relative to radiation-induced chromosome breaks. This is
analogous to genetic mapping. |
|
| Scaffold | (see supercontig) | |
| SNP | Single Nucleotide Polymorphism. A
single base difference found when comparing the same DNA sequence from two
different individuals. More on SNPs reference: Weiss KM. In search of human variation. Genome Res 1998; 8(7): 691-7. |
|
| SSAHA | A hashing algorithm developed for rapid searching
of large amounts of genome sequence. This program is similar to BLAT but
does not use splice information to align mRNA sequences, nor can it perform
translated searches. reference: Ning et al. SSAHA: a fast search method for large DNA databases. Genome Res 2001; 11(10):1725-9. |
|
| SSLP |
Simple sequence length polymorphisms. Common examples of these in mammalian
genomes include runs of dinucleotide or trinucleotide repeats (CACACACACACACACACA).
|
|
| Stem Cells |
Most cells of the adult body are terminally differentiated, that is,
they are no longer able to replace themselves or to become another cell
type. Stem cells are undifferentiated cells that are able to both proliferate
and differentiate into numerous cell types. For example, embryonic stem
cells are able to differentiate into any cell type found within the embryo,
where as hematopoietic stem cells can differentiate into any blood cell. |
|
| STS |
Sequence Tag Site. In general, short sequences (200-500 bp) are produced
throughout a genome. Oligonucleotide primers are generated such that this
sequence can be amplified using PCR to produce a discrete band when analyzed
by electrophoresis. STS markers can be polymorphic or monomorphic. They
are critical to integrating non-sequence based maps (such as genetic or
RH) with sequence based maps. |
|
| Supercontig (Scaffold) |
A supercontig is formed when an association can be made between two contigs that have no sequence overlap. This commonly occurs using information obtained from paired plasmid ends. For example, both ends of a BAC clone are sequenced. It can be inferred that these two sequences are approximately 150-200 Kb apart (based on the average size of a BAC). If the sequence from one end is found in a particular sequence contig, and the sequence from the other end is found in a different sequence contig, the two sequence contigs are said to be linked. In general, it is useful to have end sequences from more than one clone to provide evidence for linkage. |
|
| TPF | Tiling Path File. This is a simple file that simply lists the order of clones along a chromosome. These files are often used in genome assemblies in an effort to convey mapping information to the assembly program. | |
| Transgenesis | The introduction of exogenous DNA
into a cell. Typically, this term refers to the introduction of a gene into
an embryo or other eukaryotic cell. In general, this DNA will insert into
the genome at random, although specific loci can be targeted.
The size of the DNA molecule introduce can be small (a few basepairs) to
quite large (over 100 Kb). Read more about transgenesis reference: Selbert S, Rannie D. Analysis of transgenic mice. Methods Mol Biol 2002;180:305-41. |
|
| WGS |
Whole Genome Shotgun. A sequencing method by which an entire genome is
cut into chunks of discrete sizes (usually 2,10, 50 and 150 Kb) and cloned
into an appropriate vector. The ends of these clones are sequenced. The
two ends from the same clone are referred to as mate pairs. The distance
between two mate pairs can be inferred if the library size is known and
should have a narrow window of deviation. |
|
| YAC | Yeast artificial chromosome. These
cloning vectors were developed using yeast centromere and telomere sequences.
The average insert size of these clones ranges from 100-1000 Kb. These clones
can span large portions of the genome but can be highly unstable. Read more about YACs reference: Burke DT et al. Cloning of large segments of exogenous DNA into yeast by means of artificial chromosome vectors. Science 1987; 236(4803):806-12. |
|
Questions or Comments? Write to the Help Desk Disclaimer Privacy statement Accessibility |