Abstract Syntax Notation 1 (ASN.1)
ASN.1 is a standard data description language
that is used for encoding structured data. ASN.1 allows both the content and the structure of the data to be
read by and exchanged between a variety of computer programs and platforms. ASN.1 is the language used to
store and manipulate data at the NCBI. All NCBI software reads and writes ASN.1.
The accession number is the most general identifier used in the NCBI sequence
databases. This is the identifier that should be used when citing a database record in a
publication. The accession number points to a sequence record and does not change when the sequence is
modified. In the Entrez system, using the accession number as a query will retrieve the most recent
version of the record. The update history of a particular sequence record is tracked by the accession.version
number. Changes in version numbers occur only when the actual sequence of a record has been modified
and do not reflect any changes in the annotation. The specific version of a record is also tracked by
another identifier that is mainly for internal NCBI use called the GI number.
An algorithm is a formal stepwise path to solving a problem, for example the
problem of finding high-scoring local alignments between two sequences. Algorithms are
the basis of computer programs.
The alignment score is a number assigned to a pairwise or multiple alignment of
sequences that provides a numerical value reflecting the quality of the alignment. Alignment scores
are usually calculated by referring to some sort of substitution table or alignment scoring
summing the values for each pair or column in the alignment. (See also raw score and bit score).
With certain scoring matrices, high scores of local ungapped alignments between two random sequences have the special property
of following the extreme value distribution. This property allows a significance level to be assigned
to local alignment scores obtained from database searches using such tools as BLAST and FASTA. (See also
Alignment Scoring Matrix
A scoring matrix is a table of
values used to assign a numerical score to a pair or column of aligned residues in
a sequence alignment. The simplest kind, an identity matrix, assigns a high value
for a match
and some low, often negative value, for a mismatch. The identity matrix is used in the NCBI's
nucleotide-nucleotide BLAST program. Protein alignment scoring matrices are
usually more complicated and take into account the relative abundance of the
acids in real proteins and the observation that some amino acids substitute for
each other more readily in related proteins (e.g., Phe and Tyr) and others do not
(e.g., Phe and Asp). One way of generating such a matrix is to examine alignments
of real proteins that are known to be homologous (see Homolog) and tabulate the
substitution frequencies of the various amino acid pairs at all positions. The
resulting frequency table is then converted to a log-odds additive matrix by
taking the log of the ratio of the observed substitution frequency for a particular
pair and the
background substitution frequency. The PAM and the BLOSUM series are
examples of widely used protein-scoring matrices that are derived in this way.
The matrices described above do not take into account differences in substitution
frequencies at different positions in the alignments. More sensitive
position-specific scoring matrices can also be generated.
Scores of local alignments of random sequences derived from these log-odds
matrices are described by the extreme value distribution. Thus, significance levels
can be assigned to results of database searches with these matrices using tools such
as BLAST and FASTA. (See also
Alus are the most common class of short,
interspersed, repetitive element (SINE) in the human genome.
Alus may account for more than 10% of the human genome. They appear to be derived from a
signal recognition particle pseudogene. The name Alu derives from the fact that
these elements usually contain an AluI restriction enzyme recognition site.
A sequence assembly is a large sequence or ordered set
of sequences that may be derived from overlapping smaller sequences and
sometimes anchored to a genome
or chromosome scale map using information from STS content and other evidence.
Bacterial Artificial Chromosome (BAC)
A BAC is a
large insert cloning vector capable of handling large segments of cloned DNA,
typically around 150 kb. BACs can be propagated in laboratory strains of Escherichia
coli. These vectors are used in the construction of genomic libraries for genome scale
sequencing projects including human, mouse, Arabidopsis, and rice.
BankIt is a Web form for
submitting sequences to GenBank.
Basic Local Alignment Search Tool
BLAST is the NCBI's sequence similarity search tool. It finds high-scoring local
alignments between a query sequence
and nucleotide and protein database sequences. Although BLAST is less sensitive
than the complete Smith-Waterman algorithm, it provides a useful compromise between
speed and sensitivity, especially for searching large databases. Because BLAST
reports back local alignment scores, it provides statistics that may allow biologically interesting
alignments to be distinguished from chance alignments.
The bit score represents
the information content in a sequence alignment. It is expressed in base 2 log
units. The bit score is in essence a normalized score adjusted by database and
matrix scaling parameters. Hence, bit scores for different searches may be
compared and only the search space size is needed to calculate the
significance (Expect value) of the score. The relationship between Expect value
(E) and bit score
(S') is shown in equation 3 below.
SUbstitution Matrices are a set of protein log-odds alignment
calculated from substitution frequencies obtained from ungapped multiple
alignments of real proteins. Each BLOSUM matrix is identified with a number
that indicates the percent identity cut-off for inclusion in that matrix. For
example BLOSUM62, includes substitution information for proteins up to 62%
identical in the alignment, BLOSUM90 up to 90% identical. Each BLOSUM matrix
works best at finding proteins at a particular level of similarity. Hence,
BLOSUM90 is better at finding more closely related proteins wheras BLOSUM62 is
best at finding more distantly related ones. Experiments have shown that
BLOSUM62 also works well at
finding similar proteins. For this reason, BLOSUM62 is the default protein
scoring matrix for NCBI BLAST.
In the molecular sense, a
clone is a physical copy of a
piece of DNA. The term is most often used to refer to the recombinant cloning vector DNA
containing this copy such as a plasmid, BAC, or bacteriophage DNA that can be propagated
in a bacterial or other microbial host.
A cluster is a group of
sequences associated with each other, usually by some procedure that
relies on sequence similarity. Such clusters of sequences are used to
produce the UniGene datasets and the clusters of orthologous groups (COGS) dataset.
Clusters of Orthologous Groups
COG is a group of related proteins or groups of proteins (paralogs) from different genomes that are
thought to derive from a common ancestral gene. COGs are formed based on
sequence similarity using a BLAST-based approach. COGs originally were made for the complete microbial
genomes, but the dataset is expanding to include more complex organisms.
The COGs data are very useful for annotating genes on microbial genomes
and can be used to provide potential functional classification for
uncharacterized protiens. (See also paralog and ortholog.)
Cn3D (pronounced "see in
three dee") is NCBI's structure viewer. It reads Entrez structure data and
renders either single structures or structural alignments from the NCBI's
molecular modeling database (MMDB). Cn3D functions as
a helper application to the Web browser and will launch automatically when
the browser downloads NCBI structure data. Cn3D can also function as a
stand-alone viewer and can act as a network client to download structures
from NCBI. It also has a built-in BLAST and threading capability and can create sequence
alignments to fit similar sequences to known structures.
Conserved Domain Architecture Retrieval Tool
CDART provides a graphical browser that
allows one to find proteins with a similar domain
architecture (content and arrangement) beginning with the results of a CDD
Conserved Domain Database (CDD) Search
Search uses reverse position-specific BLAST (RPS-BLAST) to identify conserved domains contained in
a protein query. CDD databases are position-specific scoring matrices (PSSMs) created from
multiple sequence alignments from three domain databases: SMART, PFAM, and LOAD.
Contig is short for contiguous
sequence. Contigs are assembled overlapping primary sequences. The term contig arises
in two different contexts in the NCBI databases. Draft sequences (HTG
contain two or more contigs assembled from sequencing reads made from plasmid libraries
for that clone. The NCBI also produces contigs made by assembling
overlapping GenBank records from large-scale genome projects, such as the human genome
project. These contigs are included in the NCBI RefSeq databases and are assigned
accession numbers beginning with the prefix NT_.
A curated database is a
derivative database containing
molecular records that are compiled and edited from primary molecular data by experts
who maintain and are responsible for the content of the records. The Swiss-Prot database is
an important example of curated protein sequence database. The NCBI produces a curated non-redundant RefSeq
dataset of transcripts and proteins for important organisms.
In molecular biology, a derivative database contains
information derived and compiled from primary molecular data but includes some
type of additional
information provided by expert curators or automated computational procedures.
DNA Databank of Japan
A primary nucleotide
sequence database that is maintained as part of the Center for Information Biology and DNA Data Bank of
Japan (CIB/DDBJ) under the National Institute of Genetics (NIG) in Japan. DDBJ began
accepting DNA sequence submissions in 1986 and is a part of the International Nucleotide
Sequence Database Collaboration that also includes GenBank and the EMBL nucloeotide sequence
A domain is a discrete structural unit of
a protein. In principle, protein domains are capable of folding independently from the rest
of the protein. Domains can often be identified by non-structural approaches based on
conserved amino acid sequences. The NCBI's CDD-search uses information from curated multiple
sequence alignments to identify domains in protein sequences.
Draft sequence is unfinished genomic or cDNA
sequence. See HTG and HTC.
e-PCR is an analysis tool that tests a DNA sequence for the presence of sequence tagged
sites (STSs). e-PCR looks for STSs in DNA sequences by searching for
subsequences that closely match the PCR primers and have the correct order,
orientation, and spacing that they could plausibly prime the amplification of a
PCR product of the correct length.
European Molecular Biology Laboratory (EMBL) Database
A nucleotide sequence database produced and maintained at the European
Bioinformatics Institute (EBI) in Hinxton, UK, that collaborates with GenBank
and the DNA Database of Japan (DDBJ) to form the International Nucleotide
Sequence Database Collaboration.
Ensembl is a joint project between EBI-EMBL and the Sanger Institute to provide automatic
annotation of eukaryotic genomes.
Entrez is an integrated search and retrieval system that integrates information from
various databases at NCBI, including nucleotide and protein sequences, 3D
structures and structural domains, genomes, variation data (SNPs), gene
expression data, genetic mapping data, population studies, OMIM, taxonomy, books
online, and the biomedical literature.
European Bioinformatics Institute (EBI)
A non-profit academic organization that performs research in bioinformatics and
maintains the EMBL nucleotide sequence database.
A feature within the human genome Map Viewer that provides a graphical display
of the molecular evidence supporting the existence of a gene model. ev displays
reference sequences, GenBank mRNAs, annotated known or potential transcripts, and
ESTs that align to the genomic area of interest.
Expect Value (E-value)
In BLAST statistics, the Expect value is the number of alignments with a particular score,
or a better score, that are expected to occur by chance when comparing two random sequences. The relationship between expect value and alignment
score is given by equation 1
In Equation 1, e is the base of the natural logarithm scale, n and m
are the lengths of the two sequences, essentially the search space size for database
searching, and K and lambda are scaling factors for the search space and
the scoring system, respectively. The bit score incorporates
lambda and K so that scores can be meaningfully compared when different
databases and scoring systems are used.
Expressed Sequence Tag (EST)
A short (300-1000 nucleotide), single-pass, single-read DNA sequence derived from
a randomly picked cDNA clone. EST sequences compise the largest GenBank
division. There are numerous high-throughput sequencing projects
that continue to produce large numbers of EST sequences for important organisms. Many ESTs
are classified into gene-specific clusters in the UniGene data set.
A sequence similarity search tool developed by
William Pearson and David Lipman. The term FASTA is also used to identify a text format
for sequences that is widely used. A FASTA-formatted sequence file may contain multiple
sequences. Each sequence in the file is identified by a single line title preceded by
the greater than sign (">"). Example.
The feature table is the portion of the GenBank record that provides information about the
biological features that have been annotated on the nucleotide sequence, including coding and non-coding regions, genes, variations,
and sequence tagged
sites. The International Sequence Database Collaboration produces a document describing
and identifying allowed features on GenBank, DDBJ, and EMBL records.
File Transfer Protocol (FTP)
FTP is a standard Internet protocol used to transfer files to
and from a remote network site.
Fluorescence in Situ Hybridization (FISH) map
A FISH map is a cytogenetic map derived from the localization of fluorescently-labeled probes
to chromosomes. Genes are mapped according to their cytogenetic (band
position) location on the chromosome.
GenBank is a primary nucleotide sequence database produced and maintained at the National Center
for Biotechnology Information (NCBI) at the National Institutes of Health (NIH)
in Bethesda, MD, USA. GenBank collaborates with EMBL and
DDBJ to form the
International Nucleotide Sequence Database Collaboration.
GenBank divisions are partitions of the GenBank data into categories based on the origin of the sequence.
At first the GenBank divisions were established so that one division would be one file in the GenBank
distribution. However, the number of GenBank divisions has not kept pace with the growth of the sequence data;
the EST division now has over 150 files. There are currently 17 GenBank divisions.
GenBank Flatfile Format
is the format of the sequence records in the GenBank flatfile release. This is a text-only format containing multiple entries or records. Each record in the large text file,
also called a flatfile,
begins with a LOCUS line and ends with a single line consisting of a pair of forward
slashes ("//"). The term "GenBank format" is often used to
refer to the format of individual records within the flatfile. Each record contains a header
containing the database identifiers, the title of the record, references, and submitter information.
The header is followed by the feature table and then the sequence itself. The
GenBank flatfile is described in detail in the GenBank release notes.
In the Entrez system, the GenBank format is the default display format for non-bulk sequence entries.
Gene Expression Omnibus (GEO)
GEO is a primary database at the NCBI that is an archived repository for gene expression data
derived from different experimental platforms.
A gene model is a mapping of gene features such as coding regions and exon intron
boundaries onto the
the genomic DNA of an organism. Gene models typically provide a predicted transcript and protein
sequence. A simple kind of gene model can be made by aligning an expressed
sequence (cDNA) to the genomic DNA sequence. More precise exon intron boundaries can
be identified by constraining the aligned segments using consensus splicing signals. This type
of alignment-based gene model is used to generate many of the NCBI RefSeq model transcripts for higher genomes.
Gene features can also be predicted computationally in the absence of aligned expressed sequences.
The simplest candidate gene predictions can be made on microbial genomic DNA by searching for long
open reading frames. Database sequence similarity searches with the predicted translations of these
ORFs are used to support these gene predictions. Computational gene prediction in higher eukaryotic genomes is complicated
by the interruption of gene coding regions by intronic sequences. There are a number of methods
that are used in eukaryotic gene prediction. The NCBI uses the program GenomeScan to annotate
putative genes on the human, mouse and rat genomes.
Genetic Linkage Map
A linkage map is an ordered display of genetic information referenced to linkage groups (ultimately
chromosomes) in a genome.
The mapping units (centiMorgans) are based on recombination
frequency between various polymorphic markers traced through a pedigree. One centiMorgan equals one recombination event in 100 meioses.
Genetics Computer Group (GCG)
The GCG is a bioinformatics software development group, originally at the Department of
Genetics at the University of Wisconsin, then later existing as a private
company, and merging with Oxford Molecular, MSI and Synopsis to collectively
form Accelerys. GCG is widely known for its sequence analysis software package
properly known as the Wisconsin Package. The intials GCG have been widely
used as a synonym for that package.
Genome Survey Sequence (GSS)
GSS sequences comprise a bulk sequence division of GenBank. GSS sequences are
in essence the genomic equivalent of
the ESTs. The GSS division contains first pass,
single reads of genomic DNA. Typical GSS records are initial sequencing surveys
and end reads of large insert clones
from genomic libraries, exon-trapped genomic sequences and Alu PCR sequences.
GenomeScan is gene prediction program (algorithm) developed by Christopher Burge at the
Massachussetts Institute of Technology. This is the algorithm used at the
NCBI to produce gene models for higher genomes.
The GI number is an identifier
assigned to all sequences at the NCBI. The GI number points to a specific version of a sequence record.
This identifier is largely superceded
by the accession.version number for outside users.
GI stands for GenInfo, a database system at NCBI
that preceded the Entrez system.
A global alignment is a sequence alignment that extends the
full-length of the sequences that are being compared. Global alignment procedures usually
will produce an alignment that includes the entire length of all sequences including regions
that are not similar, and can be made to produce meaningless alignments between unrelated
sequences. Compare with local alignment.
The Golden Path refers to the human and mouse genome annotation and assembly
projects at the University of California Santa Cruz (UCSC).
High Throughput Genomic Sequence (HTG)
HTG sequences comprise a Genbank division containing unfinished genomic sequence. HTG records typically
are incomplete assemblies sequences of BAC or other large insert clones.
four stages of completion (phases) for these sequences. Phase 0 records contain one or a few
single pass reads of a given genomic clone. Phase 1 records contain two or more assembled
the sequence data; however the contigs are unordered and unoriented and there are still gaps in the
sequence. Phase 2 records also contain two or more contigs with gaps, but the order and
orientation are known. Once the sequence gaps are resolved, and there is enough sequence coverage
to give an
accuracy of 99.99%, the record moves to phase 3 and leaves the HTG division for the appropriate
taxonomic GenBank division; a human sequence would move to the pirmate division (PRI), a mouse
sequence to the rodent division (ROD).
High Throughput cDNA (HTC)
HTC is a GenBank division containing draft cDNA sequences. HTC records are similar
to ESTs, but often contain more information. Unlike ESTs but like
the genomic draft (HTG) records, HTC sequences
may be updated with additional sequence data and move to the appropriate traditional
division of GenBank.
Two biological entities (structures or molecule) are said to be homologues (or are
homologous) if it is
thought that they descend from
a common ancestral structure or molecule. Correspondong body parts and genes in different or
the same species can be homologous. The term has often been extended to
include sequences as well. However it is incorrect to report a relative homology or percent
homology as is
sometimes said of sequences; genes or sequences are either homologous or they are not. See also
orthologue and paralogue
Human Genome Nomenclature Committee
The HGNC is a non-profit organization located at the University College London that
assigns authoritative and unique gene names and symbols for all known human genes.
Human Mouse Homology Maps
The human mouse homology maps show the syntenic chromosome regions
between the two organisms and allow the corresponding sequences and other related information to
be retrieved from one organism given a gene or map location in the other.
The data used to construct these homology maps
are derived from UCSC and NCBI human genome assemblies and the mouse MGD genome
map and Whitehead/MRC radiation hybrid maps.
Database Collaboration (ISDC)
The ISDC involves the three major primary nucleotide sequence repositories GenBank, the DDBJ (DNA Data Bank of Japan), and the EMBL (European Molecular Biology Laboratory) databases.
Each database has its own set of submission and retrieval tools, but the three exchange data daily
and have shared standards for sequence submission and annotation. All three share data
so that all contain the same set of sequence data.
Interspersed repetitive sequences are primarily degenerate copies of transposable elements - also called mobile elements - that, in humans, comprise over a third of the genome. The most common mobile elements are LINEs and SINEs (long and short interspersed nuclear elements, respectively). The Alu families of repeats are the primary SINEs in primates.
Long interspersed nuclear elements are a class of transposable element, also called an
interspersed repeat. These constitute about 20% of the human genome.
A typical LINE is 6KB long and encodes a reverse transcriptase and a DNA-nick-looping enzyme, allowing it to move about the genome autonomously.
LINEs are also called non-LTR retrotransposons.
LinkOut is registry service to create links from specific articles, journals, or biological data in Entrez
to resources on external web sites. Third parties can provide a URL, resource name, brief description of
their web site, and specification of the NCBI data from which they would like to establish links.
LOAD is the library of ancient domains, a small number of conserved domain alignments
that add to the position specific scoring matrices (PSSMs or profiles) in the
Conserved Domain Database (CDD) at NCBI. The majority of domains in CDD come
form the databases SMART, Simple Modular Architecture Research Tool, and Pfam.
A local alignment is a high scoring alignment between sub-sequences of two or more longer
sequences. Unlike a global alignment, there may be multiple high scoring local alignments between
sequences. Local alignments are useful for database searches because their scores can
be used to assess the biological significance of the alignments found. (See also Alignment Score and Expect Value.) Local alignments are
produced by the popular sequence similarity search tools BLAST and FASTA.
LocusLink is an NCBI resource that provides a single query interface to curated
sequence and descriptive information about genetic loci.
It is a good place to begin a search for information about a particular gene.
LocusLink currently contains human, mouse, rat, zebrafish, fruit fly and HIV-1 loci..
Low Complexity Sequence
Low complexity sequence is a region of amino acid or nucletide sequence with a biased residue
composition. Low complexity sequence includes homopolymeric runs, short-period repeats, and some
subtler over-representation of one or a few
residues. Such sequences often look very redundant, for example
the protein sequence PADPPPDPPPP or the nucleotide sequence AAATTTAAAAAT.
Low-complexity regions can result in misleading high scores in sequence similarity searches.
These scores reflect compositional bias rather than significant position-by-position alignment.
Filter programs are usually used to eliminate these potentially confusing matches from
sequence similarity search results. The NCBI BLAST programs used filters that replace low
complexity regions in the query sequence with an anonymous residue (n for nucleic acid, X for amino acid)
These regions are thus effectively removed from the search because these anonymous residue are
treated as mismatches by the BLAST programs.
The Map Viewer is a software component of the NCBI Entrez Genomes
that provides special browsing capabilities for genomes of higher organisms.
It allows one to view and search an organism's
complete genome, display chromosome maps, and zoom into
progressively greater levels of detail, down to the sequence data.
If multiple maps are available for a chromosome, it displays
them aligned to each other based on shared marker and gene names,
and, for the sequence maps, based on a common sequence coordinate
system. The number and types of available maps vary by organism,
but include maps for: genes, contigs, BAC tiling path, STSs,
FISH mapped clones, ESTs, GenomeScan models, and SNPs.
MEDLINE is the NLM's premier bibliographic database covering the
fields of medicine, nursing, dentistry, veterinary medicine, the
health care system, and the preclinical sciences. MEDLINE contains
bibliographic citations and author abstracts from more than
4,600 biomedical journals published in the United States and 70 other countries.
The file contains over 11 million citations dating back to mid-1960.
Coverage is worldwide, but most records are from English-language sources
or have English abstracts. MEDLINE is included in PubMed, which contains
MegaBLAST is a local pairwise nucleotide alignment tool that is optimized for finding long alignments between
nearly identical sequences. MegaBLAST is most useful for comparing sequences from the same
species, and is particulary suited to such tasks as clustering ESTs, aligning genomic clones or
aligning cDNA sequences and genomic DNA. MegaBLAST can be up to 10 times faster
than many standard sequence similarity programs, including standard nucleotide-nucleotide BLAST.
It also efficiently handles much longer DNA sequences. MegaBLAST is the only BLAST program
on the NCBI's web site that can perform batch searches.
Model Maker is a tool associated with the Map Viewer that allows one to view the evidence (mRNAs, ESTs, and gene predictions)
that was aligned to assembled genomic sequence in order to build a gene model.
Model Maker also allows editing the model by selecting or removing putative exons.
Model Maker can then display the mRNA sequence and potential ORFs for the edited model,
and save the mRNA sequence data for use in other programs.
Model Maker is accessible from sequence maps displayed in the Map Viewer. To see an example, follow the "mm" link beside any gene annotated on the human "Gene_Sequence" map in the Map Viewer.
Molecular Modeling Database (MMDB)
NCBI's structure database, MMDB, contains experimentally determined, three-dimensional,
biomolecular structures obtained from the Protein DataBank (PDB); the PDB's theoretical models are not imported.
MMDB was designed for flexibility, and as such, is capable of archiving
conventional structural data as well as future descriptions of biomolecules,
such as those generated by electron microscopy (surface models).
Most 3D-structure data are obtained from X-ray crystallography and NMR-spectroscopy.
A motif is a short, well-conserved nucleotide or amino acid sequence that represents a minimal functional domain. It is often a consensus for several aligned sequences. The PROSITE database is a popular collection of protein motifs, including motifs for enzyme catalytic sites, prosthetic group attachment sites (heme, biotin, etc), and regions involved in binding another protein. Examples of DNA motifs are transcription factor binding sites.
The National Center for Biotechnology Information
The NCBI is a division of
National Library of Medicine at the National Institutes of Health in Bethesda, MD. The NCBI
was established in 1988 to create automated systems for
storing and analyzing knowledge about molecular biology, biochemistry, and
genetics; to support the use of such databases and software by the scientific
community; to coordinate efforts to gather biotechnology information both
nationally and internationally; and to perform research in computational
biology. Currently the NCBI maintains the GenBank database along with several
The National Institute of Genetics (NIG)
The National Institute of Genetics (NIG) was established in 1949 in
Mishima, Japan and reorganized in 1988 as an inter-university research institute
in genetics. The Institute currently provides graduate education in genetics and
also maintains the DNA Data Bank of Japan.
Nonredundant is a term used to describe nucleotide or amino acid sequence databases that contain
only one copy of each unique sequence.Non-redundant databases have the advantage of smaller size
and, therefore, shorter search times and more meaningful statistics.
The default database on most protein BLAST web pages is labeled "nr". This is a nonredundant
database where multiple copies of the same sequence such as the corresponding sequences of the
same protein from SWISS-PROT, PIR, and GenPept, are combined to make one sequence entry.
The default nucleotide database on the standard nucleotide-nucleotide BLAST web page is also
labeled "nr", but is no longer a nonredundant database.
Online Mendelian Inheritance in Man (OMIM)
OMIM is a catalog of human genes and genetic disorders authored and
edited by Dr. Victor A. McKusick and his colleagues at Johns Hopkins
and elsewhere, and developed for the World Wide Web by NCBI. The
database contains textual information, references, and copious
links to MEDLINE and sequence records in the NCBI's Entrez system,
plus links to additional related resources at NCBI and elsewhere.
Open Reading Frame (ORF)
An ORF is a DNA (or mRNA) sequence that is potentially able to encode a polypeptide.
ORFs begin with a start codon (ATG) and are read in triplets until they
end with a STOP codon (TAA, TGA , or TAG in the standard code).
The NCBI ORF finder is useful
for identifying ORFs in cDNA or in intron-less genomic DNA.
Orthologues are genes derived from a common ancestor through vertical descent. This is often
stated as the same gene in different species. In contrast, paralogs are genes within the same
genome that have evolved by duplication.
The hemoglobin genes are a good example. Two separate genes (proteins) make up the molecule hemoglobin (alpha and beta). The alpha and beta DNA sequences are very similar and it is believed that they arose from duplication of a single gene, followed by separate evolution in each of the sequences. Alpha and beta are considered paralogs. Alpha hemoglobins in different species are considered orthologs.
The original Percent Accepted Mutation scoring matrix (see M.O. Dayhoff, ed., 1978,
Atlas of Protein Sequence and Structure, Vol 15) was derived from observing how often different amino acids replace other amino acids in evolution, and was based on a relatively small dataset of 1,572 changes in 71 groups of closely related proteins.
Further, matrix values are based on the model that one sequence is derived
from the other by a series of independent mutations, each changing one amino
acid in the first sequence to another amino acid in the second. PAM250
was a very popular matrix, but is often now replaced by the BLOSUM series of matrices, particularly when looking for more distantly related proteins. Lower number PAM matrices correspond roughly to higher numbered BLOSUM matrices.
Paralogs are usually described as genes within
the same genome that have evolved by duplication. See Ortholog.
Pfam is a database of conserved protein regions or domains. It is one of three databases that make up the NCBI's Conserved Domain Database (CDD). The other two are SMART and LOAD.
A PopSet is a set of DNA sequences that have been collected to analyze the evolutionary relatedness of a population. The population could originate from different members of the same species, or from organisms from different species. They are submitted to GenBank via the program Sequin, often as a sequence alignment.
Position Hit Initiated BLAST (PHI-BLAST)
PHI-BLAST is a variation of BLAST that is designed to search for proteins that both contain a pattern
specified by the user, and are similar to the query sequence in the vicinity
of the pattern. This dual requirement is intended to reduce the number of
database hits that contain the pattern and are likely to have no true homology
to the query.
Position Specific Iterated BLAST (PSI-BLAST)
PSI-BLAST is a derivative of protein-protein BLAST that is more sensitive because it incorporates
position specific substitution rates in the scoring system. This makes PSI-BLAST useful
for finding very distantly related proteins.
PSI-BLAST works by first generating a position specific score matrix (PSSM) from the sequences
found from a standard BLAST search. The database is then searched with the PSSM. PSI-BLAST can be
run in multiple iterations with a new PSSM being made from the the new information collected from
the previous search.
Position Specific Scoring Matrix (PSSM)
A PSSM is an alignment scoring matrix that provides substitution scores for each
position in a protein sequence. PSSMs are often based upon the frequencies of each amino acid
substitution at each
position of protein sequence alignment. This gives rise to scoring matrix that has the length of
the alignment as one dimension and the possible substitutions in the other. In a PSSM a particular
substition, for example Ser
substituting for Thr, can have a different score at different positions in the alignment. This is
in contrast to a position independent matrix like BLOSUM62, where the Ser Thr substitution gets
the same score no matter where it occurs in the protein. PSSMs are more realistic models
for related protein sequences since substitution rates are expected to vary across the length
of a protein; some aligned positions, such as the active site residues, are more important than
In the context of alignments displayed in BLAST output,
positives are those non-identical substitutions that
receive a positive score in the underlying scoring matrix, BLOSUM62 by default.
Most often, positives indicate a conservative substitution or substitutions that
are often observed in related proteins.
A primary sequence database contains sequences submitted by the researchers who orginally
produced the data. In primary sequence databases the submitters of the sequence
control the contents and diposition of the data. GenBank is an example of a
primary database. The content, accuracy and updating of GenBank sequences is
largely the responsibility of the submitter.
This is in contrast to a curated database, such as RefSeq or SWISS-PROT,
where additional information is added to each record by the staff
maintaining the database.
ProbeSet is a by experiment view of NCBI's Gene Expression Omnibus (GEO),
which is a gene expression and hybridization array repository.
ProbeSet is intended to facilitate searches of the GEO database,
and link the search results to internal and external resources
Protein matches for ESTs (ProtEST) are the best protein matches to tranlations of EST sequences
in UniGene. The nucleotide sequences (mRNAs as well as ESTs) are matched with possible translational products through sequence comparison using BLASTX with an expect value of 1x10-6. The sequences are compared with proteins from eight organisms and the best match in each organism is recorded. UniGene nucleotide sequences can thus have up to eight matches in ProtEST.
In order to exclude proteins sequences that are strictly conceptual translations or models, the proteins used in ProtEST are those originating from the structural databases SwissProt, PIR, PDB or PRF.
Protein Data Bank (PDB)
PDB is the repository for the processing and distribution of 3-D
biological macromolecular structure data.
As of April, 2002, PDB contained almost 18,000 structures,
including more than 1,000 nucleic acids and 400 theoretical models.
Except for theoretical models, the PDB data are used to produce the
NCBI's structure database, MMDB and are
included in the default BLAST databases("nr").
Protein Information Resource (PIR)
PIR is a curated protein sequence database produced and maintained by the National Biomedical Research Foundation at Georgetown
University in Washington, D.C. PIR protein sequences are included in
BLAST "nr" database and in the Entrez protein system. PIR contains more than
Protein Resource Foundation (PRF)
PRF is a protein sequence database maintained in Osaka, Japan, and is one of the protein
databases included in BLAST "nr" database searches and in the Entrez protein system. Release 84,
March 2002, included
PubMed, a service of the National Library of Medicine, provides access to over
11 million MEDLINE citations, from more than 4,300 biomedical journals published
in the United States and 70 other countries. Citations cover the fields of
medicine, nursing, dentistry, veterinary medicine, the health care system, and
the preclinical sciences; and date back to mid-1960. PubMed includes additional
life science journals not found in MEDLINE, as well as links to many sites
providing full text articles and other related resources.
Radiation Hybrid (RH) map
A radiation hybrid map is a STS-based physical genome map produced by first breaking
chromosomes of a donor cell line with a lethal dose of radiation, and then
rescuing the cells by fusion with a recipient cell line. Distances between
markers are measured in centirays (cR), with 1 cR representing a 1% probability
that a break occurred between two markers.
RasMol is a structure rendering software package produced at the University of
Massachusetts. RasMol interprets the native format of structure files from PDB.
A raw score in BLAST output is the non-normalized score of an alignment of a query and target
sequence. The raw score is derived directly from the scoring matrix by summing the
individual substitution scores of the aligned residues in the alignment. For gapped
BLAST the raw score also includes gap penalties.
Reference single nucleotide polymorphisms (refSNP) are curated
dbSNP records that define a non-redundant set of markers used for annotation of
reference genome sequence and integration with other NCBI resources. Each refSNP
record provides a summary list of submitter records in dbSNP and a list of
external resource and database links.
Reference Sequences are curated nucleotide or protein records developed
by NCBI staff. They attempt to summarize the available information about a given
sequence and to provide the most reliable and up to date sequence and annotation. RefSeqs
include curated transcripts and proteins, noncoding transcribe RNAs, contig and
supercontig assemblies, gene
models and chromosome records.
Reverse Position Specific BLAST (RPS-BLAST)
RPS-BLAST is a variation of BLAST in which a protein
query sequence is searched against a database of pre-computed Position-Specific
Score Matrices as used in PSI-BLAST. This kind of search forms the basis of the
A sequence alignment is a residue by residue comparison of two or more sequences. In
the alignment, the relative positions of the sequences are adjusted to optimize
the alignment score derived by reference to some scoring matrix. In some cases gaps
with associated penalties may be inserted into one or more sequences to optimize the
Sequence Tagged Site STS
STS's are sequence records that contain a short sequence of genomic DNA that can be uniquely
amplified by the polymerase chain reaction (PCR) using a pair of primers. The primer
sequences and PCR conditions are usually included in the record. Sequence
tagged sites comprise the STS GenBank division. These markers are used in linkage
and radiation hybrid mapping techniques. They are useful for integrating these kinds
of mapping data with each other and also with the assembled genomic sequence. The ePCR tool is useful for indentifying known STS markers in a DNA sequence.
Sequin is a stand alone application
package produced by NCBI that is platform for preparing and annotating sequences
for submission to GenBank.
Serial Analysis of Gene Expression(SAGE)
an experimental method of generating a cDNA library that contains concatenated short
(usually ten base) fragments called tags of all cDNA species present in library. These tags
may be counted to give a quantitative measure of gene expression in the library. The NCBI SAGE Map resources match SAGE tag sequences to
UniGene cluster to identify genes expressed in SAGE libraries and provide several mechanisms
for exploring relative expression patterns in SAGE libraries..
Shotgun sequencing is a sequencing method
in which a large genomic clone is
broken into small segments that are then subcloned and randomly sequenced. Once enough
random clones have been sequenced, these random
are then assembled to establish the large insert sequence. In some cases, an entire genome may
be fragmented and cloned into small insert vectors without first being cloned and arrayed in
large insert vectors. This latter technique is called whole genome shotgun sequencing and
has been used successfully with many smaller genomes and has provided important preliminary
assemblies for the human, mouse and rice genomes.
SINEs (Short Interpersed Repeats) are transposable repeat elements in the human
genome that are typically 100-400 bp, harbor an internal polymerase III
promoter, and encode no proteins.
Single Nucleotide Polymorphism (SNP)
speaking a SNP is a variation or polymorphism in the genome sequence involving a
single nucleotide position. The NCBI maintains dbSNP as a primary repository of SNP
data. The SNP data at the NCBI also includes some variations
involving multiple positions such as repeat polymorphisms.
Spectral Karyotyping and Comparative Genomic Hybridization Database (SKY/CHG database)
SKY/CHG is a repository of publicly submitted data from SKY and CGH, which are complementary fluorescent molecular cytogenetic techniques. SKY facilitates identification of chromosomal aberrations;
CGH can be used to generate a map of DNA copy number changes in tumor genomes.
SMART (Simple Modular Architecture Retrieval
Tool) is a database of conserved domains that allows automatic identification and
annotation of domains in user-supplied protein sequences. Th SMART data are used create
one of the sets of PSSMs used in the CD-Search.
Smith Waterman algorithm
algorithm is a local alignment computational protocol that uses dynamic programming to find all possible high-scoring
between a pair of sequences. This is the most sensitive local alignment algorithm but is
computationally too expensive to be generally useful for high throughput searches of large sequence
databases. The BLAST and FASTA programs are generally used in these kinds of applications.
SWISS-PROT is A highly curated database of protein sequences established in 1986
and currently maintained by the Swiss Institute of
Bioinformatics and the
European Bioinformatics Institute (EBI).
The TaxBrowser is an aspect of the Entrez system that allows one to browse sequence, genome and structure
records based on the taxonomic classification of the source organism. The tax browser
allows access at all levels of the taxonomic hierarchy and can be used to acquire records
at any taxomic node.
TrEMBL (Translated EMBL) is a derivative protein dataset
that is a automatically-annotated supplement to the SWISS-PROT. trEMBL contains all the translations of coding regions of EMBL nucleotide
sequence entries. The
trEMBL data set serves as a source of proteins that may eventually be incorporated into
A database created and maintained at NCBI as an experimental system
for automatically partitioning expressed nucleotide sequences into a non-redundant set of
gene-oriented clusters. Each UniGene cluster contains sequences that represent a
unique gene, as well as related information such as the map location and tissue
types in which the gene has been expressed.UniGene is particularly important for reducing the redundancy and
complexity of EST data and is an important resource for gene discovery.
A resource created and maintained at NCBI that reports information
about Sequence Tagged Sites (STS). For each STS, UniSTS displays the primer
sequences, product size, and mapping information, as well as cross references to
other NCBI databases.
Vector Alignment Search Tool (VAST)
An algorithm created at NCBI that searches
for three-dimensional structures that are geometrically similar to a query
structure by first representing the secondary structure elements of each
structure as vectors, and then attempting to align these sets of vectors. VAST is
used at the NCBI to establish relationships between structures and create
structural alignments in
the Entrez system.
A parameter of the BLAST algorithm that determines the length of the
residue segments (either nucleotides or amino acids) into which BLAST partitions
the query sequence. The resulting dictionary of "words" is then used to search
the selected sequence database.
Yeast Artificial Chromosome (YAC)
A YAC is a functional (self-replicating) artificial
chromosome widely used as a vector for genomic clones in sequencing projects
involving large genomes. As the name implies, YACs are propagated in yeast (Saccharomyces).
A typical YAC clone can contain fragments up to ~2 Mb. A major problem with YAC clones is the
tendency to rearrange in the host. YAC technology has largely been
replaced by BAC cloning vectors.
Revised September 2, 2002
Questions or Comments?
Write to Peter Cooper