NCBI Glossary

Other Glossaries

 BLAST Glossary

 Talking Glossary (NHGRI)

 NCBI Handbook Glossary




Abstract Syntax Notation 1 (ASN.1)

ASN.1 is a standard data description language that is used for encoding structured data. ASN.1 allows both the content and the structure of the data to be read by and exchanged between a variety of computer programs and platforms. ASN.1 is the language used to store and manipulate data at the NCBI. All NCBI software reads and writes ASN.1.

Accession Number

The accession number is the most general identifier used in the NCBI sequence databases. This is the identifier that should be used when citing a database record in a publication. The accession number points to a sequence record and does not change when the sequence is modified. In the Entrez system, using the accession number as a query will retrieve the most recent version of the record. The update history of a particular sequence record is tracked by the accession.version number. Changes in version numbers occur only when the actual sequence of a record has been modified and do not reflect any changes in the annotation. The specific version of a record is also tracked by another identifier that is mainly for internal NCBI use called the GI number.


An algorithm is a formal stepwise path to solving a problem, for example the problem of finding high-scoring local alignments between two sequences. Algorithms are the basis of computer programs.

Alignment Score

The alignment score is a number assigned to a pairwise or multiple alignment of sequences that provides a numerical value reflecting the quality of the alignment. Alignment scores are usually calculated by referring to some sort of substitution table or alignment scoring matrix and summing the values for each pair or column in the alignment. (See also raw score and bit score). With certain scoring matrices, high scores of local ungapped alignments between two random sequences have the special property of following the extreme value distribution. This property allows a significance level to be assigned to local alignment scores obtained from database searches using such tools as BLAST and FASTA. (See also Expect value.)

Alignment Scoring Matrix

A scoring matrix is a table of values used to assign a numerical score to a pair or column of aligned residues in a sequence alignment. The simplest kind, an identity matrix, assigns a high value for a match and some low, often negative value, for a mismatch. The identity matrix is used in the NCBI's nucleotide-nucleotide BLAST program. Protein alignment scoring matrices are usually more complicated and take into account the relative abundance of the amino acids in real proteins and the observation that some amino acids substitute for each other more readily in related proteins (e.g., Phe and Tyr) and others do not (e.g., Phe and Asp). One way of generating such a matrix is to examine alignments of real proteins that are known to be homologous (see Homolog) and tabulate the substitution frequencies of the various amino acid pairs at all positions. The resulting frequency table is then converted to a log-odds additive matrix by taking the log of the ratio of the observed substitution frequency for a particular pair and the background substitution frequency. The PAM and the BLOSUM series are examples of widely used protein-scoring matrices that are derived in this way. The matrices described above do not take into account differences in substitution frequencies at different positions in the alignments. More sensitive position-specific scoring matrices can also be generated. Scores of local alignments of random sequences derived from these log-odds matrices are described by the extreme value distribution. Thus, significance levels can be assigned to results of database searches with these matrices using tools such as BLAST and FASTA. (See also Expect value.)


Alus are the most common class of short, interspersed, repetitive element (SINE) in the human genome. Alus may account for more than 10% of the human genome. They appear to be derived from a signal recognition particle pseudogene. The name Alu derives from the fact that these elements usually contain an AluI restriction enzyme recognition site.


A sequence assembly is a large sequence or ordered set of sequences that may be derived from overlapping smaller sequences and sometimes anchored to a genome or chromosome scale map using information from STS content and other evidence.




Bacterial Artificial Chromosome (BAC)

A BAC is a large insert cloning vector capable of handling large segments of cloned DNA, typically around 150 kb. BACs can be propagated in laboratory strains of Escherichia coli. These vectors are used in the construction of genomic libraries for genome scale sequencing projects including human, mouse, Arabidopsis, and rice.


BankIt is a Web form for submitting sequences to GenBank.

Basic Local Alignment Search Tool (BLAST)

BLAST is the NCBI's sequence similarity search tool. It finds high-scoring local alignments between a query sequence and nucleotide and protein database sequences. Although BLAST is less sensitive than the complete Smith-Waterman algorithm, it provides a useful compromise between speed and sensitivity, especially for searching large databases. Because BLAST reports back local alignment scores, it provides statistics that may allow biologically interesting alignments to be distinguished from chance alignments.

Bit Score

The bit score represents the information content in a sequence alignment. It is expressed in base 2 log units. The bit score is in essence a normalized score adjusted by database and matrix scaling parameters. Hence, bit scores for different searches may be compared and only the search space size is needed to calculate the significance (Expect value) of the score. The relationship between Expect value (E) and bit score (S') is shown in equation 3 below.


The BLock SUbstitution Matrices are a set of protein log-odds alignment scoring matrices calculated from substitution frequencies obtained from ungapped multiple alignments of real proteins. Each BLOSUM matrix is identified with a number that indicates the percent identity cut-off for inclusion in that matrix. For example BLOSUM62, includes substitution information for proteins up to 62% identical in the alignment, BLOSUM90 up to 90% identical. Each BLOSUM matrix works best at finding proteins at a particular level of similarity. Hence, BLOSUM90 is better at finding more closely related proteins wheras BLOSUM62 is best at finding more distantly related ones. Experiments have shown that BLOSUM62 also works well at finding similar proteins. For this reason, BLOSUM62 is the default protein scoring matrix for NCBI BLAST.





In the molecular sense, a clone is a physical copy of a piece of DNA. The term is most often used to refer to the recombinant cloning vector DNA containing this copy such as a plasmid, BAC, or bacteriophage DNA that can be propagated in a bacterial or other microbial host.


A cluster is a group of sequences associated with each other, usually by some procedure that relies on sequence similarity. Such clusters of sequences are used to produce the UniGene datasets and the clusters of orthologous groups (COGS) dataset.

Clusters of Orthologous Groups (COGs)

A COG is a group of related proteins or groups of proteins (paralogs) from different genomes that are thought to derive from a common ancestral gene. COGs are formed based on sequence similarity using a BLAST-based approach. COGs originally were made for the complete microbial genomes, but the dataset is expanding to include more complex organisms. The COGs data are very useful for annotating genes on microbial genomes and can be used to provide potential functional classification for uncharacterized protiens. (See also paralog and ortholog.)


Cn3D (pronounced "see in three dee") is NCBI's structure viewer. It reads Entrez structure data and renders either single structures or structural alignments from the NCBI's molecular modeling database (MMDB). Cn3D functions as a helper application to the Web browser and will launch automatically when the browser downloads NCBI structure data. Cn3D can also function as a stand-alone viewer and can act as a network client to download structures from NCBI. It also has a built-in BLAST and threading capability and can create sequence alignments to fit similar sequences to known structures.

Conserved Domain Architecture Retrieval Tool (CDART)

CDART provides a graphical browser that allows one to find proteins with a similar domain architecture (content and arrangement) beginning with the results of a CDD search.

Conserved Domain Database (CDD) Search

CDD Search uses reverse position-specific BLAST (RPS-BLAST) to identify conserved domains contained in a protein query. CDD databases are position-specific scoring matrices (PSSMs) created from multiple sequence alignments from three domain databases: SMART, PFAM, and LOAD.


Contig is short for contiguous sequence. Contigs are assembled overlapping primary sequences. The term contig arises in two different contexts in the NCBI databases. Draft sequences (HTG division) will contain two or more contigs assembled from sequencing reads made from plasmid libraries for that clone. The NCBI also produces contigs made by assembling overlapping GenBank records from large-scale genome projects, such as the human genome project. These contigs are included in the NCBI RefSeq databases and are assigned accession numbers beginning with the prefix NT_.

Curated Database

A curated database is a derivative database containing molecular records that are compiled and edited from primary molecular data by experts who maintain and are responsible for the content of the records. The Swiss-Prot database is an important example of curated protein sequence database. The NCBI produces a curated non-redundant RefSeq dataset of transcripts and proteins for important organisms.




Derivative Database

In molecular biology, a derivative database contains information derived and compiled from primary molecular data but includes some type of additional information provided by expert curators or automated computational procedures.

DNA Databank of Japan (DDBJ)

A primary nucleotide sequence database that is maintained as part of the Center for Information Biology and DNA Data Bank of Japan (CIB/DDBJ) under the National Institute of Genetics (NIG) in Japan. DDBJ began accepting DNA sequence submissions in 1986 and is a part of the International Nucleotide Sequence Database Collaboration that also includes GenBank and the EMBL nucloeotide sequence database.


A domain is a discrete structural unit of a protein. In principle, protein domains are capable of folding independently from the rest of the protein. Domains can often be identified by non-structural approaches based on conserved amino acid sequences. The NCBI's CDD-search uses information from curated multiple sequence alignments to identify domains in protein sequences.

Draft Sequence

Draft sequence is unfinished genomic or cDNA sequence. See HTG and HTC.




Electronic PCR (e-PCR)

e-PCR is an analysis tool that tests a DNA sequence for the presence of sequence tagged sites (STSs). e-PCR looks for STSs in DNA sequences by searching for subsequences that closely match the PCR primers and have the correct order, orientation, and spacing that they could plausibly prime the amplification of a PCR product of the correct length.

European Molecular Biology Laboratory (EMBL) Database

A nucleotide sequence database produced and maintained at the European Bioinformatics Institute (EBI) in Hinxton, UK, that collaborates with GenBank and the DNA Database of Japan (DDBJ) to form the International Nucleotide Sequence Database Collaboration.


Ensembl is a joint project between EBI-EMBL and the Sanger Institute to provide automatic annotation of eukaryotic genomes.


Entrez is an integrated search and retrieval system that integrates information from various databases at NCBI, including nucleotide and protein sequences, 3D structures and structural domains, genomes, variation data (SNPs), gene expression data, genetic mapping data, population studies, OMIM, taxonomy, books online, and the biomedical literature.

European Bioinformatics Institute (EBI)

A non-profit academic organization that performs research in bioinformatics and maintains the EMBL nucleotide sequence database.

Evidence Viewer

A feature within the human genome Map Viewer that provides a graphical display of the molecular evidence supporting the existence of a gene model. ev displays reference sequences, GenBank mRNAs, annotated known or potential transcripts, and ESTs that align to the genomic area of interest.

Expect Value (E-value)

In BLAST statistics, the Expect value is the number of alignments with a particular score, or a better score, that are expected to occur by chance when comparing two random sequences. The relationship between expect value and alignment score is given by equation 1

In Equation 1, e is the base of the natural logarithm scale, n and m are the lengths of the two sequences, essentially the search space size for database searching, and K and lambda are scaling factors for the search space and the scoring system, respectively. The bit score incorporates lambda and K so that scores can be meaningfully compared when different databases and scoring systems are used.

Expressed Sequence Tag (EST)

A short (300-1000 nucleotide), single-pass, single-read DNA sequence derived from a randomly picked cDNA clone. EST sequences compise the largest GenBank division. There are numerous high-throughput sequencing projects that continue to produce large numbers of EST sequences for important organisms. Many ESTs are classified into gene-specific clusters in the UniGene data set.





A sequence similarity search tool developed by William Pearson and David Lipman. The term FASTA is also used to identify a text format for sequences that is widely used. A FASTA-formatted sequence file may contain multiple sequences. Each sequence in the file is identified by a single line title preceded by the greater than sign (">"). Example.

Feature Table

The feature table is the portion of the GenBank record that provides information about the biological features that have been annotated on the nucleotide sequence, including coding and non-coding regions, genes, variations, and sequence tagged sites. The International Sequence Database Collaboration produces a document describing and identifying allowed features on GenBank, DDBJ, and EMBL records.

File Transfer Protocol (FTP)

FTP is a standard Internet protocol used to transfer files to and from a remote network site.

Fluorescence in Situ Hybridization (FISH) map

A FISH map is a cytogenetic map derived from the localization of fluorescently-labeled probes to chromosomes. Genes are mapped according to their cytogenetic (band position) location on the chromosome.





GenBank is a primary nucleotide sequence database produced and maintained at the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH) in Bethesda, MD, USA. GenBank collaborates with EMBL and DDBJ to form the International Nucleotide Sequence Database Collaboration.

GenBank Division

GenBank divisions are partitions of the GenBank data into categories based on the origin of the sequence. At first the GenBank divisions were established so that one division would be one file in the GenBank distribution. However, the number of GenBank divisions has not kept pace with the growth of the sequence data; the EST division now has over 150 files. There are currently 17 GenBank divisions.

GenBank Flatfile Format

This is the format of the sequence records in the GenBank flatfile release. This is a text-only format containing multiple entries or records. Each record in the large text file, also called a flatfile, begins with a LOCUS line and ends with a single line consisting of a pair of forward slashes ("//"). The term "GenBank format" is often used to refer to the format of individual records within the flatfile. Each record contains a header containing the database identifiers, the title of the record, references, and submitter information. The header is followed by the feature table and then the sequence itself. The GenBank flatfile is described in detail in the GenBank release notes. In the Entrez system, the GenBank format is the default display format for non-bulk sequence entries.

Gene Expression Omnibus (GEO)

GEO is a primary database at the NCBI that is an archived repository for gene expression data derived from different experimental platforms.

Gene Model

A gene model is a mapping of gene features such as coding regions and exon intron boundaries onto the the genomic DNA of an organism. Gene models typically provide a predicted transcript and protein sequence. A simple kind of gene model can be made by aligning an expressed sequence (cDNA) to the genomic DNA sequence. More precise exon intron boundaries can be identified by constraining the aligned segments using consensus splicing signals. This type of alignment-based gene model is used to generate many of the NCBI RefSeq model transcripts for higher genomes. Gene features can also be predicted computationally in the absence of aligned expressed sequences. The simplest candidate gene predictions can be made on microbial genomic DNA by searching for long open reading frames. Database sequence similarity searches with the predicted translations of these ORFs are used to support these gene predictions. Computational gene prediction in higher eukaryotic genomes is complicated by the interruption of gene coding regions by intronic sequences. There are a number of methods that are used in eukaryotic gene prediction. The NCBI uses the program GenomeScan to annotate putative genes on the human, mouse and rat genomes.

Genetic Linkage Map

A linkage map is an ordered display of genetic information referenced to linkage groups (ultimately chromosomes) in a genome. The mapping units (centiMorgans) are based on recombination frequency between various polymorphic markers traced through a pedigree. One centiMorgan equals one recombination event in 100 meioses.

Genetics Computer Group (GCG)

The GCG is a bioinformatics software development group, originally at the Department of Genetics at the University of Wisconsin, then later existing as a private company, and merging with Oxford Molecular, MSI and Synopsis to collectively form Accelerys. GCG is widely known for its sequence analysis software package properly known as the Wisconsin Package. The intials GCG have been widely used as a synonym for that package.

Genome Survey Sequence (GSS)

GSS sequences comprise a bulk sequence division of GenBank. GSS sequences are in essence the genomic equivalent of the ESTs. The GSS division contains first pass, single reads of genomic DNA. Typical GSS records are initial sequencing surveys and end reads of large insert clones from genomic libraries, exon-trapped genomic sequences and Alu PCR sequences.


GenomeScan is gene prediction program (algorithm) developed by Christopher Burge at the Massachussetts Institute of Technology. This is the algorithm used at the NCBI to produce gene models for higher genomes.

GI Number

The GI number is an identifier assigned to all sequences at the NCBI. The GI number points to a specific version of a sequence record. This identifier is largely superceded by the accession.version number for outside users. GI stands for GenInfo, a database system at NCBI that preceded the Entrez system.

Global Alignment

A global alignment is a sequence alignment that extends the full-length of the sequences that are being compared. Global alignment procedures usually will produce an alignment that includes the entire length of all sequences including regions that are not similar, and can be made to produce meaningless alignments between unrelated sequences. Compare with local alignment.

Golden Path

The Golden Path refers to the human and mouse genome annotation and assembly projects at the University of California Santa Cruz (UCSC).




High Throughput Genomic Sequence (HTG)

HTG sequences comprise a Genbank division containing unfinished genomic sequence. HTG records typically are incomplete assemblies sequences of BAC or other large insert clones. GenBank recoginizes four stages of completion (phases) for these sequences. Phase 0 records contain one or a few single pass reads of a given genomic clone. Phase 1 records contain two or more assembled contigs of the sequence data; however the contigs are unordered and unoriented and there are still gaps in the sequence. Phase 2 records also contain two or more contigs with gaps, but the order and orientation are known. Once the sequence gaps are resolved, and there is enough sequence coverage to give an accuracy of 99.99%, the record moves to phase 3 and leaves the HTG division for the appropriate taxonomic GenBank division; a human sequence would move to the pirmate division (PRI), a mouse sequence to the rodent division (ROD).

High Throughput cDNA (HTC)

HTC is a GenBank division containing draft cDNA sequences. HTC records are similar to ESTs, but often contain more information. Unlike ESTs but like the genomic draft (HTG) records, HTC sequences may be updated with additional sequence data and move to the appropriate traditional division of GenBank.


Two biological entities (structures or molecule) are said to be homologues (or are homologous) if it is thought that they descend from a common ancestral structure or molecule. Correspondong body parts and genes in different or the same species can be homologous. The term has often been extended to include sequences as well. However it is incorrect to report a relative homology or percent homology as is sometimes said of sequences; genes or sequences are either homologous or they are not. See also orthologue and paralogue

Human Genome Nomenclature Committee

The HGNC is a non-profit organization located at the University College London that assigns authoritative and unique gene names and symbols for all known human genes.

Human Mouse Homology Maps

The human mouse homology maps show the syntenic chromosome regions between the two organisms and allow the corresponding sequences and other related information to be retrieved from one organism given a gene or map location in the other. The data used to construct these homology maps are derived from UCSC and NCBI human genome assemblies and the mouse MGD genome map and Whitehead/MRC radiation hybrid maps.




International Sequence Database Collaboration (ISDC)

The ISDC involves the three major primary nucleotide sequence repositories GenBank, the DDBJ (DNA Data Bank of Japan), and the EMBL (European Molecular Biology Laboratory) databases. Each database has its own set of submission and retrieval tools, but the three exchange data daily and have shared standards for sequence submission and annotation. All three share data so that all contain the same set of sequence data.

Interspersed Repeats

Interspersed repetitive sequences are primarily degenerate copies of transposable elements - also called mobile elements - that, in humans, comprise over a third of the genome. The most common mobile elements are LINEs and SINEs (long and short interspersed nuclear elements, respectively). The Alu families of repeats are the primary SINEs in primates.


Long interspersed nuclear elements are a class of transposable element, also called an interspersed repeat. These constitute about 20% of the human genome. A typical LINE is 6KB long and encodes a reverse transcriptase and a DNA-nick-looping enzyme, allowing it to move about the genome autonomously. LINEs are also called non-LTR retrotransposons.


LinkOut is registry service to create links from specific articles, journals, or biological data in Entrez to resources on external web sites. Third parties can provide a URL, resource name, brief description of their web site, and specification of the NCBI data from which they would like to establish links.


LOAD is the library of ancient domains, a small number of conserved domain alignments that add to the position specific scoring matrices (PSSMs or profiles) in the Conserved Domain Database (CDD) at NCBI. The majority of domains in CDD come form the databases SMART, Simple Modular Architecture Research Tool, and Pfam.

Local Alignment

A local alignment is a high scoring alignment between sub-sequences of two or more longer sequences. Unlike a global alignment, there may be multiple high scoring local alignments between sequences. Local alignments are useful for database searches because their scores can be used to assess the biological significance of the alignments found. (See also Alignment Score and Expect Value.) Local alignments are produced by the popular sequence similarity search tools BLAST and FASTA.


LocusLink is an NCBI resource that provides a single query interface to curated sequence and descriptive information about genetic loci. It is a good place to begin a search for information about a particular gene. LocusLink currently contains human, mouse, rat, zebrafish, fruit fly and HIV-1 loci..

Low Complexity Sequence

Low complexity sequence is a region of amino acid or nucletide sequence with a biased residue composition. Low complexity sequence includes homopolymeric runs, short-period repeats, and some subtler over-representation of one or a few residues. Such sequences often look very redundant, for example the protein sequence PADPPPDPPPP or the nucleotide sequence AAATTTAAAAAT. Low-complexity regions can result in misleading high scores in sequence similarity searches. These scores reflect compositional bias rather than significant position-by-position alignment. Filter programs are usually used to eliminate these potentially confusing matches from sequence similarity search results. The NCBI BLAST programs used filters that replace low complexity regions in the query sequence with an anonymous residue (n for nucleic acid, X for amino acid) These regions are thus effectively removed from the search because these anonymous residue are treated as mismatches by the BLAST programs.




Map Viewer

The Map Viewer is a software component of the NCBI Entrez Genomes that provides special browsing capabilities for genomes of higher organisms. It allows one to view and search an organism's complete genome, display chromosome maps, and zoom into progressively greater levels of detail, down to the sequence data. If multiple maps are available for a chromosome, it displays them aligned to each other based on shared marker and gene names, and, for the sequence maps, based on a common sequence coordinate system. The number and types of available maps vary by organism, but include maps for: genes, contigs, BAC tiling path, STSs, FISH mapped clones, ESTs, GenomeScan models, and SNPs.


MEDLINE is the NLM's premier bibliographic database covering the fields of medicine, nursing, dentistry, veterinary medicine, the health care system, and the preclinical sciences. MEDLINE contains bibliographic citations and author abstracts from more than 4,600 biomedical journals published in the United States and 70 other countries. The file contains over 11 million citations dating back to mid-1960. Coverage is worldwide, but most records are from English-language sources or have English abstracts. MEDLINE is included in PubMed, which contains additional citations.


MegaBLAST is a local pairwise nucleotide alignment tool that is optimized for finding long alignments between nearly identical sequences. MegaBLAST is most useful for comparing sequences from the same species, and is particulary suited to such tasks as clustering ESTs, aligning genomic clones or aligning cDNA sequences and genomic DNA. MegaBLAST can be up to 10 times faster than many standard sequence similarity programs, including standard nucleotide-nucleotide BLAST. It also efficiently handles much longer DNA sequences. MegaBLAST is the only BLAST program on the NCBI's web site that can perform batch searches.

Model Maker

Model Maker is a tool associated with the Map Viewer that allows one to view the evidence (mRNAs, ESTs, and gene predictions) that was aligned to assembled genomic sequence in order to build a gene model. Model Maker also allows editing the model by selecting or removing putative exons. Model Maker can then display the mRNA sequence and potential ORFs for the edited model, and save the mRNA sequence data for use in other programs. Model Maker is accessible from sequence maps displayed in the Map Viewer. To see an example, follow the "mm" link beside any gene annotated on the human "Gene_Sequence" map in the Map Viewer.

Molecular Modeling Database (MMDB)

NCBI's structure database, MMDB, contains experimentally determined, three-dimensional, biomolecular structures obtained from the Protein DataBank (PDB); the PDB's theoretical models are not imported. MMDB was designed for flexibility, and as such, is capable of archiving conventional structural data as well as future descriptions of biomolecules, such as those generated by electron microscopy (surface models). Most 3D-structure data are obtained from X-ray crystallography and NMR-spectroscopy.


A motif is a short, well-conserved nucleotide or amino acid sequence that represents a minimal functional domain. It is often a consensus for several aligned sequences. The PROSITE database is a popular collection of protein motifs, including motifs for enzyme catalytic sites, prosthetic group attachment sites (heme, biotin, etc), and regions involved in binding another protein. Examples of DNA motifs are transcription factor binding sites.



The National Center for Biotechnology Information (NCBI)

The NCBI is a division of National Library of Medicine at the National Institutes of Health in Bethesda, MD. The NCBI was established in 1988 to create automated systems for storing and analyzing knowledge about molecular biology, biochemistry, and genetics; to support the use of such databases and software by the scientific community; to coordinate efforts to gather biotechnology information both nationally and internationally; and to perform research in computational biology. Currently the NCBI maintains the GenBank database along with several related databases.

The National Institute of Genetics (NIG)

The National Institute of Genetics (NIG) was established in 1949 in Mishima, Japan and reorganized in 1988 as an inter-university research institute in genetics. The Institute currently provides graduate education in genetics and also maintains the DNA Data Bank of Japan.

Nonredundant (nr)

Nonredundant is a term used to describe nucleotide or amino acid sequence databases that contain only one copy of each unique sequence.Non-redundant databases have the advantage of smaller size and, therefore, shorter search times and more meaningful statistics. The default database on most protein BLAST web pages is labeled "nr". This is a nonredundant database where multiple copies of the same sequence such as the corresponding sequences of the same protein from SWISS-PROT, PIR, and GenPept, are combined to make one sequence entry. The default nucleotide database on the standard nucleotide-nucleotide BLAST web page is also labeled "nr", but is no longer a nonredundant database.




Online Mendelian Inheritance in Man (OMIM)

OMIM is a catalog of human genes and genetic disorders authored and edited by Dr. Victor A. McKusick and his colleagues at Johns Hopkins and elsewhere, and developed for the World Wide Web by NCBI. The database contains textual information, references, and copious links to MEDLINE and sequence records in the NCBI's Entrez system, plus links to additional related resources at NCBI and elsewhere.

Open Reading Frame (ORF)

An ORF is a DNA (or mRNA) sequence that is potentially able to encode a polypeptide. ORFs begin with a start codon (ATG) and are read in triplets until they end with a STOP codon (TAA, TGA , or TAG in the standard code). The NCBI ORF finder is useful for identifying ORFs in cDNA or in intron-less genomic DNA.


Orthologues are genes derived from a common ancestor through vertical descent. This is often stated as the same gene in different species. In contrast, paralogs are genes within the same genome that have evolved by duplication.

The hemoglobin genes are a good example. Two separate genes (proteins) make up the molecule hemoglobin (alpha and beta). The alpha and beta DNA sequences are very similar and it is believed that they arose from duplication of a single gene, followed by separate evolution in each of the sequences. Alpha and beta are considered paralogs. Alpha hemoglobins in different species are considered orthologs.




PAM Matrix

The original Percent Accepted Mutation scoring matrix (see M.O. Dayhoff, ed., 1978, Atlas of Protein Sequence and Structure, Vol 15) was derived from observing how often different amino acids replace other amino acids in evolution, and was based on a relatively small dataset of 1,572 changes in 71 groups of closely related proteins. Further, matrix values are based on the model that one sequence is derived from the other by a series of independent mutations, each changing one amino acid in the first sequence to another amino acid in the second. PAM250 was a very popular matrix, but is often now replaced by the BLOSUM series of matrices, particularly when looking for more distantly related proteins. Lower number PAM matrices correspond roughly to higher numbered BLOSUM matrices.


Paralogs are usually described as genes within the same genome that have evolved by duplication. See Ortholog.

PFAM database

Pfam is a database of conserved protein regions or domains. It is one of three databases that make up the NCBI's Conserved Domain Database (CDD). The other two are SMART and LOAD.


A PopSet is a set of DNA sequences that have been collected to analyze the evolutionary relatedness of a population. The population could originate from different members of the same species, or from organisms from different species. They are submitted to GenBank via the program Sequin, often as a sequence alignment.

Position Hit Initiated BLAST (PHI-BLAST)

PHI-BLAST is a variation of BLAST that is designed to search for proteins that both contain a pattern specified by the user, and are similar to the query sequence in the vicinity of the pattern. This dual requirement is intended to reduce the number of database hits that contain the pattern and are likely to have no true homology to the query.

Position Specific Iterated BLAST (PSI-BLAST)

PSI-BLAST is a derivative of protein-protein BLAST that is more sensitive because it incorporates position specific substitution rates in the scoring system. This makes PSI-BLAST useful for finding very distantly related proteins. PSI-BLAST works by first generating a position specific score matrix (PSSM) from the sequences found from a standard BLAST search. The database is then searched with the PSSM. PSI-BLAST can be run in multiple iterations with a new PSSM being made from the the new information collected from the previous search.

Position Specific Scoring Matrix (PSSM)

A PSSM is an alignment scoring matrix that provides substitution scores for each position in a protein sequence. PSSMs are often based upon the frequencies of each amino acid substitution at each position of protein sequence alignment. This gives rise to scoring matrix that has the length of the alignment as one dimension and the possible substitutions in the other. In a PSSM a particular substition, for example Ser substituting for Thr, can have a different score at different positions in the alignment. This is in contrast to a position independent matrix like BLOSUM62, where the Ser Thr substitution gets the same score no matter where it occurs in the protein. PSSMs are more realistic models for related protein sequences since substitution rates are expected to vary across the length of a protein; some aligned positions, such as the active site residues, are more important than others.


In the context of alignments displayed in BLAST output, positives are those non-identical substitutions that receive a positive score in the underlying scoring matrix, BLOSUM62 by default. Most often, positives indicate a conservative substitution or substitutions that are often observed in related proteins.

Primary Database

A primary sequence database contains sequences submitted by the researchers who orginally produced the data. In primary sequence databases the submitters of the sequence control the contents and diposition of the data. GenBank is an example of a primary database. The content, accuracy and updating of GenBank sequences is largely the responsibility of the submitter. This is in contrast to a curated database, such as RefSeq or SWISS-PROT, where additional information is added to each record by the staff maintaining the database.


ProbeSet is a by experiment view of NCBI's Gene Expression Omnibus (GEO), which is a gene expression and hybridization array repository. ProbeSet is intended to facilitate searches of the GEO database, and link the search results to internal and external resources where possible.


Protein matches for ESTs (ProtEST) are the best protein matches to tranlations of EST sequences in UniGene. The nucleotide sequences (mRNAs as well as ESTs) are matched with possible translational products through sequence comparison using BLASTX with an expect value of 1x10-6. The sequences are compared with proteins from eight organisms and the best match in each organism is recorded. UniGene nucleotide sequences can thus have up to eight matches in ProtEST.
In order to exclude proteins sequences that are strictly conceptual translations or models, the proteins used in ProtEST are those originating from the structural databases SwissProt, PIR, PDB or PRF.

Protein Data Bank (PDB)

PDB is the repository for the processing and distribution of 3-D biological macromolecular structure data. As of April, 2002, PDB contained almost 18,000 structures, including more than 1,000 nucleic acids and 400 theoretical models. Except for theoretical models, the PDB data are used to produce the NCBI's structure database, MMDB and are included in the default BLAST databases("nr").

Protein Information Resource (PIR)

PIR is a curated protein sequence database produced and maintained by the National Biomedical Research Foundation at Georgetown University in Washington, D.C. PIR protein sequences are included in BLAST "nr" database and in the Entrez protein system. PIR contains more than 200,000 entries.

Protein Resource Foundation (PRF)

PRF is a protein sequence database maintained in Osaka, Japan, and is one of the protein databases included in BLAST "nr" database searches and in the Entrez protein system. Release 84, March 2002, included
195,660 entries.


PubMed, a service of the National Library of Medicine, provides access to over 11 million MEDLINE citations, from more than 4,300 biomedical journals published in the United States and 70 other countries. Citations cover the fields of medicine, nursing, dentistry, veterinary medicine, the health care system, and the preclinical sciences; and date back to mid-1960. PubMed includes additional life science journals not found in MEDLINE, as well as links to many sites providing full text articles and other related resources.



Radiation Hybrid (RH) map

A radiation hybrid map is a STS-based physical genome map produced by first breaking chromosomes of a donor cell line with a lethal dose of radiation, and then rescuing the cells by fusion with a recipient cell line. Distances between markers are measured in centirays (cR), with 1 cR representing a 1% probability that a break occurred between two markers.


RasMol is a structure rendering software package produced at the University of Massachusetts. RasMol interprets the native format of structure files from PDB.

Raw Score

A raw score in BLAST output is the non-normalized score of an alignment of a query and target sequence. The raw score is derived directly from the scoring matrix by summing the individual substitution scores of the aligned residues in the alignment. For gapped BLAST the raw score also includes gap penalties.

Reference SNP

Reference single nucleotide polymorphisms (refSNP) are curated dbSNP records that define a non-redundant set of markers used for annotation of reference genome sequence and integration with other NCBI resources. Each refSNP record provides a summary list of submitter records in dbSNP and a list of external resource and database links.


Reference Sequences are curated nucleotide or protein records developed by NCBI staff. They attempt to summarize the available information about a given sequence and to provide the most reliable and up to date sequence and annotation. RefSeqs include curated transcripts and proteins, noncoding transcribe RNAs, contig and supercontig assemblies, gene models and chromosome records.

Reverse Position Specific BLAST (RPS-BLAST)

RPS-BLAST is a variation of BLAST in which a protein query sequence is searched against a database of pre-computed Position-Specific Score Matrices as used in PSI-BLAST. This kind of search forms the basis of the CD-Search.




Sequence Alignment

A sequence alignment is a residue by residue comparison of two or more sequences. In the alignment, the relative positions of the sequences are adjusted to optimize (usually maximize) the alignment score derived by reference to some scoring matrix. In some cases gaps with associated penalties may be inserted into one or more sequences to optimize the alignment score.

Sequence Tagged Site STS

STS's are sequence records that contain a short sequence of genomic DNA that can be uniquely amplified by the polymerase chain reaction (PCR) using a pair of primers. The primer sequences and PCR conditions are usually included in the record. Sequence tagged sites comprise the STS GenBank division. These markers are used in linkage and radiation hybrid mapping techniques. They are useful for integrating these kinds of mapping data with each other and also with the assembled genomic sequence. The ePCR tool is useful for indentifying known STS markers in a DNA sequence.


Sequin is a stand alone application package produced by NCBI that is platform for preparing and annotating sequences for submission to GenBank.

Serial Analysis of Gene Expression(SAGE)

SAGE is an experimental method of generating a cDNA library that contains concatenated short (usually ten base) fragments called tags of all cDNA species present in library. These tags may be counted to give a quantitative measure of gene expression in the library. The NCBI SAGE Map resources match SAGE tag sequences to UniGene cluster to identify genes expressed in SAGE libraries and provide several mechanisms for exploring relative expression patterns in SAGE libraries..

Shotgun Sequencing

Shotgun sequencing is a sequencing method in which a large genomic clone is broken into small segments that are then subcloned and randomly sequenced. Once enough random clones have been sequenced, these random sub-sequences are then assembled to establish the large insert sequence. In some cases, an entire genome may be fragmented and cloned into small insert vectors without first being cloned and arrayed in large insert vectors. This latter technique is called whole genome shotgun sequencing and has been used successfully with many smaller genomes and has provided important preliminary assemblies for the human, mouse and rice genomes.


SINEs (Short Interpersed Repeats) are transposable repeat elements in the human genome that are typically 100-400 bp, harbor an internal polymerase III promoter, and encode no proteins.

Single Nucleotide Polymorphism (SNP)

Strictly speaking a SNP is a variation or polymorphism in the genome sequence involving a single nucleotide position. The NCBI maintains dbSNP as a primary repository of SNP data. The SNP data at the NCBI also includes some variations involving multiple positions such as repeat polymorphisms.

Spectral Karyotyping and Comparative Genomic Hybridization Database (SKY/CHG database)

SKY/CHG is a repository of publicly submitted data from SKY and CGH, which are complementary fluorescent molecular cytogenetic techniques. SKY facilitates identification of chromosomal aberrations; CGH can be used to generate a map of DNA copy number changes in tumor genomes.


SMART (Simple Modular Architecture Retrieval Tool) is a database of conserved domains that allows automatic identification and annotation of domains in user-supplied protein sequences. Th SMART data are used create one of the sets of PSSMs used in the CD-Search.

Smith Waterman algorithm

The Smith-Waterman algorithm is a local alignment computational protocol that uses dynamic programming to find all possible high-scoring local alignments between a pair of sequences. This is the most sensitive local alignment algorithm but is computationally too expensive to be generally useful for high throughput searches of large sequence databases. The BLAST and FASTA programs are generally used in these kinds of applications.


SWISS-PROT is A highly curated database of protein sequences established in 1986 and currently maintained by the Swiss Institute of Bioinformatics and the European Bioinformatics Institute (EBI).


The TaxBrowser is an aspect of the Entrez system that allows one to browse sequence, genome and structure records based on the taxonomic classification of the source organism. The tax browser allows access at all levels of the taxonomic hierarchy and can be used to acquire records at any taxomic node.


TrEMBL (Translated EMBL) is a derivative protein dataset that is a automatically-annotated supplement to the SWISS-PROT. trEMBL contains all the translations of coding regions of EMBL nucleotide sequence entries. The trEMBL data set serves as a source of proteins that may eventually be incorporated into SWISS-PROT.





A database created and maintained at NCBI as an experimental system for automatically partitioning expressed nucleotide sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the map location and tissue types in which the gene has been expressed.UniGene is particularly important for reducing the redundancy and complexity of EST data and is an important resource for gene discovery.


A resource created and maintained at NCBI that reports information about Sequence Tagged Sites (STS). For each STS, UniSTS displays the primer sequences, product size, and mapping information, as well as cross references to other NCBI databases.

Vector Alignment Search Tool (VAST)

An algorithm created at NCBI that searches for three-dimensional structures that are geometrically similar to a query structure by first representing the secondary structure elements of each structure as vectors, and then attempting to align these sets of vectors. VAST is used at the NCBI to establish relationships between structures and create structural alignments in the Entrez system.

Word Size

A parameter of the BLAST algorithm that determines the length of the residue segments (either nucleotides or amino acids) into which BLAST partitions the query sequence. The resulting dictionary of "words" is then used to search the selected sequence database.

Yeast Artificial Chromosome (YAC)

A YAC is a functional (self-replicating) artificial chromosome widely used as a vector for genomic clones in sequencing projects involving large genomes. As the name implies, YACs are propagated in yeast (Saccharomyces). A typical YAC clone can contain fragments up to ~2 Mb. A major problem with YAC clones is the tendency to rearrange in the host. YAC technology has largely been replaced by BAC cloning vectors.

Help Desk NCBI NLM NIH Credits

Revised September 2, 2002

Questions or Comments?
Write to Peter Cooper