Electronic Notebook 1
PubMed Entrez BLAST OMIM Taxonomy Structure

  Making Sense of DNA and Protein Sequences: an Interactive NCBI Mini-Course

Electronic Notebook for Protein Sequence Analysis

Links to glossary terms are noted by the icon. Definitions appear at the bottom of the web page; use the browser's 'back' button to return to the Notebook.

Start here with your DNA sequence

Initial DNA Sequence

To identify any exons in the DNA sequence and generate a predicted protein sequence, click here:


Paste your DNA sequence into the GenScan input window. Press the "Run Genscan" button. Select the protein with the highest exon P-values (This probability, P(E), is defined as the sum of the probabilities under the model of all possible "parses" (gene structure descriptions) which contain the exact exon E in the correct reading frame.) and paste this FASTA formatted output into your notebook.

Protein Sequence from Genscan

Cheat Now!

To scan the protein sequence for the occurrence of motifs/patterns found in the PROSITE database, use:


Paste the protein sequence from GenScan into the ScanProsite input box and press the "Start the Scan" button. Paste the ScanProsite hit into your notebook. To see the Prosite summary for the hit, click on the PDOCxxxx number.

Hit from ScanProsite

Prosite pattern

Cheat Now!

To search for proteins with similar sequences, use BLAST:


Run a BLASTp search against the Swiss-Prot database by pasting the protein sequence from GenScan into the input box on the BLASTp page.  Choose the SwissProt database from the database listbox, then press the "BLAST" button. Format your results as "Flat-query anchored with dots for identities" by selecting the "Reformat these Results" link on the results page and paste this alignment into your notebook.

BLASTP Alignment (against SwissProt)

Cheat Now!

To search against the COGs database, click here:



"COG" stands for Cluster of Orthologous Groups of proteins. The proteins that comprise each COG are assumed to have evolved from an ancestral protein, and are therefore either orthologs or paralogs and thus correspond to an ancient conserved domain. The initial version of the COGs was generated by comparing protein sequences encoded in 43 complete genomes, representing 30 major phylogenetic lineages. Use the COGnitor to compare the protein sequence to the COGs database.

Paste the FASTA formatted protein sequence from GenScan into the COGnitor input box and press the "compare to COGs" button.   Click on the link to the highest-scoring COG and click on the disk icon to save the sequences in the COG to a local file on your desktop to be used as input to Multalin below. Drag this file from your desktop onto your "tools" browser window to display the sequences. Then copy and paste these into your notebook under "COGs FASTA Sequences".


COGs FASTA Sequences

Cheat Now!

To generate a multiple sequence alignment, use:


Paste the sequences from your best-hit COG, saved in your "COGs FASTA Sequences" notebook area, into the input box of Multalin. Also paste in the protein sequence derived from GenScan to include your unknown sequence in this alignment and press the "Start Multalin!" button. Display these results in  text form by clicking on the "-Results as a text page (msf) " link.  Paste this Multalin display into your notebook.

Multalin Alignment

Cheat Now!

To search for protein domains and view a model structure for your protein, click here:

NCBI's Conserved Domain Search allows you to match your protein sequence to conserved protein domains in the Conserved Domain Database, generate a multiple sequence alignment based on this match, and explore 3D modeling templates for your sequence.
Paste your protein sequence from GenScan into the CD-Search query box and run the search. From the search results page, generate a multiple sequence alignment for the top 10 sequences representitive of the conserved domain hit by clicking on the cartoon of the domain. To view a structure with Cn3D, click on the "+Structure" link, use the listbox to specify "up to 5" sequences and invoke Cn3D with a display of a 3D modeling template, and a multiple sequence alignment including your query sequence, by pressing the "Structure View" button. Residues identical in your sequence and the structural template are shown in red. Locate the Prosite Motif you found earlier within the Cn3D alignment window by using View--Find Pattern. Use Style--Annotate from the Cn3D window to color the highlighted residues and show their side chains.

Cheat Now!

Other Tools for DNA and Protein Sequence Analysis


Questions, Comments:

Medha Bhagwat, PhD
David Wheeler, PhD

Disclaimer     Privacy statement

Revised March 8, 2007

Definition of CD: Conserved Domain. CD refers to a domain (a distinct functional and/or structural unit of a protein) that has been conserved during evolution. CDs are generated from multiple sequence alignments and may be refined by comparison to solved structures. During evolution, amino acid changes occur in ways that preserve the physico-chemical properties of critical residues, and hence the structural and/or functional properties of that domain.

Complete Glossary