| GenBank,
RefSeq, and Entrez |
- UseEntrez nucleotides
to retrieve the finished record AC009453 from the human genome project.
How many times has it been updated since it first appeared. Trace the history all the
way back to the first version. Based on the update date when did this record first appear
How many unordered pieces were there then? Now use electronic PCR (linked as a "hotspot"
on the NCBI homepage to identify STS markers present in this record. How many are there?
These include radiation hybrid and genetic markers. Notice that one of these markers is
also a repeat polymorphism that
is mapped on two human genetic maps (Marshfield and Genethon).
Follow the links from the ePCR results to see which marker it is.
- Retrieve the SWISS-PROT record for the human CFTR (cystic fibrosis) protein by searching
with CFTR_HUMAN in proteins on the search box on the NCBI home page. View the record and look
at the extensive annotations. How many primary database records are linked to this record? How
many literature citations are linked? What is the prevalence of cystic fibrosis in the
caucasian population?
Use the FEATURES table to find the nature and location of the most common mutation in this gene
in cystic fibrosis. Now compare these annotations to those on the RefSeq record NP_000483 and
the corresponding primary sequence database record M28668.
Go back to the original SWISS-PROT
entry at NCBI. Now use the BLink link to retrieve related proteins, Click the Best Hits button
and find the related protein from the fish Fundulus heteroclitus. Follow the PubMed link
from this record to read about the biology of this protein.
What is the physiological role of this CFTR homologue in this animal?
CFTR contains conserved domains that are homologous to bacterial transporters. These bacterial homologues
do not appear in the BLink output because only the top 200 proteins are shown. You can use the "Related sequences"
link on the CFTR_HUMAN record to find these. Go back to the CFTR_HUMAN record and follow the "Related sequences"
link. How many related proteins are there? To identify the ones from bacteria click on the History tab. Follow the
instructions on that page for constructing an query combining the protein neighbors with an organism field search
bacteria. Your query will be something similar to the following
#13 AND bacteria[Organism]
How many of the related sequences are bacterial proteins?
Find the genomic scaffold AE003584 from
Drosophila melanogaster
using Entrez
Nucleotide. Display protein links to see the predicted
proteins for this scaffold. (You will need to increase the number of records displayed
to see all of the proteins on one page. Then use the browser's "Find in page" function
to find the protein that you want.)
Identify conserved domains present in predicted protein CG10879 (AAF51293) by clicking on the BLink link
and then clicking the CDD buttton. These conserved domains suggest a potential function for this
hypothetical protein. Now perform a search against the Prosite patterns
using the ScanProsite tool at ExPASy. Did you find the same protein family signature? To
verify the Pfam results, try the search against the ProSite profiles. Do your results agree now?
This points out the problems with representing a profile as a pattern.
The Entrez nucleotides [Properties] field
stores information about the kind of sequence and its source. You can use the
the index feature on the Preview/Index tab to display the terms that are indexed for this field.
The Properties field terms are somewhat cryptic, but they are very useful for
searching. Three useful types are the biomol, gbdiv and srcdb sets.
The biomol terms classify records based on the the type and origin
of the molecule, for example biomol mrna or biomol genomic. The
gbdiv sets of terms index records by the GenBank division code, gbdiv est,
gbdiv pri, gbdiv htg and so on. The srcdb terms classify records based upon their database
origin. For nucleotide records these could be GenBank, EMBL, DDBJ, RefSeq or PDB (gbdiv
genbank, gbdiv embl, gbdiv ddbj, gbdiv refseq).
Perform an organism search for mouse, then use Preview/Index tab and the Properties field
terms to count the number of mouse genomic records. How many of these are draft sequences
(gbdiv htg)? How many are finished records (gbdiv rod)? How many are genome survey
sequences? How many of these genomic records are RefSeqs? What kind of RefSeqs are they?
Now retrieve all mouse mRNA records. How many of these are in the rodent
division? How many are in the EST division? Using these properties field terms, design a query
and retrieve all the mouse known mRNA RefSeqs (NM_).
Use Entrez
Nucleotide to find the
full-length cDNA (mRNA) sequence for Plasmodium falciparum
glyceraldehyde 3-phosphate
dehydrogenase (GAPD). This time start by typing Plasmodium in the search box without limiting to
any field. How many records do you retrieve? Browse through your results to find some records
that are not from Plasmodium. Display a few of these to see why you retrieved them;
you should
find "Plasmodium" somewhere on the record. Now use the Limits tab to restrict to
Plasmodium in the Organism field [Organism]. How many nucleotide records in Entrez are from
Plasmodium? Now find GAPD records by using the
Preview/Index tab to add glyceraldehyde 3-phosphate dehydrogenase as a [Title] Word.
How many records did you retrieve?
Search for population and phylogenetic studies on bears in Entrez PopSet. Find the study on brown
bears and polar bears and display the alignment. What gene or molecular regions were used in this study? Use the tool bar
link to display variations in the alignment. Are there fixed differences in the sequences from the brown bear, Ursus
arctos, and the polar bear sequences in the alignment? What if the
Ursus arctos sequence from the "ABC" islands (Sequence 7) is removed? Link to the article to read more about these remarkable results.
Substantial data are available for two species of filarial
nematodes that are human parasites. Use the Taxonomy Browser to examine the
number of nucleotide sequences for the superfamily Filaroidea and determine
which two species these are. How many nucleotide and protein sequences are
there for each of these two species? Display nucleotide records for each of
these. What kinds of sequences are most of these?
 | The last known
Tasmanian tiger died in the Hobart Zoo in 1936. DNA sequences have been
obtained from museum specimens. (In fact, there is an effort to clone this
animal using museum material.) You can retrieve tasmanian tiger sequences
using the Taxonomy Browser.
Search the taxonomy database for Tasmanian Tiger. How many DNA and protein sequences are there? What genes were cloned?
You can build a phylogenetic dataset that could be used to analyze the taxonomic position of the Tasmanian Tiger with
the Taxonomy Browser. Click on the Metatheria (Marsupial) link in the lineage of the tiger. How many nucleotide
sequences are there for Metatheria? Retrieve the entry for Metatheria and get the nucleotide sequences. In Entrez you can
refine the
query to include only cytochrome b sequences through the Preview/Index
tab. How many marsupial cytochrome b sequences
are there? You could save these in FASTA format for use in phylogenetic
analysis if you wanted. You could browse up the
lineage further to get an outgroup sequence.
There are a number of sequences for extinct organisms in the NCBI databases.
Visit the list of extinct taxa in the Taxonomy pages.
|
Inositol polyphosphate phosphatases contain conserved acidic residues involved in binding metal
ions. Retrieve the human INPP1 protein (INPP_HUMAN) from Entrez proteins.
Follow the "Domains" link to to display pre-computed Conserved Domain Database (CDD) search results. Click on the "Details"
button to display the complete results.
Follow the link to the pfam inositol_P domain and display the domain in Cn3D by clicking on the "View 3D Structure"
button. Identify the conserved residues surrounding the magnesium ions by double clicking
on them in the structure. The corresponding residues will be highlighted in the sequence
alignment. You can annotate the side chains on these if you like. First change the setting
on the CDD page from "Virtual Bonds" to "All Atoms" then display the structure. You
can then use the Style->Edit Global Style menu to turn off side chains and the
Style->Annotate menu to selectively turn on the side chains for amino acids that
coordinate the magnesium ions.
Leptin, the product of the obese gene, is a four-helix-bundle cytokine. This relationship
to the cytokines cannot be shown by sequence similarity, however. You can verify this by performing
at least two iterations of a PSI-BLAST search with the mouse leptin precursor (accession P41160).
Retrieve the mouse leptin precursor using Entrez Protein. Display related sequences
and then
display structure links from the display pull-down list. View this structure
with Cn3D and verify the four-helix bundle. Go back to the Web browser and click on the protein chain
graphic to display related structures. Confirm that these are four-helix-bundle cytokines. View the structural alignment between leptin and its structural neighbor,
interluekin 6. Do this by checking the box next to 1IL6 and clicking the View 3D structure
button. Notice that the aligned residues in the two proteins are not similar by standard protein scoring matrices.
This is a classic
example of apparently homologous proteins that have diverged beyond the sensitivity of sequence
similarity approaches.
Michael Crichton's fantasy about cloning
dinosaurs, Jurassic Park, contains a putative dinosaur DNA sequence.
Use nucleotide-nucleotide BLAST
against the default nucleotide database, nr, to identify the real source
of the following sequence. Select, copy and paste it into the BLAST form window.
This is probably the most common use of nucleotide-nucleotide BLAST:
sequence identification, establishing whether an exact match for a sequence
is already present in the database.
>DinoDNA from JURASSIC PARK p. 103 nt 1-1200
GCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGC
GGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCG
TGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGC
TGCTCACGCTGTACCTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTG
CCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAA
AGTAGGACAGGTGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAG
ATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGTCACT
CCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCT
GGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATGATTCTTCTCGCTTCCGGCGG
CCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAA
CGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCG
CACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAA
CAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAA
GCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGG
CTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTG
ACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCA
ACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCC
GCGGTGCATGGAGCCGGGCCACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGG
CCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGG
CCATCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT
NCBI scientist Mark Boguski noticed this obvious "contaminant"
and supplied Crichton with a better sequence, shown below,
for the sequel, The Lost World. Identify the most likely source of
this sequence using nucleotide-nucleotide BLAST. Mark imbedded his name in
the sequence he provided. To see Mark's name use the translating BLAST (blastx) page
with the sequence below. (Look for MARK WAS HERE NIH).
The the proper use of
the translating BLAST services is to look for similar proteins (identify potential homologueues)
in other species.
>DinoDNA from THE LOST WORLD p. 135
GAATTCCGGAAGCGAGCAAGAGATAAGTCCTGGCATCAGATACAGTTGGAGATAAGGACG
GACGTGTGGCAGCTCCCGCAGAGGATTCACTGGAAGTGCATTACCTATCCCATGGGAGCC
ATGGAGTTCGTGGCGCTGGGGGGGCCGGATGCGGGCTCCCCCACTCCGTTCCCTGATGAA
GCCGGAGCCTTCCTGGGGCTGGGGGGGGGCGAGAGGACGGAGGCGGGGGGGCTGCTGGCC
TCCTACCCCCCCTCAGGCCGCGTGTCCCTGGTGCCGTGGGCAGACACGGGTACTTTGGGG
ACCCCCCAGTGGGTGCCGCCCGCCACCCAAATGGAGCCCCCCCACTACCTGGAGCTGCTG
CAACCCCCCCGGGGCAGCCCCCCCCATCCCTCCTCCGGGCCCCTACTGCCACTCAGCAGC
GGGCCCCCACCCTGCGAGGCCCGTGAGTGCGTCATGGCCAGGAAGAACTGCGGAGCGACG
GCAACGCCGCTGTGGCGCCGGGACGGCACCGGGCATTACCTGTGCAACTGGGCCTCAGCC
TGCGGGCTCTACCACCGCCTCAACGGCCAGAACCGCCCGCTCATCCGCCCCAAAAAGCGC
CTGCTGGTGAGTAAGCGCGCAGGCACAGTGTGCAGCCACGAGCGTGAAAACTGCCAGACA
TCCACCACCACTCTGTGGCGTCGCAGCCCCATGGGGGACCCCGTCTGCAACAACATTCAC
GCCTGCGGCCTCTACTACAAACTGCACCAAGTGAACCGCCCCCTCACGATGCGCAAAGAC
GGAATCCAAACCCGAAACCGCAAAGTTTCCTCCAAGGGTAAAAAGCGGCGCCCCCCGGGG
GGGGGAAACCCCTCCGCCACCGCGGGAGGGGGCGCTCCTATGGGGGGAGGGGGGGACCCC
TCTATGCCCCCCCCGCCGCCCCCCCCGGCCGCCGCCCCCCCTCAAAGCGACGCTCTGTAC
GCTCTCGGCCCCGTGGTCCTTTCGGGCCATTTTCTGCCCTTTGGAAACTCCGGAGGGTTT
TTTGGGGGGGGGGCGGGGGGTTACACGGCCCCCCCGGGGCTGAGCCCGCAGATTTAAATA
ATAACTCTGACGTGGGCAAGTGGGCCTTGCTGAGAAGACAGTGTAACATAATAATTTGCA
CCTCGGCAATTGCAGAGGGTCGATCTCCACTTTGGACACAACAGGGCTACTCGGTAGGAC
CAGATAAGCACTTTGCTCCCTGGACTGAAAAAGAAAGGATTTATCTGTTTGCTTCTTGCT
GACAAATCCCTGTGAAAGGTAAAAGTCGGACACAGCAATCGATTATTTCTCGCCTGTGTG
AAATTACTGTGAATATTGTAAATATATATATATATATATATATATCTGTATAGAACAGCC
TCGGAGGCGGCATGGACCCAGCGTAGATCATGCTGGATTTGTACTGCCGGAATTC
- Higher eukaryotic genomes contain large amounts of repetitive DNA. The most abundant
interspersed repeat in the human genome is the Alu element. Alus tend to
occur near genes,
within the introns of genes, or in the regions between genes. In some cases, their presence and absence
can fairly accurately show the intron-exon structure of a gene. Demonstrate this by performing a
nucleotide-nucleotide BLAST search against the Alu database with the genomic sequence of the human
Von Hippel Lindau syndrome gene (Accession AF010238). Note that the exons appear in the BLAST graphic
as places where the Alu elements do not align.
The Caenorhabditis elegans gene SMA-4 is a member of the dwarfins
gene family, also called the MAD family, which plays a role in transforming
growth factor beta-mediated signal transduction.
In this
example we will attempt to find homologs for the SMA-4 protein (SMA4_CAEEL, Accession P45897)
in vertebrate species.
using protein-protein BLAST.
Of course, this protein already is in the Entrez
Protein and BLAST databases. Remember that if the goal is to find a homolog in another species
for a protein that is already present in the Entrez system, it is not necessary to perform
a BLAST search; the precalculated similarities are already available through BLink.
Verify this
by following the BLink link from P45897 in Entrez Protein. Click the best hits button and find the
best protein hit to chicken (Gallus gallus). The alignment between SMA-4 and the best chicken
match is available by clicking on the linked BLAST score.
To simulate performing a BLAST search with a novel
protein, we will use an Entrez query to remove all Caenorhabditis proteins
from the BLAST database.
Link to the protein-protein
blast page and enter the SMA-4 accession number (P45897) in the Search
text area. We will search against
the default, nr, database. In order to remove, the Caenorhabditis proteins from the nr database, enter
the following Entrez search in the "Limit by Entrez query" box under the "Options" section
of the form:
protein all[Filter] NOT Caenorhabditis[Organism]
Because there are a large number of related proteins in the BLAST database, we also need to
increase the number of descriptions or BLAST hits that will be shown. Do this by increasing
the number of descriptions to 500 in the "Format" section of the BLAST form. Run the search
by clicking the BLAST button.
On the formatting page, you can see that the CD-search has identified conserved domains in this
protein. You can click on the graphic to see what these domains are and what their function is.
Click the format button to retrieve your BLAST results.
Look at your BLAST graphical output and verify that the Entrez query eliminated the
protein from the database; you should see no full-length matches. Now look at your
descriptions and their e-values. In the non-significant e-values (> 1) there
are two proteins from sheep (Ovis aries) labeled as MAD proteins (Smad4 and Smad7).
These protein
fragments are homologs of SMA-4, but we did not demonstrate that with this particular
search. In the following exercise we will show using PSI-BLAST that these sheep proteins are
significant
matches to SMA-4. Be sure to retain your formatting page for these results or copy your
request ID so you can format them for PSI-BLAST fo the next exercise.
Look at the BLAST output and find all chicken (Gallus gallus) proteins that
are similar to SMA-4.
(Use the Tax Blast link at the upper left of the graphic to help in finding the chicken proteins.)
These should be the same proteins found by BLink previously.
Open a new browser window so you don't lose your results against the nr and
run the same search again. Restrict the search to chicken proteins using the
Entrez query option as you did
before. This time use the query
Chicken[Organism]
Are the same proteins found?
Compare the expectation values of these hits to the same hits found against
nr with no organism restriction. Why are the e-values different for
the same scores and alignments?
The Sma-4 protien that we used previously belongs to a large
family of proteins. (Jump
back to the protein-protein search). Some members of this family are not readily
identified in an ordinary blastp search, however, additional Sma-4 homologs
can be found by using the more sensitive position-specific iterated BLAST (PSI-BLAST).
Any protein-protein BLAST search on the NCBI web pages can be extended to a PSI-BLAST
search simply by re-formating the results. Check the "Format for PSI-BLAST" box on the formatting
page for the first search that you saved from the exercise above and click format.
The results are the same except that they are formatted differently. There is a line across the
descriptions section of the results
corresponding to the PSI-BLAST inclusion threshold of 0.005. Position-specific
information from a multiple sequence alignment of the sequences above this line
are used to generate a position-specific score matrix (PSSM) in the next iteration.
Notice that one of the first proteins below this line is the Smad4 from the
sheep (Ovis aries). What is the e-value of this hit?
Now click the "Run PSI-BLAST iteration 2" button. Note that the Formatting
page is refreshed in its separate window, generating a new Request ID number.
Click the "Format" button and the results of iteration 2 will load. Click on
the "Skip to the first new sequence" link on the Iteration 2 results page. What
is this sequence? What is its new expect value? Notice that there are now several
new sequences above threshold. Some of them are not annotated as Sma/Mad homologs
but are clearly significant hits. These new sequences will be used to construct
a new PSSM for iteration 3 and so on. After a few more iterations no more sequences
will be found; at this point the search is said to have converged.
- The prion protein is found in
high concentrations in the brains of humans and other mammals. In certain
degenerative nuerological diseases, prion proteins aggregate into polymers.
Several of these prion diseases seem to be transmissible. Perhaps the most
remarkable aspect of these is that the infectious agent appears to be an
aberrant form of the prion protein itself. Bovine spongiform encephalopathy
(BSE) is one of the transmissible prion diseases that has received much recent
notoriety. There are a number of polymorphisms that have been identified
in the prion proteins for several mammals, notably human, mouse, and sheep.
Some of these are associated with inherited prion diseases and some with susceptibility
to transmissible forms. Retrieve the SWISS-PROT record for the human prion
protein (PRIO_HUMAN) and look at the FEATURE table to see the various polymorphisms.
Use this protein to perform a translated blast search (PROTEIN query - TRANSLATED
database) search against human Ests and look at your results to see if any
of these polymorphisms are present in the Est data. This is easier to see
if you change the formatting options on the BLAST form to display one of
the query-anchored alignment options. Try the "flat query-anchored with identities". (See the problem on prion
SNPs in the
Genomes section.)
The human fragile histidine triad protein (FHIT, Accession P49789)
isstructurally related to
galactose-1-phosphate uridylyltransferases. However, this relationship is not
apparent in an ordinary BLAST search. Perform a protein-protein
BLAST search against the
SWISS-PROT database with P49789 and search your results for
galactose-1-phosphate uridylyltransferases. Now use PSI-BLAST to verify the
relationship between these two protein families.
A frequent use of nucleotide-nucleotide BLAST is to
check oligonucleotides for hybridization or PCR. The goal most people have
when doing this is to make sure that the primer will give a unique product
from the target genome or cDNA population. Because BLAST is local and searches
both strands, one can simply concatenate a pair of +/- strand primers and use
them in a single search. Combine the following pair of candidate
PCR primers in a nucleotide-nucleotide
search against the default nucleotide database and identify the gene amplified.
F12 GTCAAGTGGCAACTCCGTCAG
R8 TTGAGAGATGGATTGTTGCGC
Now try these modified primers. There is one mismatch in each near the middle.
F12_mod GTCAAGTGGCTACTCCGTCAG
R8_mod TTGAGAGATGTATTGTTGCGC
Notice that the previous hits are completely missing. Now adjust the Word Size from
11 to 7
under the BLAST Advanced Options and try the search again. Do you find the
original hits again?
Are they still the among the best hits? Can you devise a modification in the search strategy that will make them
the best hits again?
As the database grows, so does the number of chance occurrences
of amino acid motifs that spell out words or people's names in single-letter
amino acid codes. One such name motif is ELVIS. Find the number of
occurrences of ELVIS in the protein nr. To get any hits at all, you will
have to adjust several of the advanced BLAST parameters including the Expect
value, Word size, and Score Matrix. Adjust some of these in the "Other
advanced options" box. Options are entered in a command-line style. For example,
typing -e 10000 sets the Expect value cut-off to 10000. Visit the BLAST "Frequently Asked
Questions" by following the
link on the left side bar of the BLAST page for more information. See especially the entry on "How do I perform a
similarity search with a short peptide/nucleotide sequence?". We now have a page with presets optimized to find short nearly exact matches. You can run the search on this page to see the correct parameters to
use.
| |
Genome Resources Questions Top |
- Mycobacterium tuberculosis. (You may want to launch two browsers to do this example.) Display the distribution of BLAST hits by Taxa for each
and compare the distribution of homologs. Which organism has more best hits to Eukaryotes? Now display the BLAST hits for each by COGs (clusters of orthologous groups). The tuberculosis organism has a
disproportionate portion of the genome devoted to metabolism of what class of
biomolecules?
- Visit the COGs
pages. It has been noted that in some respects the proteome of archaea retain
more similarity to eukaryotes than to bacteria. Use the phylogenetic
patterns search to count the number of COGS shared between the yeast, COGs code
Y and the archea only ( A, O, M, P, K, Z). Now count the number of COGs
shared between the yeast and the bacteria only (Q, V, D, R, L, C, E, F, H, S, N, U,
J, X, I, T, W). Use the "Differences in closely related genomes" at the bottom
of the phylogenetic patterns search page compare the COGs for the three
enterobacteria (two E. coli strains and the highly specialized aphid
endosymbiont Buchnera sp. APS. The column containing an "x" for both
the E. coli strains and a "-" for Buchnera shows the COGs that are
missing in the endosymbiont. These pathways are apparently provided by the host
cells of the insect.
UniGene is the best NCBI resource to use to find out to what gene
(or suspected gene) a particular
database sequence belongs. This is especially true for ESTs where there may be no annotations on the sequence, but may
also be important for other sequences where the annotation may be incomplete or obsolete. Database identifiers for UniGene searches may come from BLAST
output or from microarray (hybridization) data. For example, mRNA that hybridized to the
EST sequence with accession number BG618105
was highly expressed in a human liver tumor sample.
Retrieve the record from the nucleotide database using
the accession number in the search box on the NCBI homepage. Display the record. Is there
any annotation indicating what gene this is?
-
Now link to UniGene from the "Links" menu in the upper right. What is the name of this gene?
Link to LocusLink from the UniGene cluster. What is the function of this protein?
Go back to UniGene. Look at the ESTs in this cluster. How many are
there? Identify a pair of ESTs (a 5' and 3' read) that come from the same clone ID. You'll nedd to
display all ESTs and scroll down to see these. Use BLAST 2 sequences
to align these to the full length RefSeq mRNA from the LocusLink entry. Note the mismatches that are most likely due
to sequencing errors in the ESTs.
Expression information is implied by the sources of the cDNA libraries in a
particular cluster. NCBI also has linked tag counts from quantitative SAGE libraries
to the UniGene clusters. Follow the "Gene to Tag" mapping link to see a "virtual Northern"
display of the counts of reliable tags from this cluster in SAGE libraries. What library
shows the highest relative expression of this gene?
On the LocusLink page use the main map viewer link (mv) in the "Map Information" section to display this gene in the
MapViewer. What chromosomal region is this? What maps are displayed? You can click on the map name at the top to learn
more about the information displayed for each map. Uncheck the "Compress Maps" option
on the left-hand-side to see the full marker labels. The UniGene map shows the
density of EST hits on the genome. Generally the peaks in this histogram highlight
the exons of expressed genes. Notice that there are some hits that don't correspond to
the exons shown in the gene model on the Genes map. What could these represent? To see
another view of the alignment based gene model follow the "ev" link to display this in the
evidence viewer.
Use the the zoom graphic on the left hand side of the map viewer to zoom out and
display two other members of this small gene family, AFP and AFM. Are these in the same orientation? There is also fourth member of this small family somewhat removed from these
also on chromosome 4 called GC. Display the entire region between GC and AFM by typing these symbols
in the "Region Shown" boxes on the left-hand-side and pressing the "Go" button.
From the LocusLink entry, click on the mouse gene symbol entry under mapping information to
display the corresponding mouse LocusLink record. Follow the mouse map viewer link to
display the corresponding region in the mouse map viewer. What chromosome is this? Adjust the view to see if the
same gene family is present with the same structure in the mouse genome. Link to the contig record for this
region of mouse genome from the map viewer. How large is this contig? Examine the bottom of the record
and notice that it is assembled from both BAC clone (draft and finished) and whole genome shotgun sequence. Retrieve one of
the whole genome shotgun pieces (e.g. CAAA01153721). Link from this record to the master record
for the mouse whole genome shotgun project (CAAA01000000). How many records are in this set?
The gene causing the juvenile form of nephronophthisis was recently identified on
human chromosome 1. We will use related protein and nucleotide records to identify this gene
in other species. Retrieve the human NPHP4 entry from LocusLink. This protein apparently
has a homolog in C. elegans. Demonstrate this by following the BLink link (BL) next to
the provisional RefSeq protein in this entry. Clicking the "Best hits" button will make it
easier to identify. Notice that there is also a homolog in mouse. Retrieve the mouse protein
by linking through the Accession number. Display the linked nucleotide sequence. Use this
Accession number (AY118229) in rat genome BLAST to find the gene in the rat genome.
Search against
the genome assembly. What supercontig did you hit? On what rat chromosome is this gene? Display
your results in the Map Viewer by clicking on the Genome View button that appears on the
BLAST results page and link to the contig map element. Use the "Maps and Options" link and add
the "Genes" map to the display.
Is this gene annotated on the rat Map Viewer?
Your BLAST hits imply an exon-intron structure for this gene. How many exons do your
BLAST hits imply? How large is this gene? You can make a more precise alignment-based model for this gene using the
Spidey tool. To do this
you will need to adjust the base pair range displayed on the Map Viewer to the smallest
interval that contains all of the BLAST hits. Then get this sequence using the "Download/View Sequence/Evidence"
link. Display the genomic region in the browser and save it to disk. Use this
genomic sequence on the Spidey page. Use the mouse cDNA (AY118229) you used before for the mRNA sequence.
Use the other genomic BLAST pages to try to find a homolog in other vertebrates
(fugu,
zebrafish).
You'll probably need to search at the protein level (AAM78559, tblastn) to find any lower vertebrate
homologs.
The following amplified DNA sequence is associated with a human disease gene polymorphism:
TGCCTCCTTTGGTGAAGGTGACACATCATGTGACCTCTTCAGTGACCACTCTACGGTGTCGGGCCTTGAACTACTACCCCCAGAAC
ATCACCATGAAGTGGCTGAAGGATAAGCAGCCAATGGATGCCAAGGAGTTCGAACCTAAAGACGTATTGCCCAATGGGGATGGGAC
CTACCAGGGCTGGATAACCTTGGCTGTACCCCCTGGGGAAGAGCAGAGATATACGTACCAGGTGGAGCACCCAGGCCTGGATCAGC
CCCTCATTGTGATCTGGG
Use this sequence in the human genome BLAST service to identify this gene.
Follow the linked identifier on the BLAST results to display you results in the Map Viewer.
On what chromosome is this gene? What gene is it? Examine the BLAST alignment to identify the postion and
nature of the polymorphism. In what exon is this?
We can now see if this polymorphism has been mapped to the genome from the SNP database.
Use the "Maps and Options" link to add the Variation map to the display. To zoom in to the region by
placethe mouse pointer over the map and click to display the pop-up zoom menu. Choose an appropriate
level to see the polymorphisms in the region of interest. Find the coding region SNP that maps
to the same place as your polymorphism identified by BLAST. Link from the Map Viewer to the RefSNP
record. Does this SNP imply a change in the amino acid sequence? What is it? (You will notice that
there are multiple splice variants for this gene, but the amino acid change is consistent in all
of those that contain this coding exon.)
This is a well known polymorphism in the HFE gene that causes hemochromatosis when homozygous. From the
RefSNP record you can link to OMIM to learn more about this. You can also follow the links to 3D structure
mappings to display the position of this polymorphism in the structure (1A6Z) of the HFE protein. Based on this,
why does this amino acid change have a detrimental effect on the function of this protein?
Try finding the mouse homologue of HFE by performing a BLAST search on the mouse genome page with
the above human sequence fragnent. First do this with the Megablast box checked. Do you find any hits?
Now try it with the Megablast off. Do you find it now?
-
The human MLH1 gene, a DNA mismatch repair protein, is mutated in a form of familial
colon cancer. Go to LocusLink and retrieve the entry for human MLH1. Link to the
variations mapped for this gene. (You can click on the "Var" link in the full
report or the "V" in the summary.) What non-synonymous substitutions are
reported? Note the Isoleucine-Valine polymorphism at position 32 of the
protein. Could such a substitution affect the structure and function of
MLH1? Some light can be shed on this question by examining the sequences and
structures of homologous
proteins from other organisms. You can view the E. coli mutL structure and
align its sequence with the
human MLH1 in Cn3D. Do this by following the "BL" link (BLink) from the MLH1 LocusLink
report. Once you have the BLink report, click on the 3D structures button. Then
retrieve the E. coli structure and sequence alignment by clicking on the
blue dot next to 1B63A. What amino acid does the E. coli protein have at the
equivalent position to Ile-32 in the human protein?
Use LocusLink to find the the entry for
the human glyceraldehyde 3-phosphate dehydrogenase gene. Click on the Map Viewer link ( mv)
to find the map location and the contig containing the the GAPD gene. Zoom
in to see the exon-intron structure of the gene. How
many exons are there? Now use human genome BLAST to verify the location
and structure of this gene. Use the GAPD RefSeq (NM_002046) to perform this
search. Set both the alignments and descriptions to 250. How many contigs
do you hit in the human genome? Click on the Genome View button to see the
distribution of these hits on the genome. Look at some of the high scoring
single hits and to see what's unusual about them. How can you account for
these results?
|