NCBI Problem Set
  Exercises  
    
Updated 11/24/03
 GenBank, RefSeq, and Entrez

  1. UseEntrez nucleotides to retrieve the finished record AC009453 from the human genome project. How many times has it been updated since it first appeared. Trace the history all the way back to the first version. Based on the update date when did this record first appear How many unordered pieces were there then? Now use electronic PCR (linked as a "hotspot" on the NCBI homepage to identify STS markers present in this record. How many are there? These include radiation hybrid and genetic markers. Notice that one of these markers is also a repeat polymorphism that is mapped on two human genetic maps (Marshfield and Genethon). Follow the links from the ePCR results to see which marker it is.
  2. Retrieve the SWISS-PROT record for the human CFTR (cystic fibrosis) protein by searching with CFTR_HUMAN in proteins on the search box on the NCBI home page. View the record and look at the extensive annotations. How many primary database records are linked to this record? How many literature citations are linked? What is the prevalence of cystic fibrosis in the caucasian population? Use the FEATURES table to find the nature and location of the most common mutation in this gene in cystic fibrosis. Now compare these annotations to those on the RefSeq record NP_000483 and the corresponding primary sequence database record M28668.

    Go back to the original SWISS-PROT entry at NCBI. Now use the BLink link to retrieve related proteins, Click the Best Hits button and find the related protein from the fish Fundulus heteroclitus. Follow the PubMed link from this record to read about the biology of this protein. What is the physiological role of this CFTR homologue in this animal?

    CFTR contains conserved domains that are homologous to bacterial transporters. These bacterial homologues do not appear in the BLink output because only the top 200 proteins are shown. You can use the "Related sequences" link on the CFTR_HUMAN record to find these. Go back to the CFTR_HUMAN record and follow the "Related sequences" link. How many related proteins are there? To identify the ones from bacteria click on the History tab. Follow the instructions on that page for constructing an query combining the protein neighbors with an organism field search bacteria. Your query will be something similar to the following

    #13 AND bacteria[Organism]
    How many of the related sequences are bacterial proteins?
  3. Find the genomic scaffold AE003584 from Drosophila melanogaster using Entrez Nucleotide. Display protein links to see the predicted proteins for this scaffold. (You will need to increase the number of records displayed to see all of the proteins on one page. Then use the browser's "Find in page" function to find the protein that you want.) Identify conserved domains present in predicted protein CG10879 (AAF51293) by clicking on the BLink link and then clicking the CDD buttton. These conserved domains suggest a potential function for this hypothetical protein. Now perform a search against the Prosite patterns using the ScanProsite tool at ExPASy. Did you find the same protein family signature? To verify the Pfam results, try the search against the ProSite profiles. Do your results agree now? This points out the problems with representing a profile as a pattern.


  4. The Entrez nucleotides [Properties] field stores information about the kind of sequence and its source. You can use the the index feature on the Preview/Index tab to display the terms that are indexed for this field. The Properties field terms are somewhat cryptic, but they are very useful for searching. Three useful types are the biomol, gbdiv and srcdb sets. The biomol terms classify records based on the the type and origin of the molecule, for example biomol mrna or biomol genomic. The gbdiv sets of terms index records by the GenBank division code, gbdiv est, gbdiv pri, gbdiv htg and so on. The srcdb terms classify records based upon their database origin. For nucleotide records these could be GenBank, EMBL, DDBJ, RefSeq or PDB (gbdiv genbank, gbdiv embl, gbdiv ddbj, gbdiv refseq). Perform an organism search for mouse, then use Preview/Index tab and the Properties field terms to count the number of mouse genomic records. How many of these are draft sequences (gbdiv htg)? How many are finished records (gbdiv rod)? How many are genome survey sequences? How many of these genomic records are RefSeqs? What kind of RefSeqs are they? Now retrieve all mouse mRNA records. How many of these are in the rodent division? How many are in the EST division? Using these properties field terms, design a query and retrieve all the mouse known mRNA RefSeqs (NM_).


  5. Use Entrez Nucleotide to find the full-length cDNA (mRNA) sequence for Plasmodium falciparum glyceraldehyde 3-phosphate dehydrogenase (GAPD). This time start by typing Plasmodium in the search box without limiting to any field. How many records do you retrieve? Browse through your results to find some records that are not from Plasmodium. Display a few of these to see why you retrieved them; you should find "Plasmodium" somewhere on the record. Now use the Limits tab to restrict to Plasmodium in the Organism field [Organism]. How many nucleotide records in Entrez are from Plasmodium? Now find GAPD records by using the Preview/Index tab to add glyceraldehyde 3-phosphate dehydrogenase as a [Title] Word. How many records did you retrieve?


  6. Search for population and phylogenetic studies on bears in Entrez PopSet. Find the study on brown bears and polar bears and display the alignment. What gene or molecular regions were used in this study? Use the tool bar link to display variations in the alignment. Are there fixed differences in the sequences from the brown bear, Ursus arctos, and the polar bear sequences in the alignment? What if the Ursus arctos sequence from the "ABC" islands (Sequence 7) is removed? Link to the article to read more about these remarkable results.


  7. Substantial data are available for two species of filarial nematodes that are human parasites. Use the Taxonomy Browser to examine the number of nucleotide sequences for the superfamily Filaroidea and determine which two species these are. How many nucleotide and protein sequences are there for each of these two species? Display nucleotide records for each of these. What kinds of sequences are most of these?




  8. The last known Tasmanian tiger died in the Hobart Zoo in 1936. DNA sequences have been obtained from museum specimens. (In fact, there is an effort to clone this animal using museum material.) You can retrieve tasmanian tiger sequences using the Taxonomy Browser. Search the taxonomy database for Tasmanian Tiger. How many DNA and protein sequences are there? What genes were cloned? You can build a phylogenetic dataset that could be used to analyze the taxonomic position of the Tasmanian Tiger with the Taxonomy Browser. Click on the Metatheria (Marsupial) link in the lineage of the tiger. How many nucleotide sequences are there for Metatheria? Retrieve the entry for Metatheria and get the nucleotide sequences. In Entrez you can refine the query to include only cytochrome b sequences through the Preview/Index tab. How many marsupial cytochrome b sequences are there? You could save these in FASTA format for use in phylogenetic analysis if you wanted. You could browse up the lineage further to get an outgroup sequence.

    There are a number of sequences for extinct organisms in the NCBI databases. Visit the list of extinct taxa in the Taxonomy pages.


   Structures  Top
  1. Inositol polyphosphate phosphatases contain conserved acidic residues involved in binding metal ions. Retrieve the human INPP1 protein (INPP_HUMAN) from Entrez proteins. Follow the "Domains" link to to display pre-computed Conserved Domain Database (CDD) search results. Click on the "Details" button to display the complete results. Follow the link to the pfam inositol_P domain and display the domain in Cn3D by clicking on the "View 3D Structure" button. Identify the conserved residues surrounding the magnesium ions by double clicking on them in the structure. The corresponding residues will be highlighted in the sequence alignment. You can annotate the side chains on these if you like. First change the setting on the CDD page from "Virtual Bonds" to "All Atoms" then display the structure. You can then use the Style->Edit Global Style menu to turn off side chains and the Style->Annotate menu to selectively turn on the side chains for amino acids that coordinate the magnesium ions.


  2. Leptin, the product of the obese gene, is a four-helix-bundle cytokine. This relationship to the cytokines cannot be shown by sequence similarity, however. You can verify this by performing at least two iterations of a PSI-BLAST search with the mouse leptin precursor (accession P41160). Retrieve the mouse leptin precursor using Entrez Protein. Display related sequences and then display structure links from the display pull-down list. View this structure with Cn3D and verify the four-helix bundle. Go back to the Web browser and click on the protein chain graphic to display related structures. Confirm that these are four-helix-bundle cytokines. View the structural alignment between leptin and its structural neighbor, interluekin 6. Do this by checking the box next to 1IL6 and clicking the View 3D structure button. Notice that the aligned residues in the two proteins are not similar by standard protein scoring matrices. This is a classic example of apparently homologous proteins that have diverged beyond the sensitivity of sequence similarity approaches.


   BLAST Problems  Top
  1. Michael Crichton's fantasy about cloning dinosaurs, Jurassic Park, contains a putative dinosaur DNA sequence. Use nucleotide-nucleotide BLAST against the default nucleotide database, nr, to identify the real source of the following sequence. Select, copy and paste it into the BLAST form window.

    This is probably the most common use of nucleotide-nucleotide BLAST: sequence identification, establishing whether an exact match for a sequence is already present in the database.

    >DinoDNA from JURASSIC PARK  p. 103 nt 1-1200
    GCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGC
    GGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCG
    TGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGC
    TGCTCACGCTGTACCTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTG
    CCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAA
    AGTAGGACAGGTGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAG
    ATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGTCACT
    CCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCT
    GGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATGATTCTTCTCGCTTCCGGCGG
    CCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAA
    CGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCG
    CACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAA
    CAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAA
    GCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGG
    CTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTG
    ACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCA
    ACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCC
    GCGGTGCATGGAGCCGGGCCACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGG
    CCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGG
    CCATCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT
    
    NCBI scientist Mark Boguski noticed this obvious "contaminant" and supplied Crichton with a better sequence, shown below, for the sequel, The Lost World. Identify the most likely source of this sequence using nucleotide-nucleotide BLAST. Mark imbedded his name in the sequence he provided. To see Mark's name use the translating BLAST (blastx) page with the sequence below. (Look for MARK WAS HERE NIH).

    The the proper use of the translating BLAST services is to look for similar proteins (identify potential homologueues) in other species.

    >DinoDNA from THE LOST WORLD  p. 135
    GAATTCCGGAAGCGAGCAAGAGATAAGTCCTGGCATCAGATACAGTTGGAGATAAGGACG
    GACGTGTGGCAGCTCCCGCAGAGGATTCACTGGAAGTGCATTACCTATCCCATGGGAGCC
    ATGGAGTTCGTGGCGCTGGGGGGGCCGGATGCGGGCTCCCCCACTCCGTTCCCTGATGAA
    GCCGGAGCCTTCCTGGGGCTGGGGGGGGGCGAGAGGACGGAGGCGGGGGGGCTGCTGGCC
    TCCTACCCCCCCTCAGGCCGCGTGTCCCTGGTGCCGTGGGCAGACACGGGTACTTTGGGG
    ACCCCCCAGTGGGTGCCGCCCGCCACCCAAATGGAGCCCCCCCACTACCTGGAGCTGCTG
    CAACCCCCCCGGGGCAGCCCCCCCCATCCCTCCTCCGGGCCCCTACTGCCACTCAGCAGC
    GGGCCCCCACCCTGCGAGGCCCGTGAGTGCGTCATGGCCAGGAAGAACTGCGGAGCGACG
    GCAACGCCGCTGTGGCGCCGGGACGGCACCGGGCATTACCTGTGCAACTGGGCCTCAGCC
    TGCGGGCTCTACCACCGCCTCAACGGCCAGAACCGCCCGCTCATCCGCCCCAAAAAGCGC
    CTGCTGGTGAGTAAGCGCGCAGGCACAGTGTGCAGCCACGAGCGTGAAAACTGCCAGACA
    TCCACCACCACTCTGTGGCGTCGCAGCCCCATGGGGGACCCCGTCTGCAACAACATTCAC
    GCCTGCGGCCTCTACTACAAACTGCACCAAGTGAACCGCCCCCTCACGATGCGCAAAGAC
    GGAATCCAAACCCGAAACCGCAAAGTTTCCTCCAAGGGTAAAAAGCGGCGCCCCCCGGGG
    GGGGGAAACCCCTCCGCCACCGCGGGAGGGGGCGCTCCTATGGGGGGAGGGGGGGACCCC
    TCTATGCCCCCCCCGCCGCCCCCCCCGGCCGCCGCCCCCCCTCAAAGCGACGCTCTGTAC
    GCTCTCGGCCCCGTGGTCCTTTCGGGCCATTTTCTGCCCTTTGGAAACTCCGGAGGGTTT
    TTTGGGGGGGGGGCGGGGGGTTACACGGCCCCCCCGGGGCTGAGCCCGCAGATTTAAATA
    ATAACTCTGACGTGGGCAAGTGGGCCTTGCTGAGAAGACAGTGTAACATAATAATTTGCA
    CCTCGGCAATTGCAGAGGGTCGATCTCCACTTTGGACACAACAGGGCTACTCGGTAGGAC
    CAGATAAGCACTTTGCTCCCTGGACTGAAAAAGAAAGGATTTATCTGTTTGCTTCTTGCT
    GACAAATCCCTGTGAAAGGTAAAAGTCGGACACAGCAATCGATTATTTCTCGCCTGTGTG
    AAATTACTGTGAATATTGTAAATATATATATATATATATATATATCTGTATAGAACAGCC
    TCGGAGGCGGCATGGACCCAGCGTAGATCATGCTGGATTTGTACTGCCGGAATTC
    


  2. Higher eukaryotic genomes contain large amounts of repetitive DNA. The most abundant interspersed repeat in the human genome is the Alu element. Alus tend to occur near genes, within the introns of genes, or in the regions between genes. In some cases, their presence and absence can fairly accurately show the intron-exon structure of a gene. Demonstrate this by performing a nucleotide-nucleotide BLAST search against the Alu database with the genomic sequence of the human Von Hippel Lindau syndrome gene (Accession AF010238). Note that the exons appear in the BLAST graphic as places where the Alu elements do not align.


  3. The Caenorhabditis elegans gene SMA-4 is a member of the dwarfins gene family, also called the MAD family, which plays a role in transforming growth factor beta-mediated signal transduction. In this example we will attempt to find homologs for the SMA-4 protein (SMA4_CAEEL, Accession P45897) in vertebrate species. using protein-protein BLAST.

    Of course, this protein already is in the Entrez Protein and BLAST databases. Remember that if the goal is to find a homolog in another species for a protein that is already present in the Entrez system, it is not necessary to perform a BLAST search; the precalculated similarities are already available through BLink. Verify this by following the BLink link from P45897 in Entrez Protein. Click the best hits button and find the best protein hit to chicken (Gallus gallus). The alignment between SMA-4 and the best chicken match is available by clicking on the linked BLAST score.

    To simulate performing a BLAST search with a novel protein, we will use an Entrez query to remove all Caenorhabditis proteins from the BLAST database.

    Link to the protein-protein blast page and enter the SMA-4 accession number (P45897) in the Search text area. We will search against the default, nr, database. In order to remove, the Caenorhabditis proteins from the nr database, enter the following Entrez search in the "Limit by Entrez query" box under the "Options" section of the form:

    protein all[Filter] NOT Caenorhabditis[Organism]
    
    Because there are a large number of related proteins in the BLAST database, we also need to increase the number of descriptions or BLAST hits that will be shown. Do this by increasing the number of descriptions to 500 in the "Format" section of the BLAST form. Run the search by clicking the BLAST button.

    On the formatting page, you can see that the CD-search has identified conserved domains in this protein. You can click on the graphic to see what these domains are and what their function is.

    Click the format button to retrieve your BLAST results. Look at your BLAST graphical output and verify that the Entrez query eliminated the protein from the database; you should see no full-length matches. Now look at your descriptions and their e-values. In the non-significant e-values (> 1) there are two proteins from sheep (Ovis aries) labeled as MAD proteins (Smad4 and Smad7). These protein fragments are homologs of SMA-4, but we did not demonstrate that with this particular search. In the following exercise we will show using PSI-BLAST that these sheep proteins are significant matches to SMA-4. Be sure to retain your formatting page for these results or copy your request ID so you can format them for PSI-BLAST fo the next exercise.

    Look at the BLAST output and find all chicken (Gallus gallus) proteins that are similar to SMA-4. (Use the Tax Blast link at the upper left of the graphic to help in finding the chicken proteins.) These should be the same proteins found by BLink previously.

    Open a new browser window so you don't lose your results against the nr and run the same search again. Restrict the search to chicken proteins using the Entrez query option as you did before. This time use the query

    Chicken[Organism]
    Are the same proteins found? Compare the expectation values of these hits to the same hits found against nr with no organism restriction. Why are the e-values different for the same scores and alignments?


  4. The Sma-4 protien that we used previously belongs to a large family of proteins. (Jump back to the protein-protein search). Some members of this family are not readily identified in an ordinary blastp search, however, additional Sma-4 homologs can be found by using the more sensitive position-specific iterated BLAST (PSI-BLAST). Any protein-protein BLAST search on the NCBI web pages can be extended to a PSI-BLAST search simply by re-formating the results. Check the "Format for PSI-BLAST" box on the formatting page for the first search that you saved from the exercise above and click format.

    The results are the same except that they are formatted differently. There is a line across the descriptions section of the results corresponding to the PSI-BLAST inclusion threshold of 0.005. Position-specific information from a multiple sequence alignment of the sequences above this line are used to generate a position-specific score matrix (PSSM) in the next iteration. Notice that one of the first proteins below this line is the Smad4 from the sheep (Ovis aries). What is the e-value of this hit?

    Now click the "Run PSI-BLAST iteration 2" button. Note that the Formatting page is refreshed in its separate window, generating a new Request ID number. Click the "Format" button and the results of iteration 2 will load. Click on the "Skip to the first new sequence" link on the Iteration 2 results page. What is this sequence? What is its new expect value? Notice that there are now several new sequences above threshold. Some of them are not annotated as Sma/Mad homologs but are clearly significant hits. These new sequences will be used to construct a new PSSM for iteration 3 and so on. After a few more iterations no more sequences will be found; at this point the search is said to have converged.


  5. The prion protein is found in high concentrations in the brains of humans and other mammals. In certain degenerative nuerological diseases, prion proteins aggregate into polymers. Several of these prion diseases seem to be transmissible. Perhaps the most remarkable aspect of these is that the infectious agent appears to be an aberrant form of the prion protein itself. Bovine spongiform encephalopathy (BSE) is one of the transmissible prion diseases that has received much recent notoriety. There are a number of polymorphisms that have been identified in the prion proteins for several mammals, notably human, mouse, and sheep. Some of these are associated with inherited prion diseases and some with susceptibility to transmissible forms. Retrieve the SWISS-PROT record for the human prion protein (PRIO_HUMAN) and look at the FEATURE table to see the various polymorphisms. Use this protein to perform a translated blast search (PROTEIN query - TRANSLATED database) search against human Ests and look at your results to see if any of these polymorphisms are present in the Est data. This is easier to see if you change the formatting options on the BLAST form to display one of the query-anchored alignment options. Try the "flat query-anchored with identities". (See the problem on prion SNPs in the Genomes section.)


  6. The human fragile histidine triad protein (FHIT, Accession P49789) isstructurally related to galactose-1-phosphate uridylyltransferases. However, this relationship is not apparent in an ordinary BLAST search. Perform a protein-protein BLAST search against the SWISS-PROT database with P49789 and search your results for galactose-1-phosphate uridylyltransferases. Now use PSI-BLAST to verify the relationship between these two protein families.


  7. A frequent use of nucleotide-nucleotide BLAST is to check oligonucleotides for hybridization or PCR. The goal most people have when doing this is to make sure that the primer will give a unique product from the target genome or cDNA population. Because BLAST is local and searches both strands, one can simply concatenate a pair of +/- strand primers and use them in a single search. Combine the following pair of candidate PCR primers in a nucleotide-nucleotide search against the default nucleotide database and identify the gene amplified.

    F12 GTCAAGTGGCAACTCCGTCAG          
    
    R8  TTGAGAGATGGATTGTTGCGC

    Now try these modified primers. There is one mismatch in each near the middle.

    F12_mod GTCAAGTGGCTACTCCGTCAG          
    
    R8_mod  TTGAGAGATGTATTGTTGCGC 
    

    Notice that the previous hits are completely missing. Now adjust the Word Size from 11 to 7 under the BLAST Advanced Options and try the search again. Do you find the original hits again? Are they still the among the best hits? Can you devise a modification in the search strategy that will make them the best hits again?


  8. As the database grows, so does the number of chance occurrences of amino acid motifs that spell out words or people's names in single-letter amino acid codes. One such name motif is ELVIS. Find the number of occurrences of ELVIS in the protein nr. To get any hits at all, you will have to adjust several of the advanced BLAST parameters including the Expect value, Word size, and Score Matrix. Adjust some of these in the "Other advanced options" box. Options are entered in a command-line style. For example, typing

    -e 10000
    sets the Expect value cut-off to 10000. Visit the BLAST "Frequently Asked Questions" by following the link on the left side bar of the BLAST page for more information. See especially the entry on "How do I perform a similarity search with a short peptide/nucleotide sequence?". We now have a page with presets optimized to find short nearly exact matches. You can run the search on this page to see the correct parameters to use.

   Genome Resources Questions  Top

  1. Mycobacterium tuberculosis. (You may want to launch two browsers to do this example.) Display the distribution of BLAST hits by Taxa for each and compare the distribution of homologs. Which organism has more best hits to Eukaryotes? Now display the BLAST hits for each by COGs (clusters of orthologous groups). The tuberculosis organism has a disproportionate portion of the genome devoted to metabolism of what class of biomolecules?
  2. Visit the COGs pages. It has been noted that in some respects the proteome of archaea retain more similarity to eukaryotes than to bacteria. Use the phylogenetic patterns search to count the number of COGS shared between the yeast, COGs code Y and the archea only ( A, O, M, P, K, Z). Now count the number of COGs shared between the yeast and the bacteria only (Q, V, D, R, L, C, E, F, H, S, N, U, J, X, I, T, W). Use the "Differences in closely related genomes" at the bottom of the phylogenetic patterns search page compare the COGs for the three enterobacteria (two E. coli strains and the highly specialized aphid endosymbiont Buchnera sp. APS. The column containing an "x" for both the E. coli strains and a "-" for Buchnera shows the COGs that are missing in the endosymbiont. These pathways are apparently provided by the host cells of the insect.
  3. UniGene is the best NCBI resource to use to find out to what gene (or suspected gene) a particular database sequence belongs. This is especially true for ESTs where there may be no annotations on the sequence, but may also be important for other sequences where the annotation may be incomplete or obsolete. Database identifiers for UniGene searches may come from BLAST output or from microarray (hybridization) data. For example, mRNA that hybridized to the EST sequence with accession number BG618105 was highly expressed in a human liver tumor sample.

    1. Retrieve the record from the nucleotide database using the accession number in the search box on the NCBI homepage. Display the record. Is there any annotation indicating what gene this is?

    2. Now link to UniGene from the "Links" menu in the upper right. What is the name of this gene? Link to LocusLink from the UniGene cluster. What is the function of this protein?

    3. Go back to UniGene. Look at the ESTs in this cluster. How many are there? Identify a pair of ESTs (a 5' and 3' read) that come from the same clone ID. You'll nedd to display all ESTs and scroll down to see these. Use BLAST 2 sequences to align these to the full length RefSeq mRNA from the LocusLink entry. Note the mismatches that are most likely due to sequencing errors in the ESTs.

    4. Expression information is implied by the sources of the cDNA libraries in a particular cluster. NCBI also has linked tag counts from quantitative SAGE libraries to the UniGene clusters. Follow the "Gene to Tag" mapping link to see a "virtual Northern" display of the counts of reliable tags from this cluster in SAGE libraries. What library shows the highest relative expression of this gene?

    5. On the LocusLink page use the main map viewer link (mv) in the "Map Information" section to display this gene in the MapViewer. What chromosomal region is this? What maps are displayed? You can click on the map name at the top to learn more about the information displayed for each map. Uncheck the "Compress Maps" option on the left-hand-side to see the full marker labels. The UniGene map shows the density of EST hits on the genome. Generally the peaks in this histogram highlight the exons of expressed genes. Notice that there are some hits that don't correspond to the exons shown in the gene model on the Genes map. What could these represent? To see another view of the alignment based gene model follow the "ev" link to display this in the evidence viewer.

    6. Use the the zoom graphic on the left hand side of the map viewer to zoom out and display two other members of this small gene family, AFP and AFM. Are these in the same orientation? There is also fourth member of this small family somewhat removed from these also on chromosome 4 called GC. Display the entire region between GC and AFM by typing these symbols in the "Region Shown" boxes on the left-hand-side and pressing the "Go" button.

    7. From the LocusLink entry, click on the mouse gene symbol entry under mapping information to display the corresponding mouse LocusLink record. Follow the mouse map viewer link to display the corresponding region in the mouse map viewer. What chromosome is this? Adjust the view to see if the same gene family is present with the same structure in the mouse genome. Link to the contig record for this region of mouse genome from the map viewer. How large is this contig? Examine the bottom of the record and notice that it is assembled from both BAC clone (draft and finished) and whole genome shotgun sequence. Retrieve one of the whole genome shotgun pieces (e.g. CAAA01153721). Link from this record to the master record for the mouse whole genome shotgun project (CAAA01000000). How many records are in this set?


  4. The gene causing the juvenile form of nephronophthisis was recently identified on human chromosome 1. We will use related protein and nucleotide records to identify this gene in other species. Retrieve the human NPHP4 entry from LocusLink. This protein apparently has a homolog in C. elegans. Demonstrate this by following the BLink link (BL) next to the provisional RefSeq protein in this entry. Clicking the "Best hits" button will make it easier to identify. Notice that there is also a homolog in mouse. Retrieve the mouse protein by linking through the Accession number. Display the linked nucleotide sequence. Use this Accession number (AY118229) in rat genome BLAST to find the gene in the rat genome. Search against the genome assembly. What supercontig did you hit? On what rat chromosome is this gene? Display your results in the Map Viewer by clicking on the Genome View button that appears on the BLAST results page and link to the contig map element. Use the "Maps and Options" link and add the "Genes" map to the display. Is this gene annotated on the rat Map Viewer?

    Your BLAST hits imply an exon-intron structure for this gene. How many exons do your BLAST hits imply? How large is this gene? You can make a more precise alignment-based model for this gene using the Spidey tool. To do this you will need to adjust the base pair range displayed on the Map Viewer to the smallest interval that contains all of the BLAST hits. Then get this sequence using the "Download/View Sequence/Evidence" link. Display the genomic region in the browser and save it to disk. Use this genomic sequence on the Spidey page. Use the mouse cDNA (AY118229) you used before for the mRNA sequence.

    Use the other genomic BLAST pages to try to find a homolog in other vertebrates (fugu, zebrafish). You'll probably need to search at the protein level (AAM78559, tblastn) to find any lower vertebrate homologs.

  5. The following amplified DNA sequence is associated with a human disease gene polymorphism:

    TGCCTCCTTTGGTGAAGGTGACACATCATGTGACCTCTTCAGTGACCACTCTACGGTGTCGGGCCTTGAACTACTACCCCCAGAAC
    ATCACCATGAAGTGGCTGAAGGATAAGCAGCCAATGGATGCCAAGGAGTTCGAACCTAAAGACGTATTGCCCAATGGGGATGGGAC
    CTACCAGGGCTGGATAACCTTGGCTGTACCCCCTGGGGAAGAGCAGAGATATACGTACCAGGTGGAGCACCCAGGCCTGGATCAGC
    CCCTCATTGTGATCTGGG
    Use this sequence in the human genome BLAST service to identify this gene. Follow the linked identifier on the BLAST results to display you results in the Map Viewer. On what chromosome is this gene? What gene is it? Examine the BLAST alignment to identify the postion and nature of the polymorphism. In what exon is this?

    We can now see if this polymorphism has been mapped to the genome from the SNP database. Use the "Maps and Options" link to add the Variation map to the display. To zoom in to the region by placethe mouse pointer over the map and click to display the pop-up zoom menu. Choose an appropriate level to see the polymorphisms in the region of interest. Find the coding region SNP that maps to the same place as your polymorphism identified by BLAST. Link from the Map Viewer to the RefSNP record. Does this SNP imply a change in the amino acid sequence? What is it? (You will notice that there are multiple splice variants for this gene, but the amino acid change is consistent in all of those that contain this coding exon.)

    This is a well known polymorphism in the HFE gene that causes hemochromatosis when homozygous. From the RefSNP record you can link to OMIM to learn more about this. You can also follow the links to 3D structure mappings to display the position of this polymorphism in the structure (1A6Z) of the HFE protein. Based on this, why does this amino acid change have a detrimental effect on the function of this protein?

    Try finding the mouse homologue of HFE by performing a BLAST search on the mouse genome page with the above human sequence fragnent. First do this with the Megablast box checked. Do you find any hits? Now try it with the Megablast off. Do you find it now?


  6. The human MLH1 gene, a DNA mismatch repair protein, is mutated in a form of familial colon cancer. Go to LocusLink and retrieve the entry for human MLH1. Link to the variations mapped for this gene. (You can click on the "Var" link in the full report or the "V" in the summary.) What non-synonymous substitutions are reported? Note the Isoleucine-Valine polymorphism at position 32 of the protein. Could such a substitution affect the structure and function of MLH1? Some light can be shed on this question by examining the sequences and structures of homologous proteins from other organisms. You can view the E. coli mutL structure and align its sequence with the human MLH1 in Cn3D. Do this by following the "BL" link (BLink) from the MLH1 LocusLink report. Once you have the BLink report, click on the 3D structures button. Then retrieve the E. coli structure and sequence alignment by clicking on the blue dot next to 1B63A. What amino acid does the E. coli protein have at the equivalent position to Ile-32 in the human protein?
  7. Use LocusLink to find the the entry for the human glyceraldehyde 3-phosphate dehydrogenase gene. Click on the Map Viewer link ( mv) to find the map location and the contig containing the the GAPD gene. Zoom in to see the exon-intron structure of the gene. How many exons are there? Now use human genome BLAST to verify the location and structure of this gene. Use the GAPD RefSeq (NM_002046) to perform this search. Set both the alignments and descriptions to 250. How many contigs do you hit in the human genome? Click on the Genome View button to see the distribution of these hits on the genome. Look at some of the high scoring single hits and to see what's unusual about them. How can you account for these results?