Using NCBI's
  Exercises  
    
BLAST References.

Today's RIDs.
   Basic Web BLAST  Top

  1. Learning Goals

    The following is the product of a gene in C. elegans that seems to be important in development.

       >C.elegans protein
       MFHPGMTSQPSTSNQMYYDPLYGAEQIVQCNPMDYHQANILCGMQYFNNSHNRYPLLPQMPPQFTNDHPY
       DFPNVPTISTLDEASSFNGFLIPSQPSSYNNNNISCVFTPTPCTSSQASSQPPPTPTVNPTPIPPNAGAV
       LTTAMDSCQQISHVLQCYQQGGEDSDFVRKAIESLVKKLKDKRIELDALITAVTSNGKQPTGCVTIQRSL
       DGRLQVAGRKGVPHVVYARIWRWPKVSKNELVKLVQCQTSSDHPDNICINPYHYERVVSNRITSADQSLH
       VENSPMKSEYLGDAGVIDSCSDWPNTPPDNNFNGGFAPDQPQLVTPIISDIPIDLNQIYVPTPPQLLDNW
       CSIIYYELDTPIGETFKVSARDHGKVIVDGGMDPHGENEGRLCLGALSNVHRTEASEKARIHIGRGVELT
       AHADGNISITSNCKIFVRSGYLDYTHGSEYSSKAHRFTPNESSFTVFDIRWAYMQMLRRSRSSNEAVRAQ
       AAAVAGYAPMSVMPAIMPDSGVDRMRRDFCTIAISFVKAWGDVYQRKTIKETPCWIEVTLHRPLQILDQL
       LKNSSQFGSS
    Use the protein-protein blast page to perform search against the non redundant protein database (nr) to find vertebrate homologues of this protein.

    Set up this search.


    Based on the identical hit to C. elegans what is the identity of this protein?

    Aside from the C. elegans proteins, what is the most significant hit? What is its E-value? What is the identity and E-value of the protein that is the least significant hit? What is it's e value? Find the protein that has the E-value closest to 1. This in principle should be the first non-homologue in the output. Link to this record from the search results. Is it a homologue? We'll show later on that this protein is a homologue using PSI-BLAST. Be sure to note the Request ID (RID) of the present search so we can use it again later on. You can enter it in the following text area if you like:

    (Jump ahead to the PSI-BLAST example.)

    Go back to your results, find all chicken (Gallus gallus) proteins that are similar to this protein. The Tax Blast link at the upper left of the graphic makes the chicken proteins much easier to identify. How many similar chicken proteins are there? What is the identity and E-value of the best chicken hit. We'll perform this search again using an advanced option to restrict to only chicken sequences later on.(Jump ahead to this example.)

    Your BLAST results show that this protein is identical to a protein already in the database. For proteins already in the database, there is generally no need to run this type of BLAST search.

    We have already compiled BLAST results for all proteins in our Entrez system. Click here to load the SWISS-PROT record for the C. elegans Sma-4 protein.. Link to Pre-compiled BLAST results for this protein by clicking on the BLink link in the upper right hand side of the page. The BLink output shows a graphic depicting the extent of the alignment with the query (P45897); the BLAST raw score is linked to the Blast 2 Sequences alignment; the accession number is linked to the record in Entrez, and the gi is linked to the corresponding BLink output for that sequence. The taxonomic distribution numbers in the boxes show that this signaling pathway is probably restricted to multicellular animals (metazoa). Buttons at the top of the page allow you to sort your results in various ways. Click on the "Best Hits" button to see the best hit for each species. Click on the "Common Tree" button to see a phylogenetic view of the taxonomy of the organisms involved. What two species of ray-finned fishes (Actinopterygii) have hits to Sma-4?


  2. Learning Goals

    The prion protein is found in high concentrations in the brains of humans and other mammals. In certain degenerative neurological diseases, prion proteins aggregate into polymers. Several of these prion diseases seem to be transmissible. Perhaps the most remarkable aspect of these is that the infectious agent appears to be an aberrant form of the prion protein itself. Bovine spongiform encephalopathy (BSE)is one of the transmissible prion diseases that has received much recent notoriety. There are a number of polymorphisms that have been identified in the prion proteins for several mammals, notably human, mouse and sheep. Some of these are associated with inherited prion diseases and some with susceptibility to transmissible forms. Retrieve the SWISS-PROT record for the human prion protein (PRIO_HUMAN) and look at the FEATURE table to see the various polymorphisms. Use this protein to perform a translated blast search (PROTEIN query - TRANSLATED database) search against human ests and look at your results to see if any of these polymorphisms are present in the est data. This is easier to see if you change the formatting options on the BLAST form to display one of the query anchored alignment options. Try the "flat query-anchored with identities" option.


    Set up this search.


   Advanced Options  Top
  1. Learning Goals

    In this example perform a protein-protein BLAST search with the C. elegans Sma-4 protein we used previously.(Jump back to the previous search.) This time use the "Options for advanced blasting" and use an Entrez query to restrict to only chicken sequences. You can enter the query in the left hand box by typing:

    	   Gallus gallus[organism]
    The following query also works:
    chicken[organism]

    Set up this search.


    This latter works because chicken is the accepted common name in our Entrez system for the species Gallus gallus. (Simply typing the common name without the organism restriction won't necessarily give the same results. The Entrez help document provides details about formulating Entrez queries.) You can also select the organism from the pull-down list on the right. Compare the expect value of the best BLAST hit to the expect value obtained from the previous search of the entire nr. Is the hit more or less significant. Why?

    It may be helpful in many cases to eliminate certain taxa in BLAST searches. For example:

    protein all[filter] NOT human[organism]
    Try the search with Sma-4 using an Entrez query to eliminate sequences from mammals.
   PSI- and RPS- BLAST  Top

  1. Learning Goals

    The C. elegans protein Sma-4 that we used previously belongs to a large family of proteins involved in TGF-beta mediated signaling. (Jump back to the blastp search). Some members of this family are not readily identified in an ordinary blastp search, however, additional Sma-4 homologues can be found by using the more sensitive position-specific iterated BLAST (PSI-BLAST). Start a PSI-BLAST search with the accession number for the C. elegans Sma-4 protein, P45897, as the query. You will need to increase the number of descriptions in the Format section of the page; increase to 500.


    Set up this search.


    The initial results from PSI-BLAST are the same as the previous blastp search, except that they are formatted differently. There is a line across the results corresponding to the PSI-BLAST inclusion threshold of 0.005. Position-specific information from a multiple sequence alignment of the sequences above this line are used to generate a position-specific score matrix (PSSM) in the next iteration. Notice that one of the first proteins below this line is the Smad7 from the sheep (Ovis aries). What is the E-value of this hit?

    Now click the "Run PSI-BLAST iteration 2" button. Note that the Formatting page is refreshed (in it's separate window), generating a new Request ID number. Click the "Format" button and the results of iteration 2 will load. Click on the "Skip to the first new sequence" link on the Iteration 2 results page. What is this sequence? What is its new expect value? Notice that there are now several new sequences above threshold. Some of them are not annotated as Sma/Madd homologues but are clearly significant hits. These new sequences will be used to construct a new PSSM for iteration 3 and so on. After a few more iterations no more sequences will be found; at this point the search is said to have converged.


  2. Learning Goals

    Ethylene is an important hormone in higher plants and is involved in the process of fruit ripening. The rate limiting step in the biosynthesis of ethylene is catalyzed by aminocyclopropane synthase, and this enzyme and related enzymes are found in many disparate species. Perform a PSI-BLAST search using ACC synthase from the banana (accession AAD22099) against the pdb database.


    Set up this search.


    >From the CD results, what conserved domain is present in this sequence? What is the best hit to pdb?

    Continue on to the second iteration. How many new hits do you find? Search your results for 1FG7A, histidinol phosphate aminotransferase. What is its E-value?

    Make a note of the gi number of the best pdb hit (see your BLAST output), and use BLAST 2 sequences to align the best pdb hit with 1FG7A (gi 15826679). Be sure to use blastp! Retrieve the structure record (follow the links from the BLAST output) for the best hit to pdb and display the list of structure neighbors. Look for 1FG7A in the list of neighbors. Display the structural alignment between the best pdb hit and 1FG7A in Cn3D. Compare the structural alignment with the sequence alignment you made previously.


   Genomic BLAST Pages  Top


  1. Learning Goals

    Use BLAST to find a mouse genomic sequence that includes the gene implicated in Menkes Syndrome (a copper-transporting ATPase, ATP7a). You can use as query the RefSeq entry for the mouse mRNA, NM_009726; but choosing the database to search is critical. For example, a search against nucleotide nr will not work and a search against the trace database is not useful. Try the search against the mouse assembly, Arachne_Feb01, from the Mouse Genome BLAST page. How many database sequences did BLAST find; how many BLAST hits? Now use the Spidey tool to align the mRNA with the FASTA sequence of the contig containing ATP7a (cut and paste the genomic sequence into the Spidey window). Note the correlation between the number of BLAST hits and the number of exons in ATP7a. Now find the human ATP7a gene by using the Human genome BLAST page and the same mouse mRNA. The human gene also has 22 exons. Why is the number of BLAST hits less than 22? (Hint: use blastn rather than megablast to search against the human genome).


   Standalone BLAST and the BLAST Client   Top

  1. Learning Goals

    All of the BLAST programs can be installed locally to run on most computer platforms. Executables and databases are available from the NCBI ftp site. The BLAST package is already installed on this PC. Click on the "My Computer" icon on your desktop to navigate to the Blast directory. It should be on the "Local Disk" or "C Drive". There should be a set of 12 executable programs, a set of README files, a data directory and a db directory.

    In this exercise we'll use formatdb.exe to create a binary-formatted, BLASTable version of the SwissProt database from a file of FASTA sequences. (See README.formatdb.)

    The BLAST programs are all run from the command prompt. For Windows, find a command prompt by going to: Start-> Programs; or Start-> Programs-> Accessories). After launching the command prompt, change to the Blast directory by typing at the prompt (C:\>) :

    cd \NCBI\Blast 
    [ENTER]
    You should see:
    C:\NCBI\Blast>
    To see a listing of the files in the Blast directory, type:
    DIR [ENTER]
    You can invoke any of the BLAST programs by entering the name of the program. They all require various arguments, such as the database name, the input file, and the output file; and also accept other options. Arguments and options are preceded by a dash or minus sign '-'. You can see a list of all possible arguments and options for each program by entering the name and just the dash. Type:
    formatdb - [Enter]
    You should see a list of all formatdb options.

    The FASTA version of the SwissProt database is also in the C:\NCBI\Blast> directory. Run formatdb on the swissprot database:

    C:\NCBI\Blast>formatdb -i swissprot -p T -o T [Enter]
    Now list the contents of C:\NCBI\Blast>
    C:\NCBI\Blast> DIR [Enter]
    There should be seven new files that are the components, called index files, of the formatted swissprot database.
  2. Learning Goals

    In this example we will use the the E. coli purF protein, accession P00496. Click here to load the record in FASTA format into a new web browser window. Use the File menu of your web browser to save it to the Blast directory on your machine. Give it the file name purf.txt. We will use this protein sequence to search the swissprot database with blastall. Blastall performs all five flavors of standard BLAST searches — blastn, blastp, blastx, tblastn and tblastx; see README.bls.

    Run blastall with a '-' from the Blast directory on your machine to see all the blastall options.

    C:\NCBI\Blast>blastall - [Enter]
    Now perform a protein search against swissprot using the purf sequence.
    C:\NCBI\Blast>blastall -p blastp -i purf.txt -d swissprot -o out.txt [Enter]
    View the contents of out.txt by launching a new browser window from the File menu of you browser and then use File->Open and browse to the file. You can also display the file one screen at a time using the command, "more".
    C:\NCBI\Blast>more out.txt [Enter]
    Tap the spacebar to continue scrolling the screen.

    What kind of enzyme is purF?

    What is the highest scoring eukaryotic protein? Not easy to tell is it?

    Let's use another option to restrict the blastall search to only eukaryotic proteins. To do this we will first make a file of all eukaryotic protein identifiers using the Entrez system. From the NCBI homepage click on TaxBrowser. Enter eukaryota in the search box. Click on the Eukaryota link in the TaxBrowser. Now select the radio button for Protein and click "Submit Query". This will retrieve all eukaryotic proteins. Restrict this to swissprot records by adding

    AND srcdb swiss-prot[properties]

    in the Entrez search box and clicking "Go." Now change the display option to "GI list" and click the "Display" button. Use the "Save" button to save this to the C:\NCBI\Blast> directory. Give it the name euk.gil. Check to be sure the file is now in the Blast directory. Run blastall restricting to this list of sequences by using the '-l' option.
    C:\NCBI\Blast>blastall -p blastp -i purf.txt -d swissprot -l euk.gil -o out.txt [Enter]
    Now examine the results in out.txt to find the highest scoring eukaryotic protein.

    Blast output such as this can be voluminous and often needs to be processed to extract useful information. People commonly do this in an automated way using a text processing script, often called a BLAST parser. Typically these are written in the PERL language. Certain types of BLAST output are easier to parse than others. The XML format and the tabular format are intended to be computer readable formats; these are produced by adding the "-m 7" or "-m 8" options to your blastall command line.

    You can parse out the gi numbers from the current output file, out.txt, with this simple javascript parser. These can then be submitted to the Entrez system to retrieve the sequences in FASTA or another format.


  3. Learning Goals

    PSI-BLAST is implemented in standalone BLAST as the executable program blastpgp. List all commandline options for blastpgp.

     C:\NCBI\Blast>blastpgp - [Enter]

    The blastpgp program can write out a position specific score matrix in a human readable format. In this example we'll create such a matrix using the sequences collected by searching with inositol polyphosphate 1 phosphatase (IPPase). Inositol monophosphatases and related enzymes contain conserved acidic residues that are essential for binding metal ions (Mg++). The PSSM generated should show high scores for residues in these positions. Click here to display the sequence for IPPase in a new browser window and save the sequence with the file name ipp.txt to your C:\NCBI\Blast> directory.

    Run blastpgp for 4 iterations with IPPase against the swissprot database and write out the PSSM.

    C:\NCBI\Blast>blastpgp -i ipp.txt -d swissprot -j 4 -Q pssm.txt [Enter]
    

    Open pssm.txt using a web browser as before. Can you identify three conserved acidic residues? Compare their self substitution scores in the PSSM with that in BLOSUM62. There are also two other residues that are conserved. What are they? Confirm these findings by using CD-search on the web with IPPase. Display the alignment for the inositol_P domain and look for these conserved residues.


  4. Learning Goals

    Except for the web Megablast service, there is no supported way to submit to NCBI batches of sequences for BLAST searching. We have a network BLAST client that allows batch searches and sends results back directly to your computer's hard drive. Like the standalone BLAST package, the network BLAST client is available from the NCBI ftp site. Network BLAST is installed on your computer in the C:\Netblast> directory. The network client executable is called netblast.exe. This is run from the commandline and has similar options to blastall. Launch a command prompt and display all of the blastcl3 options.

    C:\NCBI\Netblast>blastcl3 - [Return]
    In this example we will submit a file of human protein sequences known to be involved in inflammatory response, and display the best matches to zebrafish ESTs. Click here to display the query sequence file in a web browser window. Save this file to the C:\NCBI\Netblast> directory on your computer using the browser's file menu as before. Call the file inflam.txt. We will use a few advanced options to restrict our search and output. Use the '-b' and '-v' options to get only a single description and alignment for each query sequence, and use the '-u' option to apply an Entrez query limit to restrict the search to zebrafish sequences. Note that the "-u" option is analogous to the "-l" option in standalone BLAST; "-u" lets the NCBI servers do the work of creating and applying a gi list.
    C:\NCBI\Netblast>blastcl3 -i inflam.txt -d est_others -p tblastn -u zebrafish[Organism] -v 1 -b 1 -o bcl3out.txt [Return]
    Examine the the file bcl3out.txt with the web browser to identify the zebrafish EST clones. You can use the javascript parser to strip the sequence IDs as before.