MegaBlast of the query EST against the human genome: alignment overview. The difference between the EST and genomic sequence (a G to A variation at the nucleotide 16951392 of the contig NT_007592.14) is highlighted by a rectangle.
National Center for Biotechnology Information (NCBI) provides several focused bioinformatics mini-courses (http://www.ncbi.nlm.nih.gov/Class/minicourses/). The mini-courses are either problem-based such as “Identification of Disease Genes” or NCBI resource-based such as “BLAST Quick Start.” The courses are 2.5 h in length with the first 90 min devoted to an overview and an online demonstration of a problem or problem set by an instructor. This is followed by a 1-h hands-on session where students practice a similar problem or problem set to the one demonstrated at their own computers. The courses are taught on the National Institutes of Health (NIH) campus in Bethesda and at academic institutes in the United States.
This chapter describes the mini-course that focuses on the identification of a disease gene using NCBI’s human genome assembly. The reference human genome assembly along with integrated maps, literature, and expression information comprises a powerful discovery system for exploring candidate human disease genes.
The pathway of genetic information transfer in a cell begins with the transcription of genes within a genome to produce mRNAs and ends with the translation of mRNAs to produce proteins. The sequence databases contain genomic, mRNA, and protein sequences representing all three stages in the pathway.

One way to identify genes in a genome is to generate a cDNA library from the pool of RNA messages. To generate a cDNA library, the RNA messages from a tissue or from cells representing a developmental stage are copied into more stable cDNA molecules, which are then placed into an appropriate vector to generate a collection of cDNA clones (vector and the individual cDNA insert). The single pass, short 300–500 nucleotide sequences obtained from sequencing either end of the cDNA insert are called expressed sequence tags (ESTs).
For more background information about genetic terms, the user may refer to the following webpages:
Talking Glossary of Genetic Terms (http://www.genome.gov/glossary.cfm).
The NCBI Handbook Glossary (http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.glossary.1237).
A Science Primer (http://www.ncbi.nlm.nih.gov/About/primer/index.html).
NCBI Bookshelf (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books).
One way to solve the problem of identifying genes responsible for a particular phenotype is to generate a cDNA library from patient tissues/samples and obtain a number of ESTs. Then, use the ESTs to determine the genes expressing them and to determine whether they contain any nucleotide variations or single nucleotide polymorphisms (SNPs) when compared to normal individuals. Sites of DNA sequences where individuals differ at a single nucleotide are called SNPs. We will obtain more information about the SNP database in the latter part of this chapter.
Compare the sequences of ESTs from a patient to the sequences of the human genome (using Basic Local Alignment Search Tool [BLAST]).
Identify the genes aligning to the ESTs and download their sequences (using Map Viewer).
Identify whether the EST sequences contain any known SNPs (using dbSNP).
Determine whether a gene variant is known to cause a phenotype (using Online Mendelian Inheritance in Man [OMIM]).
Thus, starting from the transcribed sequences derived from patients, we will obtain information about expressed genes and determine whether these genes contain known variations that lead to the disease phenotype.
NCBI assembles component sequences from the human genome sequencing project into longer sequences called contigs whose accession numbers begin with prefix “NT_”. NCBI also performs a number of annotations on the assembly to identify genes, transcripts, clones, repeats, markers, and SNPs. NCBI releases the updated human genome assembly or the new “Build” periodically. For more information about the human genome assembly and annotation, see ref. 1 and the help document (http://www.ncbi.nlm.nih.gov/mapview/static/humansearch.html).
This problem based mini-course guides us through use of NCBI resources such as BLAST, Map Viewer, dbSNP, and OMIM as tools to identify disease genes (2).
BLAST provides a method for rapid searching of nucleotide and protein databases for similarities with a query nucleotide or protein sequence (3,4). The human genome BLAST page at (http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606) provides centralized access to the NCBI human genome assembly and annotated transcript and protein sequences. The BLAST output links directly to the Human Genome Map Viewer, where database hits can be analyzed in their genomic context to see the relationship with other annotated features.
The Map Viewer (http://www.ncbi.nlm.nih.gov/mapview/) allows us to view and search an organism’s complete genome (5). It shows integrated views of a collection of genetic, physical, and sequence maps for annotated genes, expressed sequences, SNPs, and other features, and, thus, is a valuable tool for the identification and localization of genes that contribute to human disease (as demonstrated in this mini-course).
NCBI’s SNP database (http://www.ncbi.nlm.nih.gov/SNP/) contains both single nucleotide substitutions, and short deletion and insertions (6). The data in dbSNP are integrated with other NCBI genomic data. SNPs are aligned to the human genome and the locations of SNPs with respect to the annotated genes and mRNAs are identified.
OMIM (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM) is the database of human genes and genetic disorders developed and edited by Dr. Victor A. McKusick and his colleagues at Johns Hopkins and elsewhere, and adapted for the Internet by NCBI (see Note 1 about Online Mendelian Inheritance in Animals) (7).
In this problem, we will use as an example the hemochromatosis disease, which is characterized by an iron overload. Consider that a researcher is working on the hemochromatosis disease and needs to obtain information about the gene(s) causing the phenotype. The following steps will describe the analysis of EST sequences that might have been obtained from a hemochromatosis patient.
It is recommended to follow the link to the “Identification of Disease Genes” through the mini-course webpage (http://www.ncbi.nlm.nih.gov/Class/minicourses/).
This page contains a link to a file containing the up-to-date screen images of each of the steps described below. Referring to the file is strongly recommended to follow the mini-course steps. However, a number of screen images are provided in this chapter as well for a reader to follow along. These screen images are from human Build 35.1.
One way to identify the genes expressing the ESTs is to compare their sequences using BLAST with the human genome assembly and the genes annotated on it. The specialized BLAST page for searching against the annotated human genome assembly is at (http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=9606) (see Note 2). The user may directly access the human genome BLAST page through the “Identification of Disease Genes” mini-course web page by clicking on the “BLAST (human genome)” link. We can concatenate a number of EST sequences to run the search as a batch. However, we will use only one EST sequence as a query for this analysis (see Note 3). Paste the EST sequence provided on the mini-course page in the query box of the BLAST page and select the “genome (reference only)” database from the pull down menu and use the default program MegaBlast (8) (see Note 4). Start the search by clicking on the “Begin Search” button and obtain the results by clicking on the “Format” button. The BLAST results page shows only one match to the contig sequence NT_007592.14 on chromosome 6 in the human genome Build 35.1. In certain cases, there may be multiple matches to the human genome assembly (see Notes 5 and 6).
MegaBlast of the query EST against the human genome: alignment overview. The difference between the EST and genomic sequence (a G to A variation at the nucleotide 16951392 of the contig NT_007592.14) is highlighted by a rectangle.
The difference may be due to a sequencing error in the low quality EST sequence or it may represent a real SNP in the human genome. For future reference, paste your results, such as the alignment and the nucleotide difference (sequence difference at the nucleotide 16951392 on NT_007592.14; G in the genomic and A in the query EST sequence), in the window provided in the mini-course webpage.
Map Viewer display of the Basic Local Alignment Search Tool hit from Subheading 5.2. The four maps displayed in this view, Model, RNA, Gene-seq, Contig, are highlighted by a rectangle.
The Genes_seq map shows the “known” genes annotated by alignment of EST and/or mRNA sequences to the assembly. The Contig map shows the assembled genome contig sequence in the region, the Model map shows the Ab initio model genes predicted by the NCBI’s program Gnomon and the RNA map shows the alignments of the known alternatively spliced transcripts. For more information about the human genome assembly and annotation, see ref. 4 and the help document (http://www.ncbi.nlm.nih.gov/mapview/static/humansearch.html).
Map Viewer display of the Basic Local Alignment Search Tool (BLAST) hit from Subheading 5. with Genes_seq Map as a master map. The BLAST hit, indicated by a bar on the right side of the Genes_seq map, is in the region of one of the exons of the HFE gene.
The thick bars in the Genes_seq map indicate the exons and the thin lines joining them indicate introns of the gene. Zoom out several times until the user sees the entire HFE gene structure by clicking on the gray line and selecting option “Zoom out 2 times” from the menu that appears. The query EST represents a known gene, HFE. The orientation of the arrow next to the gene link indicates the orientation of the gene on the forward or the reverse strand. A gene annotated on the forward strand is indicated by an arrow pointing downward whereas a gene annotated on the reverse strand is indicated by an arrow pointing upward (see Note 7). The HFE gene is annotated on the forward strand of chromosome 6.
The right-most map is called the master map and links that give more information about map elements are provided next to it. For example, the current master map, Genes_seq map, has links to resources that provide more information about the HFE gene such as OMIM, sv (Sequence Viewer), pr (Reference Proteins), dl (Download Sequence), ev (Evidence Viewer), mm (Model Maker), and hm (Homologene). For more information, please refer to the Human Maps Help document http://www.ncbi.nlm.nih.gov/mapview/static/humansearch.html. We will use some of these links in this mini-course. Display the entire HFE gene sequence by clicking on the download “dl” link and then on “Display” on the next page (see Notes 8 and 9). Copy the sequence and paste it in the area provided on the mini-course page. Note the accession number of the longest transcript, NM_000410. We will use this information in the next step.
Go back to the Map Viewer report by clicking on the back button of the browser twice. Click the Maps and Options link.
Remove all the maps except the Genes_seq map by selecting the map under the “Maps Displayed” menu and clicking on “Remove.” Now add the Variation map from the “Available maps” menu (by selecting the map and clicking Add). Make the Variation map the master map by selecting it and clicking the “Make Master/Move to Bottom” option. Then click “Apply.” (The Mini-Course Map Viewer Quick Start describes the usage of the Map Viewer in detail.)
Map Viewer display, containing the variation and Genes_seq maps, zoomed in the region of the Basic Local Alignment Search Tool (BLAST) hit in Subheading 5. There are two SNPs, rs1800562 and rs4986950, in the BLAST hit area indicated by a bar on the right side of each map.
Fasta sequence section of the SNP entry rs1800562. The A/G allele in the SNP, indicated in the definition line on the record, is highlighted by an oval.
Integrated maps section of the SNP entry rs1800562. The location of the SNP, nucleotide position 16951392 on the contig NT_007592.14 of the reference assembly, is highlighted by a rectangle.
GeneView section of the SNP entry rs1800562 for the mRNA NM_000410 alignment on the reference assembly contig NT_007592. The resulting amino acid change, 282nd amino acid in the protein NP_000401.1, from cysteine to tyrosine, is highlighted by a rectangle.
Thus, the query EST sequence contains a known SNP in the HFE gene that results in a cysteine to tyrosine change in the 282nd amino acid (Cys282Tyr) of the protein expressed by the longest HFE transcript variant, variant 1 (see Note 13). The next obvious step is to find out whether the SNP in the HFE gene is known to be associated with a disease phenotype.
Allelic variants list section from the Online Mendelian Inheritance in Man report for the HFE gene. The Cys282Tyr variant, highlighted by a rectangle, is reported to be associated with hemochromatosis.
This Mini-Course describes the steps needed to identify the gene producing an EST obtained from a hemochromatosis patient, download the gene sequence, identify known SNPs in the gene, and find SNP-associated phenotypes.
Results of Subheading 5.1.: the query EST sequence was found to align to contig NT_007592.14 on chromosome 6 with one nucleotide difference (G to A with respect to the nucleotide 16951392 on the contig).
Results of Subheading 5.2.: The query EST was found to align to the HFE gene.
Results of Subheading 5.3.: The query EST sequence contains a known SNP (G/A with respect to the nucleotide 16951392 on contig NT_007592.14) that results in the Cys282Tyr change in the hemochromatosis protein expressed by the longest HFE mRNA variant.
Results of Subheading 5.4: The Cys282Tyr change in the HFE protein is associated with hemochromatosis.
For more practice, we will now perform a similar analysis using another EST sequence from a sickle anemia patient. Sickle cell anemia is a disease in which the red blood cells are curved in shape and have difficulty passing through small blood vessels. It is recommended to follow along from the webpage (http://www.ncbi.nlm.nih.gov/Class/minicourses/diseasegene2.html).
This page contains a link to a file containing the screen images of each of the steps described next. Referring to the file is strongly recommended to follow the mini-course steps. However, a number of screen images are provided in this chapter as well for a reader to follow along. These screen images are from human Build 35.1.
MegaBlast of the query expressed sequence tags against the human genome: graphical overview. There are four hits to the contig sequence NT_009237.17 on chromosome 11 as highlighted by rectangles.
These multiple hits could arise from similarity to multiple gene family members and/or the query EST sequence originating from multiple exons (see Notes 5 and 6)
To determine the gene expressing the EST in this case, it is much easier to view the BLAST hits in the Map Viewer. Click the “Genome View” button at the top of the BLAST results page, then on the Map element “NT_009237.”
Map Viewer display obtained from the Genome View button link on the BLAST results page of Subheading 6.1. The four BLAST hits are indicated by the shaded areas on the right side of each map. Two of these align to the two exons of the HBB gene and two align to the two exons of the HBD gene (highlighted by the ovals). Note the percent identity of the BLAST hits (highlighted by the rectangles).
One of the exons of the HBB gene is only 99 % identical to the query EST. To note the location of the nucleotide difference between the two sequences, click on the corresponding “Blast hit” link to go back to the BLAST results page.
Alignment of the 99 % identical Basic Local Alignment Search Tool (BLAST) hit in Subheading 6.1. The BLAST hit is on the minus (reverse) strand (highlighted by an oval) of the contig NT_009237.17. There is one nucleotide difference (highlighted by a rectangle) between the query EST and genomic sequences at nucleotide 4035473 of the contig NT_009237.17.
Map Viewer display showing the recenter option by clicking on the gray line indicating the contig Map.
The upward pointing arrow next to the HBB gene link shows the placement of the gene on the reverse strand of chromosome 11 (see Note 7).
Click on the “dl” link next to the HBB gene. Because the gene is on the reverse strand, select minus on the Stand pull down menu and click on the “Change Region/Strand” button. Display the gene sequence by clicking on the “Display” option (see Notes 8 and 9). Copy the sequence and paste it in the area provided in the Mini-Course page. You can adjust the nucleotide locations to download the upstream or downstream sequence by using the “adjust by” and “Change Region/Strand” option.
Go back to the Map Viewer report by clicking on the back button of the browser twice. Click the Maps and Options link.
Map Viewer, displaying the variation and Genes_seq maps, zoomed in the region of BLAST hit in Subheading 6. There are three SNPs in the BLAST hit area indicated by the bars on the right side of each map; rs713040, rs334, rs11549407.
Fasta sequence section of the SNP entry rs334. The SNP contains an A/T SNP, indicated in the definition line of the record, highlighted by an oval.
Integrated maps section of the SNP entry rs334. The location of the SNP, nucleotide position 4035473 on the contig NT_009237.17 of the reference assembly, is highlighted by a rectangle.
GeneView section of the SNP entry rs334 for the mRNA NM_000518 alignment on the reference assembly contig NT_009237. The SNP results in the change at the seventh amino acid in the protein NP_000509.1 from glutamate to valine.
Thus, the query EST sequence contains a known SNP in the HBB gene that results in a glutamate to valine change in the seventh amino acid (Glu7Val) of the β-globin protein (see Note 13). The next obvious step is to find out whether the SNP in the HBB gene is known to be associated with a disease phenotype.
To determine whether the Glu7Val variant is known to cause a disease phenotype, we will access the OMIM database. Go back to the Map Viewer report by clicking on the back button of the web browser. Make the Genes_seq map the master map by clicking on the arrow at the top of the Genes_seq map. Click on the OMIM link next to the HBB gene. This takes us to the OMIM report for the HBB gene that details how variants (HBB gene variants) the HBB gene are associated with various phenotypes. As mentioned in the OMIM report under the “Psuedogenes” section, the allelic variants are listed for the mature HBB (β-globin) protein which lacks the initiator methionine. The SNP database reports them for the precursor protein. Hence, the allelic variants in the OMIM report are off by one amino acid compared to the variants in the SNP report (see Note 14). Thus, the Glu7Val variant in the SNP report corresponds to the Glu6Val variant in the OMIM report. Access the allelic variants list by clicking on the “View list” in the blue side bar. The Glu6Val variant, called hemoglobin S, is reported to cause the sickle cell anemia phenotype. The query EST contains a known variation that leads to the expression of the Glu7Val variant protein associated with the sickle cell anemia phenotype (see Note 15).
This mini-course describes steps to identify the gene expressing the ESTs obtained from a sickle cell anemia patient, download the gene sequence, identify known SNPs in the gene and find SNP-associated phenotypes.
Results of Subheading 6.1.: the query EST sequence was found to align to the contig NT_009237.17 on chromosome 11 with one nucleotide difference (T to A with respect to the nucleotide 4035473 on the contig).
Results of Subheading 6.2.: the query EST was found to be expressed by the HBB gene.
Results of Subheading 6.3.: the query EST sequence contains a known SNP (T/A with respect to the nucleotide 4035473 on contig NT_009237.17).
Results of Subheading 6.4.: the Glu7Val change in the HBB protein is associated with sickle cell anemia.
The mini-course describes a procedure to identify a known gene and a SNP from the NCBI databases starting from one EST sequence. The same procedure can be used with a batch of EST sequences in the initial human genome BLAST search (see Note 3) followed by a similar analysis to identify genes corresponding to them. Some ESTs may be produced by known genes (as described in the mini-course) and some may be produced by novel genes not yet annotated on the Genes_seq map. The Model map may be useful to identify the novel genes. Also, some ESTs may contain new SNPs, which can be deposited in dbSNP. By comparing the DNA sequence from patients and normal individuals, it can be discerned whether the novel SNP and/or the novel gene are associated with the disease.
The mini-course “Correlating Disease Gene and Phenotype” elucidates the biochemical and structural basis for the function of the mutant proteins and their relationship to the particular phenotype.
Online Mendelian Inheritance in Animals is a database of genes, inherited disorders and traits in animal species (other than human and mouse) authored by Professor Frank Nicholas of the University of Sydney, Australia, with help from many collaborators over the years.
In addition to the human genome, a number of other genomes are available as BLAST databases. A complete list is available under the Genomes panel on the BLAST page (http://www.ncbi.nlm.nih.gov/BLAST/).
MegaBlast also accepts a batch of query sequences. Each query sequence must have a unique identifier written on a separate line before the sequence and the identifier line should begin with a greater than (“>”) sign. For example,
>identifier1
atgcggctta…
>identifier2
ttggcatactg…
>identifier3
ggatcgatcag…
Since the human Build 36, NCBI also provides access to the previous assembly release (Build) as a BLAST database. More information about each build is provided in the release notes at http://www.ncbi.nlm.nih.gov/genome/guide/human/release_notes.html. You may choose to run the BLAST search against the previous build by using the appropriate option in the database field.
If the gene of interest is on the reverse strand, then change the Strand pull down menu to minus and click on the “Change Region/Strand” option before displaying the sequence (refer to Subheading 6.2.).
The user can also adjust the nucleotide locations to download the upstream or downstream sequence by using the “adjust by” and “Change Region/Strand” options.
When a single nucleotide polymorphism is submitted to dbSNP, an identifier with prefix “ss” is assigned to the entry. It is possible that multiple laboratories may submit information on the same SNP as new techniques are developed to assay variation, or new populations are typed for frequency information. Each of these SNP entries is assigned a unique identifier with prefix “ss”. When two or more submitted SNP records refer to the same location in the genome, a Reference SNP record is created, with an “rs” prefix on the identifier, by NCBI during periodic “builds” of the SNP database. This reference record provides a summary list of submitted “ss” records in dbSNP.
For example, the Reference SNP record from Subheading 5., rs1800562, contains three submitted SNP records; ss2420669, ss5586582, and ss24365242 in the dbSNP build 125. The Reference SNP record from Subheading 6., rs334, contains six submitted SNP records; ss335, ss1536049, ss4397657, ss4440139, ss16249026, and ss24811263 in the dbSNP build 125.
The GeneView panel shows the locations of the SNPs on the genomic assemblies with respect to the genes, their alternatively spliced mRNAs and encoded proteins. The view is color coded for quick identification of the location and to show whether the change is synonymous (not altering the amino acid translation) or nonsynonymous (altering the amino acid translation). For example, nonsynonymous SNPs are represented in red, synonymous in green and those in introns are in yellow. A link to the “Color Legend” is provided next to the Gene Model under the GeneView panel.
Some OMIM entries report the allelic variants for the mature protein, whereas dbSNP reports variants for the precursor protein. Thus, for the same SNP, amino acid numbering for the allelic variant may be different in these databases. For example, refer to Subheading 6.6 OMIM reports allelic variants for the beta globin mature protein (after removal of the initiator methionine). Thus, the Glu7Val change reported in dbSNP is the same as Glu6Val allelic variant reported in the OMIM database.
The OMIM report and thus its allelic variants list are manually derived from publications. dbSNP contains the SNPs reported by the submitters. Currently, a link is provided from dbSNP to OMIM if the amino acid number of the allelic variant in the OMIM report matches the number of the changed amino acid due to a SNP in dbSNP. Since the sources of the two databases, OMIM and dbSNP, are different, each may contain information not found in the other.