![]() |
|
|||||
|
|
Influenza virus infection is a major threat to public health in the United States, resulting in over 200,000 hospitalizations and 30,000 deaths each year. The Influenza Virus Genome Project [1] is providing researchers with a growing collection of virus sequences essential to the identification of the genetic determinants of influenza pathogenicity. NCBI provides online tools for the analysis of these and other influenza sequences in GenBank that allow researchers to: Retrieve—viral genomic, gene encoding, or protein sequences and download them in a number of formats Align—locally stored sequences with those in NCBI databases Cluster—sequences for phylogenetic analysis using a variety of algorithms and weight matrices, constructing dendrograms from the result Download—complete genomic sequences Search—influenza sequences using BLAST® An Example The analysis of the coding region (CDS) of the hemagglutinin ('HA'), sequence for influenza virus A, GenBank® accession AY653200, serves as an example of the use of these tools to classify a new sequence. Prior to the analysis, the CDS portion of the sequence was downloaded in FASTA format using NCBI's Entrez, and the FASTA definition line was changed from: >gi-50365728:29-1735 Influenza A virus(/chicken/Jilin/9/2004(H5N1))segment 4, complete sequence to read: >local chicken Selection of influenza sequences for analysis To begin, use the Database link from the Influenza Virus Resource page at to reach the Query Builder shown in Fig. 1.
Click on image to view larger Figure 1. Query Builder for influenza sequences. Queries are built by making selections in three different sections of the form, labeled A, B, and C Check the 'Coding region' radio button, indicated in section A, to specify the type of sequence to retrieve. From the menus in section B, select 'Influenza A', 'Avian', 'Asia', and 'HA' as the 'Virus Species', 'Host', Country/Region', and 'Segment', respectively. In addition, check 'Full-length sequences only' and restrict the search to H5N1 subtype sequences from the year 2005 using the check boxes and text fields in section C. Clicking on 'Add to Query Builder' will return the number of sequences that match, as shown in section D. Click on 'Get sequences' to generate the form shown in Fig. 2, containing a table of summaries for the 85 selected sequences.
Click on image to view larger Figure 2. Selection of sequences for further analysis. For brevity, only the first three of 85 selected entries is shown. The table is sortable and the controls in section A have been used to sort the records by "Virus Name", after which 10 sequences from various hosts (3 goose, 1 quail, 2 duck, 2 chicken, 1 gull, 1 heron) have been selected for further analysis using the check boxes next to each entry-only the first two of the checked entries are visible in the figure. Using the button in section B, the FASTA sequence called "local chicken" has been uploaded, as indicated in section C. Click on 'Do multiple alignment' to align the "local chicken" sequence to the selected 85 database sequences using the multiple sequence alignment program MUSCLE [2], to generate the alignment shown in Fig. 3.
Click on image to view larger Figure 3. Multiple sequence alignment for the "local chicken" HA sequences and 10 influenza HA coding sequences selected from the NCBI databases. The portion of the alignment displayed, indicated in section A, begins near base 950 and ends near base 1040. Two major groups of sequences, characterized by non-synonymous base changes, sections B, one synonymous base change, section C, and a three-base deletion, section D, are evident. Clustering and Phylogenetic analysis Click on 'Build a Tree' to invoke the setup page for phylogenetic analysis where the sequences may be selected for inclusion in the subsequent analysis using check boxes. Click on 'Phylogenetic Analysis' to display the next page where a clustering algorithm may be selected, and the tree built. The resulting dendrogram is shown in Fig. 4.
Click on image to view larger Figure 4 .Dendrogram built using the Local Search Neighbor Joining method. The dendrogram shows two clusters, as might be anticipated on the basis of the alignment of Fig. 3. Two influenza sequences from a goose host and one from a gull host lie in the first of these clusters while three from a chicken host, including our "local chicken" sequence, two from a duck and one from a heron host are in the second cluster. An outlying sequence, branching from the base of the tree, came from a goose host in Mongolia. The dendrogram may be recomputed after adjusting several parameters. A 'non-linear' two dimensional dot plot (not shown) that groups sequences to provide an overview of a large dataset may also be generated. Phylogenetic comparisons of this type have provided valuable insight into the process of genomic reassortments in influenza that lead to influenza outbreaks [3].
[1]Ghedin E, et al. Large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution. Nature. 2005 Oct 20;437(7062):1162-6. Epub 2005 Oct 5. PMID: 16208317. [2]Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004 Mar 19;32(5):1792-7. Print 2004. PMID: 15034147 [3]Holmes EC, et al. Whole-genome analysis of human influenza A virus reveals multiple persistent lineages and reassortment among recent H3N2 viruses. PLoS Biol. 2005 Sep;3(9):e300. Epub 2005 Jul 26. PMID: 1602618
|
|||||