Logo of narLink to Publisher's site
Nucleic Acids Res. Jul 1, 2011; 39(Web Server issue): W347–W352.
Published online Jun 14, 2011. doi:  10.1093/nar/gkr485
PMCID: PMC3125810

PHAST: A Fast Phage Search Tool

Abstract

PHAge Search Tool (PHAST) is a web server designed to rapidly and accurately identify, annotate and graphically display prophage sequences within bacterial genomes or plasmids. It accepts either raw DNA sequence data or partially annotated GenBank formatted data and rapidly performs a number of database comparisons as well as phage ‘cornerstone’ feature identification steps to locate, annotate and display prophage sequences and prophage features. Relative to other prophage identification tools, PHAST is up to 40 times faster and up to 15% more sensitive. It is also able to process and annotate both raw DNA sequence data and Genbank files, provide richly annotated tables on prophage features and prophage ‘quality’ and distinguish between intact and incomplete prophage. PHAST also generates downloadable, high quality, interactive graphics that display all identified prophage components in both circular and linear genomic views. PHAST is available at (http://phast.wishartlab.com).

INTRODUCTION

Bacteriophage, the viruses that infect bacteria, can typically be divided into two groups, lytic and temperate. Lytic phage infect propagate within and then lyse their host bacterial cells as part of their life cycle, while temperate phages may exist benignly within the DNA of their bacterial host. Temperate phages can physically integrate into one of the native replicons (plasmid or chromosome) of their preferred bacterial host, although a few phages can exist as independent plasmids (1). Integrated phages are termed prophages. Prophages tend to be inserted at specific integration sites within the host genome, but their location can vary depending on the phage species. Prophages are essentially dormant phages that are only replicated through bacterial DNA replication and cell division. Bacteria containing a prophage are called lysogens because their prophage is in the lysogenic cycle, in which the viral (esp. the lytic) genes are not expressed. Upon damage to the host cell DNA or other physiological cues, the prophage may be induced to excise itself from the bacterial genome. After induction, the phage's lytic genes are turned on, infectious virions are assembled within the host cell and the cell is lysed (killed) releasing infectious phage particles that can go on to infect more cells.

Bacterial genomes can contain a significant proportion (>20%) of functional and non-functional bacteriophage genes (1). Consequently, prophage sequences can account for a significant fraction of the variation within bacterial species or clades. The presence of prophage sequences may also allow some bacteria to acquire antibiotic resistance, to exist in new environmental niches, to improve adhesion or to become pathogenic (1). Because bacterial genome fragments can also be carried by phage particles, the lytic process is thought to be an important vehicle for horizontal gene transfer. Furthermore, because of their high specificity and high potency, phages are also being investigated as potential candidates for novel antibiotics (2) or even cancer therapies (3). As the most abundant biological entity on Earth (1), phages are also thought to play a crucial role in cycling nutrients and boosting photosynthesis in the world's oceans (4). However, not all prophages or prophage-like entities are functional. Indeed, many prophages are dormant due to mutational decay or the loss of critical genes over thousands of host generations. Defective or cryptic prophages are abundant in many bacterial genomes and they can carry a number of genes that may be beneficial to the host. These can include genes encoding proteins with homologous recombination functions or bacteriocins that may be used to inhibit the growth of other bacteria in competition for nutrients.

There are generally two methods to identify prophages: (i) experimental and (ii) computational. The experimental approach involves inducing the host bacteria to release phage particles by exposing them to UV light or other DNA-damaging conditions. This approach can certainly prove the existence of viable phages, but will not reveal defective prophages (1). In addition, not all viable phages can be induced under the same conditions and the required conditions often are not known a priori. Given the ease with which bacterial genomes can now be sequenced, the computational identification of prophages from genomic sequence data has become the most preferred route. Early sequence-based efforts often depended upon manual inspection of disrupted genes and attachment sites (5) or the analysis of atypical nucleotide content (6,7). However, prophage regions do not always exhibit atypical nucleotide content (8). Likewise, phages neither always integrate into the same coding regions nor do they exclusively use tRNAs as the target site for integration. Consequently, this makes scanning for atypical gene content or simple searches for disrupted genes or tRNAs unreliable for finding prophage regions. More recent methods relying on a much more holistic or integrated approach have appeared. These combine sequence comparisons to known phage or prophage genes, comparisons to known bacterial genes, tRNA and dinucleotide analysis and hidden Markov scanning for attachment site recognition. These combined methods are now available in a number of excellent programs and web servers such as Phage_Finder (9), Prophinder (10) and Prophage Finder (11). These tools have helped revolutionize finding prophages in bacterial genomes.

Except for Prophage Finder, these phage finding programs and web servers still require that the input genome sequence must be well annotated with all open reading frames (ORFs) and/or tRNA sites pre-identified. This annotation process is not only time consuming, it is also highly dependent on the choice of the annotation programs or methods. Furthermore, the choice and accuracy of the genome annotation method can significantly affect the accuracy of the phage predictions (vide infra). In addition, even with a fully annotated bacterial genome, all other phage-finding methods require 30 min to 2 h to complete their analyses. Now that it is possible to sequence an entire bacterial genome in less than a day, prophage identification needs to be faster, more accurate and much less dependent on the availability of fully annotated bacterial genomes. To address these issues, we have developed a web-based application named PHAST (a fast PHAge Search Tool), to support rapid and accurate prophage identification using either raw or annotated bacterial genome sequence data. The main features of PHAST include:

  • Prophage region identification support for both raw nucleotide sequence input (using GLIMMER gene prediction and local genome annotation tools) as well as annotated GenBank file input;
  • Support for detailed prophage annotation including position, length, boundaries, number of genes, attachment sites, tRNAs, identified phage-like genes and attachment sites (att);
  • A customized phage and prophage database that is automatically updated on a biweekly basis;
  • Support for the prediction of the completeness or potential viability of identified prophages (intact, questionable or incomplete);
  • Extremely fast processing times (about 3 min for a typical bacterial genome);
  • Graphical output that supports both circular and linear genomic views as well as interactive browsing and labeling of dynamically generated figures;
  • Fully downloadable text and graphics; and
  • Support for scriptable operations through an application programming interface (API).

PHAST's prophage finding performance is generally superior to other applications. When given an annotated genome, it achieves 85.4% sensitivity and 94.2% positive predictive value (PPV) when evaluated against the collection of prophages referenced by Prophinder. When given a raw sequence file, PHAST achieves 79.4% sensitivity and 86.5% PPV using the same evaluation set. This is about 10% more accurate than existing phage finding tools. PHAST is freely available at (http://phast.wishartlab.com).

MATERIALS AND METHODS

PHAST is an integrated search and annotation tool that combines genome-scale ORF prediction and translation (via GLIMMER), protein identification (via BLAST matching and annotation by homology), phage sequence identification (via BLAST matching to a phage-specific sequence database), tRNA identification, attachment site recognition and gene clustering density measurements using density-based spatial clustering of applications with noise (DBSCAN) (17) and sequence annotation text mining. In addition to these basic operations, PHAST also evaluates the completeness of the putative prophage, tabulates data on the phage or phage-like features and renders the data into several colorful graphs and charts. Details about the databases, algorithms and implementation are given below.

Creation of custom prophage and bacterial sequence databases

PHAST's prophage sequence database consists of a custom collection of phage and prophage protein sequences from two sources. One is the National Center for Biotechnology Information (NCBI) phage database that includes 46 407 proteins from 598 phage genomes. The other source is from the prophage database (12), which consists of 159 prophage regions and 9061 proteins not found in the NCBI phage database. Since many of the prophage proteins in the prophage database are actually bacterial proteins and some have only been identified computationally, we only selected those prophage proteins that have been associated with a clear phage function. This set includes a total of 379 phage protease, integrase and structural proteins. This PHAST phage library is used to identify putative phage proteins in the query genome via BLASTP (13) searches.

In addition to a custom, self-updating phage sequence library, PHAST also maintains a bacterial sequence library consisting of 1300 non-redundant bacterial genomes/proteomes from all major eubacterial and archaebacterial phyla. This bacterial sequence library contains more than four million annotated or partially annotated protein sequences. Relative to the full GenBank protein sequence library (100+ million sequences), this bacterial-specific library is 25× smaller. This means that PHAST's genome annotation step (see below) can be accomplished 25× faster.

Genome annotation and comparison

PHAST accepts both raw DNA sequence and GenBank annotated genomes. If given a raw genomic sequence (FASTA format), PHAST identifies all ORFs using GLIMMER 3.02 (14). This ORF identification step takes about 45 s for an average bacterial genome of 5.0 Mb. The translated ORFs are then rapidly annotated via BLAST using PHAST's non-redundant bacterial protein library (~2–3 min/genome). Because tRNA and tmRNA sites provide valuable information for identifying the attachment sites, they are calculated using the programs tRNAscan-SE (15) and ARAGORN (16). If an input (GenBank formatted) file is provided with complete protein and tRNA information, these steps are skipped. Phage or phage-like proteins are then identified by performing a BLAST search against PHAST's local phage/prophage sequence database along with specific keywords searches to facilitate further refinement and identification. Matched phage or phage-like sequences with BLAST e-values less than 10−4 are saved as hits and their positions tracked for subsequent evaluation for local phage density by DBSCAN (17).

Identification of prophage regions and prediction of their completeness

Prophages can be considered as clusters of phage-like genes within a bacterial genome. The primary challenge (after phage-like genes have been identified) is to determine if these genes are sufficiently well clustered or proximal to each other to be considered prophage candidates. Although there are a few reported clustering methods for identifying phage gene clusters (9–11), we found the general DBSCAN algorithm performs just as well, likely because the identification of clusters of prophage genes is not a particularly difficult task. DBSCAN takes two parameters: the cluster size n and a distance e. The parameter n defines the minimal number of phage-like genes required to form a prophage cluster and e is the maximal spatial distance between two neighbor genes within the same cluster. In our case, the spatial distance between two genes is just the number of nucleotides between them. In other words, n can be considered as the minimal prophage size and e is the protein density within the prophage region. Empirically, we set n to be 6, since prophages generally have more than five proteins. The value of e was set to 3000 based on assessments from a small number of identified prophages in ProphageDB (12). We found that using a moderately different e-value will generally not change the prediction sensitivity. If PHAST's input file is an annotated GenBank file, an additional text scan is performed to identify prophages that may not have been found by clustering. This secondary (moving window) scan looks for specific phage-related keywords in the GenBank protein name field of the input file, such as ‘protease’, ‘integrase’ and ‘tail fiber’. If 6 or more proteins associated with these keywords are found within a window of 60 proteins, the region is considered as a putative prophage region even if an insufficient number of phage-like genes were found by DBSCAN within this region. Finally, if the identified prophage contains an integrase, potential phage attachment sites (one for each integrase in tandem prophages) are then identified by scanning the region for short nucleotide repeats (12–80 bases) (18).

After all prophage regions have been detected, a completeness score is assigned to each identified prophage. Three potential scenarios are considered: (i) the region only contains genes/proteins of a known phage; (ii) >50% of the genes/proteins in the region are related to a known phage and (iii) <50% of the genes/proteins in the region are related to a known phage. In scenario (i), the region automatically has a completeness score of 150 (the maximum). In scenario (ii) and (iii), the region's completeness score is calculated as the sum of the scores corresponding to the region's size and number of genes. If it is found that the region is related to a known phage, both scores are calculated using the size and number of matched genes of the related phage, otherwise they are calculated using the average size (30 kb) and average number of genes (40) of typical phages. The total score in scenario (iii) also counts the number of identified ‘cornerstone’ genes as well as the density of phage-like genes in the region. ‘Cornerstone genes’ are genes encoding proteins involved in phage structure, DNA regulation, insertion and lysis (1). Table 1 shows the details of PHAST's completeness score calculation. A prophage region is considered to be incomplete if its completeness score is less than 60, questionable if the score is between 60 and 90, and intact if the score is above 90.

Table 1.
PHAST's phage completeness score calculation

Program and web server characteristics

PHAST's search, annotation and DBSCAN clustering software were written using a combination of C and Java. PHAST's web interface was implemented using a standard CGI framework. PHAST's interactive Google-Map style graphics were built using Adobe's Flash Builder. PHAST also supports remote scripting using a URL API (this is described under PHAST's ‘Instructions’ link) and it maintains a large, hyperlinked database of pre-computed bacterial genomes for rapid prophage identification among known/well-studied genomes (see PHAST's ‘Databases’ link). A screenshot montage of PHAST's output is given in Figure 1. The web application is platform independent and has been tested successfully on Internet Explorer 8.0, Mozilla Firefox 3.0 and Safari 4.0. However, in order to view the Flash output the user must have Adobe Flash Player installed. This is freely available at http://www.adobe.com/products/flashplayer. For the most up to date instructions of how to use the server please read the online help page at http://phast.wishartlab.com/how_to_use.html.

Figure 1.
A screenshot montage of some of PHAST's different graphical and tabular views including its linear and circular genome renderings as well PHAST's corresponding prophage annotation.

Performance evaluation

In order to compare PHAST's performance with other programs, we used a collection of hand-annotated prophages from 54 bacterial genomes (1,10) as our ‘gold standard’ reference control. PHAST was evaluated using both GenBank annotated sequences (i.e. bacterial genomes with manually or semi-automatically annotated genomes) as well as raw DNA sequence files. The performance was measured using both sensitivity [TP/(TP + FN)] and positive predictive value or PPV [TP/(TP + FP)]. Using this 54 genome data set PHAST achieved, a sensitivity (Sn) of 85.4% and a PPV of 94.2% when evaluated using GenBank annotated files. When using raw DNA sequence and its own ORF finding and genome annotation tools, PHAST achieved a sensitivity of 79.4% and a PPV of 86.5% for the same 54 genomes. PHAST's performance using the same annotated GenBank data was superior to Prophinder (Sn 77.5%, PPV 93.6%), Prophage Finder (Sn 92.1%, PPV 52.1%) and Phage_Finder (Sn 68.5%, PPV 94.3%). PHAST's performance using raw DNA sequence data does not quite match that of the pre-annotated data, but its combined sensitivity/positive predictive value is still comparable to Prophinder and superior to both Prophage Finder and Phage_Finder. Detailed comparisons for both the GenBank and raw sequence inputs for all 54 genomes can be found in PHAST's documentation page (http://phast.wishartlab.com/documentation.html). PHAST's improved performance does not necessarily indicate Prophinder, Prophage Finder or Phage_Finder's phage finding algorithms are inferior to PHAST's algorithm. Rather, some of the performance gain appears to be due to PHAST's implementation of a newer, larger phage sequence library and perhaps a better exploitation of keyword annotations.

A further challenge with evaluating any kind of prophage identification software is that there is no ‘absolute’ or ‘gold’ standard. Careful manual annotation by phage experts is certainly a high standard, but it is more than likely that some prophages in the 54 evaluation genomes were not identified, having decayed or mutated too much for them to appear in the Casjens reference list (1). In other words, some of the false positive predictions may in fact be true positives. Indeed, through manual inspection of PHAST's results we found a number of ‘dense’ positive BLAST hits to phage proteins in several genomes, but these were not labeled as prophages in the Casjens reference list. Instead of ‘false positives’, we believe that they should be considered as prophage-related regions that have not been previously reported in the literature.

In addition to evaluating PHAST's prophage identification performance, we also evaluated its speed. Given that PHAST accepts two kinds of file input (raw FASTA DNA sequence and GenBank formatted files), we assessed its performance for both kinds of input files. When given raw genomic sequence, PHAST must run GLIMMER as well as several gene/protein identification programs. Using the raw Escherichia coli O157:H7 genome sequence only (GenBank accession NC_002655), PHAST completed its prophage identification in just over 4 min. When tested on the same input file, Prophage Finder returned results after 20 min. However, it is important to note that Prophage Finder does not annotate bacterial genes, its output is very ‘crude’ and its combined Sn/PPV score is significantly worse than PHAST's (Table 2). Using the GenBank annotated E. coli O157:H7 file, PHAST completed its prophage identification in 140 s. Using the same annotated NC_002655 file for the Prophinder (10) web server took 33 min, while using a local copy of Phage_Finder (9) running on a 2.1 GHz Pentium PC with 12 Gb RAM, the same file took 93 min. These data suggest that PHAST is between 5 and 40 times faster than existing prophage finding programs. A more complete feature and performance comparison between PHAST and other existing prophage finding tools is given in Table 2.

Table 2.
Features and performance of PHAST relative to other prophage identification tools

Limitations

PHAST is not without some limitations. First, like all other database-driven annotation systems, PHAST obviously performs poorly at identifying novel phages, whose genes/proteins are not closely related to any record in the PHAST database. In this regard, the appearance of large numbers of proximal proteins with unknown function could be a good indication of a novel phage. Second, the DBSCAN algorithm used by PHAST assumes an even density of phage-like hits in every prophage genomic sequence, which is not generally true in practice. Consequently, a highly uneven distribution of phage-like genes could potentially fool the DBSCAN algorithm. Finally, PHAST will occasionally ‘split’ larger prophages into a number of smaller prophages due to a paucity of BLAST hits.

CONCLUSIONS

PHAST represents a new generation of fast prophage identification and phage annotation tools that produces accurate results in minutes using only raw or lightly annotated genome sequence data. PHAST also produces extensive text summaries, downloadable figures, circular and linear genome views as well as colorful, zoomable, user-interactive graphics. As phage and prophage databases continue to expand [with >100 million different phage genomes still to be sequenced (19)], we believe that PHAST's integrated comparative approach to phage finding will only lead to continued improvements in its sensitivity and specificity.

FUNDING

The authors wish to thank the Canadian Institutes of Health Research (CIHR) and Genome Alberta (a division of Genome Canada) for financial support. Funding for open access charge: Canadian Institutes of Health Research.

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

The authors wish to thank the referees for their helpful suggestions. Y.Z. wrote the phage clustering, the phage identification scripts and the FLASH viewer. Y.L. prepared the phage and bacterial sequence databases, built the web interface and implemented various external programs (BLAST, GLIMMER, CGVIEW). K.H.L. co-designed the methods and tested the server. D.S.W. and J.J.D. conceived the ideas, server requirements and general methodology. All authors contributed to the writing of the manuscript and all have seen and approved of its content.

REFERENCES

1. Casjens S. Prophages and bacterial genomics: what have we learned so far? Mol. Microbiol. 2003;49:277–300. [PubMed]
2. Coates AR, Hu Y. Novel approaches to developing new antibiotics for bacterial infections. Br. J. Pharmacol. 2007;152:1147–1154. [PMC free article] [PubMed]
3. Bar H, Yacoby I, Benhar I. Killing cancer cells by targeted drug-carrying phage nanomedicines. BMC Biotechnol. 2008;8:37. [PMC free article] [PubMed]
4. Sullivan MB, Waterbury JB, Chisholm SW. Cyanophages infecting the oceanic cyanobacterium Prochlorococcus. Nature. 2003;424:1047–1051. [PubMed]
5. Fouts DE. Bacteriophage bioinformatics. In: Fraser CM, Read TD, Nelson KE, editors. Microbial Genomes. Totowa, NJ: Humana Press Inc.; 2004. pp. 71–91.
6. Nicolas P, Bize L, Muri F, Hoebeke M, Rodolphe F, Ehrlich SD, Prum B, Bessières P. Mining Bacillus subtilis chromosome heterogeneities using hidden Markov models. Nucleic Acids Res. 2002;30:1418–1426. [PMC free article] [PubMed]
7. Srividhya KV, Alaguraj V, Poornima G, Kumar D, Singh GP, Raghavenderan L, Katta AV, Mehta P, Krishnaswamy S. Identification of prophages in bacterial genomes by dinucleotide relative abundance difference. PLoS One. 2007;2:e1193. [PMC free article] [PubMed]
8. Nelson KE, Weinel C, Paulsen IT, Dodson RJ, Hilbert H, Martins dos Santos VA, Fouts DE, Gill SR, Pop M, Holmes M, et al. Complete genome sequence and comparative analysis of the metabolically versatile Pseudomonas putida KT2440. Environ. Microbiol. 2002;4:799–808. [PubMed]
9. Fouts DE. Phage_Finder: automated identification and classification of prophage regions in complete bacterial genome sequences. Nucleic Acids Res. 2006;34:5839–5851. [PMC free article] [PubMed]
10. Lima-Mendez G, Van Helden J, Toussaint A, Leplae R. Prophinder: a computational tool for prophage prediction in prokaryotic genomes. Bioinformatics. 2008;24:863–865. [PubMed]
11. Bose M, Barber RD. Prophage Finder: a prophage loci prediction tool for prokaryotic genome sequences. In Silico Biol. 2006;6:223–227. [PubMed]
12. Srividhya KV, Rao GV, Raghavenderan L, Mehta P, Prilusky J, Sankarnarayanan M, Sussman JL, Krishnasswamy S. Database and comparative identification of prophages. Lec. Notes Control Informat. Sci. 2006;344:863–868.
13. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
14. Salzberg SL, Delcher AL, Kasif S, White O. Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 1998;26:544–548. [PMC free article] [PubMed]
15. Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25:955–964. [PMC free article] [PubMed]
16. Laslett D, Canback B. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004;32:11–16. [PMC free article] [PubMed]
17. Ester M, Kriegel HP, Sander J, Xu X. KDD-1996 Proceedings. Menlo Park, CA: AAAI Press; 1996. A density-based algorithm for discovering clusters in large spatial databases with noise; pp. 226–231.
18. Williams KP. Integration sites for genetic elements in prokaryotic tRNA and tmRNA genes: sublocation prefeence of integrase subfamilies. Nucleic Acids Res. 2001;30:866–875. [PMC free article] [PubMed]
19. Rohwer F. Global phage diversity. Cell. 2003;113:141. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • Gene (nucleotide)
    Gene (nucleotide)
    Records in Gene identified from shared sequence links
  • Nucleotide
    Nucleotide
    Published Nucleotide sequences
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...