Next-generation sequencing in clinical virology: Discovery of new viruses
Abstract
Viruses are a cause of significant health problem worldwide, especially in the developing nations. Due to different anthropological activities, human populations are exposed to different viral pathogens, many of which emerge as outbreaks. In such situations, discovery of novel viruses is utmost important for deciding prevention and treatment strategies. Since last century, a number of different virus discovery methods, based on cell culture inoculation, sequence-independent PCR have been used for identification of a variety of viruses. However, the recent emergence and commercial availability of next-generation sequencers (NGS) has entirely changed the field of virus discovery. These massively parallel sequencing platforms can sequence a mixture of genetic materials from a very heterogeneous mix, with high sensitivity. Moreover, these platforms work in a sequence-independent manner, making them ideal tools for virus discovery. However, for their application in clinics, sample preparation or enrichment is necessary to detect low abundance virus populations. A number of techniques have also been developed for enrichment or viral nucleic acids. In this manuscript, we review the evolution of sequencing; NGS technologies available today as well as widely used virus enrichment technologies. We also discuss the challenges associated with their applications in the clinical virus discovery.
Core tip: Rapid development and commercial availability of next-generation sequencers (NGS) systems have dramatically changed almost every field of biological research, especially microbiology and metagenomics. Different NGS systems have been adapted and used for numerous applications in virology too. These systems are capable of rapidly sequencing and analyzing a complex mixture of nucleic acid templates, in a massively parallel fashion, making them ideal tools for viral metagenomics and discovery. This manuscript reviews the prevailing NGS technologies, their application in virus discovery to serve as a guide for the readers, working in the field of virology, public health and in biothreat mitigation programs.
INTRODUCTION
Viral infections are a cause of significant health burden globally, particularly in the less developed countries. During the 20th century, methods for virus detection, characterization and taxonomical classification were established, that helped in the discovery of a number of important viruses, in prevention of viral infections and treatment. By the late-1950s, it was generally believed that most of the human pathogenic viruses had been discovered, but the emergence of a number of previously unknown viruses [Hepatitis viruses, Hantavirus, human immunodeficiency virus, Marburg virus, severe acute respiratory syndrome (SARS), Coronavirus Ebola virus] during the later part of the century strongly challenged this belief[1].
It has now become obvious that due to different anthropological activities, such as extensive globalization of travel and business, rapid unplanned urbanization, deforestation, etc., epidemiology of viral diseases have changed significantly[1]. This change has led to the increased exposure of different human populations to newer pathogens, including viruses, mostly zoonotic in nature[2,3]. The emergence of Ebola virus, Nipah virus, Sin Nombre Hantavirus, SARS, Influenza viruses (H1N1, H7N9), and MERS viruses in the recent past[4], clearly signify the onset of many others in the near future. According to a recent statistical estimate, there are at least 320,000 mammalian viruses that are waiting to be discovered[5]. The World Health Organization (WHO) has correctly cautioned that, “It would be extremely naïve and complacent to assume that there will be no other disease like AIDS, Ebola, or SARS, sooner or later”[6].
Apart from natural outbreaks, the risk of pathogens, especially deadly viruses, to be used as biological weapons and agents of bioterrorism have also increased in the recent years[7]. Being exceptionally diverse, in term of etiology, morphology, nucleic acid type and sequence information, clinical manifestations, etc., rapid detection and identification of viruses pose great challenge to clinical investigators. Nevertheless, during natural or deliberate outbreaks, identification and characterization of viruses in clinical samples is extremely essential to facilitate prevention and quarantine strategies, implement specific diagnostic tools, and also to determine explicit treatment strategy.
This article will review the gradual evolution and recent advances in the field of virus discovery, with special reference to the next-generation sequencing (NGS) technologies and related molecular biology methodologies.
EVOLUTION OF VIRUS DISCOVERY TECHNIQUES
Classical approaches to virus discovery
Classically, virus discovery from clinical samples was based on filtration (to remove host cells and other microbes), inoculation of the cell free filtrate in suitable cell cultures followed by purification of the viruses from cultures and their characterization[8-10]. Morphological changes in the cultured cells, collectively known as cytopathic effect, such as formation of syncytia, cell rounding, lysis, detachment, or inclusion bodies, etc., indicate the presence and successful infection of the virus(es) in the cells[11]. Virus isolate(s) are purified from the cultured cells or culture supernatant using density gradient and other high speed centrifugal techniques. This is followed by structural characterization of viral particles, antigens, nucleic acids, through different biophysical and biochemical methods[4]. Although classical methods are sometimes considered as time-consuming, tedious and need significant experimental basis, but the cell inoculation method still remains an exceptional source of enriched viral particles required for serological, molecular characterization and other purposes. Nonetheless, in many cases, viruses are not readily infective to cell cultures, which severely hamper their characterization. Additionally, repeated passaging of the virus to obtain high titer could change the population of virus being sought[12].
Nucleic acid sequence-dependent amplification approaches to virus discovery
Subsequently, with the development of nucleic acid sequence-dependent techniques, such as PCR-sequencing and microarrays, the requirement of cell culture based traditional methods became obsolete to a large extent[13-16]. These techniques were comparatively much faster as compared to classical techniques and led to the discovery of several new genotypes of known viruses. Among PCR and microarray based methods, the former gained enormous popularity due to its ability to rapidly amplify very small amounts of viral sequences from clinical samples. Even though, the prior requirement of sequence information (to design primers and hybridization probe), made this technique suitable for discovery of new genotypes of known viruses, but not appropriate for absolutely novel viruses. This limitation was later addressed by the development of consensus or degenerate PCR[17,18]. Although, this PCR method was tolerant to considerable sequence variation, but it lacked its original sensitivity and was still critically dependent on prior sequence information of the virus genera/family being investigated. Moreover, this method could only amplify small fractions of viral genome, which were sometimes not enough for further analysis.
Nucleic acid sequence-independent amplification approaches to virus discovery
The limitations of sequence dependent techniques prompted the investigators to resort to “metagenomics”, a technique that does not presume any knowledge about the organisms being investigated[19]. Metagenomics is the study of total genetic material present in a given sample, without culturing the organisms present in it. Conventional metagenomics analyses involved direct amplification of the nucleic acids through PCR, cloning and sequencing, etc.[15,20. At the outset, this technique was intensely used for assessing the bacterial diversity within highly diverse samples ranging from soil, oceans, and lakes to human gut and stool, which demonstrated the power of this technique to discover genetic materials of unknown origin[21]. Subsequently, early virus discovery investigators developed a number of random amplification techniques for viral metagenomics, such as sequence-independent single-primer amplification (SISPA), virus discovery based on cDNA-AFLP (VIDISCA), rolling circle amplification (RCA), etc., to amplify viral genetic materials for cloning and sequencing[15,20,22]. Extensive use of these viral metagenomic techniques, led to the discovery of different viruses, including human T-cell lymphotropic virus type-1, Torque Teno virus, different Parvoviruses, Coronaviruses, Polyomaviruses, Hepatitis C virus, Sin Nombre virus, Human Herpesviruses 6 and 8, and West Nile virus, etc., in clinical samples[23-26].
NGS-based metagenomic approaches to virus discovery
In all the above nucleic acid sequence based virus discovery approaches, the Sanger sequencing method played a very significant role. However, with the commercial availability of high throughput sequencing technologies in 2005, a gradual shift in generation of sequencing technologies became evident. These massively parallel sequencing technologies evolved rapidly and entirely transformed almost every field of biological research including clinical research laboratories[27]. NGS is presently the most attractive approach towards metagenomics, including viral metagenomics, due to its independence from the requirement of prior sequence information. Furthermore, being highly sensitive, NGS can rapidly recuperate nearly full genome sequences of viruses, with relatively less amount of starting material as compared to conventional cloning based approaches[28,29]. Moreover, the large dynamic detection range of the NGS has established it as the most powerful technology available till date, which has catalyzed the rate of virus discovery[30-33]. In combination with conventional methods such as SISPA, VIDISCA, RCA, etc., NGS can dramatically augment turnaround time and sensitivity of virus discovery[23]. Additionally, NGS has enormous, exciting applications in virology, including analysis of viral evolution and quasispecies analysis, antiviral resistance, vaccine, etc.[23,33-35].
A comparison of different virus discovery approaches, their advantages and limitations, applicability in different scenarios, etc., is presented in Table Table11.
Table 1
A comparative evaluation of the different virus discovery approaches showing advantages and disadvantages associated with them
| Classical approaches(Cell culture and infection based) | Nucleic acid sequence-dependent amplification approaches | Nucleic acid sequence-independent amplification approaches | Next-generation sequencers-based metagenomic approaches | |
| Requirement of cell culture systems | Yes, required for virus particle enrichment | Not required | Not required | Not required |
| Information about the cytopathic effects of the virus | Yes, could be achieved through cell changes | No information could be achieved | No information could be achieved | No information could be achieved |
| Requirement of special equipments for purification | Yes, Ultracentrifuge/high speed centrifuges, density gradient is required for preparing pure virus | Not necessary, semi pure preparations obtained through low speed centrifuges are suitable | Not necessary, semi pure preparations obtained through low speed centrifuges are suitable | Not necessary, semi pure preparations obtained through low speed centrifuges are suitable |
| Information about detailed morphological/structural features of the virus | Yes, could be achieved through Electron/Atomic Force microscopy | No information on virus morphology/structure could be achieved directly | No information on virus morphology/structure could be achieved directly | No information on virus morphology/structure could be achieved directly |
| Time required for virus identification | Long time is required for identification, ranging from days to weeks | Comparatively faster, days required if cloning and sequencing is involved. Faster with microarray based approaches | Comparatively faster, virus could be identified within few days | Fastest available approach, identification could be done within days and even some times within hours |
| Requirement of prior knowledge about the virus | Not required | Some information is required regarding genus/family to design primers/probes | Being sequence independent technique, no information is required | Being sequence independent technique, no information is required |
| Dynamic detection range | Very narrow | Narrow | Wide | Extremely wide |
| Tolerance to non-viral materials | Vulnerable to other pathogens capable of infecting cell | Being sequence dependent, less vulnerable to other sequences from host and other pathogens | Being sequence in-dependent, more vulnerable to other sequences from host and other pathogens. Virus enrichment techniques required before analysis | Being sequence in-dependent, more vulnerable to other sequences from host and other pathogens. Virus enrichment techniques required before analysis |
| Suitability for discovery of new viruses | Yes | Less suitable, good at discovery of genotypes/variants of known viruses | Yes | Yes |
| Suitability during outbreaks | Not suitable due to requirement of long time | Not suitable due to requirement of prior sequence information | Yes, but still considerable time is required during outbreaks | Being fast, very much suitable in detecting pathogens in an outbreak scenario |
EVOLUTION OF SEQUENCING TECHNOLOGIES
First-generation sequencers
Originally, two different DNA sequencing methods were described almost simultaneously, the Sanger’s method, and the Maxam-Gilbert’s method[36,37], both considered as the first-generation of sequencing methods. Sanger’s method was based on DNA sequencing with chain-terminating inhibitors, while Maxam-Gilbert’s method was based on base-specific chemical modification and cleavage of the DNA backbone[38]. Due to its ease and possibility of automation, Sanger’s method became instantly popular and was successfully commercialized into DNA sequencing machines. As a result, for almost last 3 decades, the Sanger’s method dominated as the gold standard for DNA sequencing[39]. This sequencing method was primarily accomplished by amplification of templates with fluorescently labeled chain-terminating nucleotides, followed by capillary electrophoresis of the amplicons and reading the fluorescence signals, which can provide consistent sequence information of templates up to 1000 bp. Despite its wide use for sequencing pure templates, this sequence method was constrained by its low throughput, higher cost, time and labor involved in sequencing larger genomes. Furthermore, complete dependence on specific primers, inability to sequence the genetic material from a mix of diverse organisms severely restricted its use for direct metagenomic applications.
Second-generation sequencers
To overcome the technological constraints of the Sanger sequencers, second-generation or the NGS technologies were developed, based on a large number of innovations in the amplification technology, sequencing chemistry, microfluidics, imaging technologies, and Bioinformatics, etc.[40]. These novel sequencing technologies, initially commercialized by two companies, namely Roche and Illumina, and later by Life Technologies have spectacularly high throughput and high sensitivity, making them more appropriate for direct application in metagenomic studies. As compared to Sanger sequencers, currently available 2nd-generation NGS platforms are capable of generating only short sequence reads, but the true magnificence of NGS lies in their capability to sequence and analyze complex mixes of DNA in a massively parallel manner, generating millions to billions of sequence reads in a single run. Consequently, these technologies are often referred to as “short read” technologies and are distinguished by “third generation” sequencing technologies (or “long read”) that provide significantly longer reads (kilobases). However, at present, these long read technologies have, on the whole, lower throughput and accuracy[41-43].
Even though, widely distinct in their sequencing chemistry and detection technology, NGS platforms are common in terms of massively parallel sequencing of clonally amplified or single DNA molecules. On these platforms, sequencing is executed by repetitive cycles of polymerase-mediated nucleotide extension (Roche-454, Illumina GA) or oligonucleotide ligation (SOLiD). Using a “wash-and-scan” technique, sequence data is acquired as large sets of fluorescence or luminescence images of the flow-cell surface, subsequent to each repetitive sequencing cycle step[44]. This data is later compiled by using a computer-intensive pipeline for image integration, quality assessment, storage, processing and analysis. A typical NGS run generates several hundred megabases (Mb) to gigabases (Gb) of nucleotide sequence data, depending on the platform.
Although NGS platforms commercially available today, provide massive parallel sequencing, but due to their technological features and data output capabilities, every platform is suitable for certain specific applications. Hence, as per explicit requirements, NGS platform needs to be carefully selected. In cases of virus discovery, which is the scope of this review, NGS platforms capable of generating longer sequence reads are preferable over the others. Long reads are extremely useful for de novo read assembly and generation of longer contigs, which endow with improved statistical power of finding related sequences in nucleotide database searches[45]. Conversely, for characterization and analysis of virus variants and quasispecies, platforms providing high quality reads, i.e., less error and increased depth became the choice, over longer read lengths. In this review, we will discuss briefly the most popular NGS technologies (Illumina and Roche 454), widely used in virology. The details of the technologies, sequencing chemistries and other applications have been reviewed elsewhere in details[10,31,34].
The most widely used NGS is the Illumina sequencing technology, where clonal amplification of the template is attained to form DNA clusters, using primers attached to solid surface and sequencing is achieved via reversible dye-terminator technology. Although Illumina sequencing has higher sequence yield at a relatively low cost per base, this platform has a characteristic systematic base calling bias, exhibit differences in sequence quality, a higher sequencing error rate and increased single-base errors associated with GGC motifs[46-49].
On the other hand, 454 sequencing platforms are based on parallel pyrosequencing, utilizing sequencing-by-synthesis chemistry and chemiluminescence is detected to achieve nucleotide sequence. This method amplifies DNA through an emulsion PCR, generating clones of DNA using a single template. The main benefit of this technology is its ability to produce long reads, while restricted by its high error rate in homopolymers containing regions, and a high rate of artificial amplification[50-52]. The error rates of NGS are higher relative to the Sanger sequencers, and also require advanced computational tools and statistical calculations before further data processing and assembly[53]. Due to the NGS platform specific errors, presently, use of barcoding strategies, simultaneous sequencing of the samples by two different NGS platforms or high coverage sequencing have been recommended to counteract the effects of errors[54-56]. Nevertheless, these issues are being continually addressed and resolved in the newer versions of these platforms to make them more robust, both in terms of quality and quantity.
With the advancement in instrumentations, NGS platforms are now available as benchtop sequencing instruments in the form of the 454 GS Junior (Roche) and MiSeq (Illumina) which, despite having a small footprint, offer exciting NGS capabilities for clinical settings, at modest running costs[45]. MiSeq includes the Nextera, TruSeq, and reversible terminator-based sequencing by synthesis chemistry and has highest data integrity with broader range of application, including amplicon sequencing, clone checking, small genome sequencing etc. The MiSeq provides maximum throughput per run with lowest error rates, while the 454 GS Junior generates longer reads (approximately 600 bases) with better assemblies, but is limited by lower throughput and homopolymer-associated errors.
Apart from the two most widely used NGS technologies, another technology known as the SOLiD technology (by Life Technologies) is commercially available, but its representation in the scientific literature is limited compared to Roche 454 and Illumina, which might be attributable to its recent availability or complexity of data processing and assembly[57]. Nevertheless, SOLiD is slowly but gradually being accepted as a very reliable platform and has recently been used for de novo sequencing of a large mammalian genome[58].
Technical details of the NGS technologies have been extensively reviewed earlier[23,45,59]. A comparison of the currently available NSG systems is also available at the Genohub website (https://genohub.com/ngs-instrument-guide/).
Third-generation sequencers
The third generation of the sequencers has evolved lately, that include the Ion Torrent (Life Technologies), Single-Molecule Real-Time technology SMRT (Pacific Biosciences), and the Nanopore sequencing technology (Oxford Nanopore Technologies). Third-generation sequencers are distinct from their predecessors in two primary features: (1) template amplification is not needed prior to sequencing, which cuts down template preparation time; and cost (2) the signal is registered in real time, directly, during the enzymatic reaction. Apart from the Ion Torrent, rest of the third-generation sequencing technologies is quite recent, and still in the evaluation stages. Moreover, data on their application in the field of virus discovery is extremely scanty. Hence, all these will be discussed only briefly in this review.
The Ion Torrent Personal Genome Machine is based on a semiconductor based sequencing technology and does not require a fluorescence or chemiluminescence based image scanning, resulting in high speed, low cost sequencing system within small size equipment. Cyclically, the semiconductor microfluidic chip is flooded with each nucleotide, and a voltage is generated if it is incorporated, and no voltage is generated when not incorporated. This is based on the fact that every time a nucleotide is incorporated into the DNA molecules, a proton is released, causing a change in voltage, which is subsequently detected and registered by the chip[45,60].
Using the SMRT, single large DNA molecules can be sequenced with high processivity of up to 7 kb, with average read lengths of 3-4 kb[23,61]. On a SMRT cell, numerous Zero-Mode Waveguides are embedded with single set of enzymes and DNA template. During the reaction, enzyme incorporates a nucleotide into the complementary strand, cleaving off fluorescent dye linked with the nucleotide, and this fluorescent signal is captured[61].
Nanopore sequencing is another recently developed method of the third-generation sequencing[62,63]. Nanopore is a tiny biopore with diameter in nanoscale, and involves a heptameric transmembrane channel α-haemolysin (αHL) from Staphylococcus aureus. This protein has the ability to tolerate extraordinary voltage and current conditions (up to 100 mV, 100 pA). Under a standard condition of ionic flow, when a DNA molecule is passed through the channel of, etc., HL, current is modulated according to the size difference between every deoxyribonucleoside monophosphate (dNMP). This current modulation is detected by standard electrophysiological techniques and the dNMP is identified[62]. Nanopore sequencers are extremely small (size of a USB drive), can sequence long read faster (> 5 kb at a rate of 1 bp/ns), free of fluorescence/chemiluminescence and other enzymes, less sensitive to temperature and other conditions. These benefits make it fit as an extremely rapid sequencing device for field conditions, but the requirement of highly purified DNA needs to be addressed for their wide application in virus discovery.
Among the different NGS platforms available today, choosing the right one for correct application is extremely essential before embarking on a metagenomic project. In case of absence of a reference genome, or where highly divergent sequences are expected, such as in case of virus discovery, de novo sequencing and assembly is necessary. Such an assembly requires extensive computational power and datasets containing longer reads with higher coverage are preferable[64-66]. When reference genomes for assembly are available, technologies that generate short reads could also be used to have a high coverage of the metagenomes[53].
When compared in terms of publications, Illumina technology is the most widely used platform, irrespective of application. Earlier the use of this platform was not suited for virus discovery or de novo sequencing projects due to its short reads. However, regular augmentation in read length for Illumina platforms has made it suitable for de novo assembly of genomes, at a sensitivity, comparable to specific PCR[53,67,68]. However, according to the number of publications, specifically for metagenomic studies, pyrosequencing technology (Roche 454) is preferred over the other NGS approaches producing shorter reads. Of late, Roche has announced the discontinuation of its 454 technology by the mid-2016, which leaves the new investigators with alternative NGS platforms available today.
SAMPLE PREPARATION FOR VIRAL METAGENOMICS AND DISCOVERY
NGS has emerged as the most promising tool for the detection and discovery of novel infectious agents in clinical specimens[23]. However, being unbiased method of sequencing, NGS is greatly affected by very low virus-to-host genome ratios in clinical samples[69-71]. Hence, enrichment of pathogen genetic material or depletion of host genetic materials is essential to maximize sensitivity for discovery of novel pathogens, including viruses in clinical samples[23,72,73]. A schematic representation of the different steps involved in NGS based virus metagenomics and discovery is depicted in Figure Figure11.

Diagrammatic representation of main steps of clinical virus discovery by next-generation sequencer based technologies.
Physical enrichment of virus particles
A number of virus enrichment protocols involving physical and enzymatic techniques have been successfully applied for clinical samples. These include virus capsid purification through freeze/thaw cycles of cell disruption, filtration through appropriate pore membranes (0.45 μm and 0.22 μm), centrifugation, prior nuclease digestion of host genome, etc., followed by extraction of capsid-protected viral nucleic acids, their conversion to cDNA (in case of RNA virus) and non-specific PCR amplification[15]. The efficiency of enrichment in NGS-mediated virus discovery, especially the prior nuclease digestion has been clearly demonstrated by different studies[12,72,74]. Recently Hall et al[74] reviewed literatures available on methods for enrichment of viral nucleic acids from clinical samples for NGS-based studies. They found that both ultracentrifugation-mediated enrichment and low-speed centrifugation together with filtration and a nuclease digestion step is widely used for enrichment of viral nucleic acids.
Alternatively, approaches to deplete host genetic materials include use of methylation-specific DNase activity, host ribosomal RNA removal, duplex-specific nuclease normalization methods[75-77]. Such techniques on one hand increase the detection sensitivity of the NGS platform, circumventing the cost and time involved in generating and analyzing huge amounts of background data on the other hand. Ideally, in a clinical setting virus enrichment methods are required to be rapid, standardized and undemanding in terms of cost, manpower or instrumentation facility.
Enrichment of viral nucleic acids through non-specific amplification techniques
A number of virus enrichment methods have been applied successfully for NGS studies of different clinical samples. Of them, the sequence-independent single primer amplification (SISPA), developed by Reyes and Kim[78], was modified for successful amplification of viral sequences from serum by Allander et al[79] and later by others for identification of novel viruses through Sanger sequencing[80-83]. Recently, SISPA was used in combination with NGS and shown to be successful in detection of Hepatitis B and C viruses (HBV, HCV) in solid tissue samples[72]. In a recent study, SISPA-NGS strategy was found to be helpful in detection of Schmallenberg virus (SBV) in veterinary samples[84], suggesting the utility of this technique in screening of field animals that are intermediate hosts to many human viruses. In some of the recent studies no specific physical enrichment of virus particles was applied, but NGS was done on SISPA generated random PCR products, that also resulted in rapid detection of hemorrhagic fever-associated Yellow Fever Virus (YFV), Lujo virus (LUJV), and a new Arenavirus (related to lymphocytic choriomeningitis virus, LCMV) in diverse clinical samples[85-87].
Likewise, another well-established sequence-independent amplification technique is the virus discovery cDNA-amplified fragment length polymorphism (VIDISCA), used for discovery of a novel human SARS-associated coronavirus, HCoV-NL63[88,89]. Later this technique was successfully used in combination with Sanger sequencing to discover other novel viruses in clinical samples[90,91]. To late, the utility of this technique in combination with NGS for virus discovery has been demonstrated in veterinary samples, as well as in clinical samples[92,93]. Additionally, Shaukat et al[92] modified the VIDISCA method at the reverse transcription step by using specially designed mix of random hexamers that do not anneal to ribosomal RNA, further increasing the specificity of the assay. Apart from SISPA and VIDISCA, multiply-primed RCA has also been demonstrated to enrich circular viral genomes, suitable for sequencing through NSG platforms[94-96]. A diagrammatic representation of SISPA, VIDISCA and RCA is depicted in Figure Figure22 respectively. Recently, Kohl et al[97] reported an ultracentrifugation and DNA digestion based enrichment protocol followed by SISPA for detection of known and new viruses in human tissue samples. This technique, termed as tissue-based universal virus detection for viral metagenomics was demonstrated to complete within 28 h, making it suitable for discovery of zoonotic and biothreat agents of viral origin during outbreaks[97].

Different virus nucleic acid enrichment techniques. A: Sequence-independent single-primer amplification. Initially viral RNA and ssDNA is transcribed into complementary DNA (cDNA) using reverse transcriptase (RT) and DNA Pol I respectively, with the help of tagged-primers having defined sequence at the 5’ end while random nucleotides at the 3’ end. Subsequently, second strand synthesis is performed using DNA Pol I (Klenow) to make the cDNA double stranded (dsDNA). Now all the nuceic acids present in the reaction are dsDNA fragments have tagged sequence at their ends. Finally, anchored dsDNA is amplified with primers annealing to the adapter specific sequences, PCR product are checked and ready for analysis through cloning-sequencing or direct sequencing through next-generation sequencers (NGS); B: Virus discovery based on cDNA-AFLP. Initially viral RNA is reverse transcribed into complementary DNA (cDNA) using RT and random primers. Subsequently, second strand synthesis is performed using DNA Pol I (Klenow) to make the cDNA double stranded (dsDNA). In this step, other viral single stranded DNA (ssDNA) viral is also converted to dsDNA. Now all the nuceic acids present in the reaction are dsDNA. In the next step dsDNA are digested with a set of frequent cutter restriction endonucleases, which produce asymmetric cuts. Now specially designed matching anchor-adapters are ligated ends of the restriction fragments using DNA Ligase. Finally, anchored dsDNA is amplified with primers annealing to the adapter specific sequences, PCR product are checked and ready for analysis through cloning-sequencing or direct sequencing through NGS; C: Rolling circle amplification. Amplification of multiply primed single stranded circular viral genomes. 3’-exonuclease resistant primers randomly bind the genome and are elongated by the Phi29 polymerase. The growing strand subsequently displaces the preceding strand of the DNA, making the strand available for binding of random primers and further elongation. This cyclic displacement and elongation leads to a highly branched structure of growing DNA, which is linear in topology. Rolling circle amplification has the capability to specifically enrich the circular ssDNA genomes in an environment of other genetic materials, and could then be characterized by NGS.
Alternatively, in another study, the authors used a barcoding strategy to carry out unbiased deep sequencing in multiple clinical samples and removed human and other low-quality sequences through bioinformatic filtering pipeline and identified viruses belonging to the Herpesviridae, Flaviviridae, Circoviridae, Anelloviridae, Asfarviridae, and Parvoviridae families in serum samples from tropical febrile illness[2].
Apart from virus discovery and detection in clinical samples, analysis of quasispecies, drug-resistant viral variants and monitoring of genetic consistency of live viral vaccines there are numerous applications of NGS, which are directly associated with human viral diseases. NGS-based virus detection technique has also been shown to be useful in surveillance of vector-borne and zoonotic viruses[23]. This possibility of detecting arthropod-borne viruses was demonstrated using Dengue virus-infected mosquito pools (Aedes aegypti), where, use of NGS resulted in highly sensitive detection of mosquito pools containing infected vectors[98]. Similarly, in a surveillance study focused on the discovery of bat-transmitted pathogens, using coronavirus consensus PCR and unbiased NGS, a new coronavirus related to SARS-CoV was documented[99].
BIOINFORMATICS CHALLENGES ASSOCIATED WITH NGS
Regardless of the field of applications and platforms used, ever-increasing capacities of NGS platforms and their wide usage have resulted in extremely unprecedented volumes of data. This is commonly referred to as “data deluge”, and is represented by huge NGS datasets deposited in specialized data archive such as the SRA, a primary archive of NIH, dedicated for submission and storage of raw data and alignment information, generated by all major NGS platforms. Being part of the International Nucleotide Sequence Database Collaboration at the National Center for Biotechnology Information, data submitted to either of the databases SRA, ENA (European nucleotide Archive of European Bioinformatics Institute, EBI) and the DDBJ (DNA Database of Japan) are shared amongst them. SRA serves as an initial point for downstream analysis of NGS data and also provide access to data from human clinical samples to authorized users. According to a recent comparison of GenBank statistics (Release 197, 8/2013 vs Release 203, 8/2014), total nucleotide entries to the GenBank represent an annual growth of more than 43%, and annual growth exclusively for virus sequence entries is 21%[100]. This data deluge has posed significant hardware, software and bioinformatics challenges towards storing, transfer, analysis and interpretation of the data[101].
All NGS platforms are advancing towards the capability to sequence longer DNA fragments, and to generate even larger volume of data sets[53]. To analyze such gigantic volumes of data, exceptionally massive computational facilities are also required, which has entirely revolutionized the field of Bioinformatics[60,102]. Once NGS sequence has been generated, the biggest of the challenges comes, i.e., computational requirements for storage and analysis of the massive data sets. Although a detailed description of bioinformatic processes involved in metagenomics data analysis is beyond the scale of this review, the key processes involved in the NGS data analysis are quality assessment, sequence assembly and annotation of the dataset against a database of nucleotide or protein sequences[34]. Quality assessment and data cleaning involves filtering out of low-quality sequences from the dataset, followed by alignment and error correction to separate true variance from the experimental noise[23]. After sequencing and quality assessment, there are two approaches for assembly of the reads. The sequence reads are then mapped to the available reference genome, or individual sequencing reads are assembled de novo, using different assembly servers[34,103]. The de novo approach is generally followed for discovery of viruses, considering the fact that reference genomes or related sequences may not be available in the databases. To determine the affinity of the assembled reads or the contigs, Basic Local Alignment Search Tool (BLAST) is used, that computes regions of similarity and statistical significance of possible matches between a query sequence and GenBank submissions[104]. Despite the availability of the BLAST, analyzing a viral metagenome may still be a challenging task in case of highly divergent or novel viral families, which are not represented in the database.
In the Table Table2,2, we have summarized the challenges associated with handling and analysis of NGS generated data, their solutions presently available or suggested.
Table 2
Important bioinformatics challenges associated with application of next-generation sequencers in viral diagnostics action taken or proposed to overcome challenges
| Bioinformatics challenges associated with application of NGS in viral diagnostics | Action taken or proposed to overcome challenges |
| Generation of huge volumes of data by NGS platforms-“data deluge” | Advancement in storage and computation facilities, availability of computer with greater storage and highly powerful processors, cluster/grid computing and cloud computing. Computation facilities needs to be updated with emergence of newer platforms delivering larger datasets |
| Challenges in uploading data for submission to databases and supercomputing servers for analysis | Requirement of uninterrupted and extremely fast networks |
| Challenges in storage, public archival and ease of access | Creation of specialized data archive such as the Sequence Read Archive by NIH and ENA (European nucleotide Archive) by EBI. Sharing of data within the three major databases (NIH, EBI and DDBJ) for public accessibility |
| Challenges in analysis and visualization of large volumes of data, beyond the scope of computation facilities available in molecular biology laboratories | Creation of metagenomic or NGS data analysis pipelines and integrated tool kits, such as those available at NIH-NCBI, EMBL-EBI, MGRAST, CASAVA, MetaVir, Megan, UCSC Genome Browser, BioLinux, etc., availability of cloud computing based servers such as Galaxy |
| Challenges in alignment, de novo assembly, gene prediction and phylogenetic analyses NGS datasets, especially short read datasets | Availability of alignment algorithms/programs such as ABySS, ELAND, SOAP, Bowtie, Cloudburst, Zoom, BWA, SHRiMP, MOM, SeqMap, Metagene, Velvet, QSRA, ALLPATHS, EDENA, VCAKE, FragGeneScan, BLAST, GLIMMER, EULER-SR, Avadis, Eagle View, etc. |
| Interpretation of huge amount of data generated in metagenomic analyses by NGS platforms | Proper interpretation of analyzed data is of utmost importance to identify newer pathogens as well as their clinical significance |
NGS: Next-generation sequencers.
CONCLUSION
During the last decade, numerous innovations in virus enrichment techniques, sequencing chemistry and signal detection technologies, availability of high end dedicated bioinformatic servers for analysis of the NGS data has greatly accelerated the discovery of viral pathogens in clinical samples. Apart from its increasing applications in virus discovery, NGS has been successfully used in monitoring of antiviral drug resistance, investigation of viral evolution, diversity and quasispecies, and evaluation of the human virome. The supreme advantage of the NGS platforms is their ability to characterize hundreds of different pathogens simultaneously that are not otherwise cultivable using conventional approaches. Nevertheless, there are a number of challenges that need to be overcome for these technologies to become routine in clinical settings. The initial cost of set-up, turnaround time, requirement of powerful computational facilities along with the requirement of a highly skilled group of people are the major barriers to their wide application in resource-limited countries, where the cases of emerging viruses are the highest.
Despite the broad utility of NGS in virus discovery, extremely high sensitivity of this technique also makes it prone to unintentional contamination. The use of random primers for enrichment and the deep sequencing may result in significant potential for carryover contamination from laboratory reagents. Simultaneous analyses of blinded controls may be one approach towards excluding such possibilities, but it will also double the cost of sequencing. Another outcome of the NGS data is the rapid rate of discovery of viruses. However, the absence of appropriate cell culture systems or animal models limit the possibility of experimental studies on these new viruses, thereby the clinical significance of these new viruses remains to be properly understood.
ACKNOWLEDGMENTS
We thankfully acknowledge the Defence Research and Development Organization (DRDO), Ministry of Defence, Government of India for funding and support. We also thank the editor and three anonymous reviewers for their constructive comments, which helped us immensely to improve this manuscript.
Footnotes
P- Reviewer: Chen YD, Demonacos C, Qiu HJ S- Editor: Song XX L- Editor: A E- Editor: Yan JL
Supported by The author’s laboratory is supported by the Defence Research and Development Organization (DRDO), Ministry of Defence, Government of India.
Conflict-of-interest statement: The authors declare no conflict of interest related to the submitted manuscript.
Open-Access: This article is an open-access article which was selected by an in-house editor and fully peer-reviewed by external reviewers. It is distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
Peer-review started: January 27, 2015
First decision: March 6, 2015
Article in press: May 8, 2015
