• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of cmbMary Ann Liebert, Inc.Mary Ann Liebert, Inc.JournalsSearchAlerts
Journal of Computational Biology
J Comput Biol. Jun 2012; 19(6): 796–813.
PMCID: PMC3375638

Mapping Reads on a Genomic Sequence: An Algorithmic Overview and a Practical Comparative Analysis

Abstract

Mapping short reads against a reference genome is classically the first step of many next-generation sequencing data analyses, and it should be as accurate as possible. Because of the large number of reads to handle, numerous sophisticated algorithms have been developped in the last 3 years to tackle this problem. In this article, we first review the underlying algorithms used in most of the existing mapping tools, and then we compare the performance of nine of these tools on a well controled benchmark built for this purpose. We built a set of reads that exist in single or multiple copies in a reference genome and for which there is no mismatch, and a set of reads with three mismatches. We considered as reference genome both the human genome and a concatenation of all complete bacterial genomes. On each dataset, we quantified the capacity of the different tools to retrieve all the occurrences of the reads in the reference genome. Special attention was paid to reads uniquely reported and to reads with multiple hits.

Key words: NGS, benchmarking, short read alignment, Burrows-Wheeler Transform, suffix tree, suffix array, hashing, spaced seeds

1. Introduction

Next-generation sequencing data are now the standard to produce genomic and transcriptomic knowledge about an organism, and they are massively produced due to an affordable cost. Mapping short reads against a reference genome is typically the first step to analyze such next-generation sequencing data, and it should be as accurate as possible. Because of the high number of reads to handle, numerous sophisticated algorithms have been developped in the last 3 years to tackle this problem and many mapping tools exist now. These tools usually have their own specificities, and the question of which one to use for a given application is a very vexing question. A very recent article (Ruffalo et al., 2011) presents a comparative analysis of six mapping tools run on the human genome. Their criteria to compare the performances of the mapping tools are based on the quality score, computed by the different tools, of the retrieved mappings. For a given set of reads, they define the accuracy of a mapping tool as the proportion of “good” mapping at the correct location. They globally analyze the accurracy with respect to various parameters such as the error rate, the size and the frequency of the indels in the reads. Our aim in this article is also to compare some mapping tools (four tools in common with the previous reference and five additional ones), but we choose to build a more controlled benchmark in order to study quantitatively some detailed aspects of the mapping task. By more controlled benchmark, we mean that we know how the reads really map to the reference genome and that all reads have the same amount of errors. We thus address the following questions: Are the tools capable to systematically map a read occurring exactly (with no mismatch) in the reference genome? Can they always do it for a read having as many errors as the maximum number of mismatches allowed in the alignments? For reads occurring at several positions, do they retrieve all the occurrences or only a subset? Do the reads reported as unique really occur only once along the genome? As we will see, the answer will not always be positive, so it is important to know the limitation of each tools.

We have evaluated the performance of the nine following mapping tools: BWA (Li and Durbin 2009), Novoalign (Novocraft, 2010), Bowtie (Langmead et al., 2009), SOAP2 (Li et al., 2009), BFAST (Homer et al., 2009), SSAHA2 (Ning et al., 2001), MPscan (Rivals et al., 2009), GASSST (Rizk and Lavenier, 2010), and PerM (Chen et al., 2009).

They can be divided into two main categories according to the type of algorithm they are based on: hash table based algorithms (indexing either the reads or the reference genome) and Burrows-Wheeler Transform based algorithms (see Table 2 below). MPscan uses a different approach based on suffix trees. In Section 2, a description of these algorithms will be presented. Section 3 will then describe the benchmarks we have built to compare these nine mapping tools. Results are given in Section 4 and are discussed in Section 5.

Table 2.
Global characteristics of the mapping tools

2. Algorithmic Overview

In this section, our aim is to describe the algorithms on which mapping tools are based. Although these algorithms are complex, we tried to make them as clear as possible. The beginning of each section broadly describe the methods and the structures, then a more in-depth description follows.

2.1. Hashing

2.1.1. Basic algorithms

Let us suppose, for the sake of convenience, that all the reads have the same size, say 36 nucleotides.

The most straightforward way of finding all the occurrences of a read, if no gap is allowed, consists in “sliding” the read along the genome sequence and noting the positions where there exists a perfect match. Unfortunately, although conceptually simple, this algorithm has a complexity O(LGLrNr) where LG is the size of the genome sequence, Lr the size of the read and Nr is the number of reads. When gaps are allowed, one has to resort to the popular dynamic programming algorithm, such as the Needleman and Wunsch algorithm, whose complexity is also O(LGLrNr). Algorithms with this complexity are far too slow for aligning several hundreds millions reads on the human genome as the case may happen nowadays.

Therefore, to be efficient, all the methods must rely on some sort of pre-computing. For instance, it is theoretically conceivable to compile a list of all the words of length 36 (36-mers) that are found in the genome and determine once for all their positions. Then, we can use a hashing algorithm to transform a string into a key that allows a fast search. Consequently, finding the positions of a read would be straightforward using this list. However, this method does not work in practice since it requires too much computer memory or disk space. Indeed, in the worst case, assuming that all the 36-mers generated from the genome are different, there are about LG such words. In addition, this method does not allow us to consider mismatches easily.

A possible workaround to this problem of space, is to store all k-mers in a list, using a value of k significantly less than the read size, say k = 9 (Fig. 1A). Some tools chose to select a k-mer for each read (a good choice is the leftmost part, because the quality is better) and map it to the genome using the hashing procedure. In terms of space, the problem is now more tractable since there is, at most, 49 = 262144 different 9-mers in the genome. Then, for each possible hit, the procedure would then try to map the rest of the read to the genome (possibly allowing errors, in a Needleman–Wunsch-like algorithm). This two-steps strategy, called seed and extend, is implemented in SSAHA and Stampy (Lunter and Goodson, 2010).

FIG. 1.
The hashing algorithm. (A) The genome is cut into overlapping 3-mers, and their respective positions in the genome are stored. (B) The read is cut into 3-mers. The 3-mers from the reads are compared to 3-mers from the genome using a hashing procedure. ...

A drawback of the previous method is that seeds may be highly repeated in the genome. As a consequence, many hits should be verified in the “extend” phase, which is time consuming. Another approach is to divide each read into four 9-nucleotide-long substrings. Then, as previously, every substring of a read can be matched using the 9-mer list (Fig. 1B). If the four substrings of a read are found in the list, in the correct order and adjacent to each other, the read exists in the genome (Fig. 1C, D). However, this algorithm does not permit to consider mismatches.

Chopping reads into smaller pieces can also be used to find reads with errors (mismatches and indels). Suppose that one allows for two errors when mapping reads, then one can be sure that at least two out of the four 9-mers will map exactly (in the worst case, the two errors are located in two different 9-mers, thus leaving two 9-mers without error) (Fig. 2). This is known as the pigeon hole principle. The two 9-mers that exactly match the genome constitute an anchor. A more finely-tuned search, in the immediate vicinity of this anchor, is able to recover the read containing errors. This two-steps strategy is implemented for instance in MAQ (Li et al., 2008), SOAP, RMAP (Ning et al., 2001), and SeqMap (Jiang and Wong, 2008).

FIG. 2.
The seed and extend algorithm. (A) A read ATCA is sought for in GATTACA, using seeds of size 2, with one error. Each seed maps once (left part). After extension of each seed (right part), it turns out that only one mapping contains only one error. (B) ...

A similar method, called q-gram filtering, is an extension of the pigeon hole principle. The main difference is that the reads are not chopped into non-overlapping k-mers, but every k-mer is extracted and mapped to the genome (in the read ACGT, the 2-mers would be AC, CG, and GT). If a sufficient number of k-mers map in a small region, the hit is then chosen for a more careful alignment. This method has been used for instance in SHRiMP2 (David et al., 2011) and RazerS (Weese et al., 2009).

Still, for real instances the list is usually too large to be kept in memory. Characters are usually stored using a byte (8 bits). DNA alphabet only consists of the four letters A, C, G, and T. To gain memory space, some methods choose to encode each nucleotide with 2 bits (e.g., A is 00, C is 01, G is 10, and T is 11). Notice that such an encoding does not allow the handling of ambiguous nucleotides (N), which sometimes exist in the reads or the reference genome. To get around this problem, different methods adopt different strategies. For instance, BWA randomly transforms all ambiguous nucleotides into a regular nucleotide. SOAP2, arbitrarily, transforms them into Gs in the reads.

Let us note that the reads too can be hashed, in which case one searches to match regions of the genome, having the size of the reads, to the reads. Some of the first developed tools used this strategy (e.g., MAQ, RMAP, and SeqMap). However, since nowadays sequencers are capable of producing huge amounts of reads in a single run, up to one billion reads, hashing the reads is now too memory demanding. As a consequence, current mapping tools now hash the genome.

2.1.2. Improvements

A major drawback of the seed and extend algorithm is that the reads need to be split into smaller and smaller substrings when the number of allowed errors increases. However, small substrings (say, of four nucleotides) are not effective during the seed phase, since they can match many regions of the genome. For this reason, seeds of less than 10 nucleotides are seldom used. This implies that hashing-based algorithms cannot map reads with many errors.

To circumvent this problem, it is possible to use spaced seeds (as it is implemented in ZOOM [Lin et al., 2008], BFAST, GASSST, SHRiMP2, or PerM, for instance), i.e., seeds with so-called “don't care” positions. A “don't care” position, “x”, is a position in the read where the algorithm does not check the type of nucleotide present. For instance, the simple spaced seed ACGxACG is able to match ACGAACG, ACGCACG, etc. It is clear that using a set of spaced seeds instead of a single regular seed increases the sensitivity of the method, although it has also the side-effect of increasing the computation time. For instance, in Figure 2B, we can observe that each spaced seed yields more hits than normal seeds.

In fact, most of the time spent by a seed and extend algorithm is generally spent in the “extend” part (this part solves the problem: “Given a seed, do we have a match around it?”). This part usually uses some form of the dynamic programming algorithm (e.g., Needleman & Wunsch if insertions and deletions are considered) to do so. Most tools use widely known optimizations but some, like SHRiMP, uses also special CPU commands to parallelize the job.

Other methods, like GASSST, add an intermediate step in the pipeline, and use a seed, filter and extend algorithm. Each time GASSST finds a seed, it quickly compares the neighboring region with the rest of the read, using an algorithm which is much faster than the dynamic programming algorithm. The filter consists in computing the so-called Euler distance. This distance just computes the number of letters of each type in the compared regions. For instance, if one tries to map a read containing 3 G with a region that only contains 1 G, there will be at least 2 errors in the alignment. Regions that fail this filter are thus quickly discarded.

Authors of PerM observed that, in current mapping tools, several seeds are usually used to achieve good specificity. As a consequence, several hashing structures are used, which increases the required space. PerM simply uses one atomic small gapped seed which is repeated to cover the reads. For instance, for a 36-nucleotide long read, four seeds 1111xx1xx of size 9 are used (where x is a “don't care” sign). Every atomic seed is used to map parts of the reads and promising hits are then investigated for full mapping. Since PerM uses only one hashing structure, it can use seeds with higher weights (i.e., with less don't care) for the same specificity, thus saving time.

2.2. Burrows–Wheeler transform based algorithms

2.2.1. Suffix trees

Hashing based algorithms give poor results when the reads fall into repeated regions, since many seeds should be checked. Some clever data structures have been recently used to solve this problem: the suffix trees and the suffix arrays. Remember that a suffix of a string is a substring that starts at any position and whose end coincides with that of the string. For instance, TTACA is a suffix of GATTACA, whereas TTAC is not. A suffix tree is a tree in which there is a one-to-one correspondance between the paths from the root to the leaves and the suffixes existing in a string, in other words all the suffixes of the string exist as a path joining the root to a leave in the tree (Fig. 3). Notice that some space is saved since all the suffixes are not explicitely written. Actually, current algorithms build suffix trees whose size is proportional to the size of the genome, in a time proportional to the size of the genome, too.

FIG. 3.
The suffix tree of the genome GATTACA. Dotted arrows indicate that the tree continues there. Double circle indicate that a suffix ends there.

Now, with a suffix tree, looking for a read in a genome is very simple, provided no error is allowed. Suppose we look for GAT in the genome. We start from the root and take the branch which is labelled G. We arrive to a new node, and follow the branch A, then at the next node the branch labelled T (Fig. 3). If we can successfully spell the read, then it is present in the genome. Notice, for instance, that we cannot spell the word GAG. The search, if successful, is O(Lr). We will show below how to solve the problem with errors.

Notice that if the word we are looking for is in a repeated region, it will not change the time spent to retrieve it. This structure somehow “collapses” the genome and squeezes the repeated regions into only one path.

The MPscan program uses a slightly modified version of a suffix tree to store the reads (instead of the genome). Using a dedicated algorithm, it scans the whole genome and checks if one of the reads stored in the suffix tree matches some part of the genome. While the method is very fast and uses no heuristics that may give rise to false positives, it only supports exact alignment.

2.2.2. Suffix arrays

Despite all the memory optimizations, it seems that suffix trees are still too big to fit in RAM for large genomes such as the human genome. Suffix arrays have thus been proposed to overcome this difficulty. Even though using a suffix array is a bit more tricky than using a suffix tree, its definition is much simpler: a suffix array is the set of suffixes of the genome sorted lexicographically (Fig. 4). Of course, if we had to store all the suffixes, it would be untractable. So, a first trick is to store the position of the beginning of the suffixes. The rationale is that the whole suffix can be recovered at once using its start position and the genome. Finally, a suffix array is a mere list of numbers: the start positions of the lexicographically sorted suffixes.

FIG. 4.
Suffix array of the genome GATTACA. Last column is the Burrows-Wheeler transform.

Although this may not be immediately apparent, there is a strong connection between suffix trees and suffix arrays. For instance, let us consider the first column of the suffix array. It consists of the set of letters labeling the edges that connect the root of the suffix tree to the first level of nodes. Similarly, the first two-letter words of the suffix array correspond to the paths from the root to the second level nodes, and so on. This suggests that it is possible to traverse a suffix array in a way similar to what is done with a suffix tree.

There are several ways to do so. The most popular solution is to use the Burrows–Wheeler transform. This transform has been described for text compression but it has many interesting properties in bioinformatics.

In this transform, it is useful to consider the suffixes as being written on cylinders, meaning that the first letter of the genome comes after the last letter of the suffix, and so on (Fig. 4). In other words, we write down all the circular permutations of the genome string. To mark the end of the genome, we append a sigil $ to it. By convention, the sigil is the last character in the lexicographic order. The Burrows-Wheeler transform of a suffix array is simply the last column of this cylinder suffix array. In fact, a Burrows-Wheeler transform merely is a list of characters (shown in the last column of Fig. 4).

Notice that suffix arrays, together with their Burrows-Wheeler transforms, are very compact. It is just a set of two vectors: a vector of numbers (the suffix array) and a vector of characters (the Burrows-Wheeler transform). The size of the arrays are equal to the size of the genome.

It is possible to retrieve the whole suffix array from the Burrows-Wheeler transform only (Fig. 5). For this, we begin by sorting lexicographically the list of characters (see column 2). We can observe that this corresponds exactly to the first column of the cylinder suffix array (since suffixes are sorted lexicographically). We now have the first and the last columns of the cylinder suffix array (the last column being the Burrows-Wheeler transform). Since we work on a cylinder, we know that the last column is also the one which is right before the first column. So, when we concatenate the two columns, we have the set of all two-letter words of the genome (column 3). Again, if we sort them lexicographically (column 4), they form the first two columns of the cylinder suffix array. If we prepend the Burrows–Wheeler transform to this set, and we sort it, we have the three first columns of the cylinder suffix array (column 5). Proceeding likewise, the whole suffix array is retrieved. The original genome corresponds to the line ending with the sigil.

FIG. 5.
Reconstructing the suffix array.

2.2.3. Finding a word with the Burrows-Wheeler algorithm

The algorithm previously described is not used to find the mappings of a read, it builds iteratively all the two-letter words of the genome, then three-letter words, etc. This is useless and time consuming for our problem. What we actually need is a way to traverse the array, just as we did for the suffix tree: letter by letter. Restricting the search to a specific read has been proposed in the frame of the Ferragina–Manzini algorithm (Ferragina and Manzini, 2000), and it is actually much more complicated than the algorithm previously described. First, it proceeds from right to left, meaning that if we look for the read GAT in the genome GATTACA, we will first look for the T, then the A, and finally the G. Second, it mainly uses two data structures: the transform itself and the sorted transform (which is the first column of the array, and can be computed from the transform by sorting it).

We first look for all the occurrences of T in the sorted transform (this search can be done in constant time using an additional data structure). There are two Ts, at rows 6 and 7 (Fig. 6a). The corresponding characters in the transform are the letters T and A respectively (Fig. 6b). This means that the only letters which are before a T in the genome are a A and a T. Here, we are interested in the A (Fig. 6c). We now would like to know what are the letters which are before this AT. Here comes a magic trick, which will be explained later. The A of the row 7, that we have selected, is the second A when we read the Burrows–Wheeler transform from top to bottom. So we jump to the second A of the sorted column (again, this can be done in constant time with yet some additional data structures). It corresponds to the A at row 2 (Fig. 6d). The search recursively resumes here. The letter before the AT is found in row 2 of the transform. It is a G, which shows that the read GAT exists in the genome (Fig. 6e).

FIG. 6.
Looking for the read GAT.

Suffix arrays behave equally well as suffix trees when handling repetitions. Since the array is sorted, if a word is repeated twice, the occurrences will appear in two consecutive rows. As a consequence, the algorithm actually works with intervals of rows. As in Figure 6a, b, the search algorithm considers as putative match all the solutions which are between rows 6 and 7.

An explanation of the magic trick used above is the following. The whole problem is to look for the letters which are before a given word in the genome. If we want to know which letters are before a T, we can look for the Ts in the sorted transform (rows 6 and 7) and get the corresponding characters in the transform (T and A). By construction, these letters are right before the Ts. To proceed, we need to know which letters are before AT. Suppose we are able to generate the sorted list of 2-letter words of the genome (see column 4 of Fig. 5). We then can use the Burrows–Wheeler transform to find which nucleotides are before AT (here, it is row 2). But since we do not have the second column, but only the first one, all that we know is that we should focus on the rows such that the sorted transform is an A, namely rows 1, 2 or 3 (Fig. 7a). The problem is to find to which row AT corresponds to.

FIG. 7.
The magic trick explained.

In the transform, these As are located at rows 4, 7, and 8 (Fig. 7b). Using the sorted transform, we can see that they correspond to the two letter words AC, AT, and A$ (Fig. 7c). Notice that these words are sorted according to the second letter, when we read from top to bottom. This is a direct consequence of the fact that the sorted transform is the first column of the sorted suffix array, and that the Burrows–Wheeler is the set of nucleotides that are before this column. This is why AT is after AC and A$ is after AT.

Now, we know that AT is the second 2-letter word starting with a A. It is easy to find it in the sorted transform, since it is also sorted: it is simply the second A, read from top to bottom. This is row 2 (Fig. 7d). We can finally retrieve the G which is before this A, and the word GAT is found.

2.2.4. Search with errors

While the Ferragina–Manzini algorithm permits to retrieve efficiently a read in a genome, the algorithm described above does not allow for errors. Adapting the Burrows–Wheeler algorithm to a search with errors is currently an active field of research. SHRiMP, for instance, when no exact match is found, restarts the search and accepts errors, randomly. Suppose that the read GCT is to be found in GATTACA. The letter T is matched first (remember that the reads are searched for T from right to left), then the algorithm finds that the letters before T in the genome are either T or A, but not C. SHRiMP randomly picks up one of these two letters and records one error. This procedure is carried out several times, and either a match is found, or the number of allowed errors is exceeded and the search is brought to a close. Of course, this algorithm may not provide the best solution. In recent developments, BWA proposed an interesting adaptation to suffix arrays of the well known Smith–Waterman algorithm. However, it seems that the algorithm is too slow when too many errors are allowed and BWA first aligns the leftmost and the rightmost parts of the read with a fixed small number of errors, hoping that at least one of these regions contains few errors (which is grounded by the fact that the 5′ ends of the reads usually have a better quality).

2.3. Remarks

Two categories of methods exist to map reads on genomes: those based on hashing or the ones using Burrows–Wheeler transform. Although it is difficult to foresee what will be the trend in the future, it seems that most recent tools are based on the Burrows–Wheeler transform. Each method has strong and weak points. Burrows–Wheeler-based methods are fast even when a read has many possible locations in the genome because the latter is somehow “collapsed” and repeated regions are scanned only once. However, there is still no established algorithm to handle errors, and for the existing heuristics, the computation time does not scale well when the number of errors increases. On the other hand, hashing methods can handle errors, as long as they are not spread uniformly along the read (otherwise no seed may be found). However, seeds of regions which are highly repeated are not specific and mapping may be slow for repeated sequences. Maybe a good option is, as mentionned by the authors of Stampy, to use both methods one after the other to avoid the problems of each single method.

Another interesting trend involves the quality of the mapping. What is the probability that a read comes from a predicted region? Given the number of mappings for a read, the number of mismatches and the sequencing quality, some tools like MAQ, BWA, and Stampy try to infer a probability of correctness of the mapping. As a consequence, the user may want to control the probability of the reads to be misaligned instead of the maximum number of errors, which may make more sense.

Last but not least, the way the algorithm are implemented has a great impact on the speed of the mapping tools. For instance, reading or writing to a hard disk is extremely slow and should be avoided for the frequent operations. As a consequence, most tools use several strategies to accelerate the search. They include: parallelization of the process (like bit–wise operations to compare two strings), caching results (such as Smith–Waterman results, should the same comparison be computed once again afterwards), and scaling data structure so that they can fit on RAM (by removing data which can be computed on the fly). Undoubtedly, the successful modern mapping tool designers consider equally important both the algorithm and the implementation details to deliver a time effective tool.

3. Benchmark Description

3.1. Datasets

The mapping tools are evaluated by two similar experiments. The first experiment (named equation M1 in the following) is run on the human genome (25 chromosomes for 2.7 Gbp). The second experiment (named equation M2) is run on bacterial genomes (904 genomic sequences for 1.7 Gbp).

In the experiment run on the human genome (equation M3), the reference genome equation M4 is taken from the assembly 37.1 made by the NCBI. We have built two sets of reads, all of length 40. The first set of reads (equation M5) is composed of 10 millions reads drawn uniformly from the reference genome equation M6. The drawing is done with wgsim.1 Human chromosomes sometime contain a large proportion, as high as 30%, of the letter N. Mapping reads with long runs of N is of little information to assess the efficiency of mapping tools because these reads should map in numerous locations. We thus decided, beforehand, to remove runs longer than 10 Ns from the reference genome.2 The majority of the reads (8,877,107) from equation M7 occurs only once,3 but some reads can be repeated many times along the reference genome as shown on Figure 8. For reads occurring more than once, the mean number of occurrences is 722.81 with a standard deviation of 2424.86. Moreover, the most frequent read occurs 53,162 times. The second set of reads (equation M8) is built from equation M9 by adding exactly three mismatches to each read. Therefore equation M10 contains also 10 millions reads. We are aware that modern sequencers are very unlikely to produce reads with such an error rate, but the rationale of this dataset is that many projects now produce resequencing and metagenomics data, which may diverge greatly from already sequenced genomes. The positions for the three mismatches are drawn uniformly within the 40 positions.4 A nucleotide A, C, G, or T is mutated to any of the three other nucleotide with equal probability 1/3, whereas an N is mutated into A, C, G, or T with probability 1/4. Among the 10 millions reads from equation M11 and equation M12, only 49 reads contain some Ns; the number of Ns per read is given in Table 1.

FIG. 8.
Histogram of the logarithm of the number of occurrences of the 1,122,893 (respectively 2,620,394) reads from equation M13 (resp. equation M14) occurring more than once in the reference genome (left) (resp. (right)).
Table 1.
Number of reads with a given number of Ns, from each of the four data set equation M15, equation M16, equation M17, equation M18

In the second experiment (equation M23), the reference genome equation M24 consists of 904 bacterial genomes found in Genome Reviews release 111.0 (Kersey et al., 2005). We also built two sets of 40-bps reads. The first set of reads (equation M25) is composed of 10 millions reads drawn uniformly from equation M26.5 There are 7,379,606 reads with a unique occurrence and 2,620,394 reads occurring more than once (mean 8.82 and SD 39.03). The most frequent read from equation M27 occurs 1685 times. The second set of reads (equation M28) is built from equation M29 by adding uniformly three mismatches in each read as described for equation M30. 231 reads from equation M31 and 219 reads from equation M32 contain some Ns (Table 1).

3.2. Mapping tools

We evaluated the performance of the following nine mapping tools: BWA_v0.5.8, Novoalign_v2.06.09, Bowtie_v0.12.7, SOAP_v2.20, BFAST_v0.6.5a, SSAHA2_v2.5.2, MPscan, GASSST_v1.28, and PerM_v0.3.9.

Table 2 gathers global characteristics of the tools, namely the type of algorithms they are based on, their output format, their ability to allow mismatches and/or indels in the alignments, and if they can use multiple threads.

Additional information is now given separately for each tools, in particular on the way to perform the comparisons. For each tool, our aim is indeed to retrieve all the alignments (so-called hits hereafter), either with no mismatch or with at most 3 mismatches, of our read datasets (see Section 3.1).

3.2.1. BWA

Running BWA consists in using successively three commands: the first one (bwa index) indexes the reference genome, the second one (bwa aln) finds the coordinates of the hits of each individual read in the suffix array, and the last one (bwa samse) converts the suffix array coordinates to reference genome coordinates and generate the alignments in the SAM format. By default, a non exhaustive search is done in the second step to reduce the computation time; we then used the −N option to disable this behavior and to search for all possible hits. Using this option or not has a dramatic effect on the results when mismatches are allowed, as we will see in the next section. It is possible to set the maximum number of mismatches per hit (option −n in the second step) and also per seed (option −k); we used the same value for both parameters. Moreover, one can specify the maximum number of hits to output (option −n in the third step). If a read has more hits in the reference, outputted hits are randomly chosen. The only way to get all the hits per read is to set the maximum number of hits to output to a value larger than the maximum number of occurrences of the reads in each read set. We then took the bounds 54,000 for equation M33 and equation M34,6 and 2,000 for equation M35 and equation M367 BWA randomly changes Ns in the reference genome to regular nucleotides.

3.2.2. Novoalign

Running Novoalign consists in running two successive comands: the first one (novoindex) indexes the reference genome and the second one (novoalign) aligns the reads against the indexed reference. Novoalign (at least in its academic version) does not allow the user to set the maximum (or exact) number of mismatches between the read and the reference genome. We then post-processed the results to retrieved exact matches (equation M37 and equation M38) or matches with at most three mismatches (equation M39 and equation M40). For reads with multiple hits, it is possible to report all hits (option −r A) or at most a fixed number of randomly chosen hits.

3.2.3. Bowtie

Running Bowtie consists in using successively two commands, bowtie-build which indexes the reference genome and bowtie which takes an index and a set of reads as input and outputs a list of alignments. Bowtie allows the user to set the maximum number of mismatches per hit (option −v). By default, Bowtie returns only one hit per reads; if one wants to retrieve more, or all, hits per read, one needs to specify a maximum number of hits to report (option −k). As for BWA, this maximum number should be set to the maximum number of occurrences of the read set8 to retrieve all the hits. Alignments involving one or more ambiguous characters, such as Ns, in the reference are considered invalid by Bowtie, whereas they account for mismatches if they belong to the reads.

3.2.4. SOAP2

Running SOAP2 consists in using successively two commands: the first one (2bwt-builder) creates the Burrows-Wheeler index of the reference genome, and the second one (soap) performs the alignments. SOAP2 allows the user to set the maximum number of mismatches per hit (option −v) but this maximum number is limited to 2. SOAP2 outputs systematically all the hits (no limitation is allowed). Unmapped reads can be obtained in a FASTA file. SOAP2 seems to replace all the Ns in the reads by a G.

3.2.5. BFAST

Running BFAST requires five steps: (1) the reference genome is first rewritten in a special format (bfast fasta2brg), (2) bfast index indexes the reference genome by using the spaced seeds set by the user (this step should be done with several seeds, leading then to several indexes; we used the 10 seeds proposed in Homer et al., [2009]), (3) then the bfast match command takes a set of reads and searches a set of indexes to find candidate alignment locations (or CALs) for each read, (4) the bfast localalign command takes the CALs for each read and performs a local alignment to the reference, and (5) finally an output file is created (bfast postprocess). As for Novoalign, the user cannot set the maximum (or exact) number of mismatches, so we post-processed the outputted hits. BFAST can output all the hits (option −a).

3.2.6. SSAHA2

Running SSAHA2 consists in two steps: indexing the reference genome (command ssaha2Build) and mapping the reads (ssaha2). It is possible to specify the number of mismatchs allowed, or equivalently the percentage of identity (option −identity). The number of reported hits per read is limited to 500 and cannot be changed. We asked for the “best” (Smith-Waterman score) mapping for each read (−best 1), which seems appropriate for exact mapping, but probably not for equation M41 and equation M42 (actually we also used −best 0 in the mismatches case).

3.2.7. MPscan

To run MPscan, there is only one command (mpscan) but it must be used twice, one for mapping on the direct strand and the second one for the reverse strand (options −rev −ac). No mismatch is allowed in the alignments and all alignments are reported in the output file (not in SAM format).

3.2.8. GASSST

Indexing and mapping steps are performed by issuing the Gassst command. It is possible to specify the number of mismatchs allowed, or equivalently the percentage of identity (option −p). To retrieve exhaustively all the hits for each read, we disabled the filtering process used by default to reduce the computation time (option −l 0) and we set the sensitivity to its maximal value (option −s 5). Alignments involving ambiguous characters necessarily account for mismatches. GASSST reports the alignments in a specific format; the output file can be converted in SAM format by using the command gassst_to_sam which appeared to be quite time consuming.

3.2.9. PerM

Indexing and mapping steps are done by running the perm command. It is possible to set the maximum number of mismatches per hit (option −s) and to specify a maximum number of hits to find (option −k). To report all the hits, we set the previous option to the highest number of occurrences found in our read sets (as for Bowtie and BWA) and we also activated the “all” option −A. Finally, since some of the reads contain some Ns (Table 1), we used option ––includeReadsWN followed by 40, 37, or 10 depending on the read set.

4. Results

We present here the results obtained when mapping the human datasets equation M43 and equation M44 to the human reference genome equation M45. Results for the bacterial datasets are presented in appendix because the tools' results are globally similar to those obtained on the human datasets and genome even though their performances are slightly worse.

4.1. Exact mapping from equation M46 reads

We firstly performed an exact mapping (i.e., no mismatch is allowed) of the read set equation M47 onto the human genome with each of the nine mapping tools presented in Section 3.2. Note that, by construction of the read set equation M48, all reads from equation M49 do exactly occur, at least once, in the reference genome. All the tools should then map all the reads with no mismatch, or at least the reads with no Ns because alignments involving some Ns account for mismatches for many tools.

4.1.1. Computation time and memory usage

Computation times have been obtained by running the mapping tools in single-thread mode, on the same computer9 and without any competition. However, as we can see in Table 3, SOAP2 and GASSST require more than 50 Gb of memory, so they have been run on a more powerful computer10 but without avoiding competition with other jobs (computation times could then be slightly over-estimated for both tools). For tools which are run through several command lines, we have reported the maximum of the memory required by each step. The most memory consuming step for Bowtie and BFAST is the indexing step, whereas it is the mapping step for SOAP2. Indexing times and mapping times have been reported separately whenever these steps corresponded to distinct command lines (Table 3). The general trend is as follows: BWA, Bowtie, SOAP2, MPscan, and GASSST take several hours to map the 10 millions reads (MPscan is the fastest tool), Novoalign and PerM use half a day, and one day or more is required for SSAHA2 and BFAST.

Table 3.
Global characteristics of the run of each software on the exact human dataset (equation M50)

4.1.2. Unmapped reads

None of the nine mapping tools we have analyzed maps all the reads (Table 3) but for some tools, it is just a question of Ns. Indeed, BWA, Bowtie, SOAP2, and GASSST correctly map all the reads except the 49 reads with some Ns. MPscan only fails to map 26 reads having some Ns and occurring only on the reverse strand. The other tools do not map some of the regular reads. Novoalign could not map 632 reads but succeeded in mapping 22 reads with Ns at their original11 position. More dramatically, SSAHA2, PerM and BFAST do not map many reads (35,875, 115,871, and 726,332, respectively). BFAST did not map any of the 49 reads with Ns, whereas SSAHA2 mapped 36 reads with Ns but, most of the time, not at their original location. Despite the option −k 54000, PerM seems to discard reads with more than 2000 occurrences: the highest number of hits reported by PerM is 1999 and there are exactly 115,822 reads in equation M51 which occur 2000 or more times in equation M52.

4.1.3. Is the original position retrieved?

For all the mapped reads, we have checked if their original position, (i.e., the one randomly drawn to generate the read) belongs to the complete list of returned hits. Indeed, we did not impose constraints on the number of hits per reads. BWA, Novoalign, Bowtie, and MPscan retrieve the original position of all their mapped reads (Table 3). PerM misses the original position for only four reads and GASSST for 54 reads. However, SOAP2, BFAST and SSAHA2 fail for many reads; we will point out some possible explanations in the next paragraph.

4.1.4. Unique and multiple reads

We can now get in more details and distinguish between unique and non unique reads. By unique read we mean that the read has a unique occurrence into the reference genome wheareas a non unique read will have more than one occurrence (repeats). This distinction is important when one knows that many post-analyses only consider reads mapping at a unique location. We know that equation M53 is composed of 8,877,107 unique reads (including 46 reads with Ns) and 1,122,893 multiple reads; the later have on average 722.81 occurrences. These reference values are indicated at the bottom of Table 4 (“Reference” row).

Table 4.
Mapping results for each software run on the exact human dataset (equation M54)

We can see in Table 4 that BWA, Bowtie, SOAP2, and GASSST report 8,877,061 reads as unique and at their original position which is correct if one discards the 46 unique reads with Ns. MPscan also correctly reports the unique reads. Novoalign reports the correct number of unique reads, but this is a coincidence: indeed, Novoalign only maps 22 reads with Ns, meaning that some uniquely reported reads are in fact multiple (some hits have been missed). PerM seems to be quite reliable regarding the uniquely reported reads even if it misses some occurrences for three of them. SSAHA2 reports more unique reads than in reality, meaning that it misses some hits for some of these reads; This explains also why 9,847 of these reads are not mapped at their original location.

Regarding the non unique reads, BWA, Bowtie and MPscan seem to provide the correct numbers of hits (722.81 hits on average with a standard deviation of 2424). GASSST, Novoalign and SOAP2 are also quite good. Note that SSAHA2 returns much less hits per read (around 80 on average), leading to original positions not retrieved, but this could be explained by the fact that SSAHA2 limits the number of hits per read to 500. This phenomenon is more drastic for BFAST which only returns three hits on average per read. It looks like these tools do not explore all the possible occurrences leading to many unmapped reads, to many reads not mapped at their original location and to many reads wrongly reported as unique. The case is different for perM: its small mean number of hits (126) per mapped read can be explained by the fact that PerM does not map very frequent reads (more than 2000 occurrences).

4.2. Mapping reads from equation M55 with three mismatches

We then performed a mapping, allowing up to three mismatches, of the read set equation M56 onto the human genome. Since SOAP2 and MPscan do not allow three mismatches (SOAP2 is limited to two mismatches, whereas MPscan only performs an exact mapping), we only used the seven other tools. By construction of the read set equation M57, all the reads do occur at least once with exactly three mismatches, meaning that (i) all reads should be mapped, (ii) each read is expected to have more hits than in the previous experiment without mismatch (Section 4.1), which leads to (iii) less reads uniquely retrieved.

4.2.1. Computation time and memory usage

The amount of memory required by each tool is similar to the one required in the previous experiment for Novoalign, Bowtie, BFAST, SSAHA2, and PerM, it increases for BWA (due to the mapping step) while it decreases for GASSST. GASSST still needs more than 16 Gb of memory, so it has been run on a different computer (see Section 4.1.1).

Since the indexes are built on the reference genome, indexing times are identical to the previous experiment, only mapping times may change due to allowed mismatches. The running times of BFAST and Perm decrease, respectively, by 30% and 5%. The mapping times of other methods increase globally by a factor 4, except BWA which becomes 10 times slower (Table 5). To our surprise, BFAST is significantly faster when run on the equation M58 instance. The reason for this behavior is not clear to us.

Table 5.
Global characteristics of the run of each software on the 3 mismatches human dataset (equation M59)

4.2.2. Unmapped reads and reads mapped at their original position

Again, none of the tools maps all the reads even when three mismatches are allowed (Table 6). The read set equation M60 still contains 49 reads with Ns (Table 1). Bowtie and BWA12 map all the reads with no Ns and their original position is always retrieved. Novoalign maps more reads (47 unmapped reads) but globally the returned list of hits is incomplete for many reads (666,456 reads). SSAHA2 maps a reasonable number of reads but half of them are not mapped at their original position indicating a very partial list of reported hits; this is not really due to the −best option set to 1 because we still obtain 4,325,039 reads not mapped at their orginal position when −best is set at 0. BFAST, PerM and GASSST do not map many reads but the diagnostic is different: PerM does not map frequent reads (2000 occurrences or more) but maps the other reads at their original location, whereas GASSST and BFAST seem to produce incomplete lists of hits. PerM seems then to be the most reliable tool after Bowtie and BWA.

Table 6.
Mapping results for each software run on the 3 mismatches human dataset (equation M61)

4.2.3. Unique and multiple reads

The mean number of hits per read, for non unique reads, should be greater than 722.81 (the value for the exact mapping case) because one knows that all exact hits of equation M62 are hits with three mismatches of equation M63. Unlike the results obtained for the exact mapping case, Novoalign retrieves much less hits than expected (15 on average) indicating that many hits are missed. BFAST and SSAHA213 report also very few hits per read leading to a high number of reads not mapped at their original location. Bowtie, BWA, and GASSST provides a mean number of hits greater than 1100 which is more satisfactory. As already mentionned, PerM discards reads occurring 2000 or more times, which leads to less reads with multiple hits and to a smaller mean number of hits per mapped reads.

Regarding the reads uniquely retrieved by the different tools, they are less frequent than in the exact mapping case, as expected. However, Novoalign (which performed well in the exact case) returns too many unique hits, many of them at a position different from the orginal one. Unfortunately, this behavior could have a very negative impact on further analyses that only consider uniquely mapped reads. Concerning this point, BFAST is not good, SSAHA214 is catastrophic, Bowtie, BWA, and PerM are excellent, and GASSST provides intermediate results.

5. Conclusion

We performed a quantitative comparison of nine mapping tools on well-controlled benchmarks built for this purpose. We designed a first setting in which all reads do occur exactly in the reference, in one or several copies, and a second setting in which all reads occur with three or fewer mismatches. The reads were uniformly drawn from the reference genome and the original position of these reads was expected to belong to the list of hits returned by each mapping tool. A special attention has been given to reads uniquely reported because many further analyses only consider reads with a single hit.

Regarding the mapping of reads with no mismatch, we can discern three groups. BWA, Bowtie, MPscan, and GASSST belong to the first group. Programs in this group map all the reads without Ns and the original position is returned in the list (with few exceptions for GASSST). MPscan is faster than BWA, Bowtie, and GASSST, but the latter programs can be threaded. The second group contains PerM, Novoalign, and SOAP2. The first two programs do not align many reads but they retrieve the original position of the mapped reads. SOAP2 maps all the reads without Ns, but the original position of the reads, in case of multiple copies in the genome, is not always returned. The last group consists of BFAST and SSAHA2. They fail to align a large number of reads and they do not return the original position for a number of reads (the list of returned reads is incomplete). As a consequence, reads reported uniquely by these tools are not necessarily unique and could map elsewhere.

Regarding the mapping of reads with three mismatches, we can cluster the programs in four groups. The first group consists of Bowtie and BWA. These two programs map all the reads without Ns and, for the human dataset, return their original position, even for unique reads. However, we note that both tools miss the original position of a few reads for the bacterial dataset, with a slight advantage for Bowtie. For comparable results, BWA is about 40% slower than Bowtie. The next group contains only perM that, although it does not map a large number of reads, usually finds the original position of the reads. The third group consists of GASSST and BFAST that do not map a large number of reads and that also do not return the original position of many reads (BFAST being much worst than GASSST in this respect). The last group consists of Novoalign and SSAHA2 that align most reads without Ns but that miss the original position of many reads (SSAHA2 being much worst than Novoalign).

It should be emphasized that the good results of BWA and, to a lesser extent, GASSST in our study rely on the fact that we did not use the default parameters, generally set to favor a rapid but incomplete search. We indeed forced both tools to perform an exhaustive search (options −N −k 3 for BWA and options −l 0 −s 5 for GASSST) despite a higher computation time. This trade-off between sensitivity and speed remains one of the difficulties for tuning the numerous parameters of such tools. In this respect, Bowtie appears to be the tool that is the most finely tuned to keep an excellent sensivity and a reasonable execution time with the default parameters.

Reads with Ns are discarded by most of the tools unless the number of allowed mismatches is at least equal to the number of Ns. Although this may be a concern, reads with Ns usually have a poor quality and most would be unalignable anyway.

We reported the computation times using a single thread in order to compare the different tools in similar settings. Of course, several tools can use multi-threads and this advantage should be taken into account for real usage. Moreover, some tools, like GASSST and Bowtie, allow to reduce the computation time by decreasing the sensitivity, ie. by reporting only a subset of hits.

In this work, we did not consider indels mainly because only four tools can cope with indels (BWA, Novoalign, BFAST, and GASSST), but clearly we will now investigate this aspect. Finally, we left the evaluation of pair-end or mate-pair sequencing data—a work worth a paper per se—as future direction for another study.

6. Availability

The four sets of reads (equation M64, equation M65, equation M66, and equation M67), together with both reference genomes (equation M68 and equation M69), are available at http://genome.jouy.inra.fr/ngs/mapping/. This website also describes the commands and all the options that have been used to run the nine mapping tools producing the results presented in this article. Our goal is to further enrich this benchmark, by adding both tools and new read sets (typically longer reads, reads with indels, and paired-end reads). Additional results will then be posted on this website.

7. Appendix

Here are the results when mapping the read sets equation M70 and equation M71 on the bacterial genome reference equation M72. We did not report the computation times but they are smaller than in the human dataset because the genome reference is shorter and contains less repeats.

A.1. Results for equation M73

Globally on this bacterial dataset, all mapping tools give very similar and good results (Table 7), except BFAST which fails to map 300,194 reads and returns on average few hits per reads. BWA, Bowtie and GASSST mapped all the reads without any Ns but BWA missed the original position for 36 mapped reads; GASSST could not retrieve the original position for only one read, whereas Bowtie succeeded for all the mapped reads. Novoalign, SOAP2, SSAHA2, MPscan and PerM map more reads (even some with Ns) but a few reads are not retrieved at the original position with SOAP2, PerM and MPscan. In this matter, SSAHA2 is the worst program for this dataset. Compared to the human dataset, BWA, SOAP2 and MPscan display here some minor difficulties. Finally, Novoalign seems to give the best results, but the 67 unmapped reads do not contain any Ns.

Table 7.
Mapping results for each program run on the exact bacterial dataset (equation M74)

A.2. Results for equation M75

As for the human dataset equation M76, SSAHA2 shows very bad results: whatever the −best option, it misses the original position of about 3.7 millions reads (Table 8). BFAST fares only slightly better: it does not map 172,360 reads and among the mapped reads, 134,281 are not retrieved at their original location. Novoalign does better by failing to map only 89 reads, but the number of reads not retrieved at their original position is still very high (20,154). GASSST produces intermediate results: it fails to map a very large number of reads (363,984) but, comparatively to the above mentioned tools, the number of reads for which it cannot retrieve the original position is smaller (8774). Finally, Bowtie, BWA and PerM perform very well: respectively only 214, 212, and 185 reads are not mapped (note that equation M77 contains 219 reads with Ns) but these tools miss the original position of several reads (17 for Bowtie, 68 for BWA, and 36 for PerM). Contrarily to our expectation based on the test with the human dataset, the best tools failed to retrieve all the hits of all the reads, even those with no N.

Table 8.
Mapping results for each software run on the 3 mismatches bacterial dataset (equation M78)

Footnotes

1Available from https://github.com/lh3/wgsim or SAMtools and developed by Heng Li.

2One run of 40 Ns is nevertheless kept in the experiment and this read is included in the equation M79 set.

3The true number of occurrences of each read from equation M80 have been computed thanks to a dedicated naive algorithm.

4We want each read to contain exactly three mimatches, hence a mismatch position cannot be drawn twice. Uniform drawing means that the first position is drawn uniformly within the 40 positions, the second position is drawn uniformly within the 39 remaining positions, and the third position is drawn uniformly within the 38 remaining positions.

5Unlike human chromosomes there are few Ns in bacterial genomes, so we did not need sequence curation.

6The highest number of hits retrieved by BWA for equation M81 is 30,265, so far less than the 54,000 bound.

7The highest number of hits retrieved by BWA for equation M82 is 1,749, so far less than the 2,000 bound.

8We took the same maximal values than for BWA, except for equation M83 for which we took −k 400,000 because the maximum number of hits retrieved by Bowtie appeared to be 305,424.

9Intel Quad Core 2.33 GHz 16 Gb RAM.

10Four Intel Six Core 2.40 GHz 132 Gb RAM

11The original position of a read is defined as the position randomly drawn to generate the read.

12If we omit the −N option, BWA has catastrophic results because it would only map half of the reads, and 500,000 of them would not be mapped at their original positions.

13With −best 0, the mean number of hits increases to 198.16 but is still low. Moreover, surprinsingly, 9 689 113 reads are reported with multiple hits.

14With −best 0, SSAHA2's results are worst because it reports only 310,674 reads as unique and half of them are not mapped at their original position

Acknowledgments

We thank Pierre Nicolas for hepful comments on these results. We are grateful to the INRA MIGALE bioinformatics platform (http://migale.jouy.inra.fr) for providing computational resources. This work is supported by the French Agence Nationale de la Recherche (project CBME, grant ANR-08-GENM-028-01).

Disclosure Statement

No competing financial interests exist.

References

  • Chen Y. Souaiaia T. Chen T. PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds. Bioinformatics. 2009;5:2514–2521. [PMC free article] [PubMed]
  • David M. Dzamba M. Lister D., et al. SHRiMP2: sensitive yet practical short read mapping. Bioinformatics. 2011;27:1011–1012. [PubMed]
  • Ferragina P. Manzini G. Opportunistic data structures with applications. Proc. 41st Symp. Found. Comput. Sci. 2000:390–398.
  • Homer N. Merriman B. Nelson SF. BFAST: an alignment tool for large scale genome resequencing. PLoS ONE. 2009;4:e7767. [PMC free article] [PubMed]
  • Jiang H. Wong WH. SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics. 2008;24:2395–2396. [PMC free article] [PubMed]
  • Kersey P. Bower L. Morris L., et al. Integr8 and genome reviews: integrated views of complete genomes and proteomes. Nucleic Acids Res. 2005;33:D297–D302. [PMC free article] [PubMed]
  • Langmead B. Trapnell C. Pop M., et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. [PMC free article] [PubMed]
  • Li H. Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–1760. [PMC free article] [PubMed]
  • Li R. Li Y. Kristiansen K., et al. SOAP: short oligonucleotide alignment program. Bioinformatics. 2008;24:713–714. [PubMed]
  • Li R. Yu C. Li Y., et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009;25:1966–1967. [PubMed]
  • Li H. Ruan J. Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008;18:1851–1858. [PMC free article] [PubMed]
  • Lin H. Zhang Z. Zhang M.Q., et al. ZOOM! Zillions of oligos mapped. Bioinformatics. 2008;24:2431–2437. [PMC free article] [PubMed]
  • Lunter G. Goodson M. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 2010;21:936–939. [PMC free article] [PubMed]
  • Ning Z. Cox AJ. Mullikin J.C. SSAHA: a fast search method for large DNA databases. Genome Res. 2001;11:1725–1729. [PMC free article] [PubMed]
  • Novocraft. 2010. www.novocraft.com. [Feb 25;2012 ]. www.novocraft.com
  • Rivals E. Salmela L. Kiiskinen P., et al. MPscan: fast localisation of multiple reads in genomes. Lect. Notes Comput. Sci. 2009;5724:246–260.
  • Rizk G. Lavenier D. GASSST: global alignment short sequence search tool. Bioinformatics. 2010;26:2534–2540. [PMC free article] [PubMed]
  • Ruffalo M. LaFramboise T. Koyutrk M. Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics. 2011;27:2790–2796. [PubMed]
  • Weese D. Emde A.-K. Rausch T., et al. RazerS–fast read mapping with sensitivity control. Genome Res. 2009;19:1646–1654. [PMC free article] [PubMed]

Articles from Journal of Computational Biology are provided here courtesy of Mary Ann Liebert, Inc.
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...