![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Accelerating the neighbor-joining algorithm using the adaptive bucket data structure National Center for Biotechnology Information, National Library of Medicine, National Institute of Health, Bethesda, MD, 20894, USA Leonid Zaslavsky Email: zaslavsk/at/ncbi.nlm.nih.gov Tatiana A. Tatusova Email: tatiana/at/ncbi.nlm.nih.gov See other articles in PMC that cite the published article.Abstract The complexity of the neighbor joining method is determined by the complexity of the search for an optimal pair (”neighbors to join”) performed globally at each iteration. Accelerating the neighbor-joining method requires performing a smarter search for an optimal pair of neighbors, avoiding re-evaluation of all possible pairs of points at each iteration. We developed an acceleration technique for the neighbor-joining method that significantly decreases complexity for important applications without any change in the neighbor-joining method. This technique utilizes the bucket data structure. The pairs of nodes are arranged in buckets according to values of the goal function δij = ui + uj − dij. Buckets are adaptively re-arranged after each neighbor-joining step. While the pairs of nodes in the top bucket are re-evaluated at every iteration, pairs in lower buckets are accessed more rarely, when the algorithm determines that the elements of the bucket need to be re-evaluated based on new values of δij. As a result, only a small portion of candidate pairs of nodes is examined at each iteration. The algorithm is cache efficient, since the bucket data structures are able to exploit locality and adjust to cache properties. Keywords: neighbor-joining algorithm, bucket data structure, adaptive, cache-efficient 1 Introduction The neighbor-joining algorithm [1], [2] is one of the most popular distance methods for the creation of phylogenetic trees. It is a greedy agglomerative algorithm that constructs a tree in steps [3]. The algorithm is based on the minimum-evolution criterion for phylogenetic trees. It is well-tested and studied theoretically, provides good results and is statistically consistent under many models of evolution [4], [5], [6], [7], [8]. Several algorithms have been developed as improvements to the classical neighbor-joining method [9], [10]. Since the neighbor-joining method is much more efficient than other algorithms of comparable quality, it is widely used for phylogenetic analysis as the tool of choice for preliminary analysis, with results being verified and refined by maximum likelihood and Baysian methods [11]. However, the usage of the neighbor-joining method within interactive exploratory analysis tools makes it desirable to further accelerate the algorithm for large datasets. This is especially true if the bootstrap analysis is performed and multiple trees need to be calculated [12], [3]. Since the O(N3) complexity of the neighbor joining method is determined by the amount of operations per search step performed globally at each iteration to find an optimal pair (”neighbors to join”), accelerating the neighbor-joining method requires a smarter search methodology which avoids brute-force reevaluation of all possible pairs of points. Our interest in accelerating the neighbor-joining method is motivated by our ongoing efforts to develop and improve NCBI interactive analysis web tools, such as the NCBI Influenza Virus Resource [13], [14], where the neighbor-joining method is the default tree method. The goal is to enable bootstrap analysis for meaningful dataset, in a timeframe acceptable for interactive web tools. This paper describes an ongoing effort toward this goal. Accelerating strategies for the neighbor-joining method have been proposed by several authors. The QuickJoin algorithm [15], [16] uses the quad-tree data structure to accelerate the search for optimal value of the goal function in the neighbor-joining algorithm, while still constructing the same tree as the original algorithm. The ClearCut algorithm [17], [18] implements the relaxed neighbor-joining approach. The algorithm does not search for a globally optimal pair, but selects a locally optimal pair (i, j) at each step, such that δij = ui + uj − dij is maximum for both δik and δjk for all other nodes k. In the Fast Neighbor Joining method [19] the goal function is not optimized globally, but is rather optimized over a set, called the ”visible set”. The algorithm is guaranteed to produce the same results as the neighbor-joining methods for an additive input. In this paper we pursue the same goal as [16]: to accelerate the search for a pair of nodes to be join while constructing the same tree as the classical neighbor-joining algorithm. We arrange the candidate pairs of nodes in buckets according to the value of the NJ goal function (see Figure 1
2 Methodology Below we first describe the classical neighbor-joining algorithm and then show how to use the bucket data to perform an efficient search for a pair of nodes to be joined. 2.1 The neighbor-joining method Classical NJ algorithm. At each iteration m = 0, …, N − 2 of the classical neighbor-joining method [2],[3], average distances
Nodes i* and j* are joined in new node k*. The branch lengths υi* and υj* are calculated as
Preserving non-negativity of branch lengths and distances For the implementations used in our web analysis tools [13], we chosen to keep branch lengths and distances non-negative. We modify formulas (4) and (5) as follows. Define
A non-negative analogue of equation (6) is
Recursive formulas 2.2 Upper bound for change in the goal function value for a pair of points over a neighbor-joining step Estimates for growth of average distances ui Estimates for growth of δij From (3), it is easy to see that Finally, we obtain two growth estimates:
Note. These estimates show that when the number of nodes is large, value 2.3 Construction of buckets and operating them The arrangement of pairs (i, j) in groups is performed according to the values of the neighbor-joining goal function δij defined by formula (3). Our purpose is to limit evaluation of the individual pairs only to those that were close to optimal in the previous iteration and may become optimal at the current step. First, the treatment of pairs with zero or near zero distances between nodes is considered. Lets introduce a special bucket for pairs (i, j) such that Let us consider regular pairs. Define the bucket intervals as follows:
Our initial idea was to use intervals constant step Δm:
If parameter
At each neighbor-joining iteration, a new collection of N buckets is constructed according to (19). New pairs appearing at the current iteration are placed in the buckets accordingly. Contents of most of the existing buckets is placed into new buckets without being evaluated. First, the new bucket index knew is determined by formula
Values However, this simple construction (19) would not be efficient if the values In the adaptive construction, the intervals for the initial step (m = 0) are defined as follows:
In the subsequent iterations (m = 1, 2, 3, …) the buckets are operated as follows:
Data structures Below we briefly describe data structures for a record, a bucket, and a bucket collection. Record For a pair of nodes i and j (i < j), we keep a record consisting of two indices and the value of the distance between the nodes: R = (i, j, Dij). There is no reason to save the actual value of the goal function δij since it is changed at each algorithm step and cannot be reused. However, keeping the Dij value in the record allows to avoid gathering these values from a large two-dimensional array and makes the algorithm more cache-optimal [20], [21]. Bucket Each bucket contains a linked list of records. In our initial implementation, we use the C++ STL List class. A standard constant-time splice algorithm [28] is used to combine link lists. Records referring to nodes which have already have been eliminated are erased using a C++ STL constant-time List erase() function. Special memory allocation and reallocation can be used to provide cache-efficient placement of bucket elements [29]. Bucket Collection Each bucket collection contains two arrays: the first contains real numbers in decreasing order and describes bucket intervals, while the second contains pointers to buckets. In our initial implementation, we use C++ STL vector class for these arrays. As described above, N buckets are allocated at each neighbor-joining step. In addition to bucket-based data structures we use arrays implemented as C++ STL vector objects for 3 Test results To evaluate the algorithm, we used the following four data sets containing full-length Influenza A H3N2 hemagglutinin protein coding sequences obtained from the NCBI Influenza Virus Resource [13]:
Figure 2
Figure 3
Figure 4
Figure 5
4 Discussion Accelerating the neighbor-joining method is important for enhancing performance of the online web analysis tools, where users expect to perform initial exploratory analysis of the datasets in real time and perform bootstrapping as fast as possible. In this paper we present an adaptive bucket algorithm able to significantly reduce the amount of evaluations in search steps by distributing candidate pairs in buckets and evaluating a small portion of all pairs in each iteration. The proposed construction helps to avoid empty buckets and allows the algorithm to handle the values of the neighbor-joining goal function which are distributed non-homogeneously, including cases when outliers are present. The algorithm uses simple data structures that can be further optimized, including optimizing the cache-efficiency. Our the preliminary test results are shown above, and we continue optimizing the code and plan to perform comprehensive comparisons. While the proposed algorithm is designed to produce the same results as classical neighbor-joining, the degree of acceleration it provides is determined by the distribution of the values of the neighbor-joining goal function that, in turn, depends on the structure of the dataset. 5 Acknowledgements This research was supported by the Intramural Research Program of the NIH, National Library of Medicine. The authors are thankful to David J. Lipman, Alejandro Schaffer, Stacy Ciufo, Vahan Grigoryan and Yuri Kapustin for productive discussions. References 1. Saitau N, Nei M. The neighbor-joining method: new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 1987;4:406–425. [PubMed] 2. Studier JA, Keppler KJ. A note on the neighbor-joining algorithm of Saitou and Nei. Molecular Biology and Evolution. 1988;5:729–731. [PubMed] 3. Felsenstein J. Inferring Phylogenies. Cambridge University Press; 2003. 4. Atteson K. The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica. 1999;25:251–278. 5. Tamura K, Nei M, Kumar S. Prospects for inferring very large phylogenies by using the neighbor-joining method. PNAS. 2004;101:11030–11035. [PubMed] 6. Bryant D. On the uniqueness of the selection criterion in neighbor-joining. Journal of Classification. 2005;22 7. Desper R, Gascuel O. The minimum-evolution distance-based approach to phylogenetic interference. In: Gascuel O, editor. Mathematics of evolution and phylogeny. Oxford University Press; 2005. pp. 1–32. 8. Gascuel O, Steel M. Neighbor-joining revealed. Molecular Biology and Evolution. 2006;23:1997–2000. [PubMed] 9. Gascuel O. BIONJ: an improved version of the nj algorithm based on a simple model of sequence data. Mol. Biol. Evol. 1997;14:685–695. [PubMed] 10. Bruno WJ, Socci N, Halpern AL. Weighted neighbor-joining: a likelihood-based approach to distance-based phyloginy reconstruction. Mol. Biol. Evol. 2000;17:189–197. [PubMed] 11. Yang Z. Computational Molecular Evolution. Oxford University Press; 2006. 12. Bryant D. A classification of consensus methods for phylogenies. In: Janowitz M, Lapointe FJ, McMorris F, Mirkin B, Roberts F, editors. BioConsensus. DIMACS, Americal Mathematical Society; 2003. pp. 163–184. 13. Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Zaslavsky L, Tatusova T, Ostell J, Lipman D. The Influenza Virus Resource at the National Center for Biotechnology Information. Journal of Virology. 2008;82:596–601. [PubMed] 14. Zaslavsky L, Bao Y, Tatusova TA. An adaptive resolution tree visualization of large influenza virus sequence datasets. In: Mandoiu I, Zelikovsky A, editors. Bioinformatics Research and Applications, Proc. of ISBRA 2007. Volume LNBI 4463 of Lecture Notes in Bioinformatics. Springer-Verlag; 2007. pp. 192–202. 15. Mailund T, Pedersen CN. Quickjoin – fast neighbor-joining tree reconstruction. Bioinformatics. 2004;20:3261–3262. [PubMed] 16. Mailund T, Brodal GS, Fagerberg R, Pedersen CNS, Phillips D. Recrafting the neighbor-joining method. BMC Bioinformatics. 2006;7(29) 17. Shenerman L, Evans J, Foster JA. Clearcut: fast implementation of relaxed neighbor joining. Bioinformatics. 2006;22:2823–2824. [PubMed] 18. Evans J, Shenerman L, Foster J. Relaxed Neighbor-Joining: A Fast Distance-Based Phylogenetic Tree Construction Method. J. Mol. Evol. 2006;62:785–792. [PubMed] 19. Elias I, Lagergren J. Fast neighbor joining. In: Caeires L, et al., editors. ICALP. Volume 3580 of Lect. Notes Comp. Sci. Springer-Verlag; 2005. pp. 1263–1274. 20. LaMarca A, Ladner RE. The influence of caches on the performance of sorting. Journal of Algorithms. 1999;31:66–104. 21. Brodal GS, Fagerberg R, Vinther K. Engineering a cache-oblivious sorting algorithm. Journal of Experimental Algorithmics. 2007;12:2.1. 22. Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to Algorithms. Second Edition. MIT Press and McGraw-Hill; 2001. 23. Dial RB. Algorithm 360: Shortest path forest with topological ordering. Comm. ACM. 1969;12:632–633. 24. Wagner RA. A shortest path algorithm for edge-aparse graphs. J. Assoc. Comput. Mach. 1976;23:50–57. 25. Dinic EA. Economical algorithms for finding shortest path in network. In: Popkov YS, Shmulyan BL, editors. Transportation Modeling Systems, The Institute for System Studies. In Russian: 1978. pp. 36–44. 26. Denardo EV, Fox BL. Shortest-route methods: 1. reaching, pruning, and buckets. Oper. Res. 1979;27:161–186. 27. Cherkassky BV, Goldberg AV, Silverstein C. Buckets, heaps, lists, and monotone priority queues. SIAM Journal of Computing, 1999. 1999;28:1326–1346. 28. Musser DR, Derge GJ, Saini A. STL Tutorial and Reference Guide: C++ Programming with the Standard Template Library. 2 edn. Addison-Wesley Professional. 2001 29. Meyers S. Effective STL. Addison-Wesley Professional. 2001 |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Mol Biol Evol. 1987 Jul; 4(4):406-25.
[Mol Biol Evol. 1987]Mol Biol Evol. 1988 Nov; 5(6):729-31.
[Mol Biol Evol. 1988]Proc Natl Acad Sci U S A. 2004 Jul 27; 101(30):11030-5.
[Proc Natl Acad Sci U S A. 2004]Mol Biol Evol. 2006 Nov; 23(11):1997-2000.
[Mol Biol Evol. 2006]Mol Biol Evol. 1997 Jul; 14(7):685-95.
[Mol Biol Evol. 1997]J Virol. 2008 Jan; 82(2):596-601.
[J Virol. 2008]Bioinformatics. 2004 Nov 22; 20(17):3261-2.
[Bioinformatics. 2004]Bioinformatics. 2006 Nov 15; 22(22):2823-4.
[Bioinformatics. 2006]J Mol Evol. 2006 Jun; 62(6):785-92.
[J Mol Evol. 2006]Mol Biol Evol. 1988 Nov; 5(6):729-31.
[Mol Biol Evol. 1988]J Virol. 2008 Jan; 82(2):596-601.
[J Virol. 2008]J Virol. 2008 Jan; 82(2):596-601.
[J Virol. 2008]J Virol. 2008 Jan; 82(2):596-601.
[J Virol. 2008]Bioinformatics. 2004 Nov 22; 20(17):3261-2.
[Bioinformatics. 2004]