![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright Zhang, Townsend. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Maximum-Likelihood Model Averaging To Profile Clustering of Site Types across Discrete Linear Sequences Department of Ecology and Evolutionary Biology, Yale University, New Haven, Connecticut, United States of America Wen-Hsiung Li, Editor University of Chicago, United States of America * E-mail: Jeffrey.Townsend/at/Yale.edu Co-designed and programmed this new method, carried out computer simulations, generated sequence datasets, analyzed the data and wrote the manuscript: ZZ. Designed this new method, supervised the research, and revised the manuscript: JPT. Received January 12, 2009; Accepted May 21, 2009. Abstract A major analytical challenge in computational biology is the detection and description of clusters of specified site types, such as polymorphic or substituted sites within DNA or protein sequences. Progress has been stymied by a lack of suitable methods to detect clusters and to estimate the extent of clustering in discrete linear sequences, particularly when there is no a priori specification of cluster size or cluster count. Here we derive and demonstrate a maximum likelihood method of hierarchical clustering. Our method incorporates a tripartite divide-and-conquer strategy that models sequence heterogeneity, delineates clusters, and yields a profile of the level of clustering associated with each site. The clustering model may be evaluated via model selection using the Akaike Information Criterion, the corrected Akaike Information Criterion, and the Bayesian Information Criterion. Furthermore, model averaging using weighted model likelihoods may be applied to incorporate model uncertainty into the profile of heterogeneity across sites. We evaluated our method by examining its performance on a number of simulated datasets as well as on empirical polymorphism data from diverse natural alleles of the Drosophila alcohol dehydrogenase gene. Our method yielded greater power for the detection of clustered sites across a breadth of parameter ranges, and achieved better accuracy and precision of estimation of clusters, than did the existing empirical cumulative distribution function statistics. Author Summary The invention and application of high-throughput technologies for DNA sequencing have resulted in an increasing abundance of biological sequence data. DNA or protein sequence data are naturally arranged as discrete linear sequences, and one of the fundamental challenges of analysis of sequence data is the description of how those sequences are arranged. Individual sites may be very sequentially heterogeneous or highly clustered into more homogeneous regions. However, progress in addressing this challenge has been hampered by a lack of suitable methods to accurately identify clustering of similar sites when there is no a priori specification of anticipated cluster size or count. Here, we present an algorithm that addresses this challenge, demonstrate its effectiveness with simulated data, and apply it to an example of genetic polymorphism data. Our algorithm requires no a priori knowledge and exhibits greater power than any other unsupervised algorithms. Furthermore, we apply model averaging methodology to overcome the natural and extensive uncertainty in cluster borders, facilitating estimation of a realistic profile of sequence heterogeneity and clustering. These profiles are of broad utility for computational analyses or visualizations of heterogeneity in discrete linear sequences, an enterprise of rapidly increasing importance given the diminishing costs of nucleic acid sequencing. Introduction Analysis of discrete linear sequences has played an increasingly important role in biology. In particular, the detection of heterogeneous regions among sequences can aid in understanding the heterogeneous processes that act upon those regions [1],[2]. Therefore, determining whether specified types or categories of sites, such as polymorphic [3] or substituted sites [4] within DNA or protein sequences, are concentrated in specific regions within DNA or protein sequences has become a key component of these analyses [5]–[8]. For instance, detecting regions that feature heterogeneity in substitutions may provide valuable information on the structure and function of DNAs or proteins [9]–[13]. Several parametric and nonparametric methods have been proposed and historically applied to sequence data. Parametric methods include applications of a Fisher's exact test to tallies of site types between regions, or of a likelihood ratio test to identify heterogeneous regions [14],[15]. Alternatively, several heuristic methods may be applied for this clustering [16]. For example, UPGMA (Unweighted Pair Grouping Method with Arithmetic-mean) or NN (Nearest Neighbor), are hierarchical methods that at each step combine the nearest 2 clusters into one new cluster. Iteration of this step is continued until the number of clusters is one. One of NN's variants, K-NN (K-Nearest Neighbor), differs in its termination condition, stopping the iteration until the K clusters are identified, where K needs to be defined in advance. Another heuristic approach, K-means, uses a partitioning algorithm to break data into K clusters, and also requires the number of clusters K as a prior knowledge. When regions of a sequence that are expected to have heterogeneous frequencies of a site type may be specified in advance or the number of clusters to be identified is known a priori, these methods have high power to detect clustering [17]. However, they require a priori assignment of partitions. When no a priori expectation of cluster size or cluster number may be specified, extant studies have usually relied on “sliding window” methods [18]–[23]. For example, Pesole et al. (1992) labeled invariable site as ‘1’ and variable site as ‘0’, and applied a sliding window to identify whether ‘1’s are significantly clustered [24]. Pesole et al. calculated a heuristic score based on the presence or absence of site types within a window that processes serially across the sequence of interest. Advantages of sliding window methods include their intuitive conceptual basis and their striking output: an autocorrelated plot of the score that may be superimposed upon the sequence, providing a visual appraisal of the level of clustering at every site. However, sliding window methods have two related major disadvantages [25]. First, they generally offer only crude non-parametric means for statistical significance testing. The autocorrelation of serial scores severely complicates attempts to develop more insightful parametric approaches to sliding window significance testing, making parameter estimation with confidence intervals either challenging or impossible. Second, the need to specify a window size presents a user with a procedural ambiguity. Without a unified statistical framework, there is no strong justification for selection of one window size over another. In such a situation, it may even be tempting to invert the procedure of statistical inference and select a window size that produces an autocorrelated score plot consistent with a particular scientific hypothesis, as opposed to the valid procedure of selecting a window size by an objective statistical optimality criterion. Because of these disadvantages of the sliding window methods, several nonparametric statistical methods that do not assume prior knowledge have been suggested or implemented to detect clustering in discrete linear sequences. These methods include runs tests [26]–[28] and empirical cumulative distribution function (ECDF) statistics [29],[30]. Runs tests use the “longest unbroken run” between sites of interest as a test statistic for clustering, where a run is defined as consecutive length between events [26]. This test statistic provides very weak power, because it uses very little of the relevant information about the phenomenon of interest, ignoring all runs other than the longest. Statistics based on the longest two runs, longest three runs, or even on a summary of the full distribution of run lengths have been discussed, but remain weak tests. For instance, the variance in distance between site types of interest may be calculated and used as a test statistic for the detection of clusters of sites, where a high variance is indicative of clustering [29]. This test statistic incorporates information about the length of all the runs, but does not capture all of the relevant information: it discards all information about the relative position of runs of different lengths. A sequence with all of its shorter runs in one region would be more clustered than one with short runs distributed evenly. Currently, the most powerful nonparametric method is the ECDF. It features the cumulative difference between the observed and expected proportion of variant sites to identify regions that differ from other regions in number of substitutions. Under a null model that assumes no heterogeneous region(s) within sequences, this difference remains close to zero. Its significant departure from zero is an indicator for rejecting the null model [29],[30]. Although ECDF has been used to detect heterogeneity in several studies [31]–[35], its power can be affected by the location of the heterogeneous region [30]. Moreover, a parametric method may perform even better across a wide range of datasets. Most extant methods that have been proposed to detect heterogeneous clusters among sequences suffer from poor power to detect clustering when it is present. The problem is made especially challenging by a tradeoff wherein increasing power to detect clustering also increases overparameterization or false positive rates. Methods that have high power are prone to identify clustering even in random sequences, because even in short sequences, there are so many potential patterns of clustering to evaluate. In this paper, we propose a hierarchical clustering method, model averaged clustering by maximum likelihood (MACML), requiring no priori knowledge of cluster size or cluster count, that provides greater statistical power in detecting heterogeneous regions. MACML adopts a divide-and-conquer approach to hierarchically detect heterogeneous regions and repeat similar analysis for each identified region, unlike most hierarchical methods that do not revisit clusters once they are constructed [17],[36],[37]. To address issues of overparameterization, MACML employs model selection and model averaging techniques that lead to intuitively appealing profiles of sequence heterogeneity and that facilitate description of clustered sites in discrete linear sequences. We describe MACML in detail and provide comparative results in the form of an in-depth evaluation of simulated datasets and an empirical sequence data set on polymorphisms in the Drosophila alcohol dehydrogenase gene. Materials and Methods Algorithm To apply MACML to locate regional clusters with different specified site types requires a general input sequence X with N sites, denoted as
= 1 and A/T = 0. Notations used to describe our algorithm are summarized in Table 1.
Null model In a sequence with N sites, we denote the number of variant sites as . Under a null model, rates of appearance of variants across all sites are the same, equaling . Consequently, the likelihood of the null model is
Clustering model To derive a model incorporating heterogeneity (regional clustering of sites with different variant rates in each region), the entire sequence may first be partitioned into three regions. A central region is bounded by regional endpoints cs and ce (0≤cs<ce≤N-1) (see Figure 1
, , , , and .
Based on these determinate measures associated with the model, we define
Note that if cs = 0, or if ce = N–1, then there are only two putative regions. The formulation nevertheless applies unchanged.Model selection Different regional endpoints cs and ce lead to a set of diverse, divergently parameterized candidate models (Equation 3) with a range of likelihood values. To decide which model best fits the data and to examine whether a cluster deviates significantly from neighboring sequence, we incorporate several model selection criteria [38]:
Model averaging Parameter estimation based on model selection depends upon a single “best” model selected from a set of candidate models [42]. However, because sites may not be variant even when their probability of heterogeneity is high, regional endpoints will rarely be exactly correct. Ideally, the inferred probability of heterogeneity of a site would be influenced in a weighted manner by suboptimal models. To allow all models to contribute to estimation, we make use of model averaging, which accounts for model uncertainty [43]–[45]. To average over models, we assign a weight to each model, and then infer measures of interest across all weighted models. For instance, within the AIC framework, we compute the Akaike weight (wi, i = 1, 2…m) for each model,
Implementation MACML applies a divide-and-conquer approach to hierarchically detect clusters within sequences. After determining the likelihood of all possible models, MACML locates the first cluster, partitions sequences into the three most likely segments, and then repeats a similar analysis for these three segments. The process is iterated on each segment, until all segments and sub-segments of the sequence have failed to demonstrate clustering (see Figure 2
Availability MACML is written in standard C++ programming language, and its software package, including compiled executables on Linux/Mac/Windows, example data, documentation, and source codes, is freely available for academic use only at http://www.yale.edu/townsend/software.html. Simulations To test the performance of MACML and compare it to the most powerful extant method, ECDF, we simulated sequences for analysis for which the rates of variant sites were known a priori. For each simulated sequence, we randomly generated the start and end positions of the cluster, positions of variant sites within the cluster region, and positions of variant sites within the non-cluster region (see Figure 1 = 10000 times for each parameter combination. Thus, each performance measure was determined from M replicates.Power analysis For each replicate, the expected start position and end position of cluster were denoted as cs and ce, respectively. Denoting the corresponding estimated values as and , we defined the power to detect clusters within sequences as the proportion of all replicates that satisfies and , where the permissive zone parameter . The permissive zone allows each algorithm to just slightly misidentify the start and end of the cluster, improving the scope of the results of our simulations. Without a permissive zone, any algorithm misidentifies the start and end sites of the cluster with such a high frequency that computation becomes burdensome.Accuracy & precision An alternative assessment criterion, the Kullback-Leibler (KL) divergence [46], requires no permissive zone and provides a more technically satisfactory assessment of the accuracy and precision of the method. The KL divergence calculates how divergent two probability distributions are; in this case, it is used to compare the probabilities of variant sites determined from MACML to probabilities that are known because they were simulated. M replicates with N sites were simulated for each parameter combination, so that replicates may be indexed by and sites may be indexed by . We denote pj(i) and as the expected and estimated values of variant rate at site i of replicate j, respectively. The KL divergence measures the difference between the two distributions and pj(i), and is defined as
Simulation parameters The power to detect heterogeneous clusters is a function of the number of variant sites (n), the sequence length (N), the percentage of variant sites within the cluster (q), the ratio (r = pc/p0) of variant rates within cluster (pc) to outside of cluster region (p0), and the number of clusters. We systematically varied parameters of the simulations to obtain a thorough description of algorithm performance.
Empirical data We retrieved the Drosophila alcohol dehydrogenase (Adh) gene within five species of Drosophila melanogaster species subgroup (D. melanogaster, D. sechellia, D. simulans, D. yakuba and D. erecta) from FlyBase [47]. The aligned sequences of Drosophila Adh gene can be available at http://www.yale.edu/townsend/datasets.html. Results Effects of the number of variant sites and the percentage of variant sites within the cluster The powers of MACML and ECDF were plotted against the percentage of variant sites within the cluster (q) under different numbers of variant sites (n) in Figure 3 = 10 in Figure 3A
The power of MACML and ECDF to detect cold spots was also low when n was small (n = 10 in Figure 3E = 100 (Figure 3G = 200 (Figure 3HThe accuracy and precision of MACML and ECDF were estimated by the Kullback-Leibler (KL) divergence, which is a measure of the difference between the expected and estimated distributions of variant rates. In assessing the accuracy based on the KL divergence, therefore, there are three potential scenarios: a good match between the estimated and expected variant rates when a KL divergence is near zero, an underestimation of variant rates when KL divergence is positive, and an overestimation of variant rates when KL divergence is negative. The precision based on the KL divergence is also better when it is closer to zero. Unlike the accuracy, precision based on the KL divergence cannot be negative (Equation 12). Evaluating the accuracy and precision based on the KL divergence, MACML performed better than ECDF for most of the cases examined (Figure 4 When n is small (n = 10 in Figure 4E = 200 in Figure 4HEffect of the ratio of variant rates within cluster to outside of cluster The powers of MACML and ECDF were plotted against the ratio of variant rates within cluster to outside of cluster in Figure 5
MACML provided good accuracy and precision (near zero) for detecting cold spots, whereas the accuracy of ECDF diverged negatively and the precision of ECDF diverged from the ideal as well (Figure 6A According to their definitions, the ratio of variant rates within cluster to outside of cluster = 1 1, q = 0%, or q = 100% represent sequences with entirely randomly located substitutions under the null model. Therefore, we compared three criteria adopted by MACML and examined their errors of overparameterizing the clustering model when no clustering was imposed during the sequence generation. MACML and ECDF demonstrated high overparameterization and false positive rates, respectively (Table 2). The overparameterization rate of MACML markedly exceeded the false positive rate of ECDF for n = 10, n = 100 and n = 200. Implementing the AIC and AICc did little to moderate overparameterization, whereas implementing BIC significantly moderated overparameterization. Implementing the BIC did not bring overparameterization down to the false positive rate of ECDF for n = 10, 100, and 200, but did limit the overparameterization rate to approximately the false positive rate of ECDF for sequences with n = 50.
Effect of sequence length The powers of MACML and ECDF were plotted against sequence length in Figure 7
The accuracy and precision of MACML and ECDF varied little across all values of sequence length. With increasing sequence length, the accuracy of ECDF diverged from zero positively for hot spots and diverged slightly negatively for cold spots. The precision of ECDF diverged from the ideal positively for both hot spots and cold spots (Figure 8A Effect of the number of clusters The powers of MACML and ECDF were plotted against the number of clusters in Figure 9 = 10 (Figure 9A = 50 (Figure 9B = 100 (Figure 9C = 200 (Figure 9D
Applied example We applied MACML to detect heterogeneous clusters of polymorphisms within the Drosophila Adh gene and to profile potential for polymorphism for each site based on model selection and model averaging, respectively. Identified clusters as well as profiles of the potential for polymorphism were plotted against sequence coordinate (Figure 10
Detailed clustering results for the Adh gene are summarized in Table 3. For the AIC or AICc, the four detected clusters all deviate significantly from the null model (ΔAIC<0 and ΔAICc<0 in Table 3). When sample size is large, like sequence from sites 0 to 253, the ΔAICc asymptotically approaches ΔAIC, and thus their values are nearly same. However, for a smaller sample size, for example, when detecting sub-sequence from sites 71 to 97, ΔAICc is much larger than ΔAIC. By contrast, BIC incorporates a heavier penalty than AIC or AICc and ΔBIC>0 indicated no significant cluster among sub-sequences from sites 71 to 97 or from 190 to 253, whereas AIC and AICc identified two clusters along these two sub-sequences.
Discussion Comparative analysis of simulated results The power to detect heterogeneous clustered sites within sequences depended in moderately complex ways on the parameters we examined in this report. Consistent with expectations, our results show that the power of MACML to detect hot spots and cold spots increased with increasing percentage of variant sites within the cluster (Figure 3 Across a range of ratios of variant rates within the cluster to outside of the cluster, MACML and ECDF exhibit similar trends in power, but different trends in accuracy and precision. With both methods, a significant difference between variant rates within the cluster and outside of the cluster leads to greater power, and nearly equal rates for all sites results in lower power (Figure 5 Both MACML and ECDF exhibit decreasing power with increasing sequence length (Figure 7 False positive rates and overparameterization for clustering models were high, as expected as a consequence of the large number of potential cluster boundary sets that are possible. Powerful methods for this class of problem are expected to display high false positive rates, a tradeoff that is natural in statistical inference. Although ECDF presents lower false positive rates, MACML achieves more power than ECDF to reject the null hypothesis when it is not true (Figures 3 Differences of the adopted criteria Unlike ECDF, which is not integrated into a model selection framework, MACML adopts AIC, AICc and BIC for model selection. To clarify the differences observed implementing these diverse criteria, the different penalties for additional parameterization that they entail may be compared. Based on the clustering model, two parameters (cs and ce) are evaluated (from which p0 and pc can be calculated). Therefore, the number of parameters under the clustering model is two, whereas the number under the null model is zero. From equations 4–6, then,
The values of lnLc–lnL0 may be plotted against sample size (Equations 11–13, Figure 11
For a given value of lnLc–lnL0, the three criteria are most likely to give different results with regard to rejection of the null model. The three lines plotted corresponding to the three different criteria in Figure 11 Significance of profiling heterogeneity The Drosophila Adh is the most studied enzyme that catalyzes the oxidation of alcohols to aldehydes/ketones [48]. It has been extensive reported that several functionally important residues reside in the Adh gene: tyrosine-152, lysine-156 and serine-139 are conserved in homologous dehydrogenases and have important roles in catalysis [49]–[53]; glycine-130, glycine-133 and glycine-184 contribute substantially to the structure of the active form [50]; and aspartic acid-64 lies within a coenzyme-binding domain [51]. As shown in Figure 10 Heterogeneity of variant rates among specified site types is thought to commonly occur [56]–[59] and may derive from many sources, including functional constraint, gene structure, 3D protein structure, composition bias, mutation bias or recombination [1], [18], [34], [60]–[62]. As indicated by our results based on the simulated data and real data, MACML, equipped with model selection and model averaging, features smooth and continuous profiles of variant rates for each site, and is more accurate and more informative for the detection of multiple clusters among sequences. Therefore, MACML furnishes broad utility for any computational analyses of heterogeneous discrete linear sequences and provides valuable information to aid for a better understanding of the structure and function of DNAs or proteins. In addition, MACML can be applied to a broad range of applications. For example, MACML would be appropriate for determining whether components of any multicomponent polymer have a clustered structure [33],[63]. It can also be used to detect compositional heterogeneity within sequences [64]–[66] (e.g., heterogeneous GC content by setting G/C = 1 and A/T = 0). Moreover, MACML may provide a framework upon which future modeling of the substitution process may be overlain, assessing heterogeneity in selective pressure acting on different coding sequence regions [60], [67]–[70] and detecting fast-evolving regions in noncoding sequences [71],[72].Conclusion Here we have presented a method, MACML, to detect clustering of a site type in discrete linear sequences. MACML features maximum likelihood estimation, model selection criteria (AIC, AICc, and BIC) and model averaging to profile sequence heterogeneity. It employs a divide-and-conquer approach to hierarchically detect multiple clusters within sequences, without requiring a priori knowledge for cluster size or number. We compared MACML with the most powerful competing method, the ECDF, by exploring a full range of parameter space using computer simulations, and by performing an analysis of empirical data. Our comparative results show that across a wide range of parameter combinations, MACML outperforms ECDF not only by exhibiting greater power to detecting hot spots and cold spots. Thus, it represents a powerful exploratory tool for profiling clustering in discrete linear sequences. Although discoveries using MACML should be considered tentative, it yields greater resolution than any other method, providing a significant advance for the analysis of clustering of sites within discrete linear sequences. Table S1 Power to detect heterogeneous clusters, evaluating a range of percentages of variant sites within the cluster (q) (0.02 MB XLS) Click here for additional data file.(23K, xls) Table S2 Power to detect heterogeneous clusters, evaluating a range of ratios of variant rates within the cluster to outside of the cluster (r) (0.02 MB XLS) Click here for additional data file.(18K, xls) Table S3 Power to detect heterogeneous clusters, evaluating a range of sequence lengths (0.02 MB XLS) Click here for additional data file.(17K, xls) Table S4 Power to detect multiple heterogeneous clusters (0.02 MB XLS) Click here for additional data file.(19K, xls) Acknowledgments We thank three anonymous reviewers for their critical comments and constructive suggestions on this manuscript. We also thank Zheng Wang, Francesc López, Aleksandra Adomas, Gina Wilpiszeski and Andrea Hodgins-Davis for valuable discussions of this manuscript. Footnotes The authors have declared that no competing interests exist. This work was supported by funding from Yale University and by computational resources at the Yale University Biomedical High Performance Computing Center, whose instrumentation was funded by NIH grant RR19895. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. References 1. Stephens JC. Statistical methods of DNA sequence analysis: detection of intragenic recombination or gene conversion. Mol Biol Evol. 1985;2:539–556. [PubMed] 2. Nekrutenko A, Li WH. Assessment of compositional heterogeneity within and between eukaryotic genomes. Genome Res. 2000;10:1986–1995. [PubMed] 3. Nachman MW. Single nucleotide polymorphisms and recombination rate in humans. Trends Genet. 2001;17:481–485. [PubMed] 4. Wolfe KH, Sharp PM, Li WH. Mutation rates differ among regions of the mammalian genome. Nature. 1989;337:283–285. [PubMed] 5. Huelsenbeck JP, Nielsen R. Variation in the pattern of nucleotide substitution across sites. J Mol Evol. 1999;48:86–93. [PubMed] 6. Nei M. Molecular Evolutionary Genetics. New York, USA: Columbia University Press; 1987. 7. Nielsen R. Molecular signatures of natural selection. Annu Rev Genet. 2005;39:197–218. [PubMed] 8. Yang ZH. Among-site rate variation and its impact on phylogenetic analyses. Trends Ecol Evol. 1996;11:367–372. 9. Attimonelli M, Lanave C, Sbisa E, Preparata G, Saccone C. Multisequence comparisons in protein coding genes. Search for functional constraints. Cell Biophys. 1985;7:239–250. [PubMed] 10. Reeves JH. Heterogeneity in the substitution process of amino acid sites of proteins coded for by mitochondrial DNA. J Mol Evol. 1992;35:17–31. [PubMed] 11. Zheng Y, Roberts RJ, Kasif S. Segmentally variable genes: a new perspective on adaptation. PLoS Biol. 2004;2:e81. doi:10.1371/journal.pbio.0020081. [PubMed] 12. Marin I, Fares MA, Gonzalez-Candelas F, Barrio E, Moya A. Detecting changes in the functional constraints of paralogous genes. J Mol Evol. 2001;52:17–28. [PubMed] 13. Andres AM, de Hemptinne C, Bertranpetit J. Heterogeneous rate of protein evolution in serotonin genes. Mol Biol Evol. 2007;24:2707–2715. [PubMed] 14. Gaut BS, Weir BS. Detecting substitution-rate heterogeneity among regions of a nucleotide sequence. Mol Biol Evol. 1994;11:620–629. [PubMed] 15. Hartmann M, Golding GB. Searching for substitution rate heterogeneity. Mol Phylogenet Evol. 1998;9:64–71. [PubMed] 16. Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Computing Surveys. 1999;31:264–323. 17. Berkhin P. A Survey of Clustering Data Mining Techniques. In: Kogan J, Nicholas C, Teboulle M, editors. Grouping Multidimensional Data: Recent Advances in Clustering. Berlin, Heidelberg: Springer-Verlag Berlin Heidelberg; 2006. pp. 25–71. 18. Mrazek J, Karlin S. Strand compositional asymmetry in bacterial and large viral genomes. Proc Natl Acad Sci U S A. 1998;95:3720–3725. [PubMed] 19. Ponger L, Mouchiroud D. CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences. Bioinformatics. 2002;18:631–633. [PubMed] 20. Zharkikh AA, Rzhetsky A. Quick assessment of similarity of two sequences by comparison of their L-tuple frequencies. Biosystems. 1993;30:93–111. [PubMed] 21. Liang H, Zhou W, Landweber LF. SWAKK: a web server for detecting positive selection in proteins using a sliding window substitution rate analysis. Nucleic Acids Res. 2006;34:W382–W384. [PubMed] 22. Proutski V, Holmes E. SWAN: sliding window analysis of nucleotide sequence variability. Bioinformatics. 1998;14:467–468. [PubMed] 23. Fares MA, Elena SF, Ortiz J, Moya A, Barrio E. A sliding window-based method to detect selective constraints in protein-coding genes and its application to RNA viruses. J Mol Evol. 2002;55:509–521. [PubMed] 24. Pesole G, Attimonelli M, Preparata G, Saccone C. A statistical method for detecting regions with different evolutionary dynamics in multialigned sequences. Mol Phylogenet Evol. 1992;1:91–96. [PubMed] 25. Schmid K, Yang Z. The trouble with sliding windows and the selective pressure in BRCA1. PLoS ONE. 2008;3:e3746. doi:10.1371/journal.pone.0003746. [PubMed] 26. Karlin S, Brendel V. Chance and statistical significance in protein and DNA sequence analysis. Science. 1992;257:39–49. [PubMed] 27. Karlin S, Ladunga I, Blaisdell BE. Heterogeneity of genomes: measures and values. Proc Natl Acad Sci U S A. 1994;91:12837–12841. [PubMed] 28. Karlin S. Global dinucleotide signatures and analysis of genomic heterogeneity. Curr Opin Microbiol. 1998;1:598–610. [PubMed] 29. Goss PJ, Lewontin RC. Detecting heterogeneity of substitution along DNA and protein sequences. Genetics. 1996;143:589–602. [PubMed] 30. Tang H, Lewontin RC. Locating regions of differential variability in DNA and protein sequences. Genetics. 1999;153:485–495. [PubMed] 31. Peng X, Karuturi RK, Miller LD, Lin K, Jia Y, et al. Identification of cell cycle-regulated genes in fission yeast. Mol Biol Cell. 2005;16:1026–1042. [PubMed] 32. Schaeffer SW, Walthour CS, Toleno DM, Olek AT, Miller EL. Protein variation in Adh and Adh-related in Drosophila pseudoobscura. Linkage disequilibrium between single nucleotide polymorphisms and protein alleles. Genetics. 2001;159:673–687. [PubMed] 33. Zheng Y, Roberts RJ, Kasif S. Identification of genes with fast-evolving regions in microbial genomes. Nucleic Acids Res. 2004;32:6347–6357. [PubMed] 34. Dermitzakis ET, Clark AG. Differential selection after duplication in mammalian developmental genes. Mol Biol Evol. 2001;18:557–562. [PubMed] 35. Schmid KJ, Nigro L, Aquadro CF, Tautz D. Large number of replacement polymorphisms in rapidly evolving genes of Drosophila. Implications for genome-wide surveys of DNA polymorphism. Genetics. 1999;153:1717–1729. [PubMed] 36. Levin MS. Towards hierarchical clustering. In: Diekert V, Volkov M, Voronkov A, editors. Computer Science - Theory and Applications. Heidelberg: Springer Berlin/Heidelberg; 2007. pp. 205–215. 37. Castro RM, Coates MJ, Nowak RD. Likelihood based hierarchical clustering. IEEE Trans Signal Process. 2004;52:2308–2321. 38. Sullivan J, Joyce P. Model selection in phylogenetics. Annu Rev Ecol Evol Syst. 2005;36:445–466. 39. Akaike H. New look at statistical-model identification. IEEE Trans Automat Contr. 1974;Ac19:716–723. 40. Hurvich CM, Tsai CL. Regression and time-series model selection in small samples. Biometrika. 1989;76:297–307. 41. Schwarz G. Estimating dimension of a model. Ann Stat. 1978;6:461–464. 42. Raftery AE, Madigan D, Hoeting JA. Bayesian model averaging for linear regression models. J Am Stat Assoc. 1997;92:179–191. 43. Posada D, Buckley TR. Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests. Syst Biol. 2004;53:793–808. [PubMed] 44. Johnson JB, Omland KS. Model selection in ecology and evolution. Trends Ecol Evol. 2004;19:101–108. [PubMed] 45. Zhang Z, Li J, Zhao XQ, Wang J, Wong GK, et al. KaKs_Calculator: calculating Ka and Ks through model selection and model averaging. Genomics Proteomics Bioinformatics. 2006;4:259–263. [PubMed] 46. Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat. 1951;22:79–86. 47. Wilson RJ, Goodman JL, Strelets VB. FlyBase: integration and improvements to query tools. Nucleic Acids Res. 2008;36:D588–D593. [PubMed] 48. Benach J, Winberg JO, Svendsen JS, Atrian S, Gonzalez-Duarte R, et al. Drosophila alcohol dehydrogenase: acetate-enzyme interactions and novel insights into the effects of electrostatics on catalysis. J Mol Biol. 2005;345:579–598. [PubMed] 49. Chen Z, Jiang JC, Lin ZG, Lee WR, Baker ME, et al. Site-specific mutagenesis of Drosophila alcohol dehydrogenase: evidence for involvement of tyrosine-152 and lysine-156 in catalysis. Biochemistry. 1993;32:3342–3346. [PubMed] 50. Cols N, Marfany G, Atrian S, Gonzalez-Duarte R. Effect of site-directed mutagenesis on conserved positions of Drosophila alcohol dehydrogenase. FEBS Lett. 1993;319:90–94. [PubMed] 51. Persson B, Krook M, Jornvall H. Characteristics of short-chain alcohol dehydrogenases and related enzymes. Eur J Biochem. 1991;200:537–543. [PubMed] 52. Albalat R, Gonzalez D, Atrian S. Protein engineering of Drosophila alcohol dehydrogenase. The hydroxyl group of Tyr152 is involved in the active site of the enzyme. FEBS Lett. 1992;308:235–239. [PubMed] 53. Cols N, Atrian S, Benach J, Ladenstein R, Gonzalez-Duarte R. Drosophila alcohol dehydrogenase: evaluation of Ser139 site-directed mutants. FEBS Lett. 1997;413:191–193. [PubMed] 54. Benyajati C, Place AR, Powers DA, Sofer W. Alcohol dehydrogenase gene of Drosophila melanogaster: relationship of intervening sequences to functional domains in the protein. Proc Natl Acad Sci U S A. 1981;78:2717–2721. [PubMed] 55. Bodmer M, Ashburner M. Conservation and change in the DNA sequences coding for alcohol dehydrogenase in sibling species of Drosophila. Nature. 1984;309:425–430. [PubMed] 56. Gillespie JH. Variability of evolutionary rates of DNA. Genetics. 1986;113:1077–1091. [PubMed] 57. Gu X, Fu YX, Li WH. Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. Mol Biol Evol. 1995;12:546–557. [PubMed] 58. Arndt PF, Hwa T, Petrov DA. Substantial regional variation in substitution rates in the human genome: importance of GC content, gene density, and telomere-specific effects. J Mol Evol. 2005;60:748–763. [PubMed] 59. Takano TS. Rate variation of DNA sequence evolution in the Drosophila lineages. Genetics. 1998;149:959–970. [PubMed] 60. Wagner A. Rapid detection of positive selection in genes and genomes through variation clusters. Genetics. 2007;176:2451–2463. [PubMed] 61. Yu J, Thorne JL. Testing for spatial clustering of amino acid replacements within protein tertiary structure. J Mol Evol. 2006;62:682–692. [PubMed] 62. Choi SC, Hobolth A, Robinson DM, Kishino H, Thorne JL. Quantifying the impact of protein tertiary structure on molecular evolution. Mol Biol Evol. 2007;24:1769–1782. [PubMed] 63. Vawter L, Brown WM. Rates and patterns of base change in the small subunit ribosomal RNA gene. Genetics. 1993;134:597–608. [PubMed] 64. Foster PG. Modeling compositional heterogeneity. Syst Biol. 2004;53:485–495. [PubMed] 65. Gao F, Zhang CT. GC-Profile: a web-based tool for visualizing and analyzing the variation of GC content in genomic sequences. Nucleic Acids Res. 2006;34:W686–W691. [PubMed] 66. Carulli JP, Krane DE, Hartl DL, Ochman H. Compositional heterogeneity and patterns of molecular evolution in the Drosophila genome. Genetics. 1993;134:837–845. [PubMed] 67. Pond SK, Muse SV. Site-to-site variation of synonymous substitution rates. Mol Biol Evol. 2005;22:2375–2385. [PubMed] 68. Yang Z, Swanson WJ. Codon-substitution models to detect adaptive evolution that account for heterogeneous selective pressures among site classes. Mol Biol Evol. 2002;19:49–57. [PubMed] 69. Bao L, Gu H, Dunn KA, Bielawski JP. Likelihood-based clustering (LiBaC) for codon models, a method for grouping sites according to similarities in the underlying process of evolution. Mol Biol Evol. 2008;25:1995–2007. [PubMed] 70. Yang Z, Nielsen R, Goldman N, Pedersen AM. Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics. 2000;155:431–449. [PubMed] 71. Bird CP, Stranger BE, Liu M, Thomas DJ, Ingle CE, et al. Fast-evolving noncoding sequences in the human genome. Genome Biol. 2007;8:R118. [PubMed] 72. Stajich JE, Dietrich FS, Roy SW. Comparative genomic analysis of fungal genomes reveals intron-rich ancestors. Genome Biol. 2007;8:R223. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Mol Biol Evol. 1985 Nov; 2(6):539-56.
[Mol Biol Evol. 1985]Genome Res. 2000 Dec; 10(12):1986-95.
[Genome Res. 2000]Trends Genet. 2001 Sep; 17(9):481-5.
[Trends Genet. 2001]Nature. 1989 Jan 19; 337(6204):283-5.
[Nature. 1989]J Mol Evol. 1999 Jan; 48(1):86-93.
[J Mol Evol. 1999]Mol Biol Evol. 1994 Jul; 11(4):620-9.
[Mol Biol Evol. 1994]Mol Phylogenet Evol. 1998 Feb; 9(1):64-71.
[Mol Phylogenet Evol. 1998]Proc Natl Acad Sci U S A. 1998 Mar 31; 95(7):3720-5.
[Proc Natl Acad Sci U S A. 1998]J Mol Evol. 2002 Nov; 55(5):509-21.
[J Mol Evol. 2002]Mol Phylogenet Evol. 1992 Jun; 1(2):91-6.
[Mol Phylogenet Evol. 1992]PLoS One. 2008; 3(11):e3746.
[PLoS One. 2008]Science. 1992 Jul 3; 257(5066):39-49.
[Science. 1992]Curr Opin Microbiol. 1998 Oct; 1(5):598-610.
[Curr Opin Microbiol. 1998]Genetics. 1996 May; 143(1):589-602.
[Genetics. 1996]Genetics. 1999 Sep; 153(1):485-95.
[Genetics. 1999]Genetics. 1996 May; 143(1):589-602.
[Genetics. 1996]Genetics. 1999 Sep; 153(1):485-95.
[Genetics. 1999]Mol Biol Cell. 2005 Mar; 16(3):1026-42.
[Mol Biol Cell. 2005]Genetics. 1999 Dec; 153(4):1717-29.
[Genetics. 1999]Genetics. 1999 Sep; 153(1):485-95.
[Genetics. 1999]Syst Biol. 2004 Oct; 53(5):793-808.
[Syst Biol. 2004]Genomics Proteomics Bioinformatics. 2006 Nov; 4(4):259-63.
[Genomics Proteomics Bioinformatics. 2006]Genetics. 1999 Sep; 153(1):485-95.
[Genetics. 1999]Genetics. 1999 Sep; 153(1):485-95.
[Genetics. 1999]Nucleic Acids Res. 2008 Jan; 36(Database issue):D588-93.
[Nucleic Acids Res. 2008]Genetics. 1996 May; 143(1):589-602.
[Genetics. 1996]Genetics. 1999 Sep; 153(1):485-95.
[Genetics. 1999]Genetics. 1999 Sep; 153(1):485-95.
[Genetics. 1999]Genetics. 1999 Sep; 153(1):485-95.
[Genetics. 1999]J Mol Biol. 2005 Jan 21; 345(3):579-98.
[J Mol Biol. 2005]Biochemistry. 1993 Apr 6; 32(13):3342-6.
[Biochemistry. 1993]FEBS Lett. 1997 Aug 18; 413(2):191-3.
[FEBS Lett. 1997]FEBS Lett. 1993 Mar 15; 319(1-2):90-4.
[FEBS Lett. 1993]Eur J Biochem. 1991 Sep 1; 200(2):537-43.
[Eur J Biochem. 1991]Genetics. 1986 Aug; 113(4):1077-91.
[Genetics. 1986]Genetics. 1998 Jun; 149(2):959-70.
[Genetics. 1998]Mol Biol Evol. 1985 Nov; 2(6):539-56.
[Mol Biol Evol. 1985]Proc Natl Acad Sci U S A. 1998 Mar 31; 95(7):3720-5.
[Proc Natl Acad Sci U S A. 1998]Mol Biol Evol. 2001 Apr; 18(4):557-62.
[Mol Biol Evol. 2001]Nucleic Acids Res. 2004; 32(21):6347-57.
[Nucleic Acids Res. 2004]Genetics. 1993 Jun; 134(2):597-608.
[Genetics. 1993]Syst Biol. 2004 Jun; 53(3):485-95.
[Syst Biol. 2004]Genetics. 1993 Jul; 134(3):837-45.
[Genetics. 1993]Genetics. 2007 Aug; 176(4):2451-63.
[Genetics. 2007]