- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- Int J Mol Sci
- v.11(3); 2010
- PMC2869232

# Proper Distance Metrics for Phylogenetic Analysis Using Complete Genomes without Sequence Alignment

^{1,}

^{2,}

^{*}Xiao-Wen Zhan,

^{1}Guo-Sheng Han,

^{1}Roger W. Wang,

^{3}Vo Anh,

^{2}and Ka Hou Chu

^{4}

^{1}School of Mathematics and Computational Science, Xiangtan University, Hunan 411105, China; E-Mails: moc.361@406041100130nahz (X.-W.Z.); Email: moc.361@30028201aerok (G.-S.H.)

^{2}School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia; E-Mail: ua.ude.tuq@hna.v (V.A.)

^{3}Department of Mathematics, Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China; E-Mail: moc.oohay@00_gnaww (R.W.W.)

^{4}Department of Biology, Chinese University of Hong Kong, Shatin, N.T., Hong Kong, China; E-Mail: kh.ude.khuc@uhcuohak (K.H.C.)

^{*}Author to whom correspondence should be addressed; E-Mail: moc.liamtoh@gzuy; Tel.: + 86-731-52377625; Fax: +86-731-58293934.

## Abstract

A shortcoming of most correlation distance methods based on the composition vectors without alignment developed for phylogenetic analysis using complete genomes is that the “distances” are not proper distance metrics in the strict mathematical sense. In this paper we propose two new correlation-related distance metrics to replace the old one in our dynamical language approach. Four genome datasets are employed to evaluate the effects of this replacement from a biological point of view. We find that the two proper distance metrics yield trees with the same or similar topologies as/to those using the old “distance” and agree with the tree of life based on 16S rRNA in a majority of the basic branches. Hence the two proper correlation-related distance metrics proposed here improve our dynamical language approach for phylogenetic analysis.

**Keywords:**phylogenetic analysis, complete genome, composition vector, correlation-related distance metric

## 1. Introduction

Whole genome sequences are generally accepted as excellent tools for studying evolutionary relationships [1]. Traditional distance methods with multiple alignment or various sequence evolutionary models for phylogenetic analysis are not directly applicable to the analysis of complete genomes.

A number of methods without sequence alignment for deriving species phylogeny based on overall similarities of complete genomes have been developed. These include fractal analysis [2–4], dynamical language model [5], information-based analysis [6–8], log-correlation distance and Fourier transformation with Kullback-Leibler divergence distance [9], Markov model [10–15], principal component analysis [16] and singular value decomposition (SVD) [17–19]. The analyses based on the Markov model and dynamical language model without sequence alignment using 103 prokaryotes and 6 eukaryotes have yielded trees separating the three domains of life, Archaea, Eubacteria and Eukarya, with the relationships among the taxa consistent with those based on traditional analyses [5,11]. These two methods were also used to analyze the complete chloroplast genomes [5,12]. The SVD method was used to analyze mitochondrial genomes of 64 selected vertebrates [19]. A correlation-distance method without removing the random background (similar to [7]) was used to analyze rRNA gene sequences as DNA barcodes [20].

In the above approaches of SVD, Markov model and dynamical language model, there is a step to calculate the correlation-related distance between two genomes after removing the randomness or noise from the composition vectors. A drawback is that these correlation-related distances are not proper distance metrics in the strict mathematical sense (Professor Bailin Hao, personal communication, 2009; see also [21]). There are some ways to overcome this problem. One way is to change the concept of distance to that of dissimilarity proposed by Xu and Hao [15] in the Markov model approach. Another way is to replace a pseudo-distance by a proper distance metric, which requires that the results are not worsened from the biological point of view. In the first way, there is no widely accepted mathematical definition for the concept of dissimilarity or similarity. Chen *et al.* [22] defined a similarity metric, but unfortunately the sample correlation between two vectors in a vector space does not yield a proper similarity under their definition.

In this paper, we follow the second way and propose two proper correlation-related distance metrics to replace the pseudo-distance in the dynamical language approach used by Yu *et al*. [5]. We then evaluate the effects of this replacement on the analysis of a wide range of complete genomes from the biological point of view.

## 2. Dynamical Language Approach for Phylogenetic Analysis

Three kinds of data from the complete genomes can be analysed using the dynamical language approach proposed by Yu *et al*. [5]. They are the whole DNA sequences (including protein-coding and non-coding regions), all protein-coding DNA sequences and the amino acid sequences of all protein-coding genes. We outline this approach here.

There are a total of *N* = 4* ^{K}* (for DNA sequences) or 20

*(for protein sequences) possible types of*

^{K}*K*-strings, that is, the strings with fixed length

*K*. We denote the length of a DNA or protein sequence as

*L*. Then a window of length

*K*is used to slide through the sequences by shifting one position at a time to determine the frequencies of each of the

*N*kinds of

*K*-strings in this sequence. We define

*p*(

*α*

_{1}

*α*

_{2}...

*α*) =

_{K}*n*(

*α*

_{1}

*α*

_{2}...

*α*) / (

_{K}*L*–

*K*+ 1) as the observed frequency of a

*K*-string

*α*

_{1}

*α*

_{2}...

*α*, where

_{K}*n*(

*α*

_{1}

*α*

_{2}...

*α*) is the number of times that

_{K}*α*

_{1}

*α*

_{2}...

*α*appears in this sequence. For the DNA or amino acid sequences of the protein-coding genes, denoting by

_{K}*m*the number of protein-coding genes from each complete genome, we define $({\sum}_{j=1}^{m}{n}_{j}({\alpha}_{1}{\alpha}_{2}\dots {\alpha}_{K}))/({\sum}_{j=1}^{m}({L}_{j}-K+1))$ as the observed frequency of a

*K*-string

*α*

_{1}

*α*

_{2}...

*α*; here

_{K}*n*(

_{j}*α*

_{1}

*α*

_{2}...

*α*) means the number of times that

_{K}*α*

_{1}

*α*

_{2}...

*α*appears in the

_{K}*j*th protein-coding DNA sequence or protein sequence, and

*L*the length of the

_{j}*j*th sequence in this complete genome. Then we can form a

*composition vector*for a genome using

*p*(

*α*

_{1}

*α*

_{2}...

*α*) as components for all possible

_{K}*K*-strings

*α*

_{1}

*α*

_{2}...

*α*. We use

_{K}*p*to denote the

_{i}*i*-th component corresponding to the string type

*i*,

*i*= 1,…,

*N*(

*N*strings are arranged in a fixed order as the alphabetical order). In this way we construct a composition vector

*p*= (

*p*

_{1},

*p*

_{2},...,

*p*) for a genome.

_{N}Yu *et al*. [5] considered an idea from the theory of dynamical language [23] that a *K*-string *s*_{1}*s*_{2}...*s _{K}* is possibly constructed by adding a letter

*s*to the end of the (

_{K}*K*– 1) -string

*s*

_{1}

*s*

_{2}...

*s*

_{K–1}or a letter

*s*

_{1}to the beginning of the (

*K*– 1) -string

*s*

_{2}

*s*

_{3}...

*s*. After counting the observed frequencies for all strings of length (

_{K}*K*– 1) and the four or 20 kinds of letters, the expected frequency of appearance of

*K*-strings is predicted by:

where *p*(*s*_{1}) and *p*(*s _{K}*) are frequencies of nucleotides or amino acids

*s*

_{1}and

*s*appearing in this genome. Then

_{K}*q*(

*s*

_{1}

*s*

_{2}...

*s*) of all 4

_{K}*or 20*

^{K}*kinds of*

^{K}*K*-strings is viewed as the noise background. We then subtract the noise background before performing a cross-correlation analysis through defining:

The transformation *X* = (*p* / *q*) – 1 has the desired effect of subtraction of random background in *p* and rendering it a stationary time series suitable for subsequent cross-correlation analysis.

Then we use *X* (*s*_{1}*s*_{2}...*s _{K}*) for all possible

*K*-strings

*s*

_{1}

*s*

_{2}...

*s*as components and arrange according to a fixed alphabetical order all the

_{K}*K*-strings to form a composition vector

*X*= (

*X*

_{1},

*X*

_{2},...,

*X*) for genome

_{N}*X*, and likewise

*Y*= (

*Y*

_{1},

*Y*

_{2},...,

*Y*) for genome

_{N}*Y*.

Then we view the *N* components in the vectors *X* and *Y* as samples of two random variables respectively. The sample correlation *C*(*X*, *Y*) between any two genomes *X* and *Y* is defined in the usual way in probability theory as:

The distance *D _{r}* (

*X*,

*Y*) between the two genomes is then defined by

*D*(

_{r}*X*,

*Y*) = (1 –

*C*(

*X*,

*Y*)) / 2. A distance matrix for all the genomes under study is then generated for the construction of phylogenetic trees. This distance method to construct phylogenetic tree is referred to as the

*dynamical language model method*[5]. Finally, we construct all trees using the neighbour-joining (NJ) method [24] in the software

*SplitsTree4*V4.10 [25] or in the

*Molecular Evolutionary Genetics Analysis*software (MEGA 4) [26] based on the distance matrices.

To determine a best length of strings (*K*) in our model, we plot the mean value of X over all *K*-strings from a genome (whole DNA sequences or protein sequences) as a function of *K* (see Figure 1 for examples from our data). The mean value of *X* starts to approach zero at *K* = 6 or 7 if we use protein sequences from genome and at *K* = 11 or 12 if we use whole DNA sequence. The mean value of *X* being close to zero means that the value of *p* (from the sequence) is almost equal to value of *q* (from the model). Hence these *K* values are suitable for phylogeny reconstruction using our approach. This result is also confirmed later in this paper from a biological point of view.

## 3. Proper Distance Metrics in Vector Spaces

Each genome can be considered as a point in *N* = 4* ^{K}* (for DNA sequences) or 20

*(for protein sequences) dimensional space represented by its composition vector*

^{K}*X*= (

*X*

_{1},

*X*

_{2},...,

*X*).

_{N}A function *D*(*X*, *Y*) between two vectors *X* and *Y* is said to be a distance metric if it satisfies the following properties:

*D*(*X*,*Y*) ≥ 0; and*D*(*X*,*Y*) = 0 if and only if*X*=*Y*;*D*(*X*,*Y*) =*D*(*Y*,*X*);*D*(*X*,*Z*) ≤*D*(*X*,*Y*) +*D*(*Y*,*Z*) for any*X*,*Y*and*Z*.

The inequality (iii) is called the *triangle inequality*. A distance metric *D*(*X*, *Y*) is said to be normalized if 0 ≤ *D*(*X*, *Y*) ≤ 1 for any *X* and *Y*.

If we denote:

where |*X*| and |*Y*| are the lengths of the vectors *X* and *Y* respectively, then *X _{u}* and

*Y*are unit vectors (

_{u}*i.e.*, have length 1). Let

*θ*be the angle between two vectors of

*X*and

*Y*. It is well known that

*C*(

*X*,

_{u}*Y*) = cos

_{u}*θ*.

The distance defined by *D _{r}* (

*X*,

*Y*) = (1 –

*C*(

*X*,

*Y*)) / 2 is not a proper distance metric because it does not satisfy condition (i) (except for unit vectors) and the triangle inequality (iii) [21]. In the following we describe two proper distance metrics related to the sample correlation.

### 3.1. Chord Distance

The chord distance is defined on the set of unit vectors in a vector space as the length of the chord constructed from two unit vectors. Mathematically, let *X _{u}* = (

*X*,

_{u1}*X*,…,

_{u2}*X*) and

_{uN}*Y*= (

_{u}*Y*,

_{u1}*Y*,…,

_{u2}*Y*) be two unit vectors; then the chord distance

_{uN}*D*(

_{chord}*X*,

_{u}*Y*)is defined as:

_{u}It is seen that *D _{chord}* (

*X*,

_{u}*Y*) = 0 if and only if

_{u}*C*(

*X*,

_{u}*Y*) = 1,

_{u}*i.e.*, cos

*θ*(

*X*,

_{u}*Y*) = 1, which implies that

_{u}*θ*(

*X*,

_{u}*Y*) = 0 because the angle

_{u}*θ*(

*X*,

_{u}*Y*) between the two vectors

_{u}*X*and

_{u}*Y*is in [0,

_{u}*π*]. This result means that the two vectors

*X*and

_{u}*Y*are identical. It is obvious that

_{u}*D*(

_{chord}*X*,

_{u}*Y*) =

_{u}*D*(

_{chord}*Y*,

_{u}*X*). Because the three chords constructed by the pairs

_{u}*X*and

_{u}*Y*,

_{u}*X*and

_{u}*Z*,

_{u}*Y*and

_{u}*Z*are the three edges of a triangle, and the sum of the lengths of any two edges of a triangle is larger or equal to the length of the third edge, the triangle inequality of the chord distance follows. Hence the chord distance is a proper distance metric in the strict mathematical sense. The chord distance

_{u}*D*(

_{chord}*X*,

_{u}*Y*) can be normalized by ${D}_{\mathit{\text{chord}}}^{\mathit{\text{norm}}}({X}_{u},\hspace{0.17em}{Y}_{u})={D}_{\mathit{\text{chord}}}({X}_{u},\hspace{0.17em}{Y}_{u})/2$. This distance is also called Cavalli-Sforza chord distance [27] or described on pp. 163–166 of [28]. This distance performed well in simulations of tree-building algorithms by Takezaki and Nei [29]. It has also been used to analyze microarray gene expression data [30].

_{u}### 3.2. Piecewise Distance

This distance metric is also defined on the set of unit vectors in a vector space. For any two unit vectors *X _{u}* and

*Y*, we define:

_{u}where *ρ* is any positive real number which is not smaller than 3. We call *D _{piecewise}* (

*X*,

_{u}*Y*) the

_{u}*piecewise distance*.

By definition, *D _{piecewise}* (

*X*,

_{u}*Y*) = 0 if and only if

_{u}*C*(

*X*,

_{u}*Y*) = 1, which means that the two vectors

_{u}*X*and

_{u}*Y*are identical as shown above. It is also obvious that

_{u}*D*(

_{piecewise}*X*,

_{u}*Y*) =

_{u}*D*(

_{piecewise}*Y*,

_{u}*X*). Using the facts

_{u}*ρ*≥ 3, −1 ≤

*C*(

*X*,

_{u}*Y*) ≤ 1 for any two unit vectors and

_{u}*D*(

_{piecewise}*X*,

_{u}*Y*) +

_{u}*D*(

_{piecewise}*Y*,

_{u}*Z*) –

_{u}*D*(

_{piecewise}*X*,

_{u}*Z*) = [

_{u}*ρ*+

*C*(

*X*,

_{u}*Y*) +

_{u}*C*(

*Y*,

_{u}*Z*) –

_{u}*C*(

*X*,

_{u}*Z*)]/

_{u}*ρ*≥ 0, we get the triangle inequality for the piecewise distance. Hence the piecewise distance is a proper distance metric in the strict mathematical sense. The piecewise distance

*D*(

_{piecewise}*X*,

_{u}*Y*) can be normalized by ${D}_{\mathit{\text{piecewise}}}^{\mathit{\text{norm}}}({X}_{u},\hspace{0.17em}{Y}_{u})={D}_{\mathit{\text{piecewise}}}({X}_{u},\hspace{0.17em}{Y}_{u})/2$. Usually we may take

_{u}*ρ*= 3.

## 4. Evaluation of the Proposed Distance Metrics from the Biological Point of View

We propose to replace the pseudo-distance in the dynamical language approach [5] by the chord distance or piecewise distance. We need to examine the effects of this replacement from the biological point of view. In order to do this, we evaluate the new distance metrics on four datasets, namely **Dataset 1** of 109 complete genomes of prokaryotes and eukaryotes used in [11], **Dataset 2** of 34 prokaryote and chloroplast genomes used in [12], **Dataset 3** of mitochondrial genomes of 64 selected vertebrates used in [19], and **Dataset 4** of 62 complete genomes of alpha-proteobacteria used in [31]. (*Note*: Chan *et al.* [21] recently tested the chord distance with different denoising formulas on Dataset 2).

We used the dynamical language approach for Datasets 1 and 2 in [5] and Dataset 3 in [32]. Some biological comparisons of this approach with the Markov model approach on Datasets 1 and 2 were given in [5]. Recently we found that wrong data of the Archaea Crenarchaeota bacterium *Pyrobaculum aerophilum* (Pyrae) from Dataset 1 was used in [5]. Using the right genome data, *Pyrobaculum aerophilum* (Pyrae) groups with the other Archaea Crenarchaeota bacteria correctly (when we use the amino acid sequences of all protein-coding genes from genomes and *K* = 6). After this correction, the resulting tree is better than the one in [11] from the biological point of view, with all firmicutes group together and the other branches are similar. For Dataset 2, we obtained two trees with the same topology to those using the dynamical language approach in [5] and the Markov model approach in [12] (also using the amino acid sequences of all protein-coding genes from genomes and *K* = 6). For Dataset 3, we reported in [32] a good tree in agreement with the current understanding of the phylogeny of vertebrates revealed by the traditional approaches using the dynamical language approach (based on the whole DNA sequences of genomes and *K* = 11). This tree is better than the one in [19] and the one obtained by the Markov model approach. Hence we just need to compare the best trees obtained by the dynamical language approach using the two proper distance metrics with the best trees obtained from the pseudo-distance in [5] based on the first three datasets. In 2009, Guyon *et al*. [31] compared four alignment free string distances for complete genome phylogeny using Dataset 4. We will compare our method in this paper with the results in [31] based on Dataset 4.

The whole DNA sequences (including protein-coding and non-coding regions), all protein-coding DNA sequences and the amino acid sequences of all protein-coding genes from genome data are used for phylogenetic analysis. For **Dataset 1**, we have seen that amino acid sequences of all protein-coding genes from genomes give better results than those given by the whole DNA sequences and all protein-coding DNA sequences. We evaluated the dynamical language approach with chord distance and piecewise distance on the amino acid sequences of all protein-coding genes from genomes for *K* = 3, 4, 5 and 6. We find the trees using the new distance metrics have the same topology as the trees using the old “distance” for the same value of *K*, and the trees for *K* = 6 are the best. Here we present the tree for *K* = 6 using dynamical language approach with chord distance in Figure 2. The phylogeny shown in Figure 2 supports the broad division into three domains and agrees with the tree of life based on 16S rRNA in a majority of basic branches. For further biological discussions, one can refer to [5] with the correction for the position of *Pyrobaculum aerophilum* (Pyrae).

*K*= 6 based on all protein sequences.

For **Dataset 2**, we have seen that the amino acid sequences of all protein-coding genes from genomes give better results than those given by the whole DNA sequences and all protein-coding DNA sequences. We evaluated the dynamical language approach with chord distance and piecewise distance on the amino acid sequences of all protein-coding genes from genomes for *K* = 3, 4, 5 and 6. We find the tree using the piecewise distance has the same topology as the tree using the old “distance” for the same value of *K*, the tree using the chord distance has similar topology (a little bit worse because *Pinus thunbergii* is separated from its correct position) to the tree using the old “distance” for the same value of *K*. And the trees of *K* = 6 are the best. Hence we present the tree for *K* = 6 using the dynamical language approach with piecewise distance (*ρ* = 3) in Figure 3. We also note that the topology of the tree in Figure 3 is the same as that of the tree obtained by the Markov model in [12]). The phylogeny of Figure 3 shows that the chloroplast genomes are separated to two major clades corresponding to chlorophytes *s.l.* and rhodophytes *s.l.* The interrelationships among the chloroplasts are largely in agreement with the current understanding on chloroplast evolution. For further biological discussions, one can refer to [12].

*K*= 6 based on all protein sequences.

For **Dataset 3**, after comparing all the trees with the traditional classification of the 64 vertebrates (the traditional classification from the KEGG database is available under “Complete Mitochondrial Genomes” on http://www.genome.jp/kegg/genes.html)), we find that the whole DNA sequences give better results than those given by the amino acid sequences of all protein-coding genes from genomes and all protein-coding DNA sequences. We evaluated the dynamical language approach with the proposed distance metrics on the sequences of whole genomes for *K* = 6 to 13. We find the tree using the piecewise distance has the same topology as the tree using the old “distance” for the same value of *K*, the tree using the chord distance has similar topology (a little bit better because *Dasypus novemcinctus.*(Dnov) is close to but does not remain in a branch of primates) to the tree using the old “distance” for the same value of *K*. And the trees for *K* = 11 are the best. Hence we present the tree for K = 11 using the dynamical language approach with chord distance in Figure 4. The tree (Figure 4) generated is similar in topology to the tree obtained using the SVD method in the case *K* = 4 [19], and is also similar to a recently generated tree of 69 species [33], placing a vast majority of species into well-accepted groupings. As shown in Figure 4, our distance-based analysis shows that the mitochondrial genomes are separated into three major clusters. One group corresponds to mammals; one group corresponds to the fish; and the third one represents Archosauria (including birds and reptiles). The interrelationships among the mitochondrial genomes are roughly in agreement with the current understanding of the phylogeny of vertebrates revealed by the traditional approaches. For further biological discussion, one can refer to [32].

*K*= 11. In this tree the birds and reptiles group together as Archosauria.

For **Dataset 4**, Guyon *et al*. [31] first reconstructed a reference tree using Maximum Likelihood (ML) method based on the large (LSU) and the small (SSU) ribosomal subunits sequences (*i.e.*, the traditional alignment method). Then they compared the results using four alignment free string distances for complete genome phylogeny. The four distances are Maximum Significant Matches (MSM) distance, *k*-word (KW) distance (*i.e.*, the Markov model in [11]), Average Common Substring (ACS) distance and Compression (ZL) distance. Guyon *et al*. [31] found the MSM distance out performs the other three distances and the KW cannot give good phylogenetic topology for the 62 alpha-proteobacteria (see Figure 3 in [31]). We tested our dynamical language approach with pseudo-distance in [5] and the two proper distances in this paper on Dataset 4. We found that amino acid sequences of all protein-coding genes from genomes give better results than those given by the whole DNA sequences and all protein-coding DNA sequences. We evaluated the dynamical language approach with pseudo-distance in [5] and the two proper distances in this paper on the amino acid sequences of all protein-coding genes from genomes for *K* = 3, 4, 5 and 6. We found the trees using the new distance metrics have the same topology as the trees using the old “distance” for the same value of *K*, and the topology of trees for *K* = 5 and 6 are the same and the best. Here we present the tree for *K* = 6 using dynamical language approach with chord distance in Figure 5. As shown in Figure 5, all Rhizobiales (Bartonellaceae, Brucellaceae, Rhizobiaceae and Phyllobacteriaceae) (A), Rhizobiales (Bradyrhizobiaceae) (B), Rickettsiales (Rickettsiaceae and Anaplasmataceae) (C), Rhodospirillales (D), Sphingomonadales (E); Rhodobacterales (Rhodobacteraceae) (F) group into correct branches respectively. Even inside each lineage (groups A to F), our phylogentic topology is more similar to that of ML reference tree (the right side tree in Figure 1 of [31]) than that obtained by the MSM distance (the best result in [31]). After comparing our Figure 5 with the tree obtained using KW distance (*i.e.*, the Markov model in [11]) (the tree in Figure 3 of [31]), our dynamical language model performs much better than the KW distance.

There is no significant effect by the normalization of the distances and different values of *ρ* ≥ 3. Using the proposed distance metrics, we compared the trees before and after normalization and found that the topology of the trees is the same. Then we set *ρ* = 4, 6, 8, 10 and found that we could get the trees with the same topology as the tree for *ρ* = 3. As a result, there seems to be no noticeable effect by normalization of the distances and different values of *ρ* ≥ 3.

## 5. Conclusions

We proposed two new mathematically proper distance metrics based on the lengths of the chords constructed from unit vectors and on proportions of the sample correlation function of unit vectors to replace the pseudo-distance in the dynamical language approach [5]. The results showed improvements with this replacement from a biological perspective. These results confirm their usefulness in phylogenetic analysis.

## Acknowledgments

The authors would like to thank Bailin Hao in T-Life Research Center of Fudan University for pointing out the distance problem and useful discussion. They also wish to thank the Editor and the Reviewers for their insights, comments and suggestions to improve the paper. This research was supported by the Chinese Program for New Century Excellent Talents in University grant NCET-08-0686 and the Fok Ying Tung Education Foundation grant 101004 (Z.-G. Yu), the Australian Research Council (grant no. DP0559807) (V. Anh).

## References

**Multidisciplinary Digital Publishing Institute (MDPI)**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (607K)

- Alignment-free genome tree inference by learning group-specific distance metrics.[Genome Biol Evol. 2013]
*Patil KR, McHardy AC.**Genome Biol Evol. 2013; 5(8):1470-84.* - Phylogeny of prokaryotes and chloroplasts revealed by a simple composition approach on all protein sequences from complete genomes without sequence alignment.[J Mol Evol. 2005]
*Yu ZG, Zhou LQ, Anh VV, Chu KH, Long SC, Deng JQ.**J Mol Evol. 2005 Apr; 60(4):538-45.* - Cophenetic metrics for phylogenetic trees, after Sokal and Rohlf.[BMC Bioinformatics. 2013]
*Cardona G, Mir A, Rosselló F, Rotger L, Sánchez D.**BMC Bioinformatics. 2013 Jan 16; 14:3. Epub 2013 Jan 16.* - Bacterial phylogeny based on comparative sequence analysis.[Electrophoresis. 1998]
*Ludwig W, Strunk O, Klugbauer S, Klugbauer N, Weizenegger M, Neumaier J, Bachleitner M, Schleifer KH.**Electrophoresis. 1998 Apr; 19(4):554-68.* - Bacterial phylogeny based on 16S and 23S rRNA sequence analysis.[FEMS Microbiol Rev. 1994]
*Ludwig W, Schleifer KH.**FEMS Microbiol Rev. 1994 Oct; 15(2-3):155-73.*

- Analyzing Multi-locus Plant Barcoding Datasets with a Composition Vector Method Based on Adjustable Weighted Distance[PLoS ONE. ]
*Li CP, Yu ZG, Han GS, Chu KH.**PLoS ONE. 7(7)e42154* - Whole-proteome phylogeny of large dsDNA viruses and parvoviruses through a composition vector method related to dynamical language model[BMC Evolutionary Biology. ]
*Yu ZG, Chu KH, Li CP, Anh V, Zhou LQ, Wang RW.**BMC Evolutionary Biology. 10192*

- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree