## 2 METHODS

The `phangorn` package interacts with several other R-packages, especially with the `ape` package (Paradis *et al.*, 2004). From `ape`, `phangorn` inherits the tree format (class `phylo` which has become a standard), which allows use of the excellent plotting facilities within `ape`. `phangorn` defines its own data format to store character sequences, but offers functions to convert between formats from other packages (`ape` and `seqinr`) or with common data structures (`data.frame` and `matrix`) in R. The data format is kept very general allowing nucleotides (DNA, RNA), amino acids and general character states defined by the user. For example, it is easy to define a format for nucleotide data where gaps are coded as a fifth state or for binary data. All the different ML and MP functions described below can handle these general character states.

MP is an optimality criterion for which the preferred tree is the tree that requires the least changes to explain some data. In `phangorn`, the Fitch and Sankoff algorithms are available to compute the parsimony score. For heuristic tree searches the parsimony ratchet (Nixon, 1999) is implemented. Indices based on parsimony like the consistency and retention indices and the inference of ancestral sequences are also provided.

The ML function `pml` returns an object of class pml containing all the information about the model, the tree and data. The function `optim.pml` allows to optimize the tree topology, the edge lengths as well as all model parameters (e.g. rate matrices or base frequencies). The speed and accuracy of phylogenetic reconstruction by ML are comparable to PhyML (Guindon and Gascuel, 2003) using nearest neighbor interchange (NNI) rearrangements (see Supplementary Materials). As the results are stored in memory it is possible to further investigate, plot or summarize these objects. The following lines compute and display () a phylogenetic tree based on the data of Rokas *et al.*, 2003 using a *GTR* + Γ(4) + *I* model (Kelchner and Thomas, 2007):

`data(yeast)`

`tree <- NJ(dist.logDet(yeast))`

`fit <- pml(tree, yeast, k=4, inv= .2)`

`fit <- optim.pml(fit, optNni=TRUE,`

`optGamma=TRUE, optInv=TRUE, model=“GTR”)`

`BS <- bootstrap.pml (fit, optNni=TRUE)`

`plotBS(fit$tree, BS, type = “phylogram”)`

For nucleotide data all models implemented in ModelTest (Posada, 2008) are available (e.g. “JC” or “GTR”). Moreover any reversible model can be specified by the user for different character states. For amino acids, the main common rate matrices are provided, e.g. WAG (Whelan and Goldman, 2001) or LG (Le and Gascuel, 2008). Additionally rate matrices can also be estimated. For instance Mathews *et al.*, 2010 used the function `optim.pml` to infer a phytochrome amino acid transition matrix. There are several methods implemented to compare different ML models with for example likelihood ratio-tests, AIC or BIC as in ModelTest or the SH-test (Shimodaira and Hasegawa, 1999).

As `phangorn` is implemented in the high-level language R it is easy to extend the general ML framework. `phangorn` also contains mixture models (Pagel and Meade, 2004) and partition models. The function `pmlPart` allows estimation of partitioned ML models and has a flexible yet simple formula interface. For example, the command `pmlPart(edge + Q ~ rate + bf, fit)` specifies which parameters are optimized in each partition individually (here the rate parameter and the base frequencies) or for all partitions together (the edge weights of the tree and rate matrix Q).

`phangorn` eases the analysis of splits. For instance, the Hadamard conjugation (Hendy, 2005) is a helpful tool to analyze relations between observed sequence patterns (spectra) and edge weights. The edge weight spectra can be constructed from DNA or binary data or from a distance matrix. These spectra can be visualized using a Lento plot (Lento *et al.*, 1995) to present the supporting and conflicting signals for the splits of a dataset (). Splits can easily be exported to SpectroNet (Huber *et al.*, 2002) or Splitsgraph (Huson and Bryant, 2006) and visualized as a network.

Lento plot of the edge weights from sequence spectrum for the data of Rokas *et al.*, 2003. On the *x*-axis the splits or edges are represented by the dots overlying the graph. The bars above the axis indicate the edge weights or the support of a split, bars **...** `phangorn` is distributed with two tutorials. The first explains how to perform phylogenetic analysis (in R type `vignette(“Trees”)`) and the second `vignette(“phangorn-specials”)` shows how to define data with general character states and to estimate rate matrices for those states. `phangorn` depends only on other R packages which are also available from the CRAN repository and is portable to run on different operating systems. Since `phangorn` is written in R, results can be easily extended and further processed using the graphical and statistical capabilities of R.