Send to

Choose Destination
Syst Biol. 2013 May 1;62(3):456-66. doi: 10.1093/sysbio/syt011. Epub 2013 Feb 15.

Resolving ambiguity of species limits and concatenation in multilocus sequence data for the construction of phylogenetic supermatrices.

Author information

Department of Entomology, Natural History Museum, London SW7 5BD, UK.


Public DNA databases are becoming too large and too complex for manual methods to generate phylogenetic supermatrices from multiple gene sequences. Delineating the terminals based on taxonomic labels is no longer practical because species identifications are frequently incomplete and gene trees are incongruent with Linnaean binomials, which results in uncertainty about how to combine species units among unlinked loci. We developed a procedure that minimizes the problem of forming multilocus species units in a large phylogenetic data set using algorithms from graph theory. An initial step established sequence clusters for each locus that broadly correspond to the species level. These clusters frequently include sequences labeled with various binomials and specimen identifiers that create multiple alternatives for concatenation. To choose among these possibilities, we minimize taxonomic conflict among the species units globally in the data set using a multipartite heuristic algorithm. The procedure was applied to all available GenBank data for Coleoptera (beetles) including > 10 500 taxon labels and > 23 500 sequences of 4 loci, which were grouped into 11 241 clusters or divergent singletons by the BlastClust software. Within each cluster, unidentified sequences could be assigned to a species name through the association with fully identified sequences, resulting in 510 new identifications (13.9% of total unidentified sequences) of which nearly half were "trans-locus" identifications by clustering of sequences at a secondary locus. The limits of DNA-based clusters were inconsistent with the Linnaean binomials for 1518 clusters (13.5%) that contained more than one binomial or split a single binomial among multiple clusters. By applying a scoring scheme for full and partial name matches in pairs of clusters, a maximum weight set of 7366 global species units was produced. Varying the match weights for partial matches had little effect on the number of units, although if partial matches were disallowed, the number increased greatly. Trees from the resulting supermatrices generally produced tree topologies in good agreement with the higher taxonomy of Coleoptera, with fewer terminals compared with trees generated according to standard filtering of sequences using species labels. The study illustrates a strategy for assembling the tree-of-life from an ever more complex primary database.

[Indexed for MEDLINE]

Supplemental Content

Full text links

Icon for Silverchair Information Systems
Loading ...
Support Center