NCBI logo

Computational Biology Branch

 

 

NCBI

 

 

CBB
Home Page

T. Przytycka's Research Group

  

 

Teresa M. Przytycka’s research group

Algorithmic and Graph Theoretical methods in

Computational and Systems Biology

 

 

 

Hierarchical clustering of homology relations

 

Group members:

Raja Jothi

Elena Zotenko

Teresa M. Przytycka

 

Collaborators:

Asba Tasneem

 

Reference:

COCO-CL: hierarchical clustering of homology relations based on evolutionary correlations

Raja Jothi, Elena Zotenko, Asba Tasneem, and Teresa M. Przytycka

Bioinformatics. 2006 Apr 1;22(7):779-88.


 

Motivation: Determining orthology relations among genes across multiple genomes is an important problem in the post-genomicera. Identifying orthologous genes can not only help predictfunctional annotations for newly sequenced or poorly characterizedgenomes, but can also help predict new protein-protein interactions.Unfortunately, determining orthology relation through computational methods is not straightforward due to the presence of paralogs.Traditional approaches have relied on pairwise sequence comparisonsto construct graphs, which were then partitioned into putativeclusters of orthologous groups. These methods do not attemptto preserve the non-transitivity and hierarchic nature of theorthology relation.

Results: We propose a new method, COCO-CL, for hierarchical clustering of homology relations, and identificationof orthologous groups of genes. Unlike previous approaches,which are based on pairwise sequence comparisons, our method explores the correlation of evolutionary histories of individualgenes in a more global context. COCO-CL can be used as a semi-independentmethod to delineate the orthology/paralogy relation for a refinedset of homologous proteins obtained using a less-conservativeclustering approach, or as a refiner that removes putative out-paralogsfrom clusters computed using a more inclusive approach. We analyzeour clustering results manually, with support from literatureand functional annotations. Since our orthology determinationprocedure does not employ a species tree to infer duplicationevents, it can be used in situations when the species tree isunknown or uncertain.

 

Data files

  • cococlOnCOGs.txt - This file contains the results from one iteration of COCO-CL on the 4,873 manually curated COGs. Each line in this file is of the format COG#, #Proteins in cluster 1 (#species represented in cluster 1), Proteins in cluster 2 (#species represented in cluster 2), #common set of species represented in cluster 1 and 2, clustering bootstrap score alpha, putative duplication confidence socre sigma
  • inclusiveCOGs.txt - This file contains COGs that COCO-CL predicts to be inclusive (contain out-paralogs). COCO-CL predicts a COG to be inclusive if and only if the clustering bootstrap score (alpha) >= 0.75 and confidence score (or split-score) >= 0.5. There are a total of 749 COGs in this file.

 

Download COCO-CL




eXTReMe Tracker