![]() | ![]() |
Formats:
|
||||||||||||||||||||
Copyright © 2006 Biomedical Informatics Publishing Group Incorporation of biological knowledge into distance for clustering genes 1Clinical Proteomics Center, University of Louisville, Louisville, KY 40202 2Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY - 40202; *Grzegorz M Boratyn E-mail: greg.boratyn/at/louisville.edu; Corresponding author Received December 9, 2006; Accepted January 20, 2007. This is an open-access article, which permits unrestricted use, distribution, and reproduction
in any medium, for non-commercial purposes, provided the original author and source are credited. This article has been cited by other articles in PMC.Abstract In this paper we propose a data based algorithm to marry existing biological knowledge (e.g., functional annotations of
genes) with experimental data (gene expression profiles) in creating an overall dissimilarity that can be used with any
clustering algorithm that uses a general dissimilarity matrix. We explore this idea with two publicly available gene
expression data sets and functional annotations where the results are compared with the clustering results that uses only the
experimental data. Although more elaborate evaluations might be called for, the present paper makes a strong case for
utilizing existing biological information in the clustering process. Availability Supplement is available at
www.somnathdatta.org/Supp/Bioinformation/appendix.pdf Keywords: knowledge, distance, clustering, genes, expression Background Clustering is routinely used in the exploratory phase of a microarray experiment. Genes are clustered using the pairwise correlation coefficients between two sets of
expression profiles as a measure of similarity or closeness. With the growing annotation databases, it is perhaps wise to take advantage of the functional class information of the
annotated genes along with the experimental data in grouping genes. Unlike, other approaches (e.g., [1]), the
purpose of this paper is to outline a general approach of modifying the distance itself. Thus, in a sense, we do not propose a new clustering algorithm - any classical
clustering algorithm can be applied based on the new distance (or dissimilarity) matrix. We illustrate this procedure using the agglomerative hierarchical clustering
UPGMA and the divisive hierarchical clustering algorithm DIANA applied to two gene expression data sets. Methodology Biological Information The supplementary material is linked. This new similarity metric (1) corresponds to distances used by semi-supervised clustering techniques in Machine Learning. [2]
In our case the functional distance plays the role of the similarity-adapting function. Therefore the measurement distance is computed in the same fashion as
in the case of standard clustering of gene expressions, and the role of the functional distance is to alter similarities between the gene expressions so that the resulting clusters
are in agreement with the functional annotation of genes. Because one of the purposes of cluster analysis of gene expressions is prediction of previously unknown functions
of genes, it is desired that: 1) genes with similar functions appear in the same cluster, and 2) genes with unknown functions appear in clusters where majority of genes have
known and similar function. In order to satisfy the presented goals the distance matrix needs to be altered so that:
Thus the functional distance is composed of the three parts that correspond to the above tasks:
The supplementary material is linked Decreasing the distance between genes with similar functions The often overlooked difficulty in assessing gene similarity is the lack of clear structure in the public gene ontology data bases. Some functions are more general than others
and some functions are sub-functions of others. As a solution to this problem we assume that genes are more similar if they share more functions. Let F be a binary
matrix that represents gene membership in the functional sets or lack of functional annotation:
The supplementary material is linked
where α is a diagonal matrix of scaling coefficients for each functional set. The scaling coefficients α are introduced because distance variance may
vary across functional sets. Increasing distance between unannotated genes Because the purpose of gene expression clustering is to predict functions of genes not studied previously, it is desired that the unannotated genes are placed into clusters
composed primarily of genes with known functional annotations. Increasing distances between unannotated genes will result in change of the behavior of a clustering
algorithm. The annotated genes will be clustered earlier and will thus form basis of functional clusters to which unannotated genes will later be assigned. Because the only
accessible information about unannotated genes are their expressions, dF2 should not alter the relative distances between expressions of unannotated genes. Adding a
constant to the distance between each pair of unannotated genes will satisfy the goal without changing the properties of the data. The element modifying the distance matrix that
accomplishes this task is equal to:
The supplementary material is linked.
and β is a scaling factor that controls the magnitude of the increase of the distance between a pair of unannotated genes. Increasing the distance between genes with different functions The previous paragraph presents the modification of distance matrix that increases distances between all pairs of unannotated genes so that clusters composed primarily of
genes with unknown functions are avoided. However the set of unannotated genes may contain genes with similar functions that should be placed in the same clusters. The
modified distance matrix counteracts assigning twounannotated genes to one cluster. In the absence of functional information of unannotated genes, it is not
known how the distances between unannotated genes should be altered. Gene expressions and functional information of annotated genes provide the only accessible
knowledge about the unannotated genes. Thus the distances between annotated genes that have different functions will be increased so that unannotated genes with similar gene
expressions can be placed in the same cluster. The distance matrix is updated by:
The supplementary material is linked. Selection of parameters The new updated distance matrix depends on the values of the parameters α, β, and γ. These are found by considering the constraints imposed on the distances between genes that lead to formulation of
the functional distance. The mathematical details are provided in the supplement. The supplementary material is linked. The detailed derivation of the above presented equations is given in the supplemental document. Note that this constrained distance matrix controls the behavior of a clustering algorithm. The annotated genes that belong to the same functional sets are the closest to one another and thus
create basis of clusters. Then genes with unknown functions are assigned to clusters created by annotated genes. Because (5) and (7) increase the distances between unannotated genes and genes
that do not share any function by a constant, cluster assignment is delayed but performed on the basis of gene expressions. Results Two illustrative examples of clustering gene expression data are included here. We compare the results of two distance based clustering algorithms – UPGMA and DIANA that utilize the
proposed distance matrix dM + dF with prior functional information with those of the respective clustering algorithms based on distance matrix computed only with gene expressions
dM . We measure biological validity of clustering results and the distribution of functions in gene clusters. Two publicly available sets of gene expressions and functional annotations
obtained from public databases are used. The detailed description of data sets and performance measures is presented below. Data Two publicly available sets of gene expressions: 1) Yeast time course cDNA, and 2) Normal versus breast carcinoma, SAGE data, are utilized in our illustration. Yeast time course cDNA data This set of gene expressions was collected by Chu et al. and presented in [5]. This data set records expression profiles during
sporulation of Saccharomyces cerevisiae at seven time points. The original data set was filtered using the same criterion as in [5].
We consider a subset of 513 genes (ORF's to be correct) that were overall positively expressed (i.e., Σtimelog expression ratio > 0). As in [4], the sets of functional classes were obtained using the web-based GO mining tool at
http://mips.gsf.de/proj/funcatDB/search main frame.html.
Overall, 342 of the 513 genes were annotated into the following sixteen functional classes: metabolism (121 genes), energy (25), cell cycle and DNA processing (140),
transcription (45), protein synthesis (9), protein fate (66), protein with binding function or cofactor requirement (73), protein activity regulation (15), transport (57), cell
communication (11), defense (36), interaction with environment (27), cell fate (11), development (11), biogenesis (69), cell differentiation (72). Normal versus breast carcinoma, SAGE data The second data set comes from the study presented in [6]. We illustrate our methods using the expression profiles of 258 genes (SAGE tags) that were judged to be significantly
differentially expressed at 5% significance level between four normal and seven ductal carcinoma in situ (DCIS) samples. Abba et al. [6]
combined various normal and tumor SAGE libraries in the public domain with their own SAGE libraries and used a modified form of t-statistics to compute p-values. Further details can be obtained from
their paper and its supplementary web-site. The functional classes were constructed using a publicly available webtool called Amigo (
http://www. godatabase.org/cgibin/amigo/go.cgi). As in [4], a total of 113 SAGE tags were annotated into the following eleven classes of molecular
function based on their primary biological functions. They were as follows: cell organization and biogenesis (24), transport (7), cell communication (15), cellular metabolism
(48), cell cycle (6), cell motility (7), immune response (7), cell death(7), development (5), cell differentiation (5), cell proliferation (5). Clustering algorithms and distance metric The Unweighted Pair Group Method with Arithmetic mean (UPGMA) and Divisive Analysis (DIANA) were applied to the two illustrative data sets. The following expression
The supplementary material is linked.
The values of parameters α, β, and γ utilized for
computation of the functional distance (2) for yeast and SAGE data are presented in Tables 1 and 2, respectively. These were calculated using our algorithm stated in the
Methods section (c.f., (8)-(11)).
In order to demonstrate how the distances between genes change we selected six genes: YBL043W, YBR168W that belong to the biogenesis functional set, YGL210W that
belongs to both biogenesis and transport functional groups, YAL067C from transport group, and YAL018C, YBL010C both unannotated. The measurement distances between the listed genes are presented in Table 3. Note that YAL067C
that belongs to the transport group is, according to gene expressions, more similar to YBL043W and YBR168W that belong to biogenesis group than to YGL210W, which
also belongs to the transport group. Note also that the two unannotated genes YAL018C and YBL010C are more similar to each other than the genes from the biogenesis or the transport functional group. The functional dissimilarities between the selected six genes are shown in Table 4. The complete functional distance (2)
is not listed in Table 4 so that the change with
respect to dM can be observed. The parameter αk is equal to 0 for the biogenesis group and 0.0302 for the transport group. The similarities between the genes from the
biogenesis group are not changed. The distance between YGL210W and YAL067C (that belong to the transport group) is decreased and that between the unannotated genes
YAL018C and YBL010C is increased. The distances between annotated genes that do not share any functions YAL067C and YBL043W, as well as YAL067C and
YBR168W are also increased. The resulting new distances between the selected genes are given in Table 5, where the distance between the unannotated genes YAL018C and YBL010C is larger than
between genes that share a function: YBL043W, YBR168W, YGL210W, but smaller than between annotated genes that do not share any function, such as YBL043W and YAL067C. Performance measures We compare the performance of the resulting clusterings with the following two measures: 1) distance from model profiles and 2) average proportion of functions in clusters. These
quantities are described below. A more extensive comparison along the lines of [7] or
[8] might be possible but is deemed to be beyond the scope of this paper. Distance from model profiles The distance from model profiles, proposed in [3], measures biological validity of statistical clusters. Model profiles are
created from a small group of hand-selected genes that were available from the original studies and classified into biological classes as deemed appropriate by the biologists for
that particular experiment. The gene expressions averaged over each class create the model profiles. The averaged gene expressions are also calculated for each cluster, and the
distance between so created profiles and the model profiles is computed:
The supplementary material is linked.
The expression (12) for a dissimilarity was also used here. Smaller dist indicates that resulting clusters are more similar to the model profiles thus more biologically
valid. Datta and Datta [3] proposed to use the model profiles as a benchmark for result produced by a clustering algorithm.
In the original paper, Chu et al. [5], determined on the basis of first induction of expression that seven is the right number
of clusters to be used for grouping genes for this data set. In addition they created a model expression profile by using certain handpicked genes in each class. We use the same
number of clusters (K = 7) and the benchmark model profile. The genes used for construction of model profiles have no functional information assigned. The distance from model
profiles (13) was computed for the yeast data clustered with UPGMA and DIANA using dM and dM + dF as distance matrices. The resulting values of
dist are presented in Table 6.
The same performance measure was computed for the SAGE data set. The model profiles were composed of genes reported in [6],
whose deregulation is altered in the ductal carinoma in situ stage of breast cancer. Three model clusters were created from the following functional classes: Cell cycle
(3 genes), Apoptosis (3), and Cytokines (4). The values of the distance from model profiles, computed for the SAGE data set clustered with UPGMA and DIANA are presented in Table 7.
Incorporation of the functional information into the distance (dissimilarity) matrix decreased the distance from model profiles in all but one cases
(Table 6) indicating a closer
agreement with the selected profiles. Average proportion of functions in clusters The average proportion of functions in clusters assesses the ability of a clustering algorithm to group genes with similar biological functions into the same clusters. For a given
number of clusters K, the proportion of the largest group of genes with common biological function is found in each cluster. The performance measure is given by the average
proportions weighted by the number of elements in a cluster:
The supplementary material is linked.
The value of E(K) closer to 1 indicates that a majority of genes in the clusters belongs to one functional set, therefore denotes better clustering performance. Only the
genes with known biological functions are used for computation of (14). Note however that all genes are clustered, but only the annotated ones are used for performance
assessment. If a gene belongs to more than one functional set, it is considered in finding proportion of all those sets. Note that E(1) is the proportion of the largest functional group in the entire set of genes under consideration and is independent of a clustering algorithm.
A plot of E(K) vs. K can be used to compare the effectiveness of clustering algorithm. Generally speaking, a rapidly increasing curve reaching values close to 1
would indicate better clustering results. The average proportion of functions in clusters (14), computed for the yeast gene expressions clustered with the UPGMA and DIANA algorithms are presented in
Figs. 1
The average proportion of functions in clusters was also computed for the SAGE data set. The resulting E(K) for several number of clusters produced with UPGMA and
DIANA are presented in Figs. 3
Discussion Although somewhat limited in nature, our studies make astrong case for using semi-supervised clustering whenever possible - one that merges existing biological
knowledge with experimental data in grouping genes. A penultimate stage of this approach is available in. [9] The
present approach has at least two distinct advantages over previous approaches [1,9]:
it offers a one step algorithm that determines the appropriate modifications for various categories of genes in an automatic and data based fashion. In addition, since it just modifies the
distance (or dissimilarity) matrix, it can be used in conjunction with any dissimilarity based clustering techniques. Furthermore, unlike the approaches
presented in [1] and [9], we provide analytical as well as
computationally inexpensive procedure for parameters selection. Data 1 Click here to view.(35K, pdf) Data 2 Click here to view.(20K, pdf) Data 3 Click here to view.(29K, pdf) Data 4 Click here to view.(20K, pdf) Data 5 Click here to view.(63K, pdf) Data 6 Click here to view.(146K, pdf) Data 7 Click here to view.(19K, pdf) Data 8 Click here to view.(66K, pdf) Data 9 Click here to view.(28K, pdf) Data 10 Click here to view.(11K, pdf) Data 11 Click here to view.(19K, pdf) Data 12 Click here to view.(19K, pdf) Acknowledgments This research was supported by grants from the National Science Foundation (MCB-0517135) and the National Security Agency (H98230-06-1-0062). Footnotes
Citation:Boratyn
et al., Bioinformation 1(10): 396-405 (2007) References 1. Huang D, Pan W. Bioinformatics. 2006;22:1259. [PubMed] 2. Grira N, et al. A Review of Machine Learning Techniques for Processing Multimedia Content. 2005. Report of the MUSCLE European Network of Excellence. 3. Datta S, Datta S. Bioinformatics. 2003;19:459. [PubMed] 4. Datta S, Datta S. BMC Bioinformatics. 2006;7:397. [PubMed] 5. Chu S, et al. Science. 1998;282:699. [PubMed] 6. Abba MC, et al. Breast Cancer Res. 2004;6:R499. [PubMed] 7. Handl J, et al. Bioinformatics. 2005;21:3201. [PubMed] 8. Pihur V, et al. Weighted rank aggregation of cluster validation measures: A Monte Carlo crossentropy approach. 2006 preprint. 9. Boratyn GM, et al. Proc of the 28th IEEE EMBS Annual International Conference; 2006. p. 5515. |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||
Bioinformatics. 2006 May 15; 22(10):1259-68.
[Bioinformatics. 2006]Science. 1998 Oct 23; 282(5389):699-705.
[Science. 1998]BMC Bioinformatics. 2006 Aug 31; 7():397.
[BMC Bioinformatics. 2006]Breast Cancer Res. 2004; 6(5):R499-513.
[Breast Cancer Res. 2004]BMC Bioinformatics. 2006 Aug 31; 7():397.
[BMC Bioinformatics. 2006]Bioinformatics. 2005 Aug 1; 21(15):3201-12.
[Bioinformatics. 2005]Bioinformatics. 2003 Mar 1; 19(4):459-66.
[Bioinformatics. 2003]Science. 1998 Oct 23; 282(5389):699-705.
[Science. 1998]Breast Cancer Res. 2004; 6(5):R499-513.
[Breast Cancer Res. 2004]Bioinformatics. 2006 May 15; 22(10):1259-68.
[Bioinformatics. 2006]