HOX genes and topographic differentiation. (A) Hierarchical clustering of fibroblast cultures based solely on expression of genes encoding homeodomain proteins reproduces the clustering by site of origin. Of 88 homeodomain-containing genes on the array, 51 were considered well measured as indicated by reference channel intensity over background ≥ 1.5-fold and no less than 80% informative data. Hierarchical clustering was performed with these 51 genes, and the result is displayed in the same format as in Fig. 1. Scale is the same as Fig. 1C. (B) Statistical significance of topographic clustering by homeobox genes. The 51 homeobox genes identified above were clustered by using Partitioning Around Medoids (PAM) with k = 6 clusters and 45 arrays (see Materials and Methods). The sites of origin of the fibroblast samples (abdominal skin, arm, fetal buttock thigh, fetal lung, foreskin, toe, and gum) were taken as the reference grouping of six clusters. The similarity score comparing the PAM clustering to the known site of origin is 36 of a maximum of 45. To assess the statistical significance of the similarity score, 5,000 sets of 51 random genes from a data set of 19,081 genes filtered as in A were subjected to the same analysis and the histogram of the similarity scores are shown. The median of the 5,000 similarity scores is shown in blue (21 of 45). None of the 5,000 trials achieved a score of 36; thus the P value is 0/5,000. (C) Robustness of topographic clustering. The same analysis in B was carried out for 500 of random subsets of 10, 20, 30, 40, or 50 homeobox genes. The distribution of the similarity scores is summarized by using boxplots. The central box in each plot represents the inter-quartile range (IQR), which is defined as the difference between the 75th and 25th percentiles. The line in the middle of the box represents the median. Extreme values greater than 1.5 IQR above the 75th percentile and less than 1.5 IQR below the 25th percentile were plotted individually. Site identity was reasonably recovered with as few as 10 homeobox genes, which is better than with random subsets of 51 genes (compare to median score of 21 in B).