Bldg. 38A, new NCBI library Tuesday, March 30, 2004, 11 AM What's important in life? Connections between measures of genome evolution and biological quantities Eugene V. Koonin, NCBI/NLM/NIH In the last few years, we witnessed the advent of multiple, complete genome sequences and, in parallel, of genome-wide data on biological properties of gene ensembles, such as knockout effect, expression levels, protein-protein interactions, and more. Inevitably, numerous attempts were made by several groups, including ours, to examine connections between these properties and quantitative measures of gene evolution. The question asked are quite fundamental and interesting such as: What determines the effect of the knockout of a given gene on the phenotype (in particular, is it essential or not) and the rate of a gene's evolution? And how are the phenotypic effect and evolutionary rate connected? And more questions like that. The best succinct description of the results of the genome-wide studies addressing these questions seems to be: it is a mess. Many tantalizing correlations have been detected and made into a big deal (or not). For example, a positive correlation was detected between the tendency of a gene to be lost during evolution and sequence evolution rate, and negative correlations were noticed between each of the above measures of evolutionary variability and phenotypic effect or expression level. Simply put, genes associated with a major phenotypic effect tend to be highly expressed, form many protein-protein interactions and evolve slowly, however that is measured. These connections seem to meet the expectations based on common sense. What is quite unexpected, however, is that, although many of the observed correlations are statistically highly significant (mostly thanks to the large number of data points), they are typically rather weak. Linear correlation coefficients in the range of 0.1-0.2 explaining only a small fraction of the scatter in the data are typical. This makes detecting real causal relationships a serious challenge. In this talk, I will give an overview of the reported genomic correlations; present in some detail a few examples explored in our group; describe preliminary attempts to separate the wheat from the chaff using multivariate statistical analysis; and to outline a generalized approach, also very preliminary. I will introduce the notion of the "social status" (or, simply, "importance") of a gene in the genome-wide community and classify the variables with respect to their positive or negative correlation with this status. I will also argue that, despite the complexity of the emerging web of relationships, there is a clearly discernible pattern, and deviations from it point to either problems with the data or biologically interesting phenomena. Joint work with Yuri Wolf, with thanks to Liran Carmel, I. King Jordan, Fyodor Kondrashov, Dmitry Krylov, and Igor Rogozin.