![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||
Copyright © 2008 Lesne and Benecke; licensee BioMed Central Ltd. Feature context-dependency and complexity-reduction in probability landscapes for integrative genomics 1Institut des Hautes Études Scientifiques, Bures-sur-Yvette, France 2Institut de Recherche Interdisciplinaire – CNRS USR3078 – Université Lille I, France Corresponding author.Annick Lesne: lesne/at/ihes.fr; Arndt Benecke: arndt/at/ihes.fr Received June 27, 2008; Accepted September 10, 2008. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background The question of how to integrate heterogeneous sources of biological information into a coherent framework that allows the gene regulatory code in eukaryotes to be systematically investigated is one of the major challenges faced by systems biology. Probability landscapes, which include as reference set the probabilistic representation of the genomic sequence, have been proposed as a possible approach to the systematic discovery and analysis of correlations amongst initially heterogeneous and un-relatable descriptions and genome-wide measurements. Much of the available experimental sequence and genome activity information is de facto, but not necessarily obviously, context dependent. Furthermore, the context dependency of the relevant information is itself dependent on the biological question addressed. It is hence necessary to develop a systematic way of discovering the context-dependency of functional genomics information in a flexible, question-dependent manner. Results We demonstrate here how feature context-dependency can be systematically investigated using probability landscapes. Furthermore, we show how different feature probability profiles can be conditionally collapsed to reduce the computational and formal, mathematical complexity of probability landscapes. Interestingly, the possibility of complexity reduction can be linked directly to the analysis of context-dependency. Conclusion These two advances in our understanding of the properties of probability landscapes not only simplify subsequent cross-correlation analysis in hypothesis-driven model building and testing, but also provide additional insights into the biological gene regulatory problems studied. Furthermore, insights into the nature of individual features and a classification of features according to their minimal context-dependency are achieved. The formal structure proposed contributes to a concrete and tangible basis for attempting to formulate novel mathematical structures for describing gene regulation in eukaryotes on a genome-wide scale. Background The deciphering of the gene regulatory code of eukaryotic cells and the inference of gene regulatory programs belong to the computationally "hard" problems that are very probably insoluble without using very large collections of experimental genome activity recordings under many different biological conditions in conjunction with empirical gene structure and function annotations [1-4]. Genomic sequence, gene structure and function annotation, as well as functional genomics experimental data, are of heterogeneous nature. In order to conceive computationally efficient algorithms capable of statistical integration of these different types of information, transformations of the different types of data into a continuous and homogeneous data structure have to be developed. We have recently proposed such a concept, which we refer to as probability landscapes [5]. Briefly, we have shown on theoretical grounds how any type of observable quantity (which we shall refer to hereafter as "feature") can, without loss of information, be transformed into a local probability with nucleotide resolution along the genome (creating what we define as a probability profile). For any feature, as for instance the predicted alpha-helicity of an inferred amino-acid sequence or the transcriptome of a cell recorded under a particular biological condition, such a local probability can be calculated for all nucleotides of the genome under study, resulting in a profile. If this procedure is repeated for many different features, a stack of probability profiles ("landscape") is obtained. While it might, on first sight, seem awkward to calculate a probability for every nucleotide in a genome to be part of an alpha-helix provided this nucleotide were part of an expressed codon, the advantage of translating any type of relevant experimental information into a homogeneous structure that can be used directly for statistical correlation analysis by far outweighs the apparent absurdity of having executed a secondary protein structure prediction algorithm on sequences that a priori are never even transcribed into RNA, leave alone translated into protein. Furthermore, our information on transcribed sequences for instance is still incomplete – just consider the recent discoveries related to microRNAs – and hence a complete, unbiased probability annotation is more coherent [5]. Interestingly, a probabilistic framework also alleviates the problem of the formally undefined cause and effect relationship in the case of intrinsic stochasticity in the noisy experimental data by introducing the notion of fuzziness into the mapping; a process referred to as conditioning. The nature of biological experimentation imposes two general constraints that need to be taken into account especially in the field of functional genomics. First, obviously, experimental information is never complete in that it is either a snap-shot of a dynamic reality, obtained as a mean measurement over large numbers of objects, biased by experimental or conceptual priors, or, most often, a combination of all the above, leading to context-dependency of the results. Second, the measurement itself introduces a non-negligible, albeit to some extent controllable, bias leading to further context-dependency of functional genomics data. Moreover, biological systems themselves display a strong context-dependency which is notably the object of study in functional genomics/systems biology: It is the combination of molecules in a cell that creates a biological function; hence the activity of a single molecule is context dependent. Thus, context-dependency of features is relevant for the comprehension of stimuli-responses and signals. Finally, context-dependency is itself question dependent. Consider the following example: Whether or not a given cell is differentiated to some defined state requires investigation of the presence of state-specific gene products and functionalities and the concomitant absence of molecules and functions specific to other cell-states. It does not, however, require any knowledge about the time dependency of the changes in gene expression and cellular physiology. A time series of experiments conducted on a differentiating cell, in this case, can therefore be simply projected, eliminating the time-dimension in addressing the question. The projection thereby has an important advantage over a simple end-point comparison, as (i) intermediate events are not omitted from the analysis, and (ii) statistical power is improved. However, when one tries to infer gene regulatory circuits, the time dimension of the experimental data is of outmost importance, whereas for instance the estimates of absolute molecular species quantities are far less important. Furthermore, the available genomic information can often be analyzed in a hierarchical manner. For certain biological questions it will not be important to have a detailed knowledge of feature probability profiles themselves but rather a more integrated, coarse-grained, combination of individual features. Ideally, by combining different features the set-theoretic conditioning can be turned into an unambiguous and well-defined cause and effect mapping. As studying different biological questions requires concomitant investigation of correlation and non-correlation, context-dependency and independency are similarly important. In conclusion, the very same set of information displays different context-dependencies as a function of the biological problem studied. We shall refer to this phenomenon from here on as "circumstantial context". We develop here a mathematical approach to the quantification and statistical significance testing of context dependency in functional genomics data using our previously developed probability landscape framework. As context-dependency is not an absolute but a relative quantity, a flexible approach depending on the biological problem studied has to be realized. We furthermore demonstrate how according to the circumstantial context even very large numbers of individual landscapes stemming from experimental recordings can be merged into a single, collapsed profile with greatly improved statistical properties. This procedure can therefore be used in a systematic and controlled manner to reduce the computational and formal complexity of probability landscapes. Increased algorithmic efficiency and statistical power result jointly with heightened understanding of the biological mechanisms. Results Circumstantial probability profiles Circumstantial context-dependency of functional genomics information does at the same time create important constraints, which need to be taken into consideration during statistical analysis, and simultaneously provides additional knowledge on the biological question studied. We have recently proposed probability landscapes as a means to integrate any relevant type of functional genomics information coherently and systematically into a structurally homogeneous object that can more easily be analyzed computationally. Here we asked whether or not the proposed structure of probability landscapes also permits systematic detection, analysis, and utilization of context-dependencies. Let X be an observable quantity under investigation, taking either discrete, possibly symbolic, or continuous values. We have shown how experimental information on X can be expressed in a homogeneous and universal way as a genome-wide probability profile [5]. Given the biological nature of the information (see Background), probability profiles thus de facto involve conditional probabilities: P(Xn = x|B) in case of a discrete-valued feature X or ρ(Xn = x|B)dx in case of a continuous-valued feature X. We shall use
In all that follows, we shall consider a discrete-valued feature X for the sake of simplicity, without restricting the generality. Considering a continuous-valued feature requires only replacing ∑x χ by Eliminating spurious conditioning, detecting essential ones Considering the set of all the conditions that can be controlled or at least identified during the experiment, each feature will depend on some of these conditions whereas it will be independent of others (cf. Background). We thus want to determine for each biological question and each feature the subset of factors actually conditioning its probability landscape, and hence its effective context C(X). If Ci does not add any information on X, it does not belong to the context C(X). Conversely, the proposed analysis allows features to be grouped in different subsets according to their circumstantial context. Finding the effective, thus minimal, context C(X) among the full conditionings of X ('minimax' entity) is a well-posed issue only in a hierarchical formulation: we have to investigate whether an additional condition C decreases the indeterminacy of X knowing B, and conversely whether data obtained under different conditions (B∧Cj)j can be grouped into a single condition B∧C where C is the reunion of conditions (Cj)j or even into the single condition B if (Cj)j form a complete family, so that C adds in fact no additional prescription on B. This dual process can be iterated in both directions. The issue is thus to compare P(X|B) and P(X|B∧C) to see whether the additional prescription C on the experimental conditions adds constraints and information on X (knowing B) or not (Figure (Figure2).2 Divergence of probability profiles At each genome location n, the probabilities Note that it is meaningless to compare In the case that the feature probability profiles
Statistical significance testing The Kullback-Leibler divergence thus provides a tool for calculating the difference of the individual conditional feature probability profiles where V(Pn, ε) is the ball of radius centered on the distribution Pn (distribution over the space χ); it is thus a neighborhood in a functional space, where the radius bounds the Kullback-Leibler divergence between an element and the center of the ball. We have recently investigated for a more general case how conjoint statistical significance testing for similarity and distinctness can be achieved on such a measure. Please refer for a more detailed description of the methodology to [7]. Briefly, any experimentally obtained signal (such as the fluorescence/chemiluminescence signal of a spot on a microarray) is interpreted as a random independent sample of some random variable, assumed normally distributed and with unknown average. The mean and variance estimates can be used to construct an unbiased maximum likelihood estimator, which is itself a random variable of Gaussian form. In order to formulate quantitative statements concerning the relative differences between different biological conditions, we introduce a cone Cα over the first diagonal of a signal estimate under two different biological conditions with half-angle α. The rationale for considering such cones rather than homogeneous error margins is to control the relative error. Using the so-called ratio distribution for independent normal distributions, we can then determine a likelihood of the mean estimates being within a distance smaller than Cα or not of the actual mean of the random variable. This distance measure is symmetric in the sense that we can estimate both similarity and distinctness. Moreover, the measure is also amendable to testing for statistical significance using serialized two-sided T-tests. By defining a single confidence interval on the above measure the decision on whether or not to collapse feature probability profiles then becomes straight-forward. Interestingly, the significance testing of distinctness and similarity, as we develop it in [7], takes into account the relative variance over the measure in case of massive-parallel data such as functional genomics experimental observations in form of the half-angle α of the cone Cα. In this case the quality, or better statistically perceived quality, of the measure on the observable under different biological conditions is directly taken into consideration when estimating the statistical significance of the Kullback-Leibler divergence. Extending the divergence analysis over the genome So far we have only discussed the context-dependency analysis locally; that is at any genome position n. As feature probability profiles extend over the entire genomic sequence of the organism under study, a generalization is required, which as shown below is straight-forward in our approach. Consider the case where a subset of feature probability profiles is known on biological grounds to reflect relevant measures on the biological and physical properties of a stretch I of the genome (e.g. the linear extension of a gene, possibly with gaps, such as transcriptome data, Figure Figure3).3 I a distance
Circumstantial and hierarchical complexity reduction As discussed throughout this work, context-dependency of features is itself dependent on the biological question addressed. Given a biological question or context, any set of context-dependent conditions can be tested against a cumulative biological condition calculated as an average measure over the set of sub-conditions for its relative contribution to the overall information. This can be achieved in parallel for as many different (sub-)conditions as available. The relevance of any feature probability profile with respect to the biological question addressed is hereby and importantly solely defined through a statistical significance measure in the information theoretical divergence from the pooled information when considering larger and larger joint sets of conditions. This procedure can be hierarchically repeated (using a single confidence interval) to conditionally collapse individual profiles further and further (Figure (Figure5).5
Circumstantial context illustrated with a theoretical example In order to illustrate the applicability of the methodology developed here let us consider the theoretical example of an analysis of different T-cell populations from a plausible human patient study for how context-dependency analysis is performed in a biological question motivated manner (Figure (Figure77
Let Px (x = 1, 2, 3) be a subject from whom a blood sample has been drawn. The peripheral blood mononuclear cell (PBMC) population has subsequently been separated by fluorescence activated cell sorting (FACS) and the two T-cell subpopulations CD4+CD25+, CD4+CD25- were enriched using the corresponding cell surface markers. Assume furthermore that the CD4+CD25+ (red) and CD4+CD25- (blue) cells, which are both involved for instance in the inflammatory response, have undergone brief exposure to an inflammation inducing agent such as an interleukin during ex vivo primary cell culture, before the cells were harvested and total RNA was extracted for transcriptome analysis using several technical replicates per subject (Figure (Figure7A).7A Several biological questions might be addressed using such a dataset. The first set of questions could relate to the difference in the transcriptional responses of CD4+CD25+ and CD4+CD25- T-cells to stimulation using the interleukin (Figure 7B–D Circumstantial context analysis on actual transcriptome data To demonstrate practical applicability of our approach we present here an analysis of circumstantial context at a concrete example of transcriptome data. The dataset we used was recently generated in our laboratory and has been published [8]. All microarray experiments discussed hereafter are available from the GEO database using accession number GSE10795 (see also Methods). In [8] we present a transcriptome analysis of the apoptotic transcription program downstream of the delta splice-isoform of the TFIID associated factor TAF6δ in two human isogenic cell lines inactivated or not for the p53 gene. Briefly, we demonstrate that TAF6δ acts downstream and independently of p53 to control gene expression at the onset of apoptosis [8]. For the following demonstration we selected six experiments: GSM272658-60 (TAF6δ induction in the p53-/- background, hereafter referred to as biological condition B-, using three independent biological replicates referred to as C1-, C2-, and C3-), and GSM272664-6 (TAF6δ induction in the p53+/+ background, hereafter referred to as biological condition B+, using three independent biological replicates referred to as C1+, C2+, and C3+). The data were processed as described in the Methods section and in [5] in order to obtain probability profiles, and subsequently we calculated the Kullback-Leibler divergence at probe resolution for different contexts (Figures (Figures88
As shown in Figure Figure8A8A Discussion We have introduced probability landscapes as a homogeneous and formally consistent representation of any type of functional genomics information in order to achieve a unique structure that can statistically be systematically interrogated using correlation measures [5]. To reduce unnecessary formal, mathematical and computational complexity we propose here to use the existing de facto context-dependency of features as a question-dependent measure for collapsing subsets of the landscapes. Consider the case where Ci refer to sub-conditions of the circumstantial context of the biological condition B in which the feature X has been recorded (Figure (Figure1).1 Note that since we are comparing the distributions of the same random variable under different conditions, it is only the distance (or divergence) between the two distributions that is meaningful. A joint probability, such as mutual information, can not be envisioned. This also holds for the case of two different variables because the joint probability distribution is inaccessible. Eventually, one could envision considering mutual information in the context of the comparison of two probability distributions (rather than individual variables), thereby rejoining the concept of probabilities of probabilities we have previously developed [5]. However, this seems impractical in concrete terms. The methodology developed here represents a systematic and simple way of testing the statistical limits of complexity-reduction and hence explanatory power of the integrative genomics data in their respective contexts (see for instance Figures Figures77 We also note that the Kullback-Leibler divergence calculation provides measures that can be used directly for clustering of probability profiles. Clustering of probability profiles might help to establish and analyze relatedness among data otherwise not compared directly. Conclusion Feature context-dependency can be systematically investigated using probability landscapes. Furthermore, different, independent feature probability profiles can be collapsed as a function of circumstantial context to reduce the computational and formal complexity of probability landscapes. Interestingly, the possibility of complexity reduction can be linked directly to the analysis of context-dependency. Furthermore, as the criteria for circumstantial complexity reduction are statistically controlled, an optimal probability landscape is created in a biological question dependent manner. These two advances in our understanding of the properties of probability landscapes not only simplify subsequent cross-correlation analysis in hypothesis-driven model building and testing, but also provide additional insights into the biological gene regulatory problems studied. The nature of individual features can be probed with respect to posed problems and a classification of features according to their respective contexts can be achieved. Therefore, increased algorithmic efficiency and statistical power result jointly with heightened understanding of the biological mechanisms. Obviously, other features of circumstantial context and probability landscapes in general still remain to be fully exploited. Methods Constructing In cases where the feature X takes discrete values, the construction of Another option is to discretize the feature X, using e.g. thresholds or any biologically meaningful partition of the range of values of X so that Still another option to construct Note that discretization procedures involve extra knowledge that is at the same time a flaw (introducing some subjectivity if not arbitrariness in the description and analysis) and an advantage (it reduces a wealth of information in an intractable high-dimensional space to a finite number of clear-cut and discrete, e.g. binary, properties, and takes benefit of all the additional knowledge available, e.g. on biological grounds, on the system). To enhance the beneficial aspect while minimizing the drawback, it is then essential to perform a discretization for each specific question and setting, extracting the minimal information that is relevant for that question. Collapse of conditional profiles When the comparison of the profile Proof of the absolute continuity of P(X|B∧C) with respect to P(X|B) In cases where the feature X takes discrete values
and Prob([X = x]∧B) = Prob([X = x]|B).Prob (B) which we denoted Prob([X = x]|B) = P(X|B)(x) in the main text. It shows that P(X|B∧C)(x) is proportional to P(X|B)(x) provided Prob(B∧C) does not vanish, which is obviously true since such a condition B∧C has been observed experimentally and data recorded that underlie the estimation of P(X|B∧C). Accordingly, P(X|B∧C)(x) vanishes as soon as P(X|B)(x) vanishes, demonstrating the claimed absolute continuity. The proof straightforwardly extends to the case where X takes continuous values in a metric space and P(X|B∧C)(x), P(X|B)(x) are distribution functions (i.e. densities). Kullback-Leibler divergence At each genome location n, the probabilities in the discrete case, where ∑x χ should be replaced by where the latter approximation holds when The rationale for considering the Kullback-Leibler divergence rather than a Lp distance is to weight the elementary contributions of each value x of X to the distance between the probability distributions by the probability of this value x; the distributions could differ significantly in x without having a significant divergence provided the probability of observing this value is in any case negligible. In the same spirit, Renyi generalizations can also be considered, replacing z ln z by (q-1)-1zq, which will allow the contribution of the rare events in the distance to be weighted differentially. Let us denote Its minimal value 0 is observed if C adds no new information. It has no a priori maximum: there is no other solution of the constrained variational equation than the case of equality of the two distributions. A maximal value is reached when B∧C fully conditions X, namely Transcriptome data The transcriptome data used in this study to illustrate the concept of circumstantial context are part of a study investigating the effect of the delta isoform of the general transcription factor TAF6 in apoptosis induction and its relationship to the transcription factor p53 [8]. The microarray data are accessible from the Gene Expression Omnibus database http://www.ncbi.nlm.nih.gov/geo/ under accession number: GSE10795. Transcriptome data preprocessing The median normalized relative signal intensities from the indicated transcriptome experiments, representing three biological replicates for either the B+ (p53+/+) or the B- (p53-/-) biological conditions [8], were transformed into probability landscapes as described in [5]. Here, as the scope of the demonstration is restricted, and also the number of the analyzed samples is very moderate, the following simplifications were introduced in the calculation of the probability landscapes: (1) The resolution of the p-annotation is at probe- and not nucleotide-level, as this is the smallest common denominator between the different samples and higher resolution therefore has no bearing. (2) As the data originate from the same transcriptome technology, and belong to a single series, we have omitted the calculation of the quality estimating (3) Both biological conditions were treated independently and no global rescaling of the probability landscapes between the two biological conditions (B+, B-) was performed for reasons similar to those above. Rescaling in this particular case would have marginally impacted the Kullback-Leibler divergence by a constant. (4) The estimated coefficient of variance associated with each signal was not taken into account as it affects only It should be kept in mind that the analysis presented here serves only as a proof-of-principle for the circumstantial context analysis developed, and does not aspire to investigate the features of the analyzed data systematically to the full extent using the probability landscape concept. Furthermore, the analysis presented here is probe-centered and hence only approximately comparable to the data analysis in [8] which is gene-centered, and where the probe-to-gene correspondence has been established [11]. The initial raw signal values, the P-values, and the different divergence measures are all provided as additional files 1, 2, 3. Those are equally accessible through our website (http://seg.ihes.fr/ (follow ->"web sources" ->"supplementary materials"). Competing interests The authors declare that they have no competing interests. Authors' contributions AL and AB have jointly investigated the mathematical, computational, and experimental aspects of the idea, initially proposed by AB, upon which this work is based. Both authors have written the manuscript together. Both authors have read and approved the final manuscript. Additional file 1 File provides the probe IDs, the associated raw signal estimates of the two times three biological replicates (C1, C2, C3) of the two biological conditions (B+, B-), the probability profiles, and the Kullback-Leibler divergence estimates for the transcriptome data described in [8]. This first file contains the entire dataset of 31710 probes analyzed. Click here for file(6.0M, txt) Additional file 2 File provides the probe IDs, the associated raw signal estimates of the two times three biological replicates (C1, C2, C3) of the two biological conditions (B+, B-), the probability profiles, and the Kullback-Leibler divergence estimates for the 899 selected p53 modulated probes [8]. Click here for file(177K, txt) Additional file 3 File provides the probe IDs, the associated raw signal estimates of the two times three biological replicates (C1, C2, C3) of the two biological conditions (B+, B-), the probability profiles, and the Kullback-Leibler divergence estimates where the above mentioned 899 p53 modulated probes were swapped between the two biological conditions. Click here for file(6.0M, txt) Acknowledgements We are particularly grateful to the editor Paul S. Agutter for his thoughtful suggestions for improving the language and style of the manuscript. We are equally indebt to the anonymous referees for their helpful suggestions and comments, as well as all members of our research team for stimulating discussions. This work has been supported by funds from the Institut des Hautes Études Scientifiques, the Centre National de la Recherche Scientifique (CNRS), the French Ministry of Research through the "Complexité du Vivant – Action STICS-Santé" program, the Génopole-Evry, the Agence Nationale de Recherche sur le SIDA, and the Agence Nationale de la Recherche (ISPA, 07-PHYSIO-013-02). References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||
Theor Biol Med Model. 2008 May 20; 5():9.
[Theor Biol Med Model. 2008]Theor Biol Med Model. 2008 May 20; 5():9.
[Theor Biol Med Model. 2008]PLoS One. 2008 Jul 16; 3(7):e2721.
[PLoS One. 2008]Theor Biol Med Model. 2008 May 20; 5():9.
[Theor Biol Med Model. 2008]PLoS One. 2008 Jul 16; 3(7):e2721.
[PLoS One. 2008]PLoS One. 2008 Jul 16; 3(7):e2721.
[PLoS One. 2008]Theor Biol Med Model. 2008 May 20; 5():9.
[Theor Biol Med Model. 2008]Theor Biol Med Model. 2008 May 20; 5():9.
[Theor Biol Med Model. 2008]Theor Biol Med Model. 2008 May 20; 5():9.
[Theor Biol Med Model. 2008]PLoS One. 2008 Jul 16; 3(7):e2721.
[PLoS One. 2008]PLoS One. 2008 Jul 16; 3(7):e2721.
[PLoS One. 2008]Theor Biol Med Model. 2008 May 20; 5():9.
[Theor Biol Med Model. 2008]PLoS One. 2008 Jul 16; 3(7):e2721.
[PLoS One. 2008]BMC Bioinformatics. 2005 Dec 22; 6():307.
[BMC Bioinformatics. 2005]PLoS One. 2008 Jul 16; 3(7):e2721.
[PLoS One. 2008]PLoS One. 2008 Jul 16; 3(7):e2721.
[PLoS One. 2008]