Learning directed acyclic graphs from large-scale genomics data

In this paper, we consider the problem of learning the genetic interaction map, i.e., the topology of a directed acyclic graph (DAG) of genetic interactions from noisy double-knockout (DK) data. Based on a set of well-established biological interaction models, we detect and classify the interactions between genes. We propose a novel linear integer optimization program called the Genetic-Interactions-Detector (GENIE) to identify the complex biological dependencies among genes and to compute the DAG topology that matches the DK measurements best. Furthermore, we extend the GENIE program by incorporating genetic interaction profile (GI-profile) data to further enhance the detection performance. In addition, we propose a sequential scalability technique for large sets of genes under study, in order to provide statistically significant results for real measurement data. Finally, we show via numeric simulations that the GENIE program and the GI-profile data extended GENIE (GI-GENIE) program clearly outperform the conventional techniques and present real data results for our proposed sequential scalability technique.


I. INTRODUCTION
Genetic interaction analysis aims at uncovering the interactions among a set of genes with respect to a specified cell function of a biological system, e.g., the fitness of a specific bacteria colony.The interactions among the genes under study can be characterized by a directed-acyclic-graph (DAG) [30] where the hierarchical relationship among two genes of a DAG describes their hierarchical interaction type [7].However DAGs cannot be observed directly but only the specified cell function under study which yields observable phenotypes.The term phenotype generally describes the specific manifestation of a biological attribute of an organism which can be observed, e.g., for bacteria a common biological attribute is the growth measured in colony size, where a specific size of the bacteria colony is a phenotype of this biological attribute.The role of the studied genes in the cell machinery, the hierarchical interaction types of the genes, as well as the DAG, which describes the latter ones, can only be learned by means of knock-out experiments where a gene or a set of genes is functionally switched off and the phenotype is observed.Traditionally, only single-knock-out (SK) experiments have been conducted but those mainly provide evidence on the importance of a single gene for the investigated cell process and do not convey much information about the interaction among the genes under study.
Recently, with the technological advances in micro arrays and the development of the synthetic-genetic-array technologies [14] new approaches have been taken that are based on large scale knock-out experiments of pairs of genes.Such double knock-out (DK) experiments are much more powerful for exploring genetic interactions since a DK phenotype of an arbitrary pair of genes generally differs considerably from the superposition of the corresponding SK phenotypes of this pair of genes.According to [7], the gene pairs can be classified into one out of five hierarchical relationship classes based on their SK and DK phenotypes.Further, based on the hierarchical relationship classes the DAG underlying the observed SK and DK phenotypes can be inferred which directly reflects the genetic interactions among the genes.In order to detect the DAG underlying the SK and DK phenotypes a variety of statistical methods based on scoring the measurements or on thresholding the genetic-interaction (GI)-profile data, which is commonly based on Pearson correlation of the SK and DK phenotypes, e.g.[1]- [6] respectively, have been developed.However, methods as presented in [1]- [6] have three considerable disadvantages: D1) they show poor performance in detecting the DAG underlying the observed SK and DK phenotypes D2) they have no ability to combine different types of side information, e.g., GI-profile data with SK and DK phenotypes, to enhance the detection quality D3) they cannot make use of prior knowledge in order to enhance the DAG detection quality.Especially the ability to overcome the disadvantage in D2) will become more important in the future, since there is a constantly increasing amount of different data types, e.g., SK and DK phenotypes, Pearson correlation based GI-profile data, other types of GI-profile data, freely available.Furthermore, the ability to overcome the deficit in D3), i.e., to incorporate a priori knowledge about existing results in genomics research into the DAG detection procedure, is also of great significance, since existing functional relationships among genes are increasingly better understood based on a variety of studies that constantly extend the knowledge on the cell machinery and molecular biology.Although exhibiting the above mentioned disadvantages D1) to D3), methods as those presented in [1]- [6] are the most commonly used methods to detect the DAG underlying the measured SK and DK data.Therefore, we propose the Genetic-Interactions-Detector (GENIE)-program, that is an approach based on the biological system model of [7] with which it is possible to overcome the above mentioned shortcomings of the most popular methods as those reported in [1]- [6].Since the hierarchical relationship classes are mutually dependent, classifying each pair of genes to a specific hierarchical relationship class corresponds to a multi-hypotheses test.Thus, we formulate this multihypotheses test as a linear integer optimization program, [8]- [13] in order to find the set of hierarchical relationship classes, best matching the observed SK and DK phenotypes.Based on the detected set of hierarchical relationship classes, the set of edges of the DAG which reflects the interactions among the genes can be computed.Furthermore, we propose the GI-GENIE-program where we advance the proposed GENIEprogram by incorporating GI-profile data, e.g.GI-profile data based on Pearson correlation of the observed SK and DK phenotypes, into the DAG detection procedure.Due to incomplete knowledge about the true dependencies among the very most sets of genes, i.e., the true DAG of a set of genes with respect to a specific cell function is unknown or only partially known for almost all sets of genes irrespectively of the cell function under study, there is a strong interest in the genomics research community in statistically reliable statements about the topology of the DAGs underlying large sets of genes, i.e., for the empirical probability of a pair of genes to interact with each other.Towards this aim we propose a sequential technique based on the GENIE/GI-GENIE algorithms that yields statistically stressable statements about the interactions among genes from a large set of genes under study.This paper is organized as follows.We first summarize the biological system model of [7] in Section 2, then we present in Section 3 the GENIE-program for detecting the set of hierarchical relationship classes, that represents a valid DAG and matches the DK measurements best.In Section 4 we extend the GENIE-program with GI-profile data (GI-GENIE).In Section 5, we present our scalability approach in order to obtain statistically stressable results for large sets of genes.Following Section 5, we present results for simulated data which demonstrate the performance of the GENIE and the GI-GENIE methods in Section 6.Furthermore, in Section 6 we display real data results for the scalability approach described in Section 5. Finally, we summarize in Section 7 the key parts of this paper and give a brief outlook on future work.

II. SYSTEM MODEL
In this section we provide a mathematical description of a DAG as well as its biological implications.Furthermore, we introduce the common biological terms and provide a compact description of the genetic interaction model of [7] including simple explanations on how to read and interpret a DAG of a genetic interaction map.According to [15] a graph A = V(A), E(A) is well defined by a set of nodes V(A) = {a 1 , a 2 , ..., a A } and a set of edges E(A) = {a 1 , a A , } , {a 2 , a A , } , ..., {a A , a 1 , } where {a i , a j , } for a i , a j ∈ V(A) denotes a directed edge from a i to a j and cardinality |V(A)| = A denotes the number of elements of set V(A).The operators V(•) and E(•) applied to graph A yield the set of nodes V(A) and the set of edges E(A) respectively.We mostly address sets V(A) and E(A) by Fig. 1: DAG D 0 of 13 genes and root node R and E A respectively for the sake of notational convenience, i.e., A = G A , E A .Assume that there is a path P from node a i ∈ G A to node a j ∈ G A in graph A, i.e., a directed connection from node a i ∈ G A to node a j ∈ G A .Then path P is described by the concatenation of nodes being passed through on the way from node a i ∈ G A to node a j ∈ G A , i.e., P = a i ...a j and V(P) = {a i , ..., a j } denotes the set of nodes of path P [15].The functional dependencies among a set of genes G = {g 1 , ..., g G }, with G = |G| elements, for a given cell process and specie can be characterized by a genetic-interaction-map (GI-map, [17]- [20]) which is essentially a DAG with a common root node, i.e., the reporter level R, [31].In particular, an arbitrary DAG D can be described as a graph D = (G D , E D ) with the set of nodes G D = {G ∪ R} and the set of directed edges E D = {g i , g j } , ..., {g j , g l } .As the genetic interactions can only be observed through the reporter, all edges are always orientated in such a way that each path parting from any arbitrary gene g i ∈ G always terminates in the root node R and any gene appears on the path at most once, i.e., there exist no cycles in the graph.Hence, the DAG D is always connected via its root node R. For the sake of notational convenience, in most cases we write gene i when addressing gene g i , [31].The reporter node R is an artificial node, i.e., not a gene, in the concept of a DAG and represents the measured phenotype of the specific cell process under study.To provide a better understanding of the information encoded in a DAG we state a simple example, which is similar to the one in [31], based on DAG D 0 displayed in Fig. 1.In D 0 there exists an direct edge from gene i 0 to gene j 0 , i.e. {i 0 , j 0 } ∈ E D0 , which indicates that the activity of gene i 0 controls the activity of gene j 0 .Hence, gene i 0 only affects the phenotype via gene j 0 and not directly.We emphasize that in this model the existence of edge {i 0 , j 0 } in the DAG only describes the hierarchical functional dependency between genes i 0 and j 0 and not the quantitative effect of gene i 0 on gene j 0 .Denote R(i) ∈ R as the phenotype for a single gene i ∈ G Fig. 2: Possible hierarchical relationship classes between two arbitrary genes i, j of DAG D according to [7] functionally switched off.In the same fashion we define the phenotype for the DK of genes i, j ∈ G: j > i as R(i, j) ∈ R.
Based on the DK phenotypes the GI-profile similarity of genes i, j ∈ G: j > i is computed as the Pearson correlation between all DK phenotypes which involve gene i and all DK phenotypes which involve gene j, i.e., the GI-profile similarity ρ(i, j) is given by with and R(i, l) = R(l, i) ∀i, l∈ G: l > i.We remark that the GIprofile similarity data ρ(i, j) does not have to be computed according to Eq. (1) necessarily.It can also be extracted from a data base where a priori knowledge about the set of genes under study, i.e., G, is stored.In genomics research it is a common assumption that if there is an edge between two genes i, j ∈ G: j > i in DAG D, i.e., there is an interaction between genes i, j ∈ G: j > i in DAG D, then the GI-profile similarity ρ(i, j) is very likely to be high.Furthermore according to [7] we assume that each pair of genes i, j ∈ G :j > i belongs to exactly one out of five hierarchical relationship classes that are characterized in Fig. 2. The hierarchical relationship classes k ∈ K = {1, ..., 5} are defined according to the model µ k (i, j) in which the single knock-out phenotypes R(i) and R(j) are related with the DK phenotype R(i, j).If the gene pair i, j ∈ G: j > i belongs to the hierarchical relationship class k then the observed DK phenotype R(i, j) is described by the model µ k (i, j) provided in Fig. 2. We remark that the five hierarchical dependency graphs in Fig. 2 do not reflect the absolute adjacency relations, but the hierarchical relations between genes i, j in DAG D. Hence, given that two genes i, j of DAG D are in class k, we cannot conclude that genes i, j are directly arranged in DAG D as displayed by the depiction of class k in Fig. 2.This follows from the fact that the description of the hierarchical relationship classes provided in Fig. 2 only contains relative topology information about two genes i, j in DAG D.
In the following and in addition to [7], we provide, for clarity of presentation, a formal description of the hierarchical relationship classes depicted in Fig. 2 using a graph theoretical representation.Assume that there are I paths P i,τ , for τ ∈ {1, ..., I}, from gene i to the reporter node R in DAG D and the set P i containing all such paths is defined as P i = {P i,1 , ..., P i,I }.Furthermore, set P j = {P j,1 , ..., P j,J } contains all J paths from gene j to the reporter node R in DAG D. Given gene pair i, j ∈ G : j > i in DAG D, then pair i, j belongs to the hierarchical relationship class k ∈ K if and only if condition C k as defined below is satisfied: C 2 : ∀ P j,τ ∈ P j : i ∈ V(P j,τ ) C 3 : C 4 : C 5 : ∃ P j,τ ∈ P j : i / ∈ V(P j,τ ) As stated in condition C 1 in (2a), two genes i, j ∈ G: j > i in DAG D belong to the hierarchical relationship class k = 1, if all paths from gene i to the reporter node R pass through gene j.Hence, gene j is always an element of the set of nodes of each path P i,τ ∈ P i from gene i to the reporter node R, i.e., j ∈ V(P i,τ ) for all paths P i,τ from gene i to the reporter node R. With the same line of argument as used in (2a), two genes i, j ∈ G: j > i in DAG D belong to the hierarchical relationship class k = 2 if condition C 2 in (2b) is satisfied.
Two genes i, j ∈ G: j > i in DAG D belong to the hierarchical relationship class k = 3 and are considered to be independent from each other if condition C 3 in (2c) is satisfied which states that there is no path P i,τ from gene i to the reporter node R that passes through gene j as well as there is no path P j,τ from gene j to the reporter node R that passes through gene i.As stated in (2d), two genes i, j ∈ G: j > i in DAG D belong to the hierarchical relationship class k = 4 if there is at least one path P i,τ from gene i to the reporter node R which does not pass through gene j as well as for all paths P j,τ ∈ P j there is always a path P i,τ ∈ P i that is a super-path of the respective P j,τ ∈ P j .With the same line of argument as used in (2d), two genes i, j ∈ G: To illustrate this, let us consider the example DAG D 0 of Fig. 1.All paths from gene i 0 to node R pass through gene j 0 , i.e., they are in a linear pathway with gene i 0 upwards of gene j 0 .Thus the pair of genes i 0 , j 0 belongs to class k = 1.Note that with the same line of argument, we conclude that also genes i 0 and l 0 belong to relationship class k = 1.Since all paths from gene i 0 to the reporter level R do not pass through gene t 0 and all paths from gene t 0 to the reporter level do not pass through gene i 0 , genes i 0 and t 0 belong to the hierarchical relationship class k = 3 as given in Fig. 2, which states that genes i 0 and t 0 are independent of each other and the DK phenotype amounts to R(i 0 , t 0 ) = µ 3 (i 0 , t 0 ).Finally, let us investigate the hierarchical relation between genes t 0 and n 0 in DAG D 0 .Obviously, gene t 0 has (at least) one path to node R which does not pass through gene n 0 , i.e., genes only having paths to R that do not pass through gene n 0 do not affect the activity of gene n 0 .Since there is (at least) one other path from gene t 0 to R passing through gene n 0 , we can reason that genes t 0 and n 0 belong to class k = 4. Generally, there are strong implications among the hierarchical relationship classes of [7], i.e., if some pairs belong to a specific class then this has strong implications for all other pairs.Let us consider the case that DAG D 0 was not known and only the hierarchical relationship classes for genes i 0 and j 0 , i.e., genes i 0 and j 0 belong to class k = 1, as well as the hierarchical relationship class for genes i 0 and g 0 , i.e., genes i 0 and g 0 belong to class k = 1 were available.By definition of the hierarchical dependency graphs in Fig. 2 and the assumptions, that genes i 0 and j 0 belong to class k = 1 as well as that genes i 0 and g 0 belong to class k = 1, we conclude that all paths from gene i 0 to R pass through genes j 0 and g 0 .Thus, either all paths from gene g 0 to R pass through gene j 0 , or all paths from gene j 0 to R pass through gene g 0 .Consequently, genes j 0 and g 0 either belong to the hierarchical relationship class k = 1, or k = 2.
As we have emphazised by the example above, generally if the hierarchical relationship class is known for two arbitrary genes i, j ∈ G :j > i as well as for another pair i, l ∈ G :l > i, then this has strong logical implications on the hierarchical relationship classes genes j, l ∈ G :l > j can belong to.Since we can interpret the classification of the pairs of genes i, j ∈ G : j > i, based on their observed SK and DK phenotypes R(i), R(j) and R(i, j), respectively, to exactly one out of the five hierarchical relationship classes as a coupled multi-hypotheses test, we address this problem in Section 3 by a linear integer optimization program.The proposed linear integer optimization program identifies the most consistent set of hierarchical relationship classes, i.e., the set of hierarchical relationship classes that represents a valid DAG and matches best the DK measurements with respect to the logical coupling between the classes.Furthermore, in Section 4 we extend the GENIE-program proposed in Section 3 by incorporating GIprofile data in order to jointly detect the most consistent set of hierarchical relationship classes and the corresponding DAG topology.

III. GENIE-ALGORITHM
In this section, we formulate the problem of classifying the gene pairs i, j ∈ G :j > i into the classes of hierarchical relationships based on the observed SK and DK phenotype values as a linear integer optimization program.Furthermore, we translate the logical implications among the hierarchical relationship classes into constraints that ensure that the detected set of hierarchical relationship classes represents a valid graph.That is the detected set of hierarchical relationship classes represents a graph which is a DAG as defined in Section 2. Finally, we propose a policy to derive an estimate ÊD of the true set of edges E D of DAG D based on the detected set of hierarchical relationship classes.

A. Hierarchical Relationship Class Detection
In order to quantify the mismatch between the measured DK phenotypes R(i, j) and the phenotype model µ k (i, j) of class k ∈ K according to Fig. 2, under the hypothesis that the gene pairs i, j ∈ G :j > i belong to class k given their respective SK values, we propose a simple quadratic score [7], [31], as given in Eq. (3): Let us define the following selection variables We remark that every DAG D can be represented by a set of hierarchical relationship classes which directly corresponds to a set of selection variables A D = ∀i,j∈G:j>i α D 1 (i, j), ..., α D 5 (i, j) .The GENIE-algorithm of classifying the gene pairs i, j ∈ G :j > i into the set of hierarchical relationship classes, that represents a valid DAG and matches the observed SK and DK phenotypes best can be formulated as L =⇒ additional topology constraints (5d) where In the following, we exemplarily derive topology constraints in set L. In order to identify the numerous logical dependencies among the selection variables α k (i, j), k ∈ K for all i, j ∈ G :j > i we proceed in the following way.We first fix the assumption that genes i, j ∈ G : j > i belong to class k = 1, i.e., α 1 (i, j) = 1.Further we assume that genes i, l ∈ G : l > i belong to class k , i.e. α k (i, l) = 1.Then we derive the set of classes K that genes j, l ∈ G : l > j can belong to under the assumptions made.In the following, we have formulated the logical dependencies among the selection variables for α 1 (i, j) = 1, i.e., the case that gene i is linearly upstream of gene j, as linear integer inequalities defined in constraints (6a)-(6e) and summarize them as set L 1 where constraints (6a)-(6e) are affine after the continuous relaxation of the selection variables α k (i, j), ∀i, j ∈ G : j > i.
To explain the origin and the functionality of the constraints in L 1 , let us further define a sub-genetic-interactions-map (SMAP) S, [31], as given in Fig. 3 according to the following definition where we adopt the graph notation of [15]: Definition: Given a non-empty set of edges E in and a non-empty set of edges E out .Graph S = G S , E S , with set of nodes G S and set of edges E S , is a SMAP if the following conditions are fulfilled: i) the graph S is acyclic and directed ii) there ∃e in ∈ E in , e out ∈ E out such that each path P through graph S incides S via egde e in and leaves graph S via edge e out .
e in,1 e in,2 e out S Fig. 3: Example SMAP S DAG D 1 , as displayed in Fig. 4, consists of genes i, j and SMAPs S 1 and S 2 .Obviously, genes i, j belong to class k = 1, i.e., α 1 (i, j) = 1.Furthermore, all genes l ∈ G D1 \ {R} :l > j > i for which α 1 (i, l) = 1 must be either located in SMAP S 1 or S 2 .Thus, it follows from DAG D 1 in Fig. 4 that the gene pair j, l is either in hierarchical relationship class k = 1 or k = 2, i.e., α 1 (j, l) = 1 or α 2 (j, l) = 1.This logical implication is directly reflected by constraint (6a).Given α 1 (i, j) = 1 and α 1 (i, l) = 1 the right-hand-side (RHS) of (6a) amounts to 1.In this case also the left-hand-side (LHS) of (6a) becomes 1 to fulfill the inequality (6a).Thus either α 1 (j, l) = 1 or α 2 (j, l) = 1.Reversely, assume that α 1 (i, j) = 1 and α 1 (i, l) = 1 does not hold, then the RHS of (6a) is less than 1, i.e., 0 or -1, while the LHS of (6a) is always greater than 0. Hence, constraint (6a) is fulfilled irrespectively of the choice of α k (j, l), i.e., constraint (6a) enforces no logical implications.Similarly for DAG D 2 in Fig. 4, it is obvious that genes i, j belong to the hierarchical relationship class k = 1, i.e., α 1 (i, j) = 1.All genes l ∈ G D2 \ {R} :l > j > i which are in a linear pathway upstream of gene i, i.e. α 2 (i, l) = 1, must be located in SMAP S 3 .Hence it directly follows from DAG D 2 that also gene l must be in a linear pathway upstream of gene j, i.e., α 2 (j, l) = 1.This logical implication is compactly represented in constraint (6b).Under the assumption that α 1 (i, j) = 1 and α 2 (i, l) = 1, the RHS of (6b) amounts to 1 enforcing α 2 (j, l) = 1, so that the LHS of (6b) equals the RHS and the inequality in (6b) is fulfilled.Reversely, assume that α 2 (i, l) = 0, then the RHS of (6b) is less than 1 and hence the LHS of (6b) is always bigger than or equal to the RHS irrespectively of the choice of α k (j, l), i.e., constraint (6a) enforces no logical implications.We can proceed in the same fashion to explain constraints (6c)-(6e) based on DAGs D 3 to D 5 as given in Fig. 4 respectively.Furthermore, with the same line of argument we can derive the sets L k for k ∈ K \ 1 which reflect the logical implications among the selection variables under the assumptions that α k (i, j) = 1 for k ∈ K \ 1.However, due to space limitations we omit the derivation of the full set of logical implications at this point and refer the interested reader to [32] where we will provide the full set of topology constraints L as well as further supplementary material.The full set of topology constraints L in (5d) can be computed as Finally, a considerable advantage of the presented algorithm is its ability to incorporate prior knowledge into the classification of the gene pairs i, j ∈ G: j > i to the most consistent hierarchical relationship classes.Assume that it is known from existing experimental results that two specific genes i 0 , j 0 ∈ G: j 0 > i 0 do not interact with each other.Then we can easily incorporate this prior knowledge into program O GENIE in (5) by adding Eq. ( 8) as defined below as a topology constraint to program O GENIE .This property is very valuable since it allows the GENIE-algorithm to take advantage of existing results in genetic interactions research to improve the reliability of the classification.

B. Edge Computation
Based on the detected set of selection variables A OGENIE which corresponds to the most consistent pattern of hierarchical relationship classes given the observed SK and DK phenotypes, an estimate E GENIE of the true set of edges E D of DAG D can be computed.It can be theoretically proven that the representation of an arbitrary DAG D by its corresponding set of hierarchical relationship classes is not unique.A D the set of selection variables which directly corresponds to the hierarchical relationship class pattern of DAG D represents not only the true DAG D, but also a set of similar DAGs which have minorly different sets of edges compared to the true DAG D. Assume we are only given that α D 4 (i, j) = 1 for two arbitrary genes i, j ∈G : j > i of DAG D, then we suffer an information loss on the number of paths from gene i to the reporter node R which are independent of gene j.Similarly, given that α D 5 (i, j) = 1 for two arbitrary genes i, j ∈G : j > i of DAG D, we suffer an information loss on the number of paths from gene j to the reporter node R which are independent of gene i.Hence, this information loss yields ambiguities in computing the set of edges E D of DAG D based on the A D .In order to clarify the origin of the ambiguities further, let us turn to a simple example.Given (1, 4) = 1 due to edge e 0 ∈ E Da and α Da 1 (1, 5) = 1.Now assume that we want to compute the topology of DAG D a , i.e., the set of edges E Da , based on A Da .DAG Da displayed on the RHS of Fig. 5 shows the estimated topology of DAG D a based on A Da .It can be shown that the black edges of DAG Da are necessary such that DAG Da is represented by A Da .Edges e 1 and e 2 in DAG Da are optional in a sense that their existence has no effect on set A Da .Edges e 1 and e 2 create two paths form gene g 1 to the reporter node R which are independent of gene g 2 and gene g 3 respectively.However, due to edge e 0 gene g 1 already has a path to the reporter node R which is independent of genes g 2 and g 3 .Since α Da 4 (1, 2) = α Da 4 (1, 3) = α Da 4 (1, 4) = 1 do not contain information on the number of paths from gene g 1 to R that are independent of g 2 , g 3 and g 4 , edges e 1 and e 2 do not affect the pattern of hierarchical relationship classes representing DAG D a , i.e., A Da and hence this yields ambiguities in computing the topology of DAG D a based on its corresponding set of selection variables A Da .Since it is a common assumption in genomics research, that GI-maps, i.e., DAGs, are not overly dense but rather sparse, we propose a policy which computes the sparsest DAG topology based on the detected pattern of hierarchical relationship classes.Given the detected pattern of hierarchical relationship classes of a DAG D, i.e., ÂD = ∀i,j∈G:j>i αD 1 (i, j), ..., αD 5 (i, j) , we compute an estimate ÊD of the true topology set E D of DAG D according to the policy depicted in Tab.I where we make use of the symmetry properties αD 1 (i, j) = αD 2 (j, i), αD 2 (i, j) = αD 1 (j, i), αD 3 (i, j) = αD 3 (j, i), αD 4 (i, j) = αD 5 (j, i) and αD 5 (i, j) = αD 4 (j, i) to redundantly expand the set of detected selection variables in order to obtain a compact formulation of the mutually exclusive conditions E 1 to E 4 as stated in Tab.I. Assume that either condition E 1 or condition E 2 is fulfilled, then we conclude that there is an edge from gene i to gene j in DAG D. Given that either condition E 3 or condition E 4 is fulfilled, we conclude that there exists an edge from gene j to gene i in DAG D. We remark that there cannot be an edge between two genes i, j ∈G : j > i if they are independent of each other, i.e., αD 3 (i, j) = 1.As described by E 1 there is an edge from gene i to gene j in DAG D, if gene i is linearly upstream of gene j, i.e., αD 1 (i, j) = 1, and there is no gene l in DAG D that is linearly downstream of gene i, i.e., αD 1 (i, l) = 1, and linearly upstream of gene j, i.e., αD 2 (j, l) = 1.According to condition E 2 there is an edge from gene i to gene j in DAG D, if gene i is upstream of gene j with at least one path from gene i to R which is independent of gene j, and furthermore there is no gene l in DAG D that is either linearly downstream of gene i or downstream of gene i with gene i having at least one path to R that is independent of l, and, gene l is neither linearly upstream of gene j nor is gene l upstream of gene j with an independent path to R. In order to elucidate the effect of condition E 2 onto the edge computation, we briefly turn to DAG Da in Fig. 5. Condition E 2 ensures that the optional edges e 1 and e 2 are Detection Policy: Compute sparsest DAG in line with ÂD • =⇒ there is an edge in DAG D from gene i to gene j, i.e {i, j} E 2 : • =⇒ there is an edge in DAG D from gene i to gene j, i.e {i, j} E 3 : • =⇒ there is an edge in DAG D from gene j to gene i, i.e {j, i} E 4 : • =⇒ there is an edge in DAG D from gene j to gene i, i.e {j, i} TABLE I: Proposed sparse edge detection policy not detected but only the necessary edges displayed in black color.We remark that conditions E 3 and E 4 can be elucidated by the same line of argument as used for conditions E 1 and E 2 , but due to space limitations we omit a detailed explanation at this point.Finally, we propose a condition from which all reporter node edges, i.e, all egdes that connect gene i ∈ G with reporter node R in DAG D can be computed.Based on the detected set of hierarchical relationship classes, i.e., ÂD , we follow our policy of computing the necessary edges only.For clarity of presentation and notational compactness, we define set M i as containing all genes l ∈ G which are in class k = 4 with gene i ∈ G, i.e., αD 4 (i, l) = 1.Furthermore, we define set M i as which contains all genes l of set M i that are independent of at least one other gene of set M i .Based on sets M i and M i , we formulate condition E R as stated in Tab.II.We conclude that there is an edge from gene i to reporter node R E R : Given that gene i is linearly upstream of at least a single gene l, i.e., αD 1 (i, l) = 1, there cannot exist an edge from gene i to reporter node R in DAG D, since all paths from gene i to R pass through at least one other gene l.Conversely, if there is no such gene l that αD 1 (i, l) = 1, then the LHS of E R as given in Tab.II is fulfilled.The RHS of E R accounts for our policy of detecting sparse DAGs only and is fulfilled if either M i , M i or M i and M i are empty.Note that given M i = ∅ it follows that M i = ∅ as well, whereas the opposite is not true.In order to

IV. GI-GENIE-ALGORITHM
In this section, we extend program O GENIE in ( 5) by incorporating GI-profile data, in order to jointly detect the most consistent pattern of hierarchical relationship classes and the corresponding set of edges E GI that is an estimate of the true graph topology of DAG D, i.e., the set of edges E D which underlies the SK, DK and GI-profile data.Let us define the following selection variables where β(i, j) = 1 denotes that there is an edge between genes i, j ∈ G : j > i in DAG D and β(i, j) = 0 denotes that there exists no edge between genes i and j.Note that unlike α k (i, j) = 1 for k ∈ K, β(i, j) = 1 does not capture directionality information about the graph topology, i.e., β(i, j) = 1 states that there is an edge between genes i, j in DAG D without specifying the hierarchy among both genes.The GI-GENIE program of jointly detecting the most consistent pattern of hierarchical relationship classes A OGI-GENIE and the corresponding DAG topology E GI based on SK, DK and GI-profile data can be formulated as linear integer program O GI-GENIE : where ρ(i, j) denotes a measure of the similarity between the GI-profiles of genes i, j ∈ G : j > i according to (1), the scalars λ d , λ s , λ c , λ p are non-negative weighting constants and z l (i, j)∀i, j, l ∈ G : j > i, l = i, l = j are binary slack variables.Note that λ s is assumed to be a very small nonnegative constant, i.e. 0 ≤ λ s 1.The GI-profile (GIP) term in the objective function (12a) of program O GI-GENIE seeks to detect an edge between genes i, j ∈ G : j > i if the GI-profile similarity ρ(i, j) is greater than the predefined threshold λp λc .The slack term is necessary for the correct functionality of constraints (12f) and (12g) as explained below.The slack variables z l (i, j)∀i, j, l ∈ G : j > i, l = i, l = j are generally necessary to ensure that the information about the topology of DAG D, which is encoded in the pattern of selection variables A OGI-GENIE detected by program O GI-GENIE , is not contradicting with the set of edge selection variables β(i, j) ∀i, j ∈ G : j > i detected by program O GI-GENIE .In particular, given that the detected pattern of selection variables A OGI-GENIE enforces that there is an edge between genes i, j in DAG D, then the slack variables ensure that the corresponding edge selection variable indicates that there is an edge between genes i, j, i.e., β(i, j) = 1.Furthermore, given that the detected pattern of selection variables A OGI-GENIE enforces that there is no edge between genes i, j in DAG D, then the slack variables ensure that the corresponding edge selection variable indicates that there is no edge between genes i, j, i.e., β(i, j) = 0. On the contrary, assume that the detected edge selection variables enforce that there is an edge between genes i, j in DAG D, i.e., β(i, j) = 1, then the z l (i, j)∀i, j, l ∈ G : j > i, l = i, l = j ensure that the detected pattern of selection variables A OGI-GENIE must fulfill one of the conditions stated in Tab.I. Consequently, given that the detected edge selection variables enforce that there is no edge between genes i, j in DAG D, i.e., β(i, j) = 0, then the z l (i, j)∀i, j, l ∈ G : j > i, l = i, l = j ensure that the detected pattern of selection variables A OGI-GENIE does not fulfill any of the conditions stated in Tab.I. Furthermore, let the auxiliary variables describe the detection of the edges of DAG D based on GIprofile data only, where q(i, j) = 1 denotes that there is an edge between genes i, j ∈ G : j > i and q(i, j) = 0 denotes that there is no edge between genes i, j ∈ G : j > i.Since any pattern of hierarchical relationship classes implies a specific set of edges and any set of edges implies a specific pattern of hierarchical relationship classes, there is a strong coupling of constraints, i.e. there are strong logical implications among the selection variables α k (i, j) ∀ i, j ∈ G : j > i, ∀ k ∈ K and the selection variables β(i, j) ∀ i, j ∈ G : j > i, the constraints in Eq. (12e) to Eq. (12g) ensure that the detected hierarchical relationship classes and the corresponding edges, i.e., the α k (i, j) and β(i, j), are not mutually contradicting.Given that two genes i, j ∈ G : j > i in DAG D are independent, i.e., α 3 (i, j) = 1, there cannot exist an edge between those genes in DAG D, i.e., β(i, j) = 0.This logical implication is reflected by (12e).Set L c in (12f) and the linear integer inequalities in (12g) model conditions E 1 to E 4 of our proposed edge detection policy as stated in Tab.I. Since we do not want to redundantly expand the set of selection variables α k (i, j) to all i, j ∈ G : j = i in order to not increase the complexity of program O GI-GENIE , we have to consider three cases when formulating conditions E 1 to E 4 of Tab.I as linear integer inequalities, i.e., i, j, l ∈ G : l > j > i, i, j, l ∈ G : j > i > l and i, j, l ∈ G : j > l > i.Then the constraints in set L c,1 2 − β(i, j) − q(i, j) ≥ α 4 (i, j) + α 1 (i, l) ∀i, j, l ∈ G : l > j > i model the logical implications among the selection variables α k (i, j), α k (i, l), α k (j, l) and β(i, j) for k, k , k ∈ K, ∀i, j, l ∈ G : l > j > i. Together with (12g) and the slack term in (14), constraints (16a)-(16c) model condition E 1 of our detection policy taking into account the GI-profile similarity information ρ(i, j) via selection variables β(i, j).
Assume that based on the SK and DK phenotypes it is most consistent that α 1 (i, j) = α 1 (i, l) = α 2 (j, l) = 1 for at least one gene l in DAG D which corresponds to condition E 1 being violated.Hence, there cannot exist an edge between genes i and j in DAG D. In this case the RHS of (16a) amounts to 1 which enforces the LHS of (16a) to amount to 1 as well, i.e., β(i, j) = 0. Note that for α 1 (i, j) = α 1 (i, l) = α 2 (j, l) = 1, (16c) makes no restrictions on z l (i, j) but by (16b) z l (i, j) is forced to 0, thus (12g) is fulfilled as well.Furthermore, assume for genes i, j ∈ G : j > i, based on the SK and DK phenotypes, it is most consistent that α 1 (i, j) = 1, but α 1 (i, l) and α 2 (j, l) are not jointly 1 for all other genes l ∈ G : l > j > i, i.e., α 1 (i, l) + α 1 (j, l) < 2, then there is an edge between genes i, j in DAG D according to condition E 1 .In this case it is obvious that (16a) and (16b) are always fulfilled, i.e., there are no restrictions on β(i, j) and z l (i, j) by (16a) and (16b).However, due to the slack term in (14), it follows that z l (i, j) = 1 ∀l ∈ G : l > j > i and β(i, j) is set to 1 in order to fulfill (12g).
Given that the GI-profile similarity data strongly supports that there is no edge between genes i, j in DAG D, i.e., β(i, j) = 0, and α 1 (i, j) = 1 is most consistent based on the SK and DK phenotypes measured, then it follows from (12g) that there must be at least one l ∈ G : l > j > i for which z l (i, j) = 0.In this case, with β(i, j) = 0, α 1 (i, j) = 1 and z l (i, j) = 0, the RHS of (16c) amounts to 1, forcing the LHS of (16c) to amount to 1 as well, i.e., α 1 (i, l) = 1 and α 2 (j, l) = 1, which is together with the assumption of α 1 (i, j) = 1 a combination that violates the existence of a direct edge between genes i and j.Furthermore, note that (16a) and (16b) do not have any implications on the selection variables α 1 (i, j), α 1 (i, l), α 2 (j, l) for the case that β(i, j) = 0 and z l (i, j) = 0.
Assume that the GI-profile similarity data strongly supports that there is an edge between genes i, j in DAG D, i.e., β(i, j) = 1, and α 1 (i, j) = 1 is most consistent based on the SK and DK phenotypes measured, then according to (16a) there cannot be any gene l ∈ G : l > j > i for which α 1 (i, l) = 1 and α 2 (j, l) = 1.Hence, the RHS of (16b) is smaller or equal to 0 which enforces z l (i, j) = 1 ∀l ∈ G : l > j > i due to the slack term in (14).Thus, (12g) is fulfilled with equality.Note that in this case (16c) is always fulfilled.Together with (12g) and the slack term in ( 14), the three inequalities in (16d) model condition E 2 where we can elucidate their functionality in the same fashion as before.Constraints (16e) to (16j) along with (12g) and the slack term in ( 14) model a minor modification of condition E 3 where we not only detect all necessary edges, but also optional edges given that their existence is strongly supported by the GIprofile.Given that the existence of an edge between genes i, j ∈ G : j > i in DAG D is not strongly supported by the GI-profile, i.e. q(i, j) = 0, constraints (16e) to (16g) along with (12g) and ( 14) model condition E 3 which only allows necessary edges to be detected and we can elucidate their functionality in the same fashion as in (16a) to (16c).Note that (16h) to (16j) are always fulfilled for q(i, j) = 0, i.e., no implications among the selection variables α k (i, j) and β(i, j) are posed.Assuming that the existence of an edge between genes i, j ∈ G : j > i in DAG D is strongly supported by the GI-profile, i.e. q(i, j) = 1, then the constraints in (16e) and (16g) are always fulfilled, i.e., no implications among the selection variables α k (i, j) and β(i, j) are posed by (16e) and (16g).However, constraints (16h) and (16j) pose relaxed logical implications among the selection variables α k (i, j) and β(i, j) compared to constraints (16e) to (16g).Hence given that q(i, j) = 1 and α 4 (i, j) = 1, an edge between genes i, j ∈ G : j > i in DAG D is detected if it is allowed by the pattern of hierarchical relationship classes.Constraints (16k) to (16l) along with (12g) and the slack term in ( 14) model a minor modification of condition E 4 where we not only detect all necessary edges, but also optional edges given that their existence is strongly supported by the GI-profile.Furthermore, the functionality of constraints (16k) to (16l) can be explained with the same line of argument as used to elucidate constraints (16e) to (16j).Denote L c,2 and L c,3 as the sets of topology constraints that model the logical coupling among the selection variables α k (i, j), α k (i, l), α k (j, l) and β(i, j) for k, k , k ∈ K and i, j, l ∈ G : j > i > l and i, j, l ∈ G : j > l > i respectively.Then the full set of coupled constraints of the selection variables α k (i, j) and β(i, j) is given by where we again refer the interested reader to [32] for a detailed description of L c .We obtain an estimate E GI of the true set of edges E D of DAG D based on the computed set of edge selection variables β(i, j) of program O GI-GENIE where we infer the directionality of the edges according to A OGI-GENIE that is the most consistent pattern of hierarchical relationship classes given the observed SK, DK and GI-profile data.Note that all reporter node edges are computed according to our proposed reporter node edge detection policy as given in Tab.II.Since the reporter node is an artificial node in the concept of a DAG, there is no GI-profile similarity data ρ(i, R) ∀i ∈ G and thus, no edge selection variable β(i, R) ∀i ∈ G according to 11.

V. SEQUENTIAL SCALABILITY TECHNIQUE
Due to the combinatorial nature of problems O GENIE and O GI-GENIE , the GENIE algorithm and GI-GENIE algorithm, respectively, cannot be applied to the data of large sets of genes G, since the complexity of problems O GENIE and O GI-GENIE grows exponentially with the number of genes and both problems become NP-hard.In order to nevertheless obtain statistically stressable statements about the interactions among genes in a large set of genes G, we propose the sequential-scalability (SEQSCA)-technique which is based on the GENIE-algorithm and the GI-GENIE algorithm, respectively.In particular, we create a sequence of S subsets {G s } S 1 of the full set of genes G, i.e., G s ⊂ G, ∀s ∈ {1, ..., S}, where we estimate the topology E D,s of each DAG D s , underlying the data of the subset of genes G s , by the GENIE or GI-GENIE-algorithm, respectively, in order to translate the estimated topology E D,s of DAG D s into the corresponding adjacency matrix M s for each s ∈ {1, ..., S}.Based on the computed sequence of adjacency matrices {M s } S 1 , we iteratively compute the reliability matrix M ∈ [0, 1] N ×N of the full set of genes G in such a way, that each entry [M ] i,j∈G describes the empirical probability of an edge to exist between genes i, j ∈ G, i.e., the empirical probability that genes i, j ∈ G directly interact with each other, where a value of 0 means that there is an interaction between the considered pair of genes with probability 0 and a value of 1 means that the considered pair of genes interacts with probability 1.In particular, in each iteration s we consider a subset G s of size N S |G| of the full set of genes G, where each gene of G s is selected from G without replacement with equal probability.Based on the selected subset G s , we compute in each iteration s an estimate E D,s of the true topology of DAG D s , underlying the observed data of the genes in subset G s , by the GENIE or the GI-GENIE-algorithm, respectively.Furthermore, the topology estimate E D,s of DAG D s is translated into the corresponding adjacency matrix M s .The update of the reliability matrix for iteration s is computed according to Eq. ( 18) (18) with M (s) being the N × N reliability matrix at iteration s, κ i ∈ {1, ..., N S } ∀i ∈ G s , ∪ i κ i = {1, ..., N S } and κ i < κ j for all i < j.Finally, we obtain the reliability matrix M of the full set of genes G by normalizing each entry M (S) i,j i, j ∈ G by n i,j that is the frequency of how often detecting an edge between genes i and j has been considered.

TABLE III: Summary of the proposed SEQSCA-algorithm
In Tab.III, we have summarized the SEQSCA-technique.Finally, by means of the SEQSCA-technique, we are able to provide statistically stressable statements about the interactions among the genes from a large set G by using the GENIE or GI-GENIE algorithm, respectively, in a sequential fashion.

VI. SIMULATION RESULTS
In this section, we first demonstrate the performance of the GENIE-algorithm and the GI-GENIE-algorithm with respect to conventional techniques for simulated data and further provide statistically stressable statements on the interactions among the genes from a large set of genes based on real data using the proposed SEQSCA-technique.

A. Synthetic Data Results
We have generated the ideal SK phenotypes R(i) ∈ R for all i ∈ G as well as the ideal DK phenotypes R(i, j) ∈ R for all i, j ∈ G :j > i according to the model of [7] as displayed in Fig. 2, where we have corrupted the ideal SK and DK phenotypes by independently and identically distributed zero-mean Gaussian noise with variance σ 2 .Furthermore, the GI-profile similarity data ρ(i, j)∀i, j ∈ G : j > i has been generated on the basis of the SK and DK phenotypes.We compare both the GENIE-algorithm and the GI-GENIE-algorithm with the well known GI-profile similarity approach ( [28], [7]), where the Pearson correlation between the GI-profiles of genes i and j is computed and an edge in the DAG is detected if the Pearson correlation is above a pre-defined threshold t corr , where the directionality is inferred from the selection variable α k (i, j) corresponding to the least mismatch model µ k (i, j).Furthermore, we compare our proposed methods with the solution of program O GENIE without considering set L as a constraint, which means simply classifying each pair i, j to the least mismatch scoring hierarchical relationship class based on the SK and DK phenotypes R(i) and R(i, j) respectively, without ensuring that the resulting pattern of hierarchical relationship classes represents a valid DAG.For the simulations, we consider a total of 10 genes in order to limit the Monte-Carlo simulation time.In Fig. 7 we display the false detection percentage of edges P ed in the detected DAG normalized to the true number of edges |E D | as defined in Eq. ( 19) versus the SNR.
In Fig. 8 we display the percentage of undetected edges P mis in the detected DAG normalized to the true number of edges |E D | as defined in Eq. ( 20) versus the SNR, i.e., In Fig. 7 we observe that in the low SNR regime, the Pearson correlation based method performs best in terms of false detection percentage of edges P ed , however it fails to improve performance with increasing SNR, because for correct directionality information of the edges this approach relies on the hierarchical relationship classes detected by method O GENIE without considering L. Especially in the high SNR regime, the proposed GENIE and GI-GENIE methods clearly outperform program O GENIE without the topology ruleset L and approach and respectively reach the performance of the Pearson correlation method.However, the very good performance of the Pearson correlation method in terms of false detection percentage of edges P ed according to Eq. ( 19) comes at the cost of a rather poor performance in terms of the percentage of undetected edges P mis according to Eq. (20) as can be seen in Fig. 8.In particular, in terms of the percentage of undetected edges P mis all of the proposed methods outperform the Pearson correlation method.Note that in the high SNR regime, the GI-GENIE of combining SK, DK and GI-profile similarity data yields the best of both worlds, i.e. it shows an equivalent performance as the Pearson correlation method in terms of false detection percentage of edges P ed , as well as an improvement of the strong performance of the GENIE method in terms of the percentage of undetected edges P mis .

B. Real Data Results
In this section, we apply the above mentioned SEQSCAalgorithm to the dataset of [29], in order to yield statistically stressable statements about the interactions among the genes considered in [29].In [29], the organism under study is yeast and the cell process measured is the colony size as a proxy of fitness.For the sake of computation time, we restrict the full set of genes under study in [29], i.e., G, to contain the first 200 genes of the query list of [29].Furthermore, the length of the sequence of subsets S is set to 50000 and the subset size N S of each selected set G s in the SEQSCA-algorithm is set to 10.

Fig. 5 :
Fig. 5: Left: Original DAG D a with corresponding set of hierarchical relationship classes A Da ; Right: Reconstruction Da of DAG D a based on A Da

Fig. 6 : 4 ( 3 , 4 ) = 1 .
Fig. 6: Example DAG D R to elucidate the functionality of the RHS of condition E R of Tab.II