![]() | ![]() |
Formats:
|
||||||||||||||||||||
Copyright © 2008 The Author(s) Identification of OBO nonalignments and its implications for OBO enrichment Department of Pharmacology, University of Colorado at Denver, MS 8303, RC-1 South, 12801 East 17th Avenue, L18-6400A, P.O. Box 6511, Aurora, CO 80045, USA Associate Editor: Alfonso Valencia *To whom correspondence should be addressed. Received May 15, 2007; Revised March 29, 2008; Accepted April 16, 2008. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Motivation: Existing projects that focus on the semiautomatic addition of links between existing terms in the Open Biomedical Ontologies can take advantage of reasoners that can make new inferences between terms that are based on the added formal definitions and that reflect nonalignments between the linked terms. However, these projects require that these definitions be necessary and sufficient, a strong requirement that often does not hold. If such definitions cannot be added, the reasoners cannot point to the nonalignments through the suggestion of new inferences. Results: We describe a methodology by which we have identified over 1900 instances of nonredundant nonalignments between terms from the Gene Ontology (GO) biological process (BP), cellular component (CC) and molecular function (MF) ontologies, Chemical Entities of Biological Interest (ChEBI) and the Cell Type Ontology (CL). Many of the 39.8% of these nonalignments whose object terms are more atomic than the subject terms are not currently examined in other ontology-enrichment projects due to the fact that the necessary and sufficient conditions required for the inferences are not currently examined. Analysis of the ratios of nonalignments to assertions from which the nonalignments were identified suggests that BP–MF, BP–BP, BP–CL and CC–CC terms are relatively well-aligned, while ChEBI–MF, BP–ChEBI and CC–MF terms are relatively not aligned well. We propose four ways to resolve an identified nonalignment and recommend an analogous implementation of our methodology in ontology-enrichment tools to identify types of nonalignments that are currently not detected. Availability: The nonalignments discussed in this article may be viewed at http://compbio.uchsc.edu/Hunter_lab/Bada/nonalignments_2008_03_06.html. Code for the generation of these nonalignments is available upon request. Contact: mike.bada/at/uchsc.edu 1 INTRODUCTION Several efforts in recent years have focused on the semiautomatic addition of links between existing terms in the Open Biomedical Ontologies (OBOs) through the creation of formal definitions of these terms using more atomic terms, a process to which we refer as ontology enrichment. Of note, the Gene Ontology Next Generation (GONG) project first used the description-logic-based language DAML+OIL to formally define 250 Gene Ontology (GO) metabolism terms using MeSH terms (Wroe et al., 2003), and later OWL to formally define a much larger number of GO metabolism, binding and transport terms again using MeSH terms (Aranguren, 2004); this project has since evolved into the more general Biological Ontology Next Generation (BONG), which currently exists as a plugin to the Protege ontology editor. The Obol effort uses a series of Prolog production rules that can be used to decompose a given matching GO term into an Aristotelean genus (category) and one or more differentiae (necessary and sufficient conditions that differentiate the term from other terms of the same genus); the Gene Ontology Consortium is currently using Obol to generate Aristotelean definitions of OBO terms that refer to other OBO terms (Mungall, 2004). In our frame-based Protege ontology-enrichment effort, we have created over 9600 assertions linking terms in the GO (The Gene Ontology Consortium, 2000), Chemical Entities of Biological Interest (ChEBI) ontology (Degtyarenko, 2003), and the Cell Type Ontology (CL) (Bard et al., 2005); these base assertions have been integrated into this set of ontologies such that each assertion is consistent with all assertions made at more general levels (Bada and Hunter, 2007). Both GONG and Obol have been able to take advantage of associated reasoners; for the former, an OWL reasoner can be used, while for the latter, the Aristotelean definitions can be imported into OBO-Edit (www.oboedit.org), the primary tool in which OBOs are developed, and its associated reasoner invoked. A great advantage of using such a reasoner is its ability to make new inferences derived from the added formal term definitions. For example, in the second published GONG study, using the newly added formal defintions for the GO molecular function (MF) terms neurotransmitter binding and glutamate binding (which use the MeSH terms Neurotransmitters and Glutamates, respectively), the OWL reasoner inferred that neurotransmitter binding subsumes glutamate binding, a link absent at that point in GO. However, both GONG/BONG and Obol/OBO-Edit require that these definitions use necessary and sufficient conditions in order for these inferences to be made. This is a strong requirement that does not hold bidirectionally in many, if not most cases: it is necessary and sufficient that catecholamine transport is a transport that results in the directed movement of a catecholamine. However, the semantics of OWL or OBO say that, for an existential restriction expressed for a subject class A linking it to an object class B via property p, each instance of A must have at least one value from B for p. Since we cannot say that every catecholamine takes part in a catecholamine-transport process, it is not even possible to make this a necessary assertion. Consequently, using terms from these two terminologies that have been linked, these new subsumptive inferences can only be made between subject terms for which necessary and sufficient definitions can be created (e.g. substance-transport terms) and not with the object terms (e.g. the substances that are being transported) used in these definitions. The inferences that are made by these reasoners point to what we call nonalignments—subsets of terms that are linked (other than via is_a), but that are not aligned in that the terms of one side of the links are linked by subsumption while the terms of the other side are not. (The nonalignments we identify all consist of subject terms that are subsumptively linked and object terms that are not subsumptively linked.) For example, as can be seen in Figure 1
In this example, with necessary and sufficient definitions of chlorinated hydrocarbon metabolism and 1,3-dichloro-2-propanol metabolism in terms of chlorohydrocarbons and 1,3-dichloro-2-propanol, respectively, these reasoners would point to this nonalignment through the suggestion of an is_a link from 1,3-dichloro-2-propanol metabolism to chlorinated hydrocarbon metabolism. However, if instead 1,3-dichloro-2-propanol was not subsumed by chlorohydrocarbons and 1,3-dichloro-2-propanol metabolism was subsumed by chlorinated hydrocarbon metabolism, these reasoners would not be able to suggest an is_a link from 1,3-dichloro-2-propanol to chlorohydrocarbons, because the required necessary and sufficient definitions of 1,3-dichloro-2-propanol and chlorohydrocarbons in terms of 1,3-dichloro-2-propanol metabolism and chlorinated hydrocarbon metabolism, respectively, could not be created using these terms in an ontologically valid way. This is not a fault of OWL or of Aristotelean formalism; these representational systems have strict semantics, to which ontologists should adhere when making assertions. It is just that reasoners relying solely on necessary and sufficient definitions will likely miss many of these nonalignments because ontologically valid definitions cannot be created, and it is desirable that as many of these nonalignments as possible be rectified. We have implemented our ontology-enrichment project in Protege-Frames (mainly because this is part of a larger frame-based effort). There is no associated reasoner to Protege-Frames, so we implemented a simple reasoning system to ensure the global consistency of the added assertions in our set of integrated ontologies. It is this same reasoning system we use here to discover nonalignments in the constituent ontologies through structural analysis of the assertions we added in our previous work (Bada and Hunter, 2007). Reasoning over these assertions, we were able to discover nearly 1700 instances of nonredundant nonalignments, 39.8% of which likely could not be identified via suggested inferences by OWL or OBO-Edit reasoners due to the fact that the required necessary and sufficient definitions could not be created in an ontologically valid way using these terms of the linked ontologies. We propose that those nonalignments for which such inferences cannot be made by these reasoners also be examined to increase consistency among the linked ontologies. 2 METHODS The method by which we ensure the global consistency of the set of assertions to the ontologies is through an analysis of the object classes of the properties of the classes. Specifically, this analysis relies on the fact that the object expression (here, an object class or union of object classes) of a property at a given class level must be subsumed by the object expression of the property at higher (i.e. more general) class levels. Furthermore, the object expression of a given property must be subsumed by the object expression at higher property levels. Put more simply, object expressions should monotonically narrow as one descends to more specific classes and slots. In order for each assertion to be consistent with each assertion made at more general levels, any object class of a property at a given class level that was not subsumed by an object class at a higher class and/or property level such that these conditions were satisfied was appropriately propagated up the class and/or slot hierarchies. The full details of this procedure can be read in the initial publication of our OBO-enrichment work (Bada and Hunter, 2007). Our methodology for discovering ontology nonalignments follows from this global consistency enforcement. For each base assertion (represented as a triple of a subject class, property and object class), each of the class's direct superclasses is checked to see if it is within the domain of the property. If so, it is checked if at least one of the object classes of the property of the superclass subsumes the object class of the property of the base assertion. If there is no such subsuming class, this is a nonalignment between the subject and object classes of the two assertions. If there is such a subsuming class at the level of this direct superclass, the same examination is performed for each of its direct superclasses. This continues recursively until either all direct superclasses are outside of the domain of the given property or a root of the ontology is reached. This can be made clearer with a simple but real example. Consider the base assertion 1,3-dichloro-2-propanol is metabolized in 1,3-dichloro-2-propanol metabolism, which states that 1,3-dichloro-2-propanol can be metabolized in a 1,3-dichloro-2-propanol-metabolism process. The sole direct superclass of 1,3-dichloro-2-propanol-chlorohydrocarbons is obtained. It is checked that chlorohydrocarbons is within the domain of the slot is metabolized in, which is the case. The set of allowed classes of is metabolized in at the level of chlorohydrocarbons is then obtained, which is the single class chlorinated hydrocarbon metabolism (which indicates that a chlorohydrocarbon can be metabolized in a chlorinated-hydrocarbon-metabolism process). The set of allowed classes at the superclass level (the one-member set chlorinated hydrocarbon metabolism) should subsume the set of allowed classes at the base-assertion level (the one-member set 1,3-dichloro-2-propanol metabolism). However, it does not; this is thus a nonalignment. Figure 1 For each discovered nonalignment, we extracted four entities into which the nonalignment can be distilled: the subject class of the base assertion, the superclass of this subject class at the level of which the nonalignment was found, the object class of the base assertion (i.e. the allowed class of the assertion), and the set of object classes at the level of the superclass (i.e. the set of allowed classes for the slot at the level of the superclass). There is only one object class for each base assertion, while there can be more than one object class at the level of the superclass, since monotonicity as one travels down the class hierarchy is preserved as long as an object class of a property of a class is subsumed by at least one object class of the property of the superclass. Figure 2
Each stored nonalignment represented by the four summarizing entities was written out to a text file in the following format:
This neatly summarizes the nonalignment by stating that the subject class of the base assertion is subsumed by the superclass, but the object class of the base assertion is not subsumed by any of the object classes at the level of the superclass. Thus, the nonalignment illustrated in Figure 1
Such a representation makes clear the essence of the nonalignment—that 1,3-dichloro-2-propanol is subsumed by chlorohydrocarbons (in ChEBI), but 1,3-dichloro-2-propanol metabolism is not subsumed by chlorinated hydrocarbon metabolism (in the GO biological process (BP) ontology). Due to the extensive multiple inheritance of the component ontologies, it is possible to discover redundant nonalignments or even the same nonalignment more than once. Only nonredundant nonalignments were stored and exported, as examining redundant nonalignments to assess whether there are true semantic discrepancies entails additional, unnecessary effort and biases statistics. Two nonalignments are redundant if the resolution of the one also results in the resolution of the other. Consider the following two nonalignments:
These two nonalignments are redundant with respect to one another. If the first nonalignment was resolved by adding an is_a link from benzoate transport to anion transport, the second nonalignment would also be resolved since this link addition would result in the implication that benzoate transport is a type of ion transport; thus, the second nonalignment would also be resolved. In cases of redundancy, we have kept the more specific nonalignment; thus, for the example above, only the first nonalignment was stored. The relevant relationships between the terms of these two nonalignments are illustrated in Figure 3
The March 6, 2008 versions of GO, ChEBI and CL were used for this study. These base ontologies were previously enriched with 10 270 additional assertions linking the component terms using 50 specific relationships detailed in the initial publication of our OBO-enrichment work. It is important to note that although this study relies upon the links we created in our previously published ontology-enrichment work, our methodology for nonalignment identification is not limited by the specific relationships we chose to use. (The quality of the nonalignments, however, is dependent on the quality of the links that the methodology analyzes.) In fact, we have recently generated nonalignments based on links created by members of the OBO Consortium and have begun a discussion of ways of managing these nonalignments. 3 RESULTS Using this methodology resulted in a total of 1938 nonredundant nonalignments within the set of GO, ChEBI and CL; this set of nonalignments can be examined at http://compbio.uchsc.edu/Hunter_lab/Bada/nonalignments_2008_03_06.html. To better characterize their distribution, we clustered the nonalignments according to the ontologies that were the sources of the subject and object terms of the nonalignments. For example, the nonalignment illustrated in Figure 1 Table 1 lists the number of assertions and nonredundant nonalignments for each directed pairwise combination of ontologies for which there is at least one corresponding assertion. For example, there are 2710 total added assertions from a GO BP term to another GO BP term, and 94 nonredundant nonalignments were identified from these assertions. The numbers of nonalignments are largely symmetric. The biggest discrepancy is that between the 598 nonalignments identified from the BP-to-ChEBI assertions and the 1022 nonalignments identified from the ChEBI-to-BP assertions.
Table 2 lists the numbers of assertions and nonredundant nonalignments and the ratio of nonalignments to assertions for each undirected pairwise combination of ontologies for which there is at least one corresponding assertion. The lowest ratios of nonalignments to assertions are those between BP terms and MF terms (0.02), between BP terms and BP terms (0.034), between BP terms and CL terms (0.064) and between cellular component (CC) terms and CC terms (0.065). This suggests that terms within these pairs of ontologies are relatively well-aligned. The highest ratios of nonalignments to assertions are those between ChEBI terms and MF terms (0.306), between BP terms and ChEBI terms (0.2680) and between CC terms and MF terms (0.19). This suggests that these pairs of ontologies are relatively not aligned well, which agrees with our empirical observations in our ontology-enrichment work that ChEBI is relatively not aligned well with GO.
Another way to characterize the nonalignments is whether the subject terms of the nonalignments are the more complex terms or the more atomic terms. For example, in the example illustrated in Figure 1 4 DISCUSSION 4.1 Evaluation and management of nonalignments In this study, we have used the term nonalignment to refer to two analogous sets of entities such that one entity is subsumed by the other in the first pair while one entity is not subsumed by the other in the second pair. Upon examination of a given nonalignment, if it is determined that the pairs of entities should be aligned, we term this a discrepancy. Not all nonalignments are discrepancies; Figure 4
If a given nonalignment is assessed to be a discrepancy, there are two ways to resolve it. The first is to add an is_a link from the object term of the base assertion to the object term at the superclass level (or, in the case of multiple object terms at the superclass level, to at least one of the object terms). For example, we assert the nonalignment illustrated in Figure 1 The second way to resolve a discrepancy is the removal of the is_a link from the subject term of the base assertion to the subject term at the superclass level. In Figure 1 In the case of a nonalignment that is not a discrepancy, there is still a logical inconsistency, and action should be taken to rectify the inconsistency. A general, automatic solution to such an inconsistency is the propagation of the object class of the base assertion up to the superclass level; this is the type of upward propagation we previously extensively employed in our ontology-enrichment work so as to ensure the global consistency of the ontologies when adding enriching assertions. For example, in Figure 4
A more elegant solution in this example is to instead add the GO CC term extracellular matrix part as an allowed class of results in binding of at the level of extracellular matrix binding; the semantics of this are that an extracellular-matrix-binding-process results in the binding of an extracellular matrix or an extracellular-matrix part, which seems to be a valid definition for extracellular matrix binding. The original nonalignment would be resolved in that laminin-1 complex at the level of laminin-1 binding would be subsumed by extracellular matrix part at the level of extracellular matrix binding. Though this is semantically closer to the definition of extracellular matrix binding, it is also more manual and thus more labor-intensive (which is not to say that it should not be done). Our methodology could be used to either automatically upwardly propagate the specific classes so as to make the ontologies consistent, as described in the previous paragraph, or it could be used to automatically make suggestions to the ontology curators, who would decide to add either the specific terms or more general terms (such as extracellular matrix part). Of total of 1938, 100 nonredundant nonalignments were randomly selected for an evaluation. Out of these 100, 96 were assessed to be discrepancies; that is, we assert that they should be similarly aligned through the addition or removal of an is_a link, corresponding to the first two types of resolution. The remaining four nonalignments are analogous to the example seen in Figure 4 4.2 Comparison to other projects Both the GONG/BONG and Obol projects have been focusing on creating formal defintions of OBO terms using more atomic OBO terms in necessary and sufficient conditions. These definitions can then be reasoned over (by an OWL reasoner for the former and by the Obol reasoner or the OBO-Edit reasoner for the latter), which can make new inferences using the definitions. However, the reasoner can only make new inferences using the linked terms if ontologically valid necessary and sufficient definitions can be constructed. The type of inferences that can be made largely corresponds to the absent subsumptions in the type of nonalignments in which the subject terms are more atomic than the object terms. Figure 1 However, these projects likely could not predict new inferences for many if not all of the nonalignments in which the object terms are more atomic than the subject terms presented here, because the required necessary and sufficient definitions likely could not be made in an ontologically valid manner. Figure 6
It can be argued that a reasoner in one of these other projects can infer an is_a link between chemicals by creating ontologically valid necessary and sufficient definitions in terms of, for example, parts or functions of these chemicals. However, this presupposes that not only such a more basic ontology but the required specific object terms exist. Such an approach laboriously requires the creation of an entirely new set of assertions, and there may be recursion in that the more basic object terms may not exist in a hierarchical relationship, thus once again preventing the inference of the is_a link between the more composite subject terms. Our approach only requires one set of assertions and their automatically generated inverse assertions and relies on a different kind of reasoning than the deduction used by reasoners in the aforementioned projects. However, we assert that a functionally equivalent methodology could be implemented, e.g. using an OWL API, without the use of explicitly represented inverse assertions. We have found that 39.8% of the total nonredundant nonalignments identified in this study are those in which the subject terms of the nonalignments are built up from the object terms; these correspond to the instances in which it is difficult to produce the required ontologically valid necessary and sufficient conditions, in which case new inferences by the aforementioned reasoners cannot be made using the linked terms of the ontologies. Our methodology essentially uses subsumptive analysis of term attributes toward quality assurance of ontologies, a technique which has been used by others in the field. The BERNWARD system reconstructed sets of medical concepts into hierarchies based on five subsumptive principles, but it is different in that it takes into account partonomy in its subsumption without resolution of the type we perform as in Figures 4 We are not calling for the abolition of the use of the OWL, Obol or OBO-Edit reasoners. Rather, we assert that functionality that identifies the type of nonalignments for which inferences cannot be made (due to absence of required necessary and sufficient conditions) can and should be built into ontology-enrichment tools such as BONG. A methodology analogous to ours appears possible through the use of an OWL API through a subsumptive analysis of directly asserted and inherited property-value pairs. Consider Figure 7 5 SUMMARY We have described a methodology by which we have identified over 1900 instances of nonredundant nonalignments between terms from GO, ChEBI and CL. Analysis of the ratios of nonalignments to assertions from which the nonalignments were identified suggests that BP–MF, BP–BP, BP–CL and CC–CC terms are relatively well-aligned, while ChEBI–MF, BP–ChEBI and CCMF terms are relatively not aligned well. We propose that three ways to resolve an identified nonalignment are the addition of an is_a link between the object terms, the removal of an is_a link between the subject terms and the upward propagation of the object term to the superclass level. Many of the 39.8% of these nonalignments in which the object terms are more atomic than the subject terms likely are not currently examined in other ontology-enrichment projects due to the fact that the necessary and sufficient conditions required for the inferences likely could not be added, as they are semantically too strong. We assert that a methodology analogous to ours could be implemented using an OWL API in ontology-enrichment tools in order to identify such nonalignments that are currently not examined. ACKNOWLEDGEMENTS Both authors have been supported by NIH grants 5R01 LM008111-2 and 5R01 DE15191-04. Conflict of Interest: none declared. REFERENCES
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||
Comp Funct Genomics. 2004; 5(6-7):509-20.
[Comp Funct Genomics. 2004]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Genome Biol. 2005; 6(2):R21.
[Genome Biol. 2005]Artif Intell Med. 2007 Mar; 39(3):183-95.
[Artif Intell Med. 2007]