Biomedical Vocabulary Alignment at Scale in the UMLS Metathesaurus

With 214 source vocabularies, the construction and maintenance process of the UMLS (Unified Medical Language System) Metathesaurus terminology integration system is costly, time-consuming, and error-prone as it primarily relies on (1) lexical and semantic processing for suggesting groupings of synonymous terms, and (2) the expertise of UMLS editors for curating these synonymy predictions. This paper aims to improve the UMLS Metathesaurus construction process by developing a novel supervised learning approach for improving the task of suggesting synonymous pairs that can scale to the size and diversity of the UMLS source vocabularies. We evaluate this deep learning (DL) approach against a rule-based approach (RBA) that approximates the current UMLS Metathesaurus construction process. The key to the generalizability of our approach is the use of various degrees of lexical similarity in negative pairs during the training process. Our initial experiments demonstrate the strong performance across multiple datasets of our DL approach in terms of recall (91-92%), precision (88-99%), and F1 score (89-95%). Our DL approach largely outperforms the RBA method in recall (+23%), precision (+2.4%), and F1 score (+14.1%). This novel approach has great potential for improving the UMLS Metathesaurus construction process by providing better synonymy suggestions to the UMLS editors.


Motivation.
Developed by the National Library of Medicine, the UMLS (Unified Medical Language System) Metathesaurus [4] is a terminology integration system constructed by integrating biomedical terms from over 200 source vocabularies and organizing them into concepts consisting of clusters of synonymous terms from the source vocabularies. The basic building block of the Metathesaurus, also known as an "atom," is a term from a source vocabulary.
In practice, synonymous atoms are assigned the same concept unique identifier (CUI). Such concepts can be thought of as equivalent mappings from an ontology alignment perspective. In fact, a subset of three source vocabularies from the Metathesaurus (NCI, FMA, and SNOMED CT) have been used by the Ontology Alignment Evaluation Initiative (OAEI) since 2011 [2] and in related efforts [18,21]. The OAEI aims to compare ontology matching systems on defined test cases. The OAEI organizers have used UMLS synonymy information from the Metathesaurus concepts as reference mappings for biomedical ontologies integrated in the UMLS. Although the Metathesaurus construction process may have similarities to ontology alignment, not all source vocabularies in the Metathesaurus are well-defined ontologies formally represented in OWL. Therefore, in order to avoid any misunderstanding especially in the context of the Semantic Web, we will continue to use the term vocabulary instead of ontology when referring to source vocabularies in the Metathesaurus.
The Metathesaurus construction process is based on the assumption that specially trained human experts can determine synonymy among atoms with high accuracy from the candidates obtained from a lexical similarity model and semantic pre-processing. However, manual curation is error-prone as pointed out by [7,8,19,30,31]. Given the current size of the Metathesaurus with 15.5 million atoms from 214 source vocabularies grouped into 4.28 million concepts, its maintenance process is costly, time-consuming, and extremely demanding on the human expert editors. On the other hand, with the enormous knowledge accumulated over 30 years of manual curation, the existing Metathesaurus provides ample material for supervised learning.
Supervised learning approaches with word embeddings have shown promising results in previous Metathesaurus-related experiments confirming that they have reasonably good performance for the alignment of a selected subset of source vocabularies in the Metathesaurus [21,45,47,48]. In this work, we propose to use these techniques to predict synonymy from all source vocabularies in the Metathesaurus. Aligning over 214 vocabularies with their large size and vast diversity introduces new challenges compared to the OAEI task of aligning a few vocabularies.
In this work, we are mostly interested in assessing the feasibility of using deep learning (DL) techniques for terminology integration at scale in the UMLS Metathesaurus. Therefore, this investigation is not primarily technical and does not have the usual features of a DL benchmarking study. Instead, we investigate whether a simple DL approach can outperform the editorial rules established for building the UMLS Metathesaurus.

Objectives.
Our primary objective is to develop a scalable supervised learning approach to improve synonymy predictions compared to the current lexical and semantic processing in the Metathesaurus. While existing ontology alignment approaches [2,18,21,47] have been successful on small subsets of 3 to 8 source vocabularies, our goal is to develop an approach that scales not only to large numbers of source vocabularies, but also to diverse source vocabularies, such as those in the Metathesaurus. We expect such a supervised learning approach to outperform a rule-based approach (RBA) that approximates the lexical and semantic processing used in the current Metathesaurus construction process. We will explain the rule-based approximation in Section 3.2.
Our secondary objective is to investigate the extent to which lexical similarity between the atoms used for training influences the performance of our algorithm. Intuitively, it seems more difficult to predict the absence of synonymy between lexically-similar atoms than between lexically-different atoms. We hypothesize that learning from pairs with different degrees of lexical similarity will help improve the performance and generalization of the algorithm.

Contribution.
Our contributions include:

•
The first attempt to define and address terminology integration at the full scale and diversity of the UMLS Metathesaurus using a learning-based approach.
• A reusable rule-based baseline approximating the current lexical and semantic processing used in the UMLS for comparing the performance of our algorithm against the current UMLS building process.
• A generalizable supervised learning approach that is shown to largely outperform the current lexical and semantic processing used in the UMLS Metathesaurus construction process.
• A confirmed hypothesis that the variety of degrees of lexical similarity in negative pairs from the training set is the key to the generalizability of the algorithm.
The remainder of the paper is organized as follows. Section 2 provides relevant background knowledge about the Metathesaurus. Section 3 describes the synonymy prediction and the rule-based approximation as a proxy to the current Metathesaurus construction process. Section 4 describes our supervised learning approach. In section 5, we present our experiments and discuss their results. In section 6, we discuss related work. Section 7 concludes the paper.
approximation of the Metathesaurus building process. We will use the examples in Table 1 to illustrate the concept structure in the Metathesaurus.
As mentioned earlier, key to the UMLS Metathesaurus are the notions of atom (a term from a specific source vocabulary, identified with a specific source concept identifier) and concept (grouping of synonymous atoms). While the Metathesaurus preserves source concept identifiers (SCUI), it also assigns its own identifiers to atoms (AUI), unique strings (SUI), normalized strings (LUI) and concepts (CUI). Table 1 shows examples of atoms and the various types of identifiers they were assigned. Additionally the Metathesaurus editors assign semantic types to each UMLS concept to denote the broad semantics of each concept. Of note, semantic types are not assigned to AUIs, but to CUIs instead. However, it is possible to approximate the semantics of an atom by inferring it from that of the source vocabulary (for semantically homogeneous vocabularies, such as anatomy ontologies), or the top-level subdivisions of a vocabulary (for broad-coverage vocabularies).
In the UMLS Metathesaurus, the information available to the construction process are input tuples in the form of (str, src, scui, sg) where str is the original string from the source src, and scui is the optional identifier of that str string from the source src, and sg is a semantic group reflecting the semantics of the string in the source. Of note, for this experiment, we manually assigned one semantic group to each vocabulary and to the top-level subdivisions of heterogeneous vocabularies. Each atom inherits its semantic from its source or from its high-level ancestor(s).
Let T = (S STR , S SRC , S SCUI , S SG ) be the set of all input tuples in the Metathesaurus where S STR is the set of all strings, S SRC is the set of all sources, S SCUI is the set of all source concept unique identifiers, and S SG is the set of all semantic groups. The tuples t 1 = ("Headache", "MSH", "M0009824", "Disorders") and t 3 = ("Cranial Pains", "MSH", "M0009824", "Disorders") are instances of T. Given the input tuple pairs (sfr, src, scui, sg) and (str′, src′, scui′, sg′) as instances of T = (S STR , S SRC , S SCUI , S SG ) from source vocabularies, the Metathesaurus defines several identifier types for characterizing atoms during the integration process.

AUI and m a link mapping function.
The basic building blocks or "atoms" from which the Metathesaurus is constructed are the concept names or strings from each of the source vocabularies. Every occurrence of a string in each source vocabulary is assigned a unique atom identifier (AUI). When the same string appears in multiple source vocabularies, for example, "Cephalodynia" appearing in both MSH and SNOMEDCT_US, they are assigned different AUIs "A26628141" and "A2957278" as shown in Table 1.
(D1) Let S AUI be the set of all AUIs in the Metathesaurus. Let m a be the function that maps concept string str ∈ S STR from source vocabulary src ∈ S SRC to a new AUI a ∈ S AUI such that a = m a (str, src).

SUI and m s .
These AUIs are then linked to a unique string identifier (SUI) to represent occurrences of the same string. Any lexical variation in character set, upper-lower case, or punctuation will result in a separate SUI. For example, the strings "Headache" and "Headaches" are linked to two different SUIs.
(D2) Let S SUI be the set of all SUIs in the Metathesaurus. Let m s be the function that maps an AUI a ∈ S AUI to a new SUI s ∈ S SUI such that s = m s (a).

LUI and m l .
All the English lexical variants of a given string (detected using the Lexical Variant Generator tool [26]) are associated with a single normalized term (LUI). The LVG tool recognizes that the two strings "Headache" and "Headaches" only differ by minor lexical variation and associates them with the same LUI "L0018681".
(D3) Let S LUI be the set of all LUIs in the Metathesaurus. Let m l be the function that maps a SUI s ∈ S SUI to a new LUI l ∈ S LUI such that l = m l (s).

CUI.
Lexical similarity forms the basis for suggesting synonymy in the UMLS Metathesaurus. However, all atoms that share the same LUI are not necessarily synonymous. For example, the string "nail" can denote both an anatomical structure and a surgical device. Table 1 illustrates how synonymous terms are clustered into the same concept (CUI = "C0018681"). Note that we do not define the link mapping from AUI to CUI here because this link is unavailable to the task and cannot be used in the prediction function.

SCUI and m u .
Each AUI is optionally associated with one identifier providedby its source (SCUI). Several strings including "Headache", "Headaches", "Cranial Pains", and "Cephalodynia" are associated with the same SCUI, "M0009824", from the source vocabulary MSH. SCUIs play an important role in the Metathesaurus construction process because source synonymy is very often conserved in the Metathesaurus.
(D4) Let S SCUI be the set of all SCUIs in the Metathesaurus. Let m u be the function that maps a concept string a ∈ S AUI to a new SCUI u ∈ S SCUI such that u = m u (a).

Semantic Group and m g .
As mentioned earlier, semantic groups (or semantic types) are assigned to CUIs, not AUIs, by the Metathesaurus editors. For this reason, this information is unavailable to the task and cannot be used in the prediction function. Instead, we manually assigned semantic groups to source vocabularies or to their top-level subdivisions. All the atoms from a source vocabulary (or top-level subdivision thereof) inherit the semantic group of the source (or top-level subdivision). Most of the atoms have a single semantic group. Semantic group information is used to determine semantic compatibility among atoms defined as sharing one semantic group.
(D5) Let S SG be the set of all semantic groups in the Metathesaurus. Let m g be the function that maps concept string a ∈ S AUI to a set of semantic groups g ⊂ S SG such that g = m g (a).
So far we have defined the constraint mappings for each AUI to be linked to other identifier types. Every AUI is linked to a single string STR, a single SCUI (optionally), a single SUI, a single LUI, and, most often, a single Semantic Group. Next we will show how these identifiers and mapping links can be leveraged in the rule-based approximation of the Metathesaurus construction process to derive synonymy predictions.

Problem Formulation
We define the synonymy prediction task as follows. T is the set of all input tuples (S STR , S SRC , S SCUI , S SG ) from source vocabularies.
Note that here we consider the whole tuple for the prediction task instead of using the string str only. A string itself does not carry sufficient information for the task at hand; we need to know which source the string comes from and which semantics it has. This is especially useful for processing homonyms (e.g., depending on the source, "nail" can denote an anatomical structure or a surgical device, which will be indicated by the semantic group, "Anatomy" or "Device").
As ground truth for the prediction task, we use the groupings of strings into concepts in the Metathesaurus. If two strings from two different tuples are assigned the same CUI, they are synonymous. Otherwise, they are not.
A synonymy prediction task will decide if each of the tuple pairs is synonymous (or, more precisely, if the atoms in each pair are synonymous). Finding the prediction function p is the problem we address in this paper. We will describe the rule-based approach in Section 3.2 and the supervised learning approach in Section 4.

Rule-based Approximation of the Metathesaurus Construction Process
Here we formalize an approach that approximates the current Metathesaurus construction process that takes as input tuple pairs from source vocabularies. (We have confirmed with the UMLS Metathesaurus editors at the National Library of Medicine that this formalization of the Metathesaurus editorial guidelines accurately reflects the Metathesaurus construction process.) We use this approximation as a baseline in the evaluation of our supervised learning approach. We use the concepts/identifiers and functions/links described in Section 2 to show how the identifiers and links can be combined into rules for synonymy predictions.
We have defined S STR , S SRC , S SCUI , S AUI , S SUI , S LUI , S CUI , and S SG to be the set of all strings, sources, SCUIs, AUIs, SUIs, LUIs, CUIs, and semantic groups in the Metathesaurus, respectively. We also have the link mapping functions m a , m s , m l , m u , and m g defined from (D1), (D2), (D3), (D4), and (D5) above. Next we will derive the editorial rules from the identifiers and mapping links in the Metathesaurus.
The rule-based approach reflects the following Metathesaurus construction principles: • Synonymy asserted between atoms in a source vocabulary tends to be conserved in the Metathesaurus

•
Lexical similarity is used to identify candidates for synonymy • Atoms that do not share a common semantics are prevented from being recognized as synonymous and grouped into the same concept These principles are formalized into two rules, "source synonymy" and "lexical similarity and semantic compatibility". These rules can be combined into a disjunction and amplified through transitivity.

against each rule with
Source synonymy (SS) rule.-The two input tuples are synonymous if they have the same identifier in a given source (SCUI). Formally, given a tuple pair t = (sfr, src, scui, sg) ∈ T and t′ = (str′, src′, scui′, sg′) ∈ T, let p ss be the prediction function for the source synonymy rule: if scui = scui′ then p ss (t, t′) = 1.
Lexical similarity and semantic compatibility (LS_SC) rule.-The two input tuples are synonymous if they have the same lexical terms and semantic groups derived from the input tuples using the set of identifiers and links in the Metathesaurus. In practice, given the input as a pair of tuples, included in the lexical similarity and semantic compatibility rule are: (1) a set of axioms to derive the lexical term (lui, lui′) and semantic groups (sg, sg′) for each input tuple, and (2) the assertions that they have the same lexical term and a common semantic group. We formalize this rule using the Metathesaurus notions as follows.
Rule combination (SS_LS_SC).-For the three tuple pairs at hand, the two pairs (t 1 , t 3 ) and (t 4 , t 5 ) are predicted to be synonymous by the source synonymy rule and the lexical and semantic similarity rule. The last pair (t 1 , t 5 ) is predicted to be non-synonymous by both rules. However, all these pairs share the same CUI and are considered synonymous in the Metathesaurus (ground truth). Therefore, the rule-based approach can only correctly predict two out of the three pairs above.
Since both source synonymy preservation and lexical and semantic similarity are principles used in the Metathesaurus construction process, it is legitimate to create a disjunction of the corresponding rules (i.e., SS or LS_SC).
Transitivity.-The combination rule SS_LS_SC can be further amplified by considering its transitive closure. Given t 1 , t 2 , t 3 ∈ T, let p trans be the prediction function for the transitivity rule: P = {p ss , p lssc , p sslssc , p trans } is the set of prediction functions, p trans (t 1 , t 3 ) = 1 if ∃ p 1 , p 2 ∈ P such that p 1 (t 1 , t 2 ) = 1 and p 2 (t 2 , t 3 ) = 1.
Note that all prediction functions in P are commutative. Changing the order of the parameters does not change the results. Section 5 will describe our experiments and evaluate this approach against the supervised learning approach described in Section 4.

SUPERVISED LEARNING APPROACH
This section introduces our supervised approach for learning and predicting synonymy among Metathesaurus atoms. The general idea is to learn similarities between pairs of atoms within a concept and dissimilarities between pairs of atoms across concepts. We present the model formulation, dataset generation and neural network architecture. Table 2 provides a list of abbreviations used in the paper for a quick reference.

Problem Formulation
Supervised deep learning (DL) is a learning function that maps an input to an output based on examples of input-output pairs through layers of dense networks [39]. The Metathesaurus comprises approximately 10 million English atoms, each of which is associated with a concept. One can simply train a supervised classifier to predict which concept should be assigned to a given atom. However, this approach is considered an extreme classification task [3] due to the very large prediction space of 4.28 million concepts. However, the concept is simply a "mechanism" to cluster synonymous atoms together. We are primarily interested in assessing whether two atoms are synonymous and should be labeled with the same concept regardless of whether this concept already exists in the Metathesaurus. Hence, we formulate this problem as a similarity task. Ideally, we would like to to assess similarity based not only on the lexical features of an atom, but also on its context (e.g., represented by neighboring concepts in this source vocabulary). However, in this preliminary investigation, we only rely on the term itself to determine synonymy among atoms. In practice, a fully trained model should identify and learn scenarios where

•
Atoms that are lexically similar in nature but are not synonymous, e.g., "Lung disease and disorder" versus "Head disease and disorder", and • Atoms that are lexically dissimilar but are synonymous, e.g., "Addison's disease" versus "Primary adrenal deficiency".
Moreover, such a model should outperform the current Metathesaurus building process, approximated by the rule-based approach described earlier.

Dataset generation
The input data for supervised learning is the same as for the rule-based approach, with the difference that supervised learning only relies on the terms, while the rule-base approach also uses some elements of context (source synonymy and semantic group). In both cases, we use the active subset of the 2020AA UMLS. Only atoms from English source vocabularies are used, excluding atoms marked as suppressible synonyms. The final dataset consists of 8.7M strings from 168 sources grouped into 4.2M concepts.
Ground truth.-Labeled data are taken from the pairs of atoms that are linked to the same (positive) or different (negative) concepts. Let POS be the set of positive pairs and NEG be the set of negative pairs. Given a pair of tuples t = (str, src, scui, sg) and t′ = (str′, src′, scui′, sg′), aui = m a (str, src), aui′ = m a (str′, src′), let m c be the mapping function respectively linking aui, aui′ ∈ S AUI to cui, cui′ ∈ S CUI such that cui = m c (aui) and cui′ = m c (aui′), if cui = cui′ then (aui, aui′) ∈ POS else (aui, aui′) ∈ NEG.
The number of positive pairs in POS is approximately 27.9M, and the number of negative pairs in NEG is approximately 10 14 since most atoms do not share a CUI. It is computationally impossible for us to generate all of the negative pairs in NEG. Even if we could overcome resource limitations, training with extreme class imbalance towards negative is unlikely to yield accurate predictions. Therefore, we drastically reduce the negative sample space so that the datasets have a better class balance.
Data generation principles.-We follow two principles to generate the experimental datasets: (1) provide different degrees of lexical similarity in the negative pairs, and (2) maximize the coverage of AUIs in the training datasets.
We hypothesize that neural networks can predict more efficiently if they can learn from interesting negative pairs that are lexically similar. However, since most negative pairs have no (or low) lexical similarity, it is particularly important for the algorithm to learn from lexically-similar negative pairs. Therefore, we created various negative sets with different levels of lexical similarity so that we can assess how lexical similarity influences performance.
We also hypothesize that neural networks can generalize better if they can learn from both positive and negative pairs for every string in the Metathesaurus. We would also like to maintain the class balance (i.e., keep the maximum ratio between positive and negative pairs at about 1:3). Therefore, every atom in the Metathesaurus will have n positive pairs and approximately ≤ 3n negative pairs.
We use the Jaccard index (1) as a measure for the similarity between atoms. To ignore minor variation among atoms (e.g., singular/plural differences), we assess the lexical similarity of normalized strings rather than original strings. Let norm be the normalizing function that maps a sui to its normalized string, and m s be the function mapping an AUI to its SUI. The JACC score assessing the similarity between two AUIs is computed as follows.
Degrees of similarity in negative pairs.-We can divide all of the negative pairs in the Metathesaurus into two mutually exclusive sets: (1) SIM, the negative pairs with some similarity (JACC > 0) between the two atoms, and (2) NOSIM, the negative pairs that have no similarity (JACC = 0) between the two atoms. We can formally define these sets as follows.
In practice, the size of the SIM set is significantly smaller than that of the NOSIM set. If there is a single atom in a concept, no positive pairs can be created (k=0). In such cases, we will add a negative pair for this atom to NEG TOPN (SIM) and NEG RAN (SIM) if this atom shares at least some similarity with other atoms. Note that we select twice as many negative pairs as needed for training purposes in each set so that we can split each set of negative pairs equally between learning and generalization experiments.
Learning vs. generalization datasets.-We create two types of datasets: (1) learning datasets for training and validating the neural network models, and (2) generalization datasets for testing the generalization of the neural network models. The datasets of the two types are mutually exclusive.
In summary, as shown in Table 3, we create 4 dataset variants (TOPN_SIM, RAN_SIM, RAN_NOSIM, and ALL) for each dataset type. We split the set of positive pairs, POS, randomly into the learning and generalization datasets (80:20 ratio). The positive learning datasets (80% of POS) will be combined with the one half of the negative dataset for a given variant. Similarly, the positive generalization datasets (20% of POS) will be combined with other half of the negative datasets for a given variant. Therefore, the size of the learning datasets are bigger than the generalization datasets because they have more positive pairs. Hence, we have 8 datasets in total as shown in Table 3 for the experiments in Section 5.

Neural Network Architecture
Our model adopts the Siamese structure from [32] with BioWordVec embeddings as shown in Figure 1.
Word embeddings.-A pair of atoms are first transformed into their respective numerical word representations, i.e., word vectors. A word embedding is a language modeling and feature learning technique in NLP where words are mapped to vectors of real numbers with varying dimensions. These word vectors are positioned in the vector space such that words that share similar contexts in the corpus are situated close to one another in the space [28]. Word embeddings are often used to calculate sentence pair similarity. In the general domain, the SemEval Semantic Textual Similarity (SemEval STS) challenge has been organized for over five years, which calls for effective models to measure sentence similarity [20]. Averaged word embeddings are used as a baseline to measure sentence pair similarity in the challenges: each sentence is transformed into a vector by averaging the word vectors for each word in the sentence, and sentence pair similarity is effectively measured by the similarity between the averaged vectors using common measures such as Cosine and Euclidean similarity.
Instead of training the word vectors from scratch, we leverage the pre-trained biomedical word embeddings (BioWordVec-intrinsic) that are trained on a PubMed text corpus and MeSH data [50]. The rationale is to "precondition" the Siamese network with prior knowledge of the inherent similarity between words in the UMLS vocabulary. Prior to generating the positive and negative pairs, we preprocess the lexical features of UMLS atoms similar to how the authors in [50] preprocessed their dataset (i.e., we removed all punctuation except hyphen, lowercased, and tokenized on space) to ensure conformity as we leverage their pre-trained BioWordVec embeddings in our downstream network.
Upon plotting a word length distribution, 97% of atoms in the UMLS have a word length of 30 or less. Hence, we apply padding or truncation to restrict the word length of each atom to a maximum of 30 to ensure a uniformity in dimension to speed up the training process. The embeddings of the pair of atoms are fed to two LSTMs, each of which processing one of the atoms in the pair and consisting of 50 hidden learning units. These units learn the specific semantic and syntactic features based on word order of each individual atom through time.
Siamese-LSTM network.-Contrary to the traditional neural networks which accepts one input at a time, the Siamese network is an architecture that takes a pair of inputs and learns representations based on explicit similarity and dissimilarity information (i.e., the pairs of similar and dissimilar inputs) [5]. It was originally used for signature verification [5] and has since been applied to various applications such as face verification [6], unsupervised acoustic modeling [43], and learning semantic entailment [32], as well as text similarity [34].
A series of deep learning (DL) models can be incorporated within the Siamese architecture. RNNs (Recurrent Neural Networks) are a type of DL model that excel at processing sequential information due to the presence of memory cells to store and "remember" data read over time [40]. A particular variant of RNN is the Long Short-Term Memory (LSTM). It enhances the standard RNN to handle long-term dependencies and to minimize the inherent vanishing gradient problem of RNNs with the introduction of "gates" (input, output, and forget gates) to control the flow of and retain information better through time. It is more accurate in handling long sequences. However, it comes at the cost of higher memory consumption and longer training times compared to a standard RNN which is faster, but less accurate. Nonetheless, a combination of a Siamese network with RNN and LSTM have been successfully applied to various NLP tasks including similarity assessment [12,32,44]. On the other hand, CNNs (Convolutional Neural Networks) have also performed well in NLP due to their ability to extract distinctive features at a higher granularity [20]. A Siamese CNN model learns sentence embedding and predicts sentence similarity with features from various convolution and pooling operations [15].
The output of the model is a Manhattan distance similarity function, exp(−‖LSTM A − LSTM B ‖ 1 ) ∈ [0, 1], a function that is well-suited for high dimensional spaces [1]. We will use the Siamese neural network architecture with LSTM and the datasets described above to train our models. Next, we describe our design for evaluating the supervised learning approach and comparing it with the rule-based approach.

EVALUATION
This section presents the experiments to evaluate the proposed supervised learning approach against the baseline from the rule-based approach.
The experiments are reproducible and the baselines are also reusable. The materials for reproducing the experiments are publicly available. A no-cost UMLS license 1 is required to access and download the materials in this page.

Experimental Setup
We conducted two types of experiments on the same datasets and evaluated the performance of (1) the rule-based approximation baseline, and (2) the proposed supervised learning approach. The editorial rules are defined Section 3.2 and the neural networks are described in Section 4. We implemented our approaches using Python 3.8 and Tensorflow 2.0.
Both experiment types are executed by deploying batches of parallel jobs to the Biowulf high-performance computing cluster 2 at the National Institutes of Health (NIH). We use the norm and gpu partitions for the corresponding CPU and GPU servers in this cluster with a limit of 10,000 CPU cores, 60 TB of RAM, and 56 GPUs per user. Our evaluation includes several steps organized into different pipelines. The execution of each step maximizes the resources allocated in Biowulf to reduce the runtime. Our settings for deployment are: (1) using multiple nodes, usually 500-625 nodes, (2) using multiple threadings with 16-20 threads per node, (3) using about 95-125 GB of RAM per node, and (4) using Tesla V100 GPUs for the training and testing tasks.
The implementation is highly configurable, reusable, and reproducible with scripts. However, note that these experiments make extensive use of computational resources. We reportedly used over 1.6 million CPU hours over 3 months for developing and deploying the models.

Data Generation
We used the active source vocabularies restricted to English terms (excluding suppressible synonyms) in the UMLS 2020AA release, which can be downloaded 3 with a no-cost UMLS license 1 .

Rule-based Approximation Baseline
We implemented the editorial rules defined in Section 3.2. For evaluating how individual and combined rules influence the performance, we created four variants of the RBA baseline: (1) SS for the source synonymy rule, (2) LS_SC for the lexical similarity and semantic compatibility rule, (3) SS_LS_SC for the disjunction of the two SS and LS_SC rules (SS OR LS_SC), and (4) SS_LS_SC_TRANS for the transitive closure of the SS_LS_SC variant.
We evaluate and compare the four RBA variants using the 4 variants of the generalization dataset. We will select the best RBA variant as our baseline for comparison against the supervised learning approach.
Results.- Table 4 shows the results of the evaluation. All the RBA variants consistently share the same pattern across all the generalization datasets, namely very high precision (0.8631 to 1), but very low recall (0.2026 to 0.6871). Comparing the performance of these RBA variants against the 4 variants of the generalization dataset, each RBA variant shares the same recall for all the generalization datasets, while precision and F1 score improve among ALL, TOPN_SIM, RAN_SIM, and RAN_NOSIM.
The SS_LS_SC_TRANS variant performed best in terms of accuracy, recall, and F1 score, but had the lowest precision among all the RBA variants across all the generalization datasets. Adding the transitive closure (SS_LS_SC_TRANS variant) significantly increased the performance with a 16% increase in recall and 19-23% in F1 score across all the generalization datasets. The SS rule yields higher precision and recall compared to the LS_SC rule. Combining the two rules with OR (SS_LS_SC variant) also brings significant improvements with a 18% increase in recall and 19% in F1 score. We will compare this SS_LS_SC_TRANS variant with the deep learning approach in Section 5.5.

Training
Training parameters.-For training the neural networks, we ran various experiments to select the most suitable hyper-parameters that can balance performance and speed for our models. We tried batch sizes from 64 to 65356 and learning rates from 0.00001 to 0.01. While a batch size of 64 can take at least 16 hours of training for an epoch with a single V100 GPU, a batch size of 8192 can finish an epoch in less than 10 minutes. Also, the experiments in [49] suggest to fit as many data samples as possible to the GPU memory, but not higher than 8192. This was consistent with our preliminary findings. Therefore, we used a batch size of 8192 in our experiments.
We trained and evaluated each variant with 100 epochs and report the results in Table 5 with the usual metrics (accuracy, precision, recall, and F1 score). Table 5, all the trained models can learn very effectively. Accuracy, precision, recall, and F1 score exceed 93% for training and validation. We observed that compared to other models, the TRAINED_RAN_NOSIM model was able to learn especially well with all the metrics near or above 99% and low loss. This was expected because its input pairs are highly dissimilar lexically and mostly non-synonymous. Training seems less effective when the negative input pairs were more lexically similar but non-synonymous, like the ones in TRAINED_ALL and TRAINED_TOPN_SIM. Of note, the excellent training scores from the TRAINED_RAN_NOSIM do not guarantee good generalization, as we show in the next section.

Generalization Test Results
This section provides a comprehensive performance comparison between the trained models (TRAINED_ALL, TRAINED_TOPN_SIM, TRAINED_RAN_SIM, and TRAINED_RAN_NOSIM), and the rule-based approximation baseline (SS_LS_SC_TRANS) using the same generalization datasets. Since each model is trained with a dataset corresponding to a specific variant in terms of lexical similarity between atoms in the negative pairs, we perform a generalization test by evaluating the model performance on generalization datasets for other variants of lexical similarity in negative pairs. Table 6 shows the results of the performance comparison. Here we compare the trained models with each other and against the rule-based approximation SS_LS_SC_TRANS.
Comparing DL-trained models.-As shown in Table 6, the TRAINED_RAN_NOSIM variant seemed to perform very well with its own generalization variant RAN_NOSIM with all of the metric scores being above 97.9%. However, it did not generalize well to other test variants, especially the ALL and TOPN_SIM, with very low precision 20-22%. The TRAINED_RAN_SIM model had a performance pattern similar to the TRAINED_RAN_NOSIM model, but with 20-23% improvement in F1 score for the ALL and TOPN_SIM generalization variants.
In contrast, compared to the two RAN models above, the two models TRAINED_ALL and TRAINED_TOPN_SIM had exceptionally good performance in every measure across all the generalization variants. Of the two, the TRAINED_ALL model had consistently better results than the TRAINED_TOPN_SIM in every measure. Overall, the performance for the trained models ranked as follows from worst to best: TRAINED_RAN_NOSIM, TRAINED_RAN_SIM, TRAINED_TOPN_SIM, and TRAINED_ALL.
These experiments show that the degrees of lexical similarity (ALL, TOPN_SIM, RAN_SIM, RAN_NOSIM) between strings in negative pairs actually influence performance, thus confirming our hypothesis. Learning from one of the lexical similarity variants is necessary, but insufficient. The trained models without TOPN_SIM pairs perform worse than the trained models with those pairs, which demonstrates the importance of the highest lexical similarity variant. The TRAINED_TOPN_SIM model without RAN_SIM and RAN_NOSIM pairs perform worse than the TRAINED_ALL model with those pairs, which demonstrates the importance of the RAN_SIM and RAN_NOSIM pairs. The TRAINED_ALL model combining all three degrees yields the best performance. Next, we will compare the TRAINED_ALL model with the best RBA variant.
Comparing the best trained model TRAINED_ALL with the best RBA variant SS_LS_SC_TRANS.-Overall, the TRAINED_ALL model consistently outperforms the rule-based SS_LS_SC_TRANS variant by a large margin in every measure. The best RBA variant has high precision and low recall, while the best DL-trained model has both high precision and high recall across all the generalization variants. While their accuracy and precision are quite close (1-3%), there are significant differences in their recall (21-22%) and F1 score (23-24% for ALL and TOPN_SIM, 11-14% for RAN_SIM and RAN_NOSIM).
Comparing prediction differences.-Here we analyze those cases where the DL and RBA approaches make different predictions in the ALL generalization dataset. Table 7 shows the distribution of correct and incorrect predictions in the SIM and NOSIM sets.
Overall, while the RBA approach makes a larger number of wrong predictions than the DL approach, both approaches tend to have more difficulty making accurate predictions for pairs with a some lexical similarity (SIM) compared to pairs with no lexical similarity (NOSIM). This is consistent with our assumption that highly similar but non-synonymous pairs are more difficult to predict.

Overall Discussion
Findings.-The experimental evaluation presented above has shown that a relatively simple DL approach largely outperformed the best variant of the rule-based approximation approach. It has also validated our hypothesis that lexical similarity degrees among negative pairs strongly influence the performance of the trained models. However, the DL approach did take longer time for prediction than the RBA approach. Particularly, the DL models took about an hour for predicting the generalization test sets with a single V100 GPU while the best RBA variant took 15-20 minutes with a CPU server.
Significance.-Compared to the rule-based approximation, the excellent performance of the TRAINED_ALL model is even more remarkable given that it only uses lexical information (e.g., terms) from the source vocabularies, while the rule-based approach uses both lexical information and contextual information (i.e., source synonymy and semantic group). These results suggest that the DL approach could be further improved by incorporating contextual information. Furthermore, the good performance of the DL approach on pairs with no lexical similarity (above 95% for F1 and 99% for accuracy) encourages us to perform more extensive experiments on the UMLS, where most pairs exhibit no lexical similarity.
Limitations and Future Work.-There are several limitations to this preliminary investigation, which we plan to address in future work. As mentioned earlier we have not yet incorporated contextual information into the neural networks, which we could do by using additional vectors for the terms of neighboring concepts or by using Graph Neural Networks for representing relations among atoms, such as source synonymy and hierarchical relations. Also, we have not yet evaluated the approaches at the full-scale of the UMLS Metathesaurus. While a full-scale evaluation is extremely expensive computationally (10 14 pairs), we plan to perform larger evaluations in the future. We also need to perform an error analysis to better understand how learning could be improved. Finally, we deliberately used fairly simple and established DL techniques in this work. In the future, we plan to experiment with recent techniques, such as transformers (e.g., BioBERT), which we briefly discuss in the next section.
Generalization.-Beyond the confines of the UMLS project, our approach can be used in a variety of terminology integration and ontology alignment applications in biomedicine and healthcare. For example, BioPortal [37] is "the world's most comprehensive repository of biomedical ontologies". It uses lexical similarity to find equivalent terms among ontologies. It would be interesting to test our DL approach on this vast repository. Along the same lines, we plan to test our approach on biomedical ontologies in the ontology alignment evaluation organized by OAEI. We also expect that other researchers will be encouraged to try similar approaches for ontology alignment outside the biomedical domain, provided sufficient material is available for learning purposes.
Applications.-This research is directly applicable to improve the UMLS construction process. Two applications come to mind, which we will be exploring shortly. The first one is the insertion of new source vocabularies (or new terms from updated source vocabularies) into the Metathesaurus as part of the bi-annual Metathesaurus update process. Predictions from our DL approach could replace the rule-based predictions and be presented to human editors, hopefully saving them time compared to the current editing environment. Another, more ambitious application is to "rebuild the Metathesaurus from scratch". What we envision is to use our pairwise synonymy prediction to cluster atoms in a manner to recreate the Metathesaurus concepts. The analysis of differences with the existing Metathesaurus could open interesting avenues for quality assurance.

RELATED WORK
The OAEI has been driving ontology matching research in the biomedical domain since 2005. The largebio track uses the datasets extracted from a subset of source vocabularies in the UMLS Metathesaurus. A variety of matching techniques including rule-based and statistical methods have been developed. Among the top general-purpose matchers are AgreementMakerLight (AML) [10], YAM++ [35], and LogMap [17]. AML [10] uses a combination of different matchers, such as the lexical matcher, mediating matcher, and word-based string similarity matcher. YAM++ [35] implemented a decision tree learning model over many string similarity metrics but leaves the challenges of finding suitable training data to the user, defaulting to information retrieval-based similarity metrics for its decision-making when no training data is provided. LogMap [17] is designed to efficiently align large ontologies, generating logical output alignments.
Similarity assessment between words and sentences, also known as Semantic Text Similarity (STS) task, is an active research area in Natural Language Processing (NLP) due to its crucial role in various downstream tasks such as information retrieval, machine translation, and in our case, synonym clustering. The STS task can be expressed as follows: given two sentences, a system returns a probability score of 0 to 1 indicating their degree of similarity. STS is a challenging task due to the inherent complexity in language expressions, word ambiguity, and variable sentence lengths. Traditional approaches rely on hand-engineering lexical features (e.g., word overlap and subwords [22], syntactic relationship [51], structural representations [42]), linguistic resources (e.g., corpora), bag-of-words and term frequency inverse document frequency (TF-IDF) models that incorporate a variety of similarity measures [11] for example string-based [13] and term-based [41]. However, most are syntactically and semantically constrained.
Recent successes in STS [29] in predicting sentence similarity and relatedness have been obtained by using corpus-based [23] and knowledge-based similarity, e.g. word embeddings for feature representation [27] with supervised DL approaches, e.g., Siamese Network with Recurrent Neural Network (RNN) [32] and Convolutional Neural Networks (CNN) [15] as well as hybrid approaches [16] to perform deep analysis of words and sentences to learn the necessary semantics and structure. Unsupervised attention and transformer based mechanisms that were pioneered by Google research [46] have also been widely applied to STS with great degree of success [38]. The (self)-attention mechanism adds attention, weights keywords, learns contextual relations between words (or sub-words) in a text, and finds the connection within the sequence of words [14]. One of such transformer-based computations is Bidirectional Encoder Representations (BERT) which has consistently triumphed in most NLP tasks including STS [9]. Other variants trained on different corpora include BioBERT, which was pre-trained on the PubMed text corpus, has outperformed many biomedical-related NLP tasks [24]. This form of two-step-learning (pre-training and fine-tuning), termed transfer learning, is a popular method where a model trained on general domain with large-scale well-annotated datasets is re-purposed as the starting point for a model on a second (related) task. In our DL approach, we employed this form of learning by using pre-trained biomedical word embeddings (from BioWordVec-intrinsic) and subsequently fine-tuned the network with Bi-LSTM(s). Since this is the first contribution (to the best of our knowledge) in applying DL to biomedical vocabulary alignment task at scale, we adopted a knowledge-based similarity approach (Siamese-BioWordVec-BiLSTM network) for its simplicity and effectiveness. We aimed to evaluate this approach on real-world data and against a rule-based approximation of the current Metathesaurus construction process, instead of benchmarking it against other forms of resource-intensive DL techniques, such as attention and transformer-based mechanisms in the future work.
Reminiscent of the UMLS are two projects that aim to discover and organize links among large knowledge resources, BabelNet [33] and LIMES [36]. Closest to our work is a recently published paper in which the authors used DL techniques to measure semantic relatedness in the UMLS Metathesaurus [25]. There are, however, several major differences with our work, including the fact that they assessed semantic relatedness among concepts, while we assess synonymy among atoms. In addition, the scale of their work is limited to a few thousands of UMLS concept pairs, while the number of atom pairs involved in our experiments is several orders of magnitude larger.

CONCLUSION
We have presented our supervised approach for learning synonymy between biomedical terms in the UMLS Metathesaurus. The excellent performance of the supervised learning model compared to the rule-based approximation of the UMLS Metathesaurus construction process used as our baseline shows the great potential of this learning approach, especially because the learning approach only makes use of the lexical features (terms) from the source vocabularies, while the rule-based approach additionally uses contextual information (source synonymy and semantics). This approach has great potential for improving the UMLS Metathesaurus construction process by providing better synonymy suggestions to the UMLS editor.         Comparing prediction differences from the best variants of Deep Learning models (TRAINED_ALL) and Rule-based Approximation baseline (SS_LS_SC_TRANS) on the same ALL generalization dataset