
BIOSPIDA: A Relational Database Translator for NCBI
Matthew S. Hagen
1Center for Operations Research in Medicine and HealthCare,
2NSF I/UCRC Center for Health Organization Transformation,
3Cemter for Bioinformatics and Computational Genomics,
5College of Computing, Georgia Institute of Technology, Atlanta, GA
Eva K. Lee
1Center for Operations Research in Medicine and HealthCare,
2NSF I/UCRC Center for Health Organization Transformation,
3Cemter for Bioinformatics and Computational Genomics,
4School of Industrial and Systems Engineering,
Abstract
As the volume and availability of biological databases continue widespread growth, it has become increasingly difficult for research scientists to identify all relevant information for biological entities of interest. Details of nucleotide sequences, gene expression, molecular interactions, and three-dimensional structures are maintained across many different databases. To retrieve all necessary information requires an integrated system that can query multiple databases with minimized overhead. This paper introduces a universal parser and relational schema translator that can be utilized for all NCBI databases in Abstract Syntax Notation (ASN.1). The data models for OMIM, Entrez-Gene, Pubmed, MMDB and GenBank have been successfully converted into relational databases and all are easily linkable helping to answer complex biological questions. These tools facilitate research scientists to locally integrate databases from NCBI without significant workload or development time.
Introduction
Biologists have the unique challenge to classify a wealth of information that is constantly growing and evolving. Each new scientific discovery and laboratory advancement can yield new concepts and revise data representation. Resource providers have the potential to focus on their own sub-discipline and philosophy of information structure. The annual Nucleic Acids Research journal database issue has listed over 1,200 databases in 2010 covering molecular and cell biology1.
To meet the demands of the complexity and diversity of the biological community, NCBI has chosen to use Abstract Syntax Notation (ASN.1)2,3. It is a hierarchical data model that contains records, collections and variant types4. Its flexibility can handle needs for constant database restructuring where a relational system would be unsuccessful. ASN.1 is equally expressive as it is flexible. The semantics and biological meaning of the data can be interpreted without additional constraints of understanding a relational model. The standardized format allows exchange of information without dependence on a specific hardware or software environment5.
As ASN.1 gives many advantages to the scientific community, it creates several limitations for procedures that require a broad range of investigation. Full text records need post-processing and parsing of their results. Scientists may spend 80% of their time retrieving, manipulating and preparing their data for analysis6. In relational databases, SQL-level queries give fine granular access increasing the control, speed and global reach of analysis.
Pertinent biological data is frequently stored across many different databases, hindering investigators that require information from multiple experiment types. It is unrealistic in today’s climate to ask researchers to visit numerous databases with a “point and click” interface7. They are often confronted with this environment to answer common biological questions. For example, to validate whether encoding genes for protein-protein interactions are expressed in the same cell type would require visiting several databases8.
There are multiple proposed regimes towards data integration. Distributed systems build software to access heterogeneous databases over the internet. Users do not need to invest in hardware and are provided with the most recent versions of the data. However, they are limited to the speed of the internet and can only view the information at the service-level of the data provider. Few external databases accept complex queries that may be resource intensive.
Data warehousing extracts and integrates information into a locally housed unified repository. They are a popular approach and have met with successful results, facilitating studies that normally could not be performed under other conditions. Mootha et al was able to identify one of the disease genes implicated in Leigh syndrome using integrated sequence and annotation data9. Shah et al helped determine yeast orthologs for human genes implicated in disease with sequence, taxonomy and genetic disorder databases10.
The main disadvantage of data warehousing is the considerable amount of setup-time and installation11. A detailed amount of knowledge is required of both the source and target data. Databases become frequently out of date and strategies for updating do not regularly meet the needs for a constantly shifting environment. Changes to the source data often require software code revisions and retesting.
Biological Search Propagate Integrate Demonstrate Analyze, BIOSPIDA, is a utility that helps alleviate the shortcomings of local integration. It contains a universal parser and relational translator written in Python and MySQL that can be used for all NCBI ASN.1 databases. The software does not need to be updated upon changes to the source data structure. Researchers are not required to develop a separate parser for each individual database they wish to integrate from NCBI.
Abstract Syntax Notation
ASN.1 objects can have many attributes that all have different data types. These include primitive, hierarchical, nested, enumerated, variant, and collection types. Primitive types hold basic values, such as integers, real numbers, booleans, and strings of characters. Enumerated types restrict attributes to a list of possible values. Variant types allow attributes to be many different data types. Attribute types can also be ASN.1 objects themselves creating a hierarchical structure. Keywords are used to identify the data type of each attribute, such as Integer, VisibleString, REAL, BOOLEAN, and ENUMERATED. CHOICE is used for variant data types.
Optional modifiers can be used for attributes that further change their data types. The SEQUENCE OF or SET OF keyword treats the attribute as a collection. A sequence is an ordered collection, while a set does not enforce the order. Authors of a publication are an example of a sequenced collection, and mesh terms are an example of a set collection data type. OPTIONAL and DEFAULT keywords flag attributes specifying if they are required or have a default value.
An ASN.1 object is not always a simple tuple of attributes with primitive data types. The data type of an attribute can be an ASN.1 object linking itself to another tuple of attributes. Attributes are not limited to having one value. They can be collections of all available data types. Lists of primitive values or nested ASN.1 objects are possible. Inside ASN.1 objects, there can be variants permitting multiple data types for one attribute. Attributes also have the ability to be a tupled nest of child attributes. There is not a large set of restrictions for ASN.1 objects, and they are not limited to a tuple of attributes. ASN.1 objects have the availability to be primitive, enumerated or variant data types themselves (Fig 1).
ASN.1 object with collection attribute of object type that derives into enumerated type. (Na-strand is an object that is an enumerated type)
The flexibility in the ASN.1 data structure leads to elaborate scenarios yielding many distinct cases. The following are examples illustrating the possible complexity. An ASN.1 object can be a variant type that is itself a collection. An attribute can be a collection of ASN.1 objects, where the ASN.1 object derives itself into an enumerated data type (Fig 8). One of the data types of a variant attribute can be a tupled list of child attributes.
Universal Parser
To build a universal parser for the NCBI databases, a context-free grammar must be constructed that successfully handles all data types provided by NCBI. Since each database uses the ASN.1 object model, the grammar must cover all possible types that are supported by the notation. This leads to an extra layer of complexity, because data types allow nested combinations that develop into numerous unique cases. The context-free grammar for ASN.1 notation used by BIOSPIDA can be seen in Figure 2. The advantage of using a context-free grammar is that NCBI databases can be selected for download without having to create an individual parser. The universal parser in BIOSPIDA uses the grammar to read ASN.1 specification files, automatically learning how to read each database.
Minimal human intervention is required to create ASN.1 specification files that describe the data model for targeted NCBI databases. The specification file is in text format and must contain the data structure for all ASN.1 objects used by the database. This can be created by knowing the parent ASN.1 object and recursively placing the data structures for all referenced child objects. For example, <Biostruc> is the parent object for the MMDB database, NCBI provides the data structure for all ASN.1 objects used across all of their databases15. Building a specification file for an NCBI database can be accomplished in under a few hours, and current specifications files for each NCBI database can be provided. If the data structure changes for a given NCBI database, it is only necessary to make appropriate revisions to the specification file. There is no obligation for software code changes. When initially selecting a database for conversion, researchers must specify whether the uncompressed ASN.1 xml data is locally stored or requires remote retrieval. If remote retrieval is necessary, BIOSPIDA requires a text file of database records ids to be sent to the NCBI Efetch utility13.
NCBI provides access to each of their databases either by their ftp site12 or their efetch utility13. The data for each record is available in xml format and it is structured according to the document type definition file14. The XML tags identify each possible attribute and item given for each record.
The software code in BIOSPIDA does not require identifying each individual xml tag to search for when parsing database records. All that is required is the ASN.1 specification file describing the data model of the target database. BIOSPIDA has the ability to learn how to parse the xml from each NCBI database without human intervention, because NCBI follows a strict set of guidelines when converting their ASN.1 records to xml format. NCBI presents XML for each database record that follows the exact structure given by their ASN.1 data model15.
Figures 3 and and44 compare the ASN data model for Entrezgene with example xml retrieved from a database record. The parent xml tag is <Entrezgene> which is the parent ASN.1 object in the definition file. An xml tag is created for each attribute of the ASN.1 object. <Entrezgene_type> conveys that the “Entrezgene” object has an attribute named “type.” Since the attribute is an INTEGER, it is a primitive type. Therefore, its value is placed immediately after its identifier.
The data type for the track-info attribute is another ASN.1 object, Gene-track. The value for the track-info attribute cannot be placed immediately after its xml tag. It must identify it is an ASN.1 object. The <Gene-track> xml tag follows the <Entrezgene_track-info> tag. The same logic is used to recursively list all of the subsequent attributes and child objects.
If an attribute type is CHOICE, it can have many possible data types. An additional xml tag distinguishes the variant attribute. Below the <Orgname> xml tag, <Orgname_name> and <Orgname_name_binomial> are listed to identify the attribute and the chosen data type.
NCBI follows a consistent list of guidelines for generating xml of collection attributes in ASN.1 object records. Collections of primitive and variant attributes add an “_E” in the xml tag name to identify each entry. Variant types additionally add the chosen attribute name after the “_E” suffix. Collections of object types do not add the “_E” suffix in their xml tags. Instead, they have an open and closing tag of the object name for each item in the list. Figure 5 gives an example for these cases.
GB-block contains the keywords attribute that is a collection of VisibleStrings. Seq-align has a collection of Score objects. The names attribute in the Auth-list object is a variant with one of its permitted types being a collection of VisibleStrings. Title is a collection of variant types. The generated xml for these objects all adhere to the same conventions.
The universal parser utilizes the exact rules that NCBI implements to deliver xml data of their database records. No software revisions are necessary for each additional NCBI database that must be loaded and parsed. Scientists are only required to create a new template file that contains all of the objects used by the selected database.
Relational Translator
Parsers only facilitate retrieval of the data. The information must still be inserted, stored, and analyzed. .BIOSPIDA contains a relational translator that automatically converts the ASN.1 data model from the given template file into a MySQL relational database.
ASN.1 has a broad range of data types, and the translator must also follow a set of rules to create all tables and subsequent fields. Tables are created for every ASN.1 object specified in the template file. Attributes that are primitive data types are directly converted into fields. Enumerated types are also converted into fields as an integer data type. Separate reference tables are generated to list all of the values for each enumerated type. For variant types, individual fields are created for each possible data type. All of the variant attribute’s fields will be null for the cases not chosen.
A primary column is added to every table to uniquely identify each row inserted into the database. The value of the primary key is incremented for each record insertion. These identifying fields also exist in child tables as foreign keys facilitating table linkages. Separate linking tables are produced for collection types. The linking table includes a sequence field for SEQUENCE collections, but not for SET collections. Example tables generated from the relational translator for the Entrezgene database can be viewed in Figure 6.
In Figure 6, the Prot-ref object has attributes that are collection types. A separate linking table is generated for the “name” attribute labeled Prot-ref_name. Since, the attribute is a VisibleString, only its value and Prot-ref identifier are needed in the linking table. The db attribute in the Prot-ref object is a collection of Dbtag objects. Therefore, the Prot-ref_db linking table contains identifier fields for the Prot-ref table and the Dbtag table. The tables are linked with the Prot-ref_id and Dbtag_id columns.
The universal parser inserts all of the data it retrieves into loading files. A separate loading file is designated for each table created from the Relational Translator. The Relational translator also produces the MySQL load and create table statements. After the loading files are fully populated, the only human intervention required is to execute the create and load statements to generate a fully converted relational database.
Results
BIOSPIDA has been tested against the Entrezgene, OMIM, Genbank, MMDB, and Pubmed NCBI databases. Relational models were successfully generated for each resource center from the given ASN.1 specification file.
It took approximately 7 days to parse 170 gigabytes of all 6.6 million records in Entrezgene. The generated loading files from BIOSPIDA totaled 35 gigabytes, because all xml tags can be discarded. This achieved a 5 to 1 ratio in saving storage space. The relational database for Entrezgene produced 224 tables. The amount of data per table varied, but millions of records could be loaded very quickly. For example, 6.5 million records were inserted into the Orgname table in approximately 3 minutes.
Converting the Pubmed database was especially advantageous for a study with the Atlanta Clinical Translation & Science Institute (ACTSI). After being provided a list of pubmed identifiers, BIOSPIDA was able to perform a social network analysis for all publications generated from the center. 6,813 records were parsed and inserted into a relational Pubmed database. Reports could easily be established with filtering criteria, such as Affiliation, Authors, Title, Publication date, Mesh Terms, and Abstract keywords.
All of the entries in OMIM were fully converted into a relationship database, facilitating further analysis procedures for ACTSI. Due to local integration, the information was easily linkable to the publication data. It was found that their literature refers to a total of 103 genetic disorders from OMIM.
Conclusion
As more methods are implemented for multi-database queries across heterogeneous systems, scientists can widen the range of their field of study. The common bottleneck for creating a local repository of information is development and installation. Other data warehousing solutions have been successful, but all develop a separate parser for each selected database6,8,10,11. BIOSPIDA allows researchers to download multiple databases from NCBI without iteratively customizing software.
In the event that NCBI changes their data model, only revisions are required to the template files for each database. The modifications will then automatically be reflected when the data is refreshed.
Many of the NCBI databases are linkable with numeric identifiers. The OMIM database alone links to Pubmed, Entrez-gene, Entrez-nucleotide, Entrez-protein, Entrez-genome and MMDB. BIOSPIDA will further be tested against other databases, allowing retrieval of records from many sub-disciplines and expanding the scope of analytical tools.
Acknowledgments
We acknowledge funding from the National Institutes of Health for translational biomedical informatics advances.
References
Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association





