15StrainInfo: Reducing Microbial Data Entropy40

Dawyndt P.

Publication Details

In a keynote speech [5] given at the 2002 O’Reilly Open Bioinformatics Conference, Lincoln Stein of Cold Spring Harbor Laboratory compared today’s bioinformatics landscape with the old Italian city-state model:

“During the Middle Ages and the early Renaissance, Italy was fragmented into dozens of rival city-states that were formed by legendary families, such as the Estes, Viscontis and Medicis. Although this era had some positive aspects, the political fragmentation was ultimately damaging to science and commerce because of the lack of standardization in everything from weights and measures to the tax code, the currency, and even the dialects the people spoke. And because a fragmented and technologically weak society was vulnerable to conquest, Italy was dominated by invading powers from the 17th to the 19th centuries.

The old city-states of Italy are an apt metaphor for bioinformatics today, as this field is dominated by rival groups that each promote their own Web sites and Web services and data formats. And while this environment has led to some creative chaos that has greatly enriched the field, it has also created a significant hindrance to researchers who wish to fully explore the wealth of genome data.

Eventually, the nation of Italy was forged through a combination of violent and diplomatic efforts, so that it is now, despite its shaky beginnings, a strong and stable country. It is also a component of a larger economic entity, the European Union, whose countries share a common currency, a common set of weights and measures, and a common set of rules for national and international commerce. The hope is that one day bioinformatics will achieve the same degree of strength and stability by adopting the same universal code of conduct.”

If you look at the consortium of culture collections that are members of the World Federation of Culture Collections from a bioinformatics point of view, one striking observation is that fully integrated information about microorganisms is not immediately available. In contrast to the data organization mechanisms put in place by the National Library of Medicine (NLM), that allow to easily follow the links between genes and the diseases they cause and various publications on these genes and diseases, you cannot easily do the same sort of research on the strains in the culture collections. Suppose, for example, you would like to find all 16S rRNA genes of Pseudomonas strains that have been isolated from soil and get information on their taxonomy, ecology, genomes, and so on. There is no easy way to collect this information without time-consuming searches.

At Ghent University, we decided to perform an experiment to see how far we could get with the information as it is made available from the culture collections today, in order to try to build something like the infrastructure constructed at the NLM. The idea was to build a software platform that accesses information from the culture collections, so that somebody could direct questions to a single point of access instead of having to visit the catalogues of the culture collections one by one to get an answer. Collecting online information coming from autonomous and heterogeneous data providers is the sort of job a web spider does, so we decided to look into building this kind of infrastructure. We also decided not to build the software platform as a monolithic structure, but make it flexible in the sense that it could take into account regional projects that had already established portals for a number of culture collections, in various countries or in regions like Asia.

The idea is that if a researcher has a question about a microorganism, instead of having to go to the online catalogues of the individual culture collections, the system would do it for the researcher. The Internet is conceived as a collection of data that is linked together by hyperlinks. These hyperlinks indicate connections points within and between various datasets. But the Internet does not lend itself very well to discovering new links and new ways of finding compatibilities between different datasets.

The approach we took to link microorganisms with all their downstream information is inspired by the “knuckles-and-nodes” model described by Lincoln Stein [6]. The idea is to organize nodes of information into a number of thematic networks, each with its own hub, or knuckle, that interconnects with all the other networks through the knuckles. Some of these knuckles have already been established, so we simply needed to integrate them. The Bergey’s Manual, as the previous speaker described, could serve as a taxonomy knuckle, and a variety of bioinformatics knuckles are also available (e.g., public sequence databases bundled into the International Nucleotide Sequence Database Collaboration; INSDC). But what was missing was the organism knuckle, which would provide access to all the bacterial, archaeal and fungal resources that are in the culture collections and, by extension, in all public and private research collections.

So here’s another look at Lincoln Stein’s idea: a number of people put their data online in databases or simply as text documents. To bundle all this information together in what he calls knuckles requires the construction of some sort of integration network that helps discovery across these disparate data sources. One possible approach to accomplish this could be to build an infrastructure on top of the disparate data sources where globally unique identifiers are assigned in an ongoing discovery process of pointers between autonomous and heterogeneous data sources.

By following this approach, you can test various hypotheses and answer different questions about the data. One particular question we focused on was to estimate how many organisms for which the complete genome sequence is available from public databases are also available from public culture collections. To get an answer on this question, we took the integrated information from the culture collections and simply linked it with the Genomes OnLine Database (GOLD; www.genomesonline.org) [2]. What we found was a tremendous gap between the availability of genomic information and the availability of the sequenced organisms in public culture collections.

In bacterial taxonomy there is a rule that states that if you want to describe a novel species, you have to deposit its type strain in at least two culture collections in two different countries. This to safeguard that the species remain available for further research. A similar rule is not required for when depositing and publishing the complete genome sequence of an organism. It seems natural that researchers would make the biological material available in order to add value to their publication of a whole-genome sequence. However, the results of our investigation show that more than 50 percent of the complete genome sequences that have been deposited in the public sequence databases do not have a publicly accessible organism.

Let me now give a more detailed description of the StrainInfo bioportal (www.straininfo.net) that we developed [1]. One major purpose of the bioportal is to increase discoverability of the biological material preserved within a global network of culture collections. One possible way to access the bioportal is to enter the scientific name of an organism. You can also enter an accession number assigned to a sequence record by the INSDC or a strain number of organism assigned by whatever culture collection. The smart search feature of the system will figure out what your search terms mean, so there is no need to specify what type of identifier was used. For example, if you enter a strain number like LMG 6923, the system will display all available information about that particular microorganism by collecting strain information from all culture collections. The system collects all this information on the fly, and tries to associate it with related information from taxonomic databases, sequence databases, publication repositories and so forth. From the resulting information one can easily find the different collections that have a copy of this organism, simply by looking at their geographic distribution on a world map. In addition, for example, the resulting information also includes all genome sequences from this organism and pointers to the published literature that made us of this organism.

If you want to drill down to more detailed information, you can, for example, visit the individual online catalogues of the culture collections. Deep links to related information in these online catalogues are provided, obviating the users’ need to know all the strain numbers that have been assigned to the same organism by the different collections. Say, that you know the strain number of an organism assigned by the American Type Cultures Collection (ATCC). In the public sequence databases you may find sequences that are linked to that particular ATCC number. What we add are links to all sequence, regardless of the strain number being used when the sequence was deposited.

The StrainInfo bioportal offers a way to discover information by fetching all related data and trying to make sense of it using different integration strategies. For example, the bioportal provides the entire genealogy of a strain, from the initial isolate down to its distribution from one culture collection to another or from one researcher to another. Integrating this genealogical information is extremely difficult. First of all because the way it is encoded in the catalogues of the culture collections is completely unstandardized. Secondly, quite a lot of implicit distribution information is missing from the catalogues of the culture collections. For example, collections change names as they move from one funding agency to another, as two or more collections merge, or for other reasons. Whereas people that are intimately familiar with the past history of culture collections might known that Collection A has changed its name to Collection B and then to Collection C, the broader community is not aware of such reorganizations. This might introduce uncertainties in the distribution history of reference strains.

As a countermeasure, we developed an algorithm that can automatically reconstruct the strain distribution history from unstandardized and incomplete textual descriptions [8]. My original idea was to put a Ph.D. student to work on building an editor so that end users could reconstruct strain distribution histories by manually gluing bits and pieces of information together. But during the development process the student came to me and said, “I think I found a way to build these histories automatically”. I did not believe him at first, and made him convince me that the automatic predictions actually were correct.

To prove the success rate of his automatic reconstruction algorithm, the student first undertook a fully manual curation experiment. He took the collection of all 8,000 bacterial type strains as a working data set for which the strain distribution history needed to be reconstructed, and went to the Laboratory of Microbiology at Ghent University, trying to convince the local microbiologists to reconstruct all 8,000 strain histories by hand using the information that was made available from the StrainInfo bioportal. At the same time, he processed the data with his reconstruction algorithm to construct the strain distribution histories in a completely automatically fashion. After three of four weeks, enough histories had been manually reconstructed (about 60 percent) in order to evaluate the success rate of the automated predictions.

A comparison of the manual and automatic history reconstructions showed that 98 percent of the histories that were manually created could be rebuilt automatically, with only a minor number of inconsistencies. After inspecting those inconsistencies, it often turned out that the manual curation was wrong. The automated reconstruction algorithm uses all available strain distribution information at the same time, which overall makes it more robust than the manual reconstruction. Only related to the lack of completeness of the reconstructed strain histories, the automated reconstruction algorithm was outperformed by manual reconstruction. The reason for this observation is that some of the strain distribution information is not explicitly available from the online catalogues of the culture collections. Manual curators can compensate this lack of implicit data using their background knowledge about the problem domain. For example, they may know some of the relationships between the culture collections or the people working in those culture collections.

The bottom line is that we were able to motivate some experts to make a manually created data set and use it as a benchmark to prove that we could automate the whole process. As such, we could automatically reconstruct the strain distribution history for more than 700,000 strains of microorganism that are available from a global network of culture collections. In order to counterbalance errors made in predicting the strain distribution history, the StrainInfo bioportal allows its end users to make corrections if they find mistakes and to make updates whenever they have additional information.

This is one example that shows how we were able to use a semi-automatic approach to conquer a problem that seemed impossible to automate at first sight. We found that we could approximate human curation with automatic prediction while allowing end users to make annotations and corrections to further enhance the quality of the information.

As a second example, I will demonstrate the ontogrator experiment (tools.envotestsite.org/ontogrator) that makes use of information extracted from the StrainInfo bioportal. Usually, ontologies are used when autonomous and heterogeneous data sets need to be integrated into a single portal. This is the general approach taken by the ontogrator, that automates the integration pipeline for a given set of data sources and a given set of ontologies. As an experiment, the researchers that developed and implemented the ontogrator used the following data sources: CAMERA [4], PubMed, GOLD [2], SILVA [3], and StrainInfo [1]. In addition, they used a series of controlled vocabularies, or ontologies, related to ecology, geographical locations, habitats, and so forth. Next, he integrated the different data sources based on the fact that they can be linked through a common vocabulary. The integrated interface resulting from the ontogrator approach for example allows one to search for all entities that relate to dairy products. The underlying knowledge base knows that yogurt, cheese and ice cream are all dairy products, so the user does not need specify this. And yes, it is also possible to specify exactly what kind of dairy product you are looking. Searching StrainInfo this way produces hits on organisms in the public culture collections that were isolated from dairy products, along with their descriptions. From those organisms it is possible to jump directly to related information in CAMERA, the published literature or to get their complete genome sequences. Simply by using a shared vocabulary, we can allow users to use faceted browsing as a way to relate pieces of information that were not explicitly related to one another.

As a final comment, when we were building our initial prototype of StrainInfo, we deliberately decided not to put any extra burden on culture collections and their staff. Knowing that culture collections overall have limited information technology resources, we took the challenge to work with the information as it was available and see how far we could go in our integration experiment. We simply screen scraped the data from the online catalogues of the culture collections and indexed it in somewhat the same way Google is doing, as common data exchange formats have simply not been adopted to in the field of culture collections. This approach worked initially, until we came to the point that we were indexing more than 60 culture collections. By that time, however, it had become clear to the culture collections that we had built a software platform that gave them more visibility. This increased their willingness to make some additional effort in helping us to scale up the integration process. Instead of simply screen scraping the HTML-formatted data from the online catalogues, we now offer the culture collections the export of their data in a standardized exchange format called the Microbiologial Common Language (MCL) [7]. This allows us to index the culture collections more frequently, extract and integrate more detailed information, and scale up the number of culture collections being indexed.

We were aware of the fact that it would be extremely difficult to convince culture collections to export their data in a standardized XML format, knowing that this might seem quite straightforward for a computer expert. But because the culture collections could directly see the added value from the initial prototype, more and more they started to provide us their data in the MCL format, and more culture collections wanted to become members of StrainInfo as soon as possible. Gradually introducing a data exchange standard thus will allow us to scale up the integration experiment behind the StrainInfo bioportal from the five culture collections we initially had in mind to more than 500 culture collections that are member of the World Federation of Culture Collections.


Dawyndt P, Vancanneyt M, De Meyer H, Swings J. Knowledge Accumulation and Resolution of Data Inconsistencies during the Integration of Microbial Information Sources. IEEE Transactions on Knowledge and Data Engineering. 2005;17(8):1111–1126.
Kyrpides N. Genomes OnLine Database (GOLD): a monitor of complete and ongoing genome projects worldwide. Bioinformatics. 1999;15:773–774. [PubMed: 10498782]
Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, Glöckner FO. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Research. 2007;35(21):7188–7196. [PMC free article: PMC2175337] [PubMed: 17947321]
Seshadri R, Kravitz SA, Smarr L, Gilna P, Frazier M. CAMERA: A Community Resource for Metagenomics. PLoS Biol. 2007;5(3):e75. [PMC free article: PMC1821059] [PubMed: 17355175]
Stein L. Creating a bioinformatics nation. Nature. 2002;417:119–120. [PubMed: 12000935]
Stein L. Integrating biological databases. Nature Reviews Genetics. 2003;4:337–345. [PubMed: 12728276]
Verslyppe B, Kottmann R, De Smet W, De Baets B, De Vos P, Dawyndt P. Microbiological Common Language (MCL): a standard for electronic information exchange in the Microbial Commons. Res Microbiol. 2010;161(6):439–45. [PubMed: 20211251]
Verslyppe B, De Smet W, De Baets B, De Vos P, Dawyndt P. Make Histri: Reconstructing the exchange history of bacterial and archaeal type strains. Systematic and Applied Microbiology. 2011 [PubMed: 21514082]