U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2010.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010 [Internet].

Show details

TaxPub: An Extension of the NLM/NCBI Journal Publishing DTD for Taxonomic Descriptions

.

Author Information and Affiliations

TaxPub is an extension of the NLM/NCBI Journal Publishing DTD (Version 3.0) for the encoding of the literature of biological taxonomy. A key feature of this literature is the taxonomic description: publications or sections of publications that name and describe species and other taxonomic information. Given that it is estimated that the majority of all species have yet to be described, and that some 15-20,000 new species are described each year, and that markup might be applied prior to publication at less expense than applying markup to existing publications, TaxPub aims at providing a tagset for the encoding of new taxonomic literature. TaxPub extends the Publishing ("Blue") DTD parsimoniously. A few phrase-level elements are available at the relevant places throughout the entire DTD. Most of the extension, however, occurs in a single section-level element <tp:taxon-treatment>. The development of the extension proceeded smoothly, but several challenges have been encountered: lack of consensus on the components of taxonomic descriptions; relationship and alignment of TaxPub to other related schemas in the field; decisions on creating new elements or using existing NLM DTD elements and how to document and validate the usages; resistance to DTD as the XML schema language; and the efficiency of creating a superset extension rather than utilizing other simpler profiling mechanisms

Introduction

TaxPub is an extension of the NLM/NCBI Journal Publishing DTD (Version 3.0) for encoding literature of biological taxonomy. A key feature of this literature is the taxonomic “treatment”: publications or (more frequently) sections of publications documenting the features or distribution of a related group of organisms (called a “taxon”, plural “taxa”) in ways adhering to highly formalized conventions. Some of these are over a century old and are maintained by scientific commissions accepted by the profession. Two of the most significant are the international standard for naming animals, the International Code for Zoological Nomenclature (ICZN), and the corresponding code for plants, the International Code for Botanical Nomenclature (ICBN).

The features and structure of treatments have varied across time as well as across and within publications. Despite the variation, however, a few key features are commonly found. First, and most important, is a section usually displayed as a heading presenting information related to the naming of the described species or other taxon of higher rank in one or another standard hierarchy. This “nomenclature” section contains at minimum the name of the taxon. Often following the name is an indication of whether the taxon is new to science and the name or names of the persons responsible for the naming. Citations of earlier treatments are also very common for taxa that are not new. Other information, such as standard identifiers and references to physical specimens, may also be found.

A number of other sections may follow the nomenclature section. Perhaps the most significant is a section frequently titled “Materials Examined” citing the specimens or other materials (e.g.., DNA sequences) used as the basis of the treatment. This section often includes the circumstances of collection and/or deposition at a museum or other institution. Historically, these details allowed scientists to visit the holding institution (or seek a loan) for further scientific investigation of the very material that was described by the treatment. Also common is a “Description” section providing information—often in highly structured language, and sometimes in tabular form—on the distinctive features of the collected organisms, with an aim toward specifying a characterization of the entire taxonomic class such material represents.

Similar to a Description section is the “Diagnosis”, which contains descriptions of only those features “that distinguish that species from others, in the same way that the disease identification you receive when you visit the doctor is called the diagnosis because the doctor has distinguished your illness from all other possibilities based on the basis of your symptoms and tests.” [Winston, 189]. Other treatment sections may include an “Etymology” section explaining the origins of the taxon name, sections summarizing the spatial and temporal distribution of the taxon, and an “Ecology” section discussing behavior and relationships to habitat. For higher level taxa (such as genera and families) a “Key” presenting a set of instructions for distinguishing lower level taxa from one another is also very common.

TaxPub

Background

The development of TaxPub is an outgrowth of an earlier effort to digitize the taxonomic literature of ants for purposes of developing data mining techniques for the extraction of species data from taxonomic literature. The work was originally performed as part of a joint U.S. National Sciences Foundation and Deutsches Forschungsgemeinschaft (German Research Foundation) grant awarded to the American Museum of Natural History (AMNH) and the University of Magdeburg (later to Karlsruher Institut für Technologie/Karlsruhe Institute of Technology ).

Development of TaxonX, an XML-Schema for markup of treatments had begun at AMNH prior to the NSF/DFG grant and continued through its duration. As the project was concluding, participants established Plazi Verein, a Switzerland-based independent not-for-profit organization aiming to help remove technological, social, and legal barriers to the creation of and access to taxonomic literature. Among its many activities Plazi maintains the TaxonX schema and a repository of XML-encoded publications, develops the semi-automatic markup tool, GoldenGate [Sautter et al., 2007], and strenuously advocates for open access to scientific literature [Agosti and Eglof, 2009]. As part of these efforts, Plazi has encoded approximately 500 publications containing roughly 11,000 treatments using the TaxonX schema [Sautter et al., 2009]. This experience greatly informed both the rationale and design of the TaxPub extension.

Rationale

It is estimated that the majority of species on earth have yet to be described* and that each year some 15-20,000 new species are described [Polaszek et al., 2005]. Yet many efforts to digitize taxonomic literature, including Plazi's, have predominantly focused on the minority of species already described. It is time consuming and costly to convert the legacy literature to XML. Challenges are faced at the most basic level, that of the accuracy of the transcription of source texts. OCR may yield good results for clean copies of modern documents, but for older publications accuracy suffers. Costs are then incurred to either correct OCR errors or through double-keying, the latter not scaling well for massive digitization efforts. Even with 100% accurate texts, encoding remains a challenge. Particularly due to variant editorial practices, a wide range of styles and text structures are present in the existing literature. TaxonX, for example, became a very loose Schema in order to accommodate the variation. The TaxonX treatment element was eventually made available almost everywhere in the Schema after treatments were encountered in a variety of locations in source documents, even in footnotes. The laxity of the schema, however, confers little benefit to processing of valid instances, making it difficult and expensive to program against. In addition to the problems of stylistic and formal variability, encoding information implicit in untagged text is a major task.

For example, interpreting and expanding abbreviations or parsing the components of a bibliographic references, a scientific name, or a geographic reference can be time consuming and prone to error. **

Given the complexity and difficulty of digitizing existing taxonomic literature, and that it covers a minority of all species, greater benefit at less cost might be found in the encoding of new, born digital, taxonomic literature. Increasingly, treatments are derived from data maintained in databases, whether for names, specimens, or bibliographic references. This information could be exported into XML directly, saving an enormous amount of time and ensuring accuracy. The idea to generate publishable natural language treatments from databases arose in the early 1970's and was unambiguously in place by 1980 [Dallwitz, 1980]. The rise of XML has provided more tools to produce and exploit structured treatments, but often these tools are used backwards with time wasted by experts providing markup to published literature. Indeed, in the case of recently published literature information originating in parsed form in citation managers and databases becomes converted by an author to unstructured text for publication, only to be parsed out once again during the markup process.

Consensus on an XML schema often fosters development of tools, services, and applications utilizing suitably encoded data. TaxPub is an attempt to catalyze this process in the hope that the community will be intrigued, and find it useful enough to adopt and sustain.

Design and Development

In the second half of 2008, with the assistance of Jeff Beck and Laura Kelly of NCBI, Plazi developed the first draft of the extension now called TaxPub. Since then development has been assisted by Donat Agosti, an ant systematist, President of Plazi and research scientist at the American Museum of Natural History, and by Robert Morris, Emeritus Professor of Computer Science at University of Massachusetts at Boston and an Information Technology Associate of the Harvard University Herbaria. The project is hosted on SourceForge (http://sourceforge.net/projects/taxpub/) with the first release in December, 2008.

The first version release of TaxPub is scheduled for March 2011. A call for comments will be sent in December 2010 soliciting feedback and requests for new features. Subsequent releases will be backwards compatible until the next version release.

Rather than adapting taxonX for publishing applications it was more efficient to extend the NLM/NCBI DTD. The Journal Publishing DTD already included elements for document features, so it was necessary only to add elements and attributes relevant to taxonomic descriptions. TaxPub extends the Publishing (“Blue”) DTD parsimoniously.

To better distinguish TaxPub elements from those of the base DTD, elements from the extension have been put into their own namespace, with element names starting with the prefix "tp:". A few phrase-level elements are made available at relevant places throughout the DTD. There are elements for scientific names, <tp:taxon-name>, citations of specimens and other materials, <tp:material-citation>, and descriptions of organisms’ physical characteristics, <tp:descriptive-statement>.

The <tp:taxon-name> and <tp:descriptive-statement> elements have simple content models, each allowing any number of optional “part” elements allowing for tagging of the element's components. Required “-part-type” attributes provides further semantics. Because the field of biodiversity has many published vocabularies, URIs are available for many concepts and entities of interest. The addition of “-type-uri” attributes to all TaxPub elements with “-type” attributes is under consideration so that, if available, semantics may be provided through use of a URI as a value instead of, or in addition to, a string value.

Of course an additional attribute is not strictly necessary as users may already use URIs in the existing “-type” attributes. We encourage that usage.

Additionally, as in many TaxPub elements, the <object-id> element from the base DTD is available, again with the intention of allowing semantic enhancement through linkage to standard identifiers. <tp:taxon-name> also has additional special attributes: <tp:taxon-name> with “auth-code” to report the nomenclatural code to which the tagged name is conformant; “rank” to explicitly indicate the taxonomic rank (e.g.., genus, species, etc...) of the named taxon; and a “reg” attribute (shared by <tp:taxon-name-part>) to contain a regularized form of an element's contents.

The other element available throughout the DTD, <tp:material-citation>, has a richer content model. Like bibliographic citations, specimen citations can be complex, with many pieces of information. To accommodate granular encoding, <tp:material-citation> allows #PCDATA, the Publishing DTD elements <named-content>, <xref>, and <object-id>, and TaxPub elements <tp:taxon-name>, <tp:material-location> for information on the institution currently housing the referenced material, and <tp:collecting-event> for information on where, when, and by whom the specimen was found. The <tp:collection-event> element has a number of sub-elements: <named-content>, <object-id>, as well as <date>, and extension elements <tp:taxon-name> and <tp:collecting-location>. <tp:collection-location> itself permits zero or more <object-id> and an optional <comment> element, and zero or more <tp:location> element which has a “location-type” attribute to specify whether tagged location is a country, city, province, etc...

Most of the extension occurs in a single section-level element <tp:taxon-treatment>, available in the body of an NLM document. The <tp:taxon-treatment> element contains elements for metadata about the treatment itself, <treatment-meta>, and its component sub-sections: a required <tp:nomenclature> section and zero or more <tp:treatment-sec> elements. Originally, two other named treatment sections were included in the extension, <tp:description> and <tp:materials-examined>, but as their content models did not differ from that of <treatment-sec>, they were removed. A “treatment-sec-type” attribute is available to provide specific semantics for <treatment-sec>, but aside from the inclusion of the other TaxPub elements available throughout the DTD, the content model of treatment-sec is essentially the same as a generic section.

The only required element in the TaxPub extension is <tp:nomenclature>. Its content model is more complicated than other extension elements because it must model and conform to the very formal structure required by the aforementioned nomenclatural codes. <tp:nomenclature> must contain a <tp:taxon-name>, which includes the name of the organism being described by the treatment. Indication that a taxon is a new species or genus is handled by a <tp:taxon-status> element. A <tp:taxon-authority> element may be used for a “brief bibliographic reference to the original publication of the [taxon] name” [Winston, 130] required by nomenclatural codes and typically in the form of an author’s last name followed by the year of publication. For more granular markup, <tp:taxon-authority-part> elements with “tp:taxon-authority-part-type” attributes are available.

The codes address other complexities of citations (e.g., multiple authors, a species being moved to a different genus since the original publication, etc...), but the current <tp:taxon-authority> model ought to be sufficient. Following the citation of taxon authorship will frequently be a series of citations “of all the names that have been used in published references to [the described] taxon” [Winston, 136]. TaxPub provides a <tp:nomenclature-citation-list> element to group <tp:nomenclature-citation> elements for these citations. The citations may consist of several parts. First is a reference to a name, consisting of a required <tp:taxon-name>, followed by zero or more <tp:taxon-author> elements. Next is a bibliographic reference to the publication in which the taxon was named, for which <mixed-citation> (for an inline citation) or <xref> (for links to an entry in a reference list>) may be used. A reference to specimens may be present for which <tp:material-citation> is available. Other information may be included in an optional <comment> element. As it models perhaps the most complex, least standardized component of taxonomic descriptions, <tp:nomenclature-citation> will no doubt be subject to further review and criticism, and will likely be revised frequently until a stable element definition is achieved.

Implementations

In 2009 initial tests using TaxPub were performed. Norman Johnson, of Ohio State University (OSU) and a Plazi member, produced treatments from a database tracking morphological features of wasp species described as part of the NSF-sponsored Planetary Biodiversity Inventories program***. The resulting TaxPub encoded treatments contain nomenclature sections, a description section containing standardized descriptions of morphological features, a listing of specimens used as the basis of the treatment including the locations of collection and of deposition, and a link to a map showing the distribution of the specimens. Significantly, the marked up text was generated by software directly from the database. The OSU implementation realized one of the primary objectives of TaxPub: database-driven publication of species descriptions in order to enable less lossy, more rapid publication of data rich descriptions.

Soon after the initial release of TaxPub, Plazi was joined by Pensoft, the publisher of the online open access Taxonomy journal ZooKeys, in a collaboration to integrate TaxPub into its publication workflow. The approach differed from OSU’s in applying markup to submitted manuscripts. Pensoft faced a set of challenges similar to those for retrospective conversion. Among them was the identification and encoding of treatments, scientific names, and bibliographic references. Developing their own software tools, in 2010 ZooKeys began to publish TaxPub versions of their articles.**** Although lacking a very fine level of markup granularity (for example <material-citation> is not used), the ZooKeys articles accomplish many of the goals of the TaxPub extension. Treatments are identified, and thus are directly and easily machine addressable, as are treatment sub-sections. All scientific names and name parts are tagged with <tp:taxon-name> elements. <tp:nomenclature-citation> elements include <tp:taxon-name> and link to full bibliographic entries, themselves marked up with <mixed-citation>. Significantly, because TaxPub motivated and enabled its use of the NLM DTD, ZooKeys articles will be archived in PubMed Central.

Problems Encountered and Lessons Learned

The development of the extension has proceeded smoothly but with some challenges. While taxonomic treatments do appear to follow conventional patterns, there is in fact no consensus on what the structural components of a treatment are, nor even what they are named. This is a problem even within the domain of zoology (the primary focus of TaxPub to date), but more so if one seeks consensus that simultaneously encompasses the domains of zoology, botany, and bacteriology. More discussion needs to take place beyond the limited circle of TaxPub if there is any hope that the extension will be useful for any purposes beyond Plazi’s own.

A number of XML schemas (e.g., for names specimens, descriptive data, and phylogenetics) are in use or under development in biological taxonomy and related fields. There is significant interest in integrating or harmonizing TaxPub with these related schemas. Care must be taken in the design of TaxPub in aligning with external schemas without compromising its integrity or complicating maintenance. One such schema is Darwin Core (http://rs.tdwg.org/dwc/index.htm), a representation-free controlled vocabulary with implementations in XML Schema and in RDF used for the exchange of specimen data. One option for alignment with Darwin Core was simply to incorporate its elements in TaxPub at the appropriate points in the DTD as, for instance, MathML elements are included. Ultimately this approach was rejected due to the maintenance burdens of synchronizing TaxPub with Darwin Core, a schema not under Plazi’s control. Also, the likelihood that valid instances of TaxPub would become invalid if changes to Darwin Core were to be incorporated would complicated maintenance of applications developed against TaxPub. The approach decided upon was to not include Darwin Core at all, but to eventually document the use of URIs of Darwin Core terms as values for type attributes of relevant TaxPub and Publishing DTD elements, e.g. <named-content> and <tp:location>. Further testing and implementation of TaxPub will reveal whether this approach is effective or expressive enough.

The choice of whether to create a new element or to expect use of an existing Publishing DTD elements occurred often in the design of the extension. In fact, most TaxPub elements (<tp:treatment>, <tp:nomenclature> and <tp:nomenclature-citation> excepted) closely resemble some Publishing DTD element that could be used instead. In modeling keys to taxa, for example, it was decided to rely on generic Publishing DTD elements, particularly those for tables, rather than create new special purpose elements. When use of existing elements is preferred, however, it becomes necessary to somehow express intentions regarding usage of the Publishing DTD. In implementing TaxPub for ZooKeys, Pensoft, for example, found nothing regarding keys but were eventually provided instructions and examples on tagging of keys using generic table elements by the TaxPub editors. Extension of the DTD thus requires more than simply editing DTD and entity files. Those planning on creating an extension should plan on producing at minimum written documentation on not just the extension, but on use of the base DTD as well. Going beyond this, it is advantageous to express the intended or endorsed usages in a Schematron schema, if only to provide an example for implementers to do so themselves for their applications. Defining such profiling mechanisms again places burdens on those developing and maintaining an extension.

The issues of both documentation and alignment with other schemas were complicated by working with DTD as the XML Schema language. The lack of robust namespace support in DTD removed the option of importing external schemas into TaxPub. This would make synchronization less onerous, for example, were it decided to include Darwin Core elements in TaxPub. It also would enable the inclusion of XML data in TaxPub instances themselves rather than on linking to them as external documents. It may be argued that there is little practical difference between inclusion of XML data and linking to that data in external documents, but it is nevertheless a limitation to which extension editors must adjust. And it must be explained to users who either have XML data they wish to include in articles, or consumers who would prefer access such data directly.

Though the Publishing DTD itself has excellent documentation, the means to include notes on usage (through use of comments) in the DTD is limited and ad hoc. Far more useful would be the built-in mechanisms for inline documentation as are available in W3C XML Schema and RelaxNG. Especially helpful is the possibility of XML encoded documentation in schema annotations. This would allow for far richer, more readily processed and easily maintained documentation than is currently the case. Given the importance of documentation, it is frustrating and burdensome to have to work around the limitations on annotation in DTD.

In the biodiversity informatics community we have encountered other resistance to the use of DTD as the schema language for TaxPub on several other grounds. There is general unfamiliarity with DTD and perception that it is “old fashioned” and complicated (e.g., “do I change *models.ent or *modules.ent?”). While not technical hindrances, nevertheless such perceptions do impede acceptance of TaxPub by its target community. As a result there has been a need to educate on XML Schema languages, something not envisioned as a task at the outset.

Finally, at a more fundamental level the question of whether to superset the Publishing DTD is a major consideration. TaxPub, as an extension, provides semantics beyond what is available in the base DTD through creating newly named elements—thus lending itself to domain-specific application. However, TaxPub instances may not be easily processed by applications already familiar with the Publishing DTD. TaxPub does not add many new elements with content models that could not be modeled using ordinary Publishing DTD elements. So why create a superset at all? Much could be accomplished through other methods of profiling. Most important for profiling is written documentation, detailing usage—already a necessary task when creating an extension. A controlled vocabulary for “-type” attribute of <named-content>, <sec> and similar generic elements (perhaps published as a machine readable form such as SKOS) can effectively provide semantics for document features not already addressed in the Publishing DTD. Rules on usage and checking of controlled values of type attributes can be expressed in a Schematron schema and provide validation. While its customizability is a well developed feature of the Publishing DTD, it may not ultimately be the most effective or efficient approach for TaxPub.

Acknowledgments

The author wishes to thank Robert Morris, Norman Johnson, Donat Agosti, Lyubomir Penev, and Guido Sautter for their help in the writing of this paper.

  1. Agosti D Egloff W Taxonomic information exchange and copyright: the Plazi approach BMC Research Notes 2009. 2 1 53 DOI: doi:10​.1186/1756-0500-2-53 . [PMC free article: PMC2673227] [PubMed: 19331688]
  2. Dallwitz MJ A general system for coding taxonomic descriptions 1980. Taxon 291-43 Also available at http://delta-intkey​.com.
  3. Polaszek A et al. A universal register for animal names 2005. Nature 437477. DOI: 10​.1038/437477a. [PubMed: 16177765]
  4. Sautter G Böhm K Padberg F Tichy W Empirical Evaluation of Semi-Automated XML Annotation of Text Documents with the GoldenGATE Editor Proceedings of European Conference on Research and Advances in Digital Libraries 2007. Budapest, Hungary .
  5. Sautter G Böhm K Agosti D Klingenberg C Creating digital resources from legacy documents: An experience report from the biosystematics domain Proceedings of the 6th European Semantic Web Conference on The Semantic Web: Research and Applications 2009. Heraklion, Crete .
  6. “What is the Problem?: The Taxonomic Impediment” Secretariat of the Convention on Biological Diversity; 2007. Available from: http://www​.cbd.int/gti/problem.shtml .
  7. Willis A King D Morse D Dil A Lyal C Roberts D From XML to XML: The why and how of making the biodiversity literature accessible to researchers Language Resources and Evaluation Conference (LREC) 19-21May2010. Malta.
  8. Winston J. Describing Species. New York: Columbia University Press; 1999.

Footnotes

*
**

see Sautter et al., 2009 and Willis et al., 2010 for reports from projects involved in the retrospective conversion of legacy taxonomic literature to XML

***

See vSysLab: a virtual Systematics Laboratory at http://vsyslab​.osu.edu/index.html

****

See the ZooKeys 50 Special Issue for discussion of the XML enhanced scientific publications, the Pensoft workflow and examples of TaxPub encoded articles at http:​//pensoftonline​.net/zookeys/index.php​/journal/issue/view/52

Copyright 2010 by Terry Catapano.

The copyright holder grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License

Bookshelf ID: NBK47081

Views

  • PubReader
  • Print View
  • Cite this Page

Related information

  • PMC
    PubMed Central citations
  • PubMed
    Links to PubMed

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...