NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2012 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2012.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2012

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2012 [Internet].

Show details

Implementation of TaxPub, an NLM DTD extension for domain-specific markup in taxonomy, from the experience of a biodiversity publisher

, , , , , and .

Author Information

TaxPub was created as an XML extension to the general JATS to provide domain-specific markup for prospective publishing in the area of biological systematics. The core idea of the schema is to delimit descriptions of taxa, or treatments, within an article and also several sub-elements within a treatment, and to use these individual portions of information for various purposes. TaxPub was developed in a close cooperation between the author (Terence Catapano), a community interested in such markup (Plazi), the NLM JATS group and a journal publisher (Pensoft). Since July 2009, TaxPub has been routinely implemented in the everyday publishing practice of Pensoft, to provide: (1) Semantically enhanced, domain-specific XML versions of articles for archiving in PubMedCentral (PMC); (2) Visualization of taxon treatments on PMC; (3) Export of taxon treatments to various aggregators, such as Encyclopedia of Life, Plazi Treatment Repository, and the Wiki


You can only protect what you know. The Earth Summit in 1992 and the subsequent Convention on Biological Diversity (CBD, 1992) has been the first global official acknowledgement of an ongoing biodiversity crisis with an assumed loss of species at a never known rate. This has been based on inferences, mainly derived from the loss of tropical forests, and hardly on the actual observation of the decline and extinction of species. Scientifically it became clear that even the knowledge of the diversity of species, not to speak of its dynamics, is very limited. A comprehensive list of all the scientifically known species did not – and still does not -- exist, nor tools to identify species. This became known as the “Taxonomic Impediment” (GTI, 1998). Access to taxonomic information and the agonizingly slow process of scientific description of new species have been identified as core features of the Impediment. This became even more vexing with increasing access to and potential of the Internet and online publishing, which lends itself to create a seamless biodiversity knowledge system. In such a system, publications would be the validation and import tool of new data and information. Though more than 17,000 new species are being described every year (Polaszek et al., 2005), and a multiple of updated re-descriptions, they are scattered within more than 2,000 scientific journals which are at best accessible in electronic PDF format. They do not lend themselves without discouragingly large conversion efforts as import vehicle into a biodiversity knowledge system.

Though not domain specific, the US National Library of Medicine Journal Archiving and Interchange Tag Suite (NLM DTD, 2008) offers the appealing feature that it has at least a basic structure that could serve as the main starting point for building tools to transfer content from various journals at a finer degree of granularity than at the article level. NLM DTD based journals have the additional feature that they can be more readily submitted to PubMed, and several of them are displayed in open access on PubMedCentral. The TaxPub extension of the NLM DTD furthermore made use of the ability to add domain specific elements to the tags to semantically enhance and link the content to dedicated databases. This allows especially machines to harvest content and contribute to initiatives like the recently launched Intergovernmental Platform on Biodiversity and Ecosystem Services (IPBES, 2012), a follow-up of the CBD, thus reducing the taxonomic Impediment. With the dramatic removal of the extraction barrier, the creation of new semantically enhanced and linked articles and journals remained the issue, for which a specific journal production would be the solution.

This paper is a follow-up of an earlier presentation at JATS (Catapano, 2010) on the creation of the domain specific TaxPup NLM DTD, in which the concepts and techniques are described. Here we describe a professional journal production workflow to create semantically enhanced and linked taxonomic publications encoded using TaxPub. Together with tools to disseminate the discovery of new taxa, as well as treatments of taxa in a particular area or habitat, they will reduce the Taxonomic Impediment tremendously and allow catching up with the increased speed of discovery by additional molecular methods.


As discussed in an earlier paper presented at JATS-Con in 2010 (Catapano, 2010), the TaxPub extension was developed with an eye towards new taxonomic literature. Prior experiences with the retrospective conversion showed that any schema attempting to model the broad range of stylistic, editorial, and formal variation of legacy taxonomic literature would be so loose as to greatly challenge interchange as well as development of consuming applications. TaxPub, conversely, was designed to be adequately constrained to facilitate interchange and application development. It was hoped that such constrained markup could be applied more easily during either the authorial or editorial stages rather than after publication, as Pensoft has done in their implementation.

The “taxon treatment” is the focus of TaxPub. Following publishing traditions in taxonomy, a taxon treatment is a formal description of a taxon, including sections on nomenclature, morphological characteristics, behaviour, ecology, distribution, and specimens examined. TaxPub primarily models these taxon treatment features, providing (within a namespace with1 the prefix “tp”) a <tp:taxon-treatment> element with a required <tp:nomenclature> element which is highly structured and contains the essential data about the named species and an <tp:treatment-sec> element for other sub-sections for which a treatment-sec-type attribute provides specific semantics. Otherwise, TaxPub provides within most “Blue” DTD block level elements the elements <tp:taxon-name> to encode scientific names of organisms, <tp:material-citation> for references to specimens and other material, and <tp:descriptive-statement> for descriptions of physical characteristics of organisms. Beyond these elements TaxPub relies on the elements in “Blue” DTD for all other features. In particular, the <named-content> element is intended to be used for the wide range of phrase level data which may be of interest in taxonomy (e.g., locality information such as latitude, longitude, elevation, etc...). The intention is that users will employ terms from external controlled vocabularies (e.g., DarwinCore) to supply specific semantics. Indeed, un-extended NLM DTD/JATS could have been used in a similar way, with domain-specific terms supplying semantics in -type attribute values. This would have the advantage of eliminating the need to produce the various modified entity files, but would have shifted the burden to usage documentation and development of non-DTD based validation (e.g., Schematron). As it turned out the NLM/JATS DTD proved to be exceptionally well designed for extension. There is relatively little to be gained by not taking advantage of those mechanisms in favor of documenting and developing validation tools for a profile of the un-extended DTD, particularly if the extension is easily converted to generic NLM/JATS DTD as TaxPub documents are in the process of ingest into PubMed Central.

A TaxPub encoded taxon treatment can be used for several purposes, the most important of which is retrieval of the taxon treatment and its sections separately from the article text. Figure 1 presents an example of how the subelements of a taxon treatments are either harvested by or exported to various aggregators of biodiversity information.

Fig. 1

Fig. 1

Traditional taxonomy publication (left) and export of its content to different aggregators, performed with XML markup based on the TaxPub extension to the NLM JATS DTD

As minimal as it is, TaxPub has proven to be effective in encoding taxonomic articles by Pensoft journals over the past two years, including now additional journals (Phytokeys, Mycokeys, Journal of Hymenoptera Research, International Journal of Myriapodolgy, Biodiversity Data Journal). Some minor changes have been made to the extension (e.g., the addition of the <x> element to the <tp:nomenclature> element for encoding of interstitial punctuation), but it has remained quite stable. The latest version may be found on the TaxPub SourceForge site at, and development is performed within the subversion repository at While the extension itself is stable, work on documentation for TaxPub, available at, has been slow, thus delaying its 1.0 release. To date TaxPub has been a unfunded, volunteer effort by members of the organization Plazi. As previously mentioned, developing the extension per se was relatively easily accomplished and a discrete task. What was learned, however is that any schema is more than the schema files themselves. There is also a need to maintain the schema through versions, feature requests, bug tracking, support, etc., with the preservation of previous versions. What eventually follows is a need for documentation, examples, profiling tools, presentation and crosswalking stylesheets, etc., themselves continuously revised as new versions emerge. These are open-ended tasks that present demands of time, resources, and funds on maintainers and developers, however dedicated. Starting in September 2012, further development and maintenance of TaxPub, its documentation and related tools will be done by Plazi as part of the European Union funded project Pro-iBiosphere which aims to plan for an integrated system for the management of biodiversity knowledge. The support and renewed focus should help to insure that TaxPub will stabilize and be useful to more users for a longer time.

Pensoft’s Implementation

History and workflow

The implementation of TaxPub XML schema in a routine editorial process started in 2009, when the ZooKeys journal of Pensoft launched several pilot projects and created the respective software tools to provide semantic tagging of taxonomy articles (Penev et al., 2010a; Catapano, 2010). During the elaboration of the workflow, the schema was tested against different types of manuscripts, and its structure was discussed and improved in a close collaboration with the Plazi team. The testing and implementation phase was completed more than a year later with publishing of ZooKeys’s special issue No 50 ‘Taxonomy shifts up a gear: New publishing tools to accelerate biodiversity research’ which marked the journal’s innovative publishing model, based on an XML editorial workflow and on the domain-specific XML schema TaxPub) (Penev et al., 2010a).

The process happened simultaneously with the development of the Pensoft Mark Up Tool (PMT), a program specially designed for XML tagging and exporting a TaxPub version of the published articles, compliant with the NLM JATS specifications. The PMT workflow is described by Penev et al. (2010b), and thus we only briefly present its key elements here:

  1. After a manuscript’s acceptance, the text is formatted in InDesign.
  2. A plug-in to InDesign identifies the article structure – namely sections and taxon treatments.
  3. A dedicated algorithm identifies all inline elements - taxon names, geographical coordinates, etc. within the text.
  4. The marked proo-f is verified by a semi-automated method.
  5. The TaxPub XML is verified and exported into a semantically enhanced HTML version.
  6. The paper is published in three electronic versions: PDF, HTML and XML.
  7. The XML version is submitted to PubMedCentral for archiving and display.
  8. Taxon treatments are extracted and exported to various aggregators.

Export to Aggregators

The delimitation of taxon treatments within articles and the possibility to export them in various XML formats, independent of the rest of article’s text, has established the basis for several further useful applications. A key advantage of TaxPub is that it extends the usage of the regular NLM JATS DTD into a domain-specific “atomization” of the article content, providing at the same time the output format for papers to be archived and visualized on PubMedCentral. Soon after launch of the TaxPub-based publication workflow, the NLM team configured the visualization of treatments within articles on PMC in a special section called “Supplementary Material” (Fig. 2).

Fig. 2

Fig. 2

XML taxon treatments are visualized separately in a “Supplementary Material” section of the PubMedCentral’s HTML version of each article

One of the first use cases of TaxPub has been implemented with Encyclopedia of Life (EOL). On the day of publication, all newly described taxa are being exported through XML onto the website of EOL, from where they are harvested on a daily basis and visualized on the EOL species pages (Fig. 3).

Fig. 3

Fig. 3

Taxon treatments are extracted from the main text and exported into the Encyclopedia of Life (EOL) server, from where they are harvested and visualized on a daily basis

A slightly different workflow has been established with Plazi’s treatment repository. Plazi accesses the XML sitemap of the publisher’s website and harvests the XML versions of the published articles. Thereafter, treatments are extracted and converted into another XML schema, TaxonX (TaxonX, 2009), and then stored and visualized on Plazi’s website. From there, they can be exported to various aggregators, such as EOL and others (Fig. 4).

Fig. 4

Fig. 4

Taxon treatments are harvested daily by Plazi’s server and stored at the Plazi Treatment Repository direct from the published XML version of each article

Realizing the importance of Wiki environments for popularization and dissemination of biodiversity data, in April 2011 ZooKeys undertook another major step towards increased usage and dissemination of the taxon treatments. Another tool, Pensoft Wiki Convertor (PWC) has been created to transform the TaxPub treatments into Media-Wiki files and to upload the latter to the Wiki treatment repository, Species-ID (Penev et al., 2011) (Fig. 5).

Fig. 5. Visualization of a taxon treatment on the Wiki treatment repository Species-ID.

Fig. 5

Visualization of a taxon treatment on the Wiki treatment repository

Some aggregators collect metadata about objects and to link from their platforms to the original source. Such an approach is applied by KeyCentral – a global database of taxonomic keys and other identification resources for living organisms. Thanks to the possibility to delimit treatments within the text, metadata about identification keys are automatically exported to KeyCentral.

Semantically Enhanced Publications

TaxPub XML files are also used to create a semantically enhanced HTML version of the publication. Semantic enhancement to scientific texts can be determined as “anything that enhances the meaning of a published journal article, facilitates its automated discovery, enables its linking to semantically related articles, provides access to data within the article in actionable form, or facilitates integration of data between articles” (Shotton et al., 2009).

TaxPub elements have been exploited to create of semantic enhancements to taxonomic texts. The process has been described and exemplified in issue 50 of ZooKeys (Penev et al., 2010a,b) and from that point onward turned into a routine practice for Pensoft’s journals.

The most important uses of semantic enhancements implemented through the TaxPub XML files are:

  1. A newly published taxonomic revision can be searched and retrieved for taxon treatments
  2. Treatments, taxon names, and citations can be highlighted throughout the papers in different colors so users can easily identify them while reading.
  3. Georeferenced localities can be plotted on Google Maps for individual treatments, or collated for groups of treatments (e.g, for all species in a genus treated in the papers).
  4. Occurrence data can be published as a supplementary KML file and visualized on Google Earth.
  5. Citations in the text are cross-linked with the reference lists; each citation can be visualized as a full text reference by pointing the cursor on it.
  6. Figure citations are cross-linked with the figures themselves; each figure can be visualized just by pointing the cursor on its citation.
  7. Each taxon name published in the paper, independently of its rank, is linked to its dynamic online profile (Pensoft Taxon Profile, PTP,, created on the fly. PTP links the taxon name to a number of other biodiversity resources, for example Global Biodiversity Information Facility (GBIF), Encyclopedia of Life (EOL), National Center for Biotechnology Information (NCBI), Biodiversity Heritage Library (BHL), International Plant Name Index (IPNI), Index Fungorum, ZooBank, Tropicos, PLANTS database, Morphbank, Wikipedia, Wikispecies, Yahoo images, and others (Fig. 6).
Fig. 6

Fig. 6

Pensoft Taxon Profile (PTP) generated in real time by clicking on a taxon name mentioned in a journal article

Future Developments

Pensoft Writing Tool (PWT)

The Pensoft Writing Tool (PWT) is for collaborative online article authoring; it provides templates for different kinds of biodiversity articles, with upfront markup, links to external resources, and various options for data publishing.

The tool is designed to solve one of the main difficulties with the implementation of TaxPub, namely to mark texts that have already been structured by the authors in text editors (usually MS Word or OpenOffice). Authors using the PWT have at their disposal a set of pre-defined, yet flexible templates they will need to fill in through sophisticated editing software. Taxon treatments are a core element in the PWT. The different types of taxon treatments, such as (re-)descriptions, nomenclatoral acts (new synonymies, re-validations of names, designations of type specimens, etc.) are modelled in accordance with the slightly different requirements of the Biological Codes, e.g., for animals, plants, and fungi.

The PWT will serve as gateway for the Biodiversity Data Journal (, BDJ) and also for other journals in the future. BDJ is the first journal ever to complete the cycle from writing a manuscript, through submission, community peer-review and editing, to publication and dissemination within a single, fully XML-based, online collaborative platform, called Pensoft Journal System(PJS) (Fig. 7). The publication in the PJS environment is intended to be very low-cost, and this is largely achieved by properly structured submission and thus minimising handling by human editors.

Fig. 7. A generalized workflow of the Pensoft Journal System (PJS), consisting of collaborative article authoring and editing software (Pensoft Writing Tool, PWT), and peer-review and editorial online manager that allow opting for conventional, community and public peer-review.

Fig. 7

A generalized workflow of the Pensoft Journal System (PJS), consisting of collaborative article authoring and editing software (Pensoft Writing Tool, PWT), and peer-review and editorial online manager that allow opting for conventional, community and (more...)

In addition, PWT will provide:

  • A collaborative environment for authors to create and work on an online document (manuscript);
  • Authors may invite additional contributors (e.g., mentors, potential reviewers, linguistic and copy editors, etc.) to watch, comment, and edit the text during the writing process;
  • Email and chat communication tools within the group of co-authors and contributors associated with a manuscript;
  • Automated import of data-structured manuscripts generated in various platforms (e.g., Scratchpads, authors’ databases);
  • Revision history, version control, and version comparison;
  • Various modes of data publishing (supplementary files, multimedia, import of data tables, linking to external data repositories, etc.), in accordance with internationally accepted standards (e.g., species occurrence data in Darwin Core)
  • Semantic markup of text and data during the writing process, with no additional effort for the authors;
  • Rich-text editing, smart management of citation and placement for references/figures/tables;
  • Import of references from external bibliographic databases (CrossRef, PubMed and others);
  • Pre-submission validation of the manuscript.

PWT is being developed within the EU-funded project ViBRANT ( and at the time of writing this manuscript is available for beta-testing at


While taxonomy and nomenclature focus on taxon treatments, a huge amount of related literature is structured differently. For example, floristic and faunistic papers usually describe a region or locality and then list species that have been encountered there, based on data from literature and/or newly performed surveys. Listings of species on their turn may contain information allocated to each species, such as distribution, ecological traits, conservation status, etc.

Ecological papers may have a similar structure, but the focus is usually on habitats or ecosystems instead on regions or localities. Description of a habitat contains most of the geographical details of a locality, but in addition it should also yield information on the ecological features that characterize a habitat, often according to existing habitat classifications, such as EUNIS at European scale.

The two aforementioned cases could be modeled in the present version of TaxPub through the taxon treatment element, but at the expense of complicated and inefficient markup of the text. To solve the problem we have to “turnaround” the focus from taxon treatment to “locality-, region- or habitat-treatments”.

“Eco-TaxPub” will be developed as a new extension based on the present TaxPub, containing a new core element at the hierarchical level of taxon treatment. The element called “Locality- or Habitat-Treatment” will feature and model a set of sub-elements, such as locality name, geographical coordinates, habitat name, and several more available in DarwinCore. The species encountered at the locality or in the habitat will be listed in different categories, for example: (1) species proved to be present; (2) wrongly recorded species; (3) species proved to be present but currently extinct, and so on.

The main purpose of Eco-TaxPub is to facilitate structuring and markup of papers in the huge domain of biodiversity inventorying at different spatial and temporal scales. Good examples are the countless inventories of nature reserves that either remain “hidden” in project reports, or in the best case, published on separate and isolated websites, Eco-TaxPub will facilitate publishing of such inventories in a highly structured format that will facilitate data mining and re-use of the accumulated data.


The application of TaxPub NLM DTD, the first domain specific enhancement of the NLM/JATS DTD, in a rapidly developing professional publishing environment and the adherence to the principle of Open Access are removing a great barrier to the knowledge of biodiversity, one of the main elements of the taxonomic Impediment in its conservation. The building of journal production workflow and dissemination mechanism remove almost the entire publishing impediment and thus will make it easier to disseminate new knowledge in a fast turnover. The linking of the semantically enhanced elements – the core element in the corpus of biodiversity knowledge - to the original sources create a rich documentation of an increasing number of species and thus the biodiversity conservation has for the first time the technical tools to popularise what scientists discover and needs be protected, immediately after publication and in a highly automated way. In other words, TaxPub NLM DTD is at the base of a technical revolution that will substantially contribute to the change of the way we study and protect our biodiversity.


The work on the present paper was supported in part by the European Union’s Framework Program 7 (FP7) project Pro-iBiosphere - Coordination and policy development in preparation for a European Open Biodiversity Knowledge Management System, addressing Acquisition, Curation, Synthesis, Interoperability and Dissemination.


  1. Catapano T. (2010) TaxPub: An Extension of the NLM/NCBI Journal Publishing DTD for Taxonomic Descriptions. Proceedings of the Journal Article Tag Suite Conference 2010 (http://www​.ncbi.nlm.nih​.gov/books/NBK47081/)
  2. CBD (1992) Convention on Biological Diversity. http://www​
  3. GTI (1998) The Taxonomic Impediment. http://www​
  4. IPBES (2012) Intergovernmental Platform on Biodiversity and Ecosystem Services. http://www​ [PubMed: 22958167]
  5. NLM DTD (2008) NLM Journal Archiving and Interchange Tag Suite. http://dtd​
  6. Penev L, Roberts D, Smith VS, Erwin T. (2010a)Taxonomy shifts up a gear: New publishing tools to accelerate biodiversity research. ZooKeys 50: i-iv. doi: 10​.3897/zookeys.50.543.
  7. Penev L, Agosti D, Georgiev T, Catapano T, Miller J, Blagoderov V, Roberts D, Smith VS, Brake I, Ryrcroft S, Scott B, Johnson NF, Morris RA, Sautter G, Chavan V, Robertson T, Remsen D, Stoev P, Parr C, Knapp S, Kress WJ, Thompson FC, Erwin T. (2010b) Semantic tagging of and semantic enhancements to systematics papers: ZooKeys working examples. ZooKeys 50: 1-16. doi: 10​.3897/zookeys.50.538. [PMC free article: PMC3088020] [PubMed: 21594113]
  8. Penev L, Hagedorn G, Mietchen D, Georgiev T, Stoev P, Sautter G, Agosti D, Plank A, Balke M, Hendrich L, Erwin T. (2011) Interlinking journal and wiki publications through joint citation: Working examples from ZooKeys and Plazi on Species-ID. ZooKeys 90: 1-12. doi: 10​.3897/zookeys.90.1369. [PMC free article: PMC3084489] [PubMed: 21594104]
  9. Polaszek A, et al. (2005) A universal register for animal names. Nature 437, 477. doi: 10​.1038/437477a. [PubMed: 16177765]
  10. Shotton D, Portwin K, Klyne G, Miles A. (2009) Adventures in Semantic Publishing: Exemplar Semantic Enhancements of a Research Article. PLoS Comput Biol 5(4): e1000361. doi: 10​.1371/journal.pcbi.1000361. [PMC free article: PMC2663789] [PubMed: 19381256]
Lyubomir Penev, Terence Catapano, Donat Agosti, Teodor Georgiev, Guido Sautter, Pavel Stoev.

This is an open access article distributed under the terms of the Creative Commons Attribution License 3.0 (CC-BY), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Bookshelf ID: NBK100351


  • PubReader
  • Print View
  • Cite this Page

Related information

  • PMC
    PubMed Central citations
  • PubMed
    Links to PubMed

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...