Background
The development of TaxPub is an outgrowth of an earlier effort to digitize the
taxonomic literature of ants for purposes of developing data mining techniques
for the extraction of species data from taxonomic literature. The work was
originally performed as part of a joint U.S. National Sciences Foundation and
Deutsches Forschungsgemeinschaft (German Research Foundation) grant awarded to
the American Museum of Natural History (AMNH) and the University of Magdeburg
(later to Karlsruher Institut für Technologie/Karlsruhe Institute of Technology
).
Development of TaxonX, an XML-Schema for markup of treatments had begun at
AMNH prior to the NSF/DFG grant and continued through its duration. As the
project was concluding, participants established Plazi Verein, a Switzerland-based
independent not-for-profit organization aiming to help remove
technological, social, and legal barriers to the creation of and access to
taxonomic literature. Among its many activities Plazi maintains the TaxonX
schema and a repository of XML-encoded publications, develops the semi-automatic
markup tool, GoldenGate [Sautter et al., 2007], and
strenuously advocates for open access to scientific literature [Agosti and Eglof, 2009]. As part of these efforts, Plazi has encoded
approximately 500 publications containing roughly 11,000 treatments using the TaxonX
schema [Sautter et al., 2009]. This experience greatly
informed both the rationale and design of the TaxPub extension.
Rationale
It is estimated that the majority of species on earth have yet to be described* and that each year some 15-20,000 new species are described [Polaszek et al., 2005]. Yet many efforts to
digitize taxonomic literature, including Plazi's, have predominantly focused on
the minority of species already described. It is time consuming and costly to
convert the legacy literature to XML. Challenges are faced at the most basic
level, that of the accuracy of the transcription of source texts. OCR may yield
good results for clean copies of modern documents, but for older publications
accuracy suffers. Costs are then incurred to either correct OCR errors or
through double-keying, the latter not scaling well for massive digitization
efforts. Even with 100% accurate texts, encoding remains a challenge.
Particularly due to variant editorial practices, a wide range of styles and text
structures are present in the existing literature. TaxonX, for example, became a
very loose Schema in order to accommodate the variation. The TaxonX treatment
element was eventually made available almost everywhere in the Schema after
treatments were encountered in a variety of locations in source documents, even
in footnotes. The laxity of the schema, however, confers little benefit to
processing of valid instances, making it difficult and expensive to program
against. In addition to the problems of stylistic and formal variability,
encoding information implicit in untagged text is a major task.
For example,
interpreting and expanding abbreviations or parsing the components of a
bibliographic references, a scientific name, or a geographic reference can be
time consuming and prone to error. **
Given the complexity and difficulty of digitizing existing taxonomic literature,
and that it covers a minority of all species, greater benefit at less cost might
be found in the encoding of new, born digital, taxonomic literature.
Increasingly, treatments are derived from data maintained in databases, whether
for names, specimens, or bibliographic references. This information could be
exported into XML directly, saving an enormous amount of time and ensuring
accuracy. The idea to generate publishable natural language treatments from
databases arose in the early 1970's and was unambiguously in place by 1980 [Dallwitz, 1980]. The rise of XML has
provided more tools to produce and exploit structured treatments, but often
these tools are used backwards with time wasted by experts providing markup to
published literature. Indeed, in the case of recently published literature
information originating in parsed form in citation managers and databases
becomes converted by an author to unstructured text for publication, only to be
parsed out once again during the markup process.
Consensus on an XML schema often fosters development of tools, services, and
applications utilizing suitably encoded data. TaxPub is an attempt to catalyze
this process in the hope that the community will be intrigued, and find it
useful enough to adopt and sustain.
Design and Development
In the second half of 2008, with the assistance of Jeff Beck and Laura Kelly of
NCBI, Plazi developed the first draft of the extension now called TaxPub. Since
then development has been assisted by Donat Agosti, an ant systematist,
President of Plazi and research scientist at the American Museum of Natural
History, and by Robert Morris, Emeritus Professor of Computer Science at
University of Massachusetts at Boston and an Information Technology Associate of
the Harvard University Herbaria. The project is hosted on SourceForge
(http://sourceforge.net/projects/taxpub/) with the first release in
December, 2008.
The first version release of TaxPub is scheduled for March 2011. A call for
comments will be sent in December 2010 soliciting feedback and requests for new
features. Subsequent releases will be backwards compatible until the next
version release.
Rather than adapting taxonX for publishing applications it was more efficient to
extend the NLM/NCBI DTD. The Journal Publishing DTD already included elements for
document features, so it was necessary only to add elements and attributes
relevant to taxonomic descriptions. TaxPub extends the Publishing (“Blue”) DTD
parsimoniously.
To better distinguish TaxPub elements from those of the base
DTD, elements from the extension have been put into their own namespace, with
element names starting with the prefix "tp:". A few phrase-level elements are
made available at relevant places throughout the DTD. There are elements
for scientific names, <tp:taxon-name>, citations of specimens and other
materials, <tp:material-citation>, and descriptions of organisms’ physical
characteristics, <tp:descriptive-statement>.
The <tp:taxon-name> and
<tp:descriptive-statement> elements have simple content models, each
allowing any number of optional “part” elements allowing for tagging of the
element's components. Required “-part-type” attributes provides further
semantics. Because the field of biodiversity has many published vocabularies,
URIs are available for many concepts and entities of interest. The addition of
“-type-uri” attributes to all TaxPub elements with “-type” attributes is under
consideration so that, if available, semantics may be provided through use of a
URI as a value instead of, or in addition to, a string value.
Of course an
additional attribute is not strictly necessary as users may already use URIs in
the existing “-type” attributes. We encourage that usage.
Additionally, as in
many TaxPub elements, the <object-id> element from the base DTD is
available, again with the intention of allowing semantic enhancement through
linkage to standard identifiers. <tp:taxon-name> also has additional
special attributes: <tp:taxon-name> with “auth-code” to report the
nomenclatural code to which the tagged name is conformant; “rank” to
explicitly indicate the taxonomic rank (e.g.., genus, species, etc...) of the
named taxon; and a “reg” attribute (shared by <tp:taxon-name-part>) to
contain a regularized form of an element's contents.
The other element available throughout the DTD, <tp:material-citation>, has
a richer content model. Like bibliographic citations, specimen citations can be
complex, with many pieces of information. To accommodate granular encoding,
<tp:material-citation> allows #PCDATA, the Publishing DTD elements
<named-content>, <xref>, and <object-id>, and TaxPub elements
<tp:taxon-name>, <tp:material-location> for information on the
institution currently housing the referenced material, and
<tp:collecting-event> for information on where, when, and by whom the
specimen was found. The <tp:collection-event> element has a number of
sub-elements: <named-content>, <object-id>, as well as <date>,
and extension elements <tp:taxon-name> and <tp:collecting-location>.
<tp:collection-location> itself permits zero or more <object-id> and
an optional <comment> element, and zero or more <tp:location>
element which has a “location-type” attribute to specify whether tagged location
is a country, city, province, etc...
Most of the extension occurs in a single section-level element
<tp:taxon-treatment>, available in the body of an NLM document. The
<tp:taxon-treatment> element contains elements for metadata about the
treatment itself, <treatment-meta>, and its component sub-sections: a
required <tp:nomenclature> section and zero or more
<tp:treatment-sec> elements. Originally, two other named treatment
sections were included in the extension, <tp:description> and
<tp:materials-examined>, but as their content models did not differ from
that of <treatment-sec>, they were removed. A “treatment-sec-type”
attribute is available to provide specific semantics for <treatment-sec>,
but aside from the inclusion of the other TaxPub elements available throughout
the DTD, the content model of treatment-sec is essentially the same as a generic
section.
The only required element in the TaxPub extension is <tp:nomenclature>. Its
content model is more complicated than other extension elements because it must
model and conform to the very formal structure required by the aforementioned
nomenclatural codes. <tp:nomenclature> must contain a
<tp:taxon-name>, which includes the name of the organism being described by
the treatment. Indication that a taxon is a new species or genus is handled by a
<tp:taxon-status> element. A <tp:taxon-authority> element may be
used for a “brief bibliographic reference to the original publication of the
[taxon] name” [Winston, 130] required by
nomenclatural codes and typically in the form of an author’s last name followed
by the year of publication. For more granular markup,
<tp:taxon-authority-part> elements with “tp:taxon-authority-part-type”
attributes are available.
The codes address other complexities of citations
(e.g., multiple authors, a species being moved to a different genus since the
original publication, etc...), but the current <tp:taxon-authority> model
ought to be sufficient. Following the citation of taxon authorship will
frequently be a series of citations “of all the names that have been used in
published references to [the described] taxon” [Winston, 136]. TaxPub provides a
<tp:nomenclature-citation-list> element to group
<tp:nomenclature-citation> elements for these citations. The citations may
consist of several parts. First is a reference to a name, consisting of a
required <tp:taxon-name>, followed by zero or more <tp:taxon-author>
elements. Next is a bibliographic reference to the publication in which the
taxon was named, for which <mixed-citation> (for an inline citation) or
<xref> (for links to an entry in a reference list>) may be used. A
reference to specimens may be present for which <tp:material-citation> is
available. Other information may be included in an optional <comment>
element. As it models perhaps the most complex, least standardized component of
taxonomic descriptions, <tp:nomenclature-citation> will no doubt be
subject to further review and criticism, and will likely be revised frequently
until a stable element definition is achieved.