NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2010.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010 [Internet].

Show details

NLM Journal Publishing DTD Flexibility: How and Why Applications of the NLM DTD Vary Based on Publisher-Specific Requirements

.

Author Information

The NLM Journal Publishing DTD was designed for full-text encoding of journal articles for current publication. More restrictive than the NLM Archive and Interchange DTD, the Journal Publishing DTD nevertheless allows for wide latitude in the application of many elements and attributes. The interpretation of the DTD in a given context may be based on forethought and specific business requirements or may be somewhat arbitrary, depending on the experience level of the DTD user. On the basis of a review of more than 20 implementations of the DTD, this paper will discuss various interpretations chosen by a range of publishers as well as the business or technical requirements that led to those decisions. The implications, pro and con, of this flexibility will be examined. The paper concludes with the suggestion that this flexibility is one factor that has led to wide adoption of the NLM DTD Suite.

Introduction

The NLM DTD Suite was first released in March 2003 [1] and included two versions, the Archiving and Interchange DTD and the Journal Publishing DTD. The original development goals of the project were to provide a DTD to meet the growing needs of PubMed Central [2,3] and to meet the discipline-independent needs of an eJournal archive that would be designed by the Harvard University Library [4] and funded by the Mellon Foundation [5].

A primary consideration of the Mellon-funded archive, which later was setup as Portico, was ease of interchange from other DTDs used by scholarly publishers. To facilitate such conversions to the Archiving and Interchange DTD, the Archiving and Interchange Tag Set was designed to permit a wide range of tagging styles, as reported in the E-Journal Archive DTD Feasibility Study [6].

Routine production use by journal publishers was anticipated during the creation of NLM DTD Suite, but it was not the primary consideration. In the almost eight years since its initial release, the Journal Publishing DTD has become the de facto standard for journal publishers to mark up XML versions of their articles [7]. Part of the reason behind the broad adoption of NLM DTD Suite has been its flexibility, which allows for wide latitude in the application of many elements and attributes. This paper will discuss some variations in the application of the DTD by different publishers and some of the reasons behind the choices they have made.

Terminology

This paper will use the terms NLM DTD Tag Suite (NLM DTD) and Journal Article Tag Suite (JATS) to describe the complete family of DTDs available at NLM at http://dtd.nlm.nih.gov/. Preference is given to the term “NLM DTD” because the NCBI Book Tag Set is also discussed in this paper. The term “Tag Suite” will always refer to the entire family of DTDs

Specific tag sets will be referenced within this paper by these names:

Table 1

Tag SetLocationNicknameColor
Archiving and Interchange Tag Sethttp://dtd​.nlm.nih.gov/archiving/ArchiveGreen
Journal Publishing Tag Sethttp://dtd​.nlm.nih.gov/publishing/PublishingBlue
Article Authoring Tag Sethttp://dtd​.nlm.nih.gov/articleauthoring/AuthoringPumpkin
NCBI Book Tag Sethttp://dtd​.nlm.nih.gov/book/Book(non)
This paper will generally use the full Tag Set name, but may occasionally use the Nickname or reference a DTD by it’s Color.

The term “Tag Set” will always refer to a specific version within the Tag Suite.

Methodology

The information in this paper was drawn from more than twenty implementations of the NLM DTD developed by Inera Incorporated within the eXtyles and eXtyles refXpress products since 2003 [8]. As such, it does not serve as a scientific survey of the various applications of the NLM DTD but rather provides an overview of how a range of publishers have incorporated the NLM DTD into their workflows and highlights some of the decisions made by those publishers about the application of the NLM DTD.

Results

Publishers

Table 2 shows a matrix of publishers that have implemented the NLM DTD within Inera’s eXtyles products. Rather than showing a common implementation for all publishers, Table 2 shows a fair degree of variability from one implementation to the next. This is a reflection of the differing requirements of each publisher and how various publishers take advantage of the Tag Suite’s flexibility.

Table 2. Implementations of the NLM DTD within Inera’s eXtyles products for markup of full text content in journals, books, and newsletters.

Table 2

Implementations of the NLM DTD within Inera’s eXtyles products for markup of full text content in journals, books, and newsletters.

Suppliers

Table 3 shows a matrix of suppliers that have implemented the NLM DTD within Inera’s eXtyles products. Suppliers, in this case, are defined as organizations that provide typesetting and other services to multiple publishers but are not publishers themselves. Table 3 includes a subset of the columns of Table 2 because suppliers are outsource providers, by definition, and the online hosting will be done by a variety of sources.

Table 3. Implementations of the NLM DTD within Inera’s eXtyles products for markup of full-text content in journals and books by service providers.

Table 3

Implementations of the NLM DTD within Inera’s eXtyles products for markup of full-text content in journals and books by service providers.

Year of Adoption

Although the Journal Article Tag Suite first became available in 2003, most implementations went into production in 2006 or later. We attribute the surge of implementations starting in 2006 to a number of factors including a) the growing maturity of the Tag Suite with the 1.1 and 2.0 revisions, which added a number of features that facilitated use of the Tag Suite in journal publication; b) greater public awareness of the DTD’s availability and characteristics among key decision makers at scholarly publishing organizations; and c) wider availability of customizable off-the-shelf tools that could be used with the NLM DTD.

The majority of publishers in Table 2 who adopted the NLM DTD over the period from 2003 to 2010 had not previously worked with any full-text tagging. These publishers who were new to full-text XML started XML workflows for a variety of reasons, including a) PubMed Central deposit or other open access publishing mandatess and b) a desire to have an XML workflow to meet current publishing needs and future archive requirements. For the publishers who were new to XML, NLM DTD Suite, which was freely available and modifiable without royalties or copyright issues, provided a good starting point that could accelerate development and deployment of a new workflow and lower the overall costs of creating an XML workflow.

DTD Selection

The majority of journal publishers have elected to use the Journal Publishing Tag Set. However, some journal publishers have opted to use the Archiving and Interchange Tag Set. Key reasons for use of the Archiving and Interchange Tag Set include:

  • In the early years (2003 through 2005), many publishers found the Journal Publishing Tag Set too restrictive to cover all their needs. Rather than develop custom versions of the Journal Publishing Tag Set, these publishers opted to use an unmodified version of the Archiving and Interchange Tag Set.*
  • The Journal Publishing Tag Set requires an ISSN element in each document instance. Some publishers (notably Publisher 4) have used the DTD for both serial and non-serial content. Rather than create a customized version of the Journal Publishing Tag Set, they opted to use a single uncustomized version of the Archiving and Interchange Tag Set for all publications.
  • Publisher 1 recently upgrade from version 1.0 to version 3.0. Because they had started with a customized version of the 1.0 Archiving and Interchange Tag Set, it was easier to move those customizations to the 3.0 Archiving and Interchange Tag Set rather than the Journal Publishing Tag Set

In recent years, fewer publishers have opted to use the Archiving and Interchange Tag Set because the Journal Publishing Tag Set met all of their requirements without modification

The past three years have seen growth in use of the NCBI Book Tag Set. Though it is not as mature as the Journal Tag Sets,** it is nevertheless quite functional for some books, and it has become a logical choice for publishers with both journal and book publishing programs where a common tag set permits easier repurposing of content from journals into books.

Implementation Characteristics

Whether a publisher opts for the Archiving and Interchange, Journal Publishing, or NCBI Book Tag Set, there are a number choices that publishers can make in how to apply the selected tag set that result in different, but valid, XML.

Special Character Encoding

XML allows several forms of special character encoding. Regardless of the value in the xml encoding attribute, special characters can be represented as either Unicode entities (e.g., “β”), or ISO entities (e.g., “β”). Special characters can also be represented in the native encoding (e.g. UTF-8), though native encoding has not been used in any of the implementations shows in Tables 1 and 2. Each of these representations has advantages and disadvantages:

  • UTF-8 is the most compact encoding and is fully compatible with modern web browsers (which avoids extra transforms for conversion to HTML for the web), but is not “human readable” when XML files are viewed in a text editor. UTF-8 encoding also means that a file is binary rather than text format, which can make it more difficult to use standard text differencing applications as part of quality assurance.
  • Unicode entities are fully compatible with modern web browsers and permit the file to be text format rather than binary. However, like UTF-8, Unicode entities are not “human readable” when XML files are viewed in a text editor.
  • ISO entities are “human-readable” in a text editor (relatively speaking). However, they are not compatible with all browsers.

Most of the implementations completed by Inera use Unicode entities. Interestingly, most of the users of ISO entities started using the NLM DTD in earlier years, although we are not aware of any specific reason for the shift toward Unicode entities in later years.

Table Tagging

XHTML is the default model tagging tables in the NLM DTD. However, the OASIS CALS model is also supported and can be added to the standard tag set releases by changing only about six lines in the selected DTD.

If the CALS model requires modifying the DTD, why do some users of the NLM DTD prefer it to the XHTML model? There are several reasons:

  • The CALS model supports features not supported in XHTML, specifically tagging of table and cell border information and table groups where the number of columns changes from one part of a table to another.
  • Adobe InDesign (CS3 and later) includes native support to import and export CALS tables but not XHTML tables. The same is true of FrameMaker.
  • Most users of the 3B2 composition system appear to prefer the CALS table model to XHTML.

CALS tables, of course, have to be converted to XHTML for web rendering.

Though a key goal of XML is to have content tagged such that it is independent of the rendering application, it appears that many publishers have opted for CALS tables to allow for simpler and/or more flexible PDF creation through traditional composition applications that have internal biases towards the CALS table model.

Math Handling

The NLM DTD permits math to be tagged in a variety of ways, including MathML, TeX, and inclusion of graphic files rather than tagged math.

MathML has the advantage that it is native XML and can be used to render math in a variety of environments. However, native browser support is limited, with good support in Firefox, limited support in Safari, and no support in Microsoft Internet Explorer [9].

Because of limited browser support for MathML, many publishers, especially those that have only infrequent display equations, have opted to handle all display math as images. When math is infrequent, graphics are certainly the path of least resistance, as a single format that will work for print/PDF composition and web delivery.

Even for math-intensive publishers, the selection of a composition engine sometimes drives the selection of a math model. For example, virtually all publishers that use InDesign handle math as graphics because InDesign does not have native support to render MathML.

One supplier that uses InDesign has opted to include both graphics and MathML for all equations, using graphics for InDesign composition, and MathML for customer deliveries of final XML. This combination can also aid in delivery requirements for some publishers. For example, Elsevier requires all display math in both MathML and graphic format [10].

Those organizations that use MathML tend to typeset with applications such as 3B2, AntennaHouse, or FrameMaker, or they create PDF files from Word and do not typeset from the MathML.

A few organizations that use 3B2 prefer TeX instead of MathML. This may be because TeX is the native rendering system for math in 3B2, so TeX markup avoids an extra conversion. However, TeX must be marked as CDATA within <tex-math>.

So, as with tables, the selection of a math model appears to be driven largely by the requirements of specific composition applications.

Generated and Boilerplate Text

We use the term Generated Text to mean inconsequential, formulaic, or stereotypical text, punctuation, and formatting omitted from an XML file, which is applied to content by a style sheet when an XML file is rendered. The style sheet generates this text and visual formatting based on the structural information provided by the markup elements and attributes.

We use the term Boilerplate Text for the opposite scenario, i.e., inconsequential, formulaic, or stereotypical text, punctuation, and formatting that could have been omitted but which the publisher has chosen to keep in the XML file rather than to generate with a style sheet.

SGML and XML have always been about structure rather than formatting. Steve DeRose commented, "Strong separation of formatting from structure is the hallmark of good SGML use [11],” and many people followed this reasoning by keeping any such formatting out of their tagged content. However, others decided to rely less on style sheets and more on boilerplate text.***

The NLM DTD is flexible and permits users to work with Generated or Boilerplate Text. The degree to which this is allowed varies from one tag set to the next, with the Archiving and Interchange Tag Set allowing the greatest degree of Boilerplate Text, especially when using the <x> element, which is not available in the Journal Publishing Tag Set.

Flexibility around the use of Generated versus Boilerplate Text may well be one reason the NLM DTD has been so widely adopted. As we will see in the next two subsections, there is wide variation in how publishers have chosen to approach this issue.

Reference Tagging and PCDATA

The NLM DTD has had several models for tagging references. Versions 1.0 through version 2.3 had the <citation> and <nlm-citation> elements, where the former allowed tags in any order and permitted Parsed Character Data (PCDATA) such as punctuation and text (e.g. “pp.” before page ranges) between elements, and the latter had a proscribed element order and did not permit PCDATA.

Few were happy with this model, in part because there was not a way to have elements in any order while restricting the use of PCDATA unless the DTD was modified for local use. Version 3 dealt with this matter by eliminating <citation>, deprecating <nlm-citation>, and adding two new elements, <mixed-citation> and <element-citation>, that better addressed the needs of users.

Three-fourths of the users shown in Table 2 and Table 3 have opted to keep PCDATA in references, including all of the suppliers, using <citation> or <mixed-citation>, while only one quarter of users drop the PCDATA.

For suppliers, this is a logical choice because they typically service multiple publishers, each with their own reference editorial style. By keeping the PCDATA and order element intact in the XML, less template development work is necessary in their composition systems.

For publishers, the choice to retain or drop PCDATA could go either way. However, it is possible that because many of the publishers in Table 2 do both composition and online hosting in-house, they may have decided that it’s easier to keep the PCDATA than develop two different rendering templates, one for PDF creation and one for online presentation. More research would be necessary to determine if this is a reason that publishers have opted to retain PCDATA in references.

List Labels

The NLM DTD uses the list-type attribute to encode whether a list is bulleted, ordered (Arabic numbered), alphabetic, or Roman numbered. In most applications this value, combined with a style sheet, should permit appropriate rendering of list item labels. However, almost half of the publishers in Table 2 keep the content of the list labels in a <label> element at the start of each list-item.

One place where keeping a <label> element is helpful is when using the NCBI Book Tag Set. Occasionally, books (at least more frequently than journals) will have discontinuous numbered lists — e.g., a list with items 1 through 4, several paragraphs of text that are not part of the list, and then a continuation with items 5 through 7. In this situation, where the second list starts with item 5, a simple ordered attribute is insufficient to correctly present the list.

In other cases, publishers have opted to keep the <label> element, regardless of the DTD used, to make the style sheet simpler for print. If the label is included, no style information need be set up.

Interestingly, while there is a high correlation between users who drop reference section PCDATA and list labels, there is less correlation between those who keep reference section PCDATA and keep list labels. From this distinction, it is clear that publishers are treating generated text for different elements uniquely rather than taking an all or nothing approach.

Production Implications

Composition

As noted above, more than a half dozen different composition systems are used to create PDF files.**** In most cases (those using Typefi being the largest exception), the composition system was selected before the NLM DTD was selected, and then the NLM DTD XML workflow was retrofitted to the composition system. In other words, the workflow with XML was set up to match an existing composition engine, instead of a composition engine being chosen because it would most effectively work with an XML workflow.

Additionally, it can be noted that almost all users of 3B2 are either publishers that outsource composition or suppliers that provide composition services to a number of publishers. On the other hand, InDesign, InDesign with Typefi, or simple PDF creation from Word are the most common systems to create PDFs for publishers that do their composition in-house.

Regardless of whether the composition system was selected prior to the NLM DTD selection, the composition system in use has some bearing on how the XML is created. As noted above, selections related to table and math models, in particular, are often driven by the limitations of specific composition applications.

Online Hosting

Most of the publishers shown in Table 2 host their own content rather than using a third-party hosting service. This may be an indication that those who are willing to create XML in-house are also more likely to self-host; in other words, they tend to be more hands-on publishers.

About half of the publishers shown in Table 2 also deliver content to PubMed Central (PMC), and in some cases, the decision to move to an XML workflow was driven specifically by a desire or requirement of the publisher to deliver content to PMC.

While PMC does have both requirements and preferences for acceptance of XML, clearly some flexibility is granted to publishers in how they set up their XML. For example, though most publishers submitting to PMC drop list labels, a few retain the labels. Similarly, most publishers use HTML tables, but a few submitting to PMC use CALS tables.

Conclusions

The NLM DTD has become the de facto XML standard for full-text markup of journal content. The acceptance of the Tag Suite by a large number of publishers has been driven by several factors. However, the most important may well be that the NLM DTD allows for flexibility in the tagging of XML content to match each publisher’s specific business and production needs rather than requiring that all publishers follow a one-size-fits-all model of XML tagging.

Although the requirement for flexibility was originally driven by the needs of the Archive and Interchange Tag Set, the same structures have also allowed for flexibility in the use of all Tag Sets within the Tag Suite, and it has been a significant factor in its wide adoption by publishers and suppliers.

As an additional benefit, the broad acceptance of the freely available and easily modifiable Tag Suite, and the growing availability of tools tailored to the Tag Suite, has lowered the cost of entry to XML workflows for all publishers and enabled the use of XML by many small and medium-sized publishers that could not have afforded such workflows as recently as five years ago.

Acknowledgements

The author extends his thanks to his colleagues at Inera, Igor Kleschevich, Liz Blake, and Nathan Day, for helping to create so many unique configurations using the Journal Article Tag Suite, and to Evan Owens and Irina Golfman for their roles in engaging the author in this project so many years ago.

References

1.
Tag Suite Versions. http://dtd​.nlm.nih.gov/#id48886. Accessed on October 1, 2010.
2.
PMC Overview. http://www​.ncbi.nlm.nih​.gov/pmc/about/intro.html. Accessed on October 1, 2010.
3.
Beck J. Report from the Field: PubMed Central, an XML-based Archive of Life Sciences Journal Articles. Presented at International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML, Montréal, Canada, August 2, 2010. In Proceedings of the International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML. Balisage Ser Markup Tech 2010; 6: doi:10.4242/BalisageVol6.Beck01.
4.
Harvard University Library. Report on the Planning Year Grant for the Design of an E-journal Archive. http://www​.diglib.org​/preserve/harvardfinal.pdf. Accessed on October 1, 2010.
5.
Cantara L, ed. Archiving Electronic Journals. http://www​.diglib.org/preserve/ejp.htm. Accessed on October 1, 2010.
6.
Rosenblum B, Golfman I. E-Journal Archive DTD Feasibility Study. 2001. http://www​.diglib.org/preserve/hadtdfs​.pdf. Accessed on October 1, 2010.
7.
Lindberg DAB, Humphreys BL. Rising Expectations: Access to Biomedical Information. Yearb Med Inform 2008; 3:165-172. [PMC free article: PMC2441483] [PubMed: 18587496]
8.
eXtyles Product Information. http://www​.inera.com/extylesinfo.shtml. Accessed on October 1, 2010.
9.
MathML in HTML5: Internet Explorer 9 is Broken. http://www​.dessci.com​/en/products/mathplayer​/tech/MathMLinHTML5.htm. Accessed on October 1, 2010.
10.
Bernickus B, et al. Tag by Tag: The Elsevier DTD 5 Family of XML DTDs. http://www​.elsevier.com​/framework_authors​/DTDs/ja50_tagbytag5-v1.1.pdf, page 421. Accessed on October 1, 2010.
11.
DeRose S. The SGML FAQ Book. 1997. Kluwer Academic Publishers: Norwell, MA.

Footnotes

*

Interestingly, the Publishing (Blue) Tag Set has become somewhat less restrictive since 2003, such that several publishers that opted to use the Archive (Green) Tag set in earlier years could now work effectively with the Blue Tag set if they wanted to switch, but following the old adage “if it ain’t broke, don’t fix it,” they have opted not to switch DTDs. In NLM DTD Working Group meetings, this tendency to make the Publishing Tag Set less restrictive has become known as the “greenification of blue.”

**

The NCBI Book Tag Set was developed specifically to tag books published by NCBI, and it has not seen the same degree of document analysis go into its development as the Journal Tag Sets have.

***

The reasons for using boilerplate text are explored in the e-Journal Archive DTD Feasibility Study [6] and will not be repeated here.

****

Note that XyEnterprise XPP is not included in this table, but only because none of Inera’s customers use XPP with the NLM DTD, although Inera is familiar with a number of organizations that use XPP to typeset XML from 12083-derived DTDs. Quark Xpress is notably absent from the table both because none of Inera’s customers use it to typeset XML according to the NLM DTD and because Inera is not familiar with any scholarly publishers that are typesetting XML directly with Quark Xpress.

Copyright 2010 by Bruce D. Rosenblum.

The copyright holder grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License

Bookshelf ID: NBK47101

Views

  • PubReader
  • Print View
  • Cite this Page

Related information

  • PMC
    PubMed Central citations
  • PubMed
    Links to PubMed

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...