Whether a publisher opts for the Archiving and Interchange, Journal Publishing, or NCBI Book Tag Set, there are a number choices that publishers can make in how to apply the selected tag set that result in different, but valid, XML.
Special Character Encoding
XML allows several forms of special character encoding. Regardless of the value in the xml encoding attribute, special characters can be represented as either Unicode entities (e.g., “β”), or ISO entities (e.g., “β”). Special characters can also be represented in the native encoding (e.g. UTF-8), though native encoding has not been used in any of the implementations shows in Tables 1 and 2. Each of these representations has advantages and disadvantages:
UTF-8 is the most compact encoding and is fully compatible with modern web browsers (which avoids extra transforms for conversion to HTML for the web), but is not “human readable” when XML files are viewed in a text editor. UTF-8 encoding also means that a file is binary rather than text format, which can make it more difficult to use standard text differencing applications as part of quality assurance.
Unicode entities are fully compatible with modern web browsers and permit the file to be text format rather than binary. However, like UTF-8, Unicode entities are not “human readable” when XML files are viewed in a text editor.
ISO entities are “human-readable” in a text editor (relatively speaking). However, they are not compatible with all browsers.
Most of the implementations completed by Inera use Unicode entities. Interestingly, most of the users of ISO entities started using the NLM DTD in earlier years, although we are not aware of any specific reason for the shift toward Unicode entities in later years.
Table Tagging
XHTML is the default model tagging tables in the NLM DTD. However, the OASIS CALS model is also supported and can be added to the standard tag set releases by changing only about six lines in the selected DTD.
If the CALS model requires modifying the DTD, why do some users of the NLM DTD prefer it to the XHTML model? There are several reasons:
The CALS model supports features not supported in XHTML, specifically tagging of table and cell border information and table groups where the number of columns changes from one part of a table to another.
Adobe InDesign (CS3 and later) includes native support to import and export CALS tables but not XHTML tables. The same is true of FrameMaker.
Most users of the 3B2 composition system appear to prefer the CALS table model to XHTML.
CALS tables, of course, have to be converted to XHTML for web rendering.
Though a key goal of XML is to have content tagged such that it is independent of the rendering application, it appears that many publishers have opted for CALS tables to allow for simpler and/or more flexible PDF creation through traditional composition applications that have internal biases towards the CALS table model.
Math Handling
The NLM DTD permits math to be tagged in a variety of ways, including MathML, TeX, and inclusion of graphic files rather than tagged math.
MathML has the advantage that it is native XML and can be used to render math in a variety of environments. However, native browser support is limited, with good support in Firefox, limited support in Safari, and no support in Microsoft Internet Explorer [9].
Because of limited browser support for MathML, many publishers, especially those that have only infrequent display equations, have opted to handle all display math as images. When math is infrequent, graphics are certainly the path of least resistance, as a single format that will work for print/PDF composition and web delivery.
Even for math-intensive publishers, the selection of a composition engine sometimes drives the selection of a math model. For example, virtually all publishers that use InDesign handle math as graphics because InDesign does not have native support to render MathML.
One supplier that uses InDesign has opted to include both graphics and MathML for all equations, using graphics for InDesign composition, and MathML for customer deliveries of final XML. This combination can also aid in delivery requirements for some publishers. For example, Elsevier requires all display math in both MathML and graphic format [10].
Those organizations that use MathML tend to typeset with applications such as 3B2, AntennaHouse, or FrameMaker, or they create PDF files from Word and do not typeset from the MathML.
A few organizations that use 3B2 prefer TeX instead of MathML. This may be because TeX is the native rendering system for math in 3B2, so TeX markup avoids an extra conversion. However, TeX must be marked as CDATA within <tex-math>.
So, as with tables, the selection of a math model appears to be driven largely by the requirements of specific composition applications.
Generated and Boilerplate Text
We use the term Generated Text to mean inconsequential, formulaic, or stereotypical text, punctuation, and formatting omitted from an XML file, which is applied to content by a style sheet when an XML file is rendered. The style sheet generates this text and visual formatting based on the structural information provided by the markup elements and attributes.
We use the term Boilerplate Text for the opposite scenario, i.e., inconsequential, formulaic, or stereotypical text, punctuation, and formatting that could have been omitted but which the publisher has chosen to keep in the XML file rather than to generate with a style sheet.
SGML and XML have always been about structure rather than formatting. Steve DeRose commented, "Strong separation of formatting from structure is the hallmark of good SGML use [11],” and many people followed this reasoning by keeping any such formatting out of their tagged content. However, others decided to rely less on style sheets and more on boilerplate text.***
The NLM DTD is flexible and permits users to work with Generated or Boilerplate Text. The degree to which this is allowed varies from one tag set to the next, with the Archiving and Interchange Tag Set allowing the greatest degree of Boilerplate Text, especially when using the <x> element, which is not available in the Journal Publishing Tag Set.
Flexibility around the use of Generated versus Boilerplate Text may well be one reason the NLM DTD has been so widely adopted. As we will see in the next two subsections, there is wide variation in how publishers have chosen to approach this issue.
Reference Tagging and PCDATA
The NLM DTD has had several models for tagging references. Versions 1.0 through version 2.3 had the <citation> and <nlm-citation> elements, where the former allowed tags in any order and permitted Parsed Character Data (PCDATA) such as punctuation and text (e.g. “pp.” before page ranges) between elements, and the latter had a proscribed element order and did not permit PCDATA.
Few were happy with this model, in part because there was not a way to have elements in any order while restricting the use of PCDATA unless the DTD was modified for local use. Version 3 dealt with this matter by eliminating <citation>, deprecating <nlm-citation>, and adding two new elements, <mixed-citation> and <element-citation>, that better addressed the needs of users.
Three-fourths of the users shown in and have opted to keep PCDATA in references, including all of the suppliers, using <citation> or <mixed-citation>, while only one quarter of users drop the PCDATA.
For suppliers, this is a logical choice because they typically service multiple publishers, each with their own reference editorial style. By keeping the PCDATA and order element intact in the XML, less template development work is necessary in their composition systems.
For publishers, the choice to retain or drop PCDATA could go either way. However, it is possible that because many of the publishers in do both composition and online hosting in-house, they may have decided that it’s easier to keep the PCDATA than develop two different rendering templates, one for PDF creation and one for online presentation. More research would be necessary to determine if this is a reason that publishers have opted to retain PCDATA in references.
List Labels
The NLM DTD uses the list-type attribute to encode whether a list is bulleted, ordered (Arabic numbered), alphabetic, or Roman numbered. In most applications this value, combined with a style sheet, should permit appropriate rendering of list item labels. However, almost half of the publishers in keep the content of the list labels in a <label> element at the start of each list-item.
One place where keeping a <label> element is helpful is when using the NCBI Book Tag Set. Occasionally, books (at least more frequently than journals) will have discontinuous numbered lists — e.g., a list with items 1 through 4, several paragraphs of text that are not part of the list, and then a continuation with items 5 through 7. In this situation, where the second list starts with item 5, a simple ordered attribute is insufficient to correctly present the list.
In other cases, publishers have opted to keep the <label> element, regardless of the DTD used, to make the style sheet simpler for print. If the label is included, no style information need be set up.
Interestingly, while there is a high correlation between users who drop reference section PCDATA and list labels, there is less correlation between those who keep reference section PCDATA and keep list labels. From this distinction, it is clear that publishers are treating generated text for different elements uniquely rather than taking an all or nothing approach.